# <center><font color='red'>JanataHack: Machine Learning for IoT Hackathons</font></center>

<center><img src='https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/jantahack_-thumbnail-1200x1200-90.jpg'/></center>

### Hackathons Introduction

IoT devices are becoming popular nowadays. The widespread use of IoT yields huge amounts of raw data. This data can be effectively __processed by using machine learning__ to derive many __useful insights__ that can become game changers and affect our lives deeply.
<br>


- This  __#JantaHack challange__ bring you a challenge providing all of you with an opportunity to work with __sensor data__ and solve an interesting __IOT__ problem.


### Problem Statement

You are working with the government to transform your city into a __smart city__. The vision is to convert it into a digital and intelligent city to improve the efficiency of services for the citizens. One of the problems faced by the government is __traffic__. You are a data scientist working to manage the traffic of the city better and to __provide input on infrastructure planning for the future__.

The government wants to implement a __robust traffic system__ for the city by being prepared for __traffic peaks__. They want to understand the __traffic patterns of the four junctions__ of the city.
- Traffic patterns on holidays, as well as on various other occasions during the year, differ from normal working days. This is important to take into account for your forecasting. 

### TASK

### <center><font color='red'>Predict traffic patterns in each of these four junctions for the next 4 months.</font></center>

<br>
- The sensors on each of these junctions were collecting data at different times, hence you will see traffic data from different time periods. To add to the complexity, some of the junctions have provided limited or sparse data requiring thoughtfulness when creating future projections. Depending upon the historical data of 20 months, the government is looking to you to deliver accurate traffic projections for the coming four months. Your algorithm will become the foundation of a larger transformation to make your city smart and intelligent.


#### Evaluation Metric
- The evaluation metric for this competition is __Root Mean Squared Error (RMSE)__

In [97]:
# As i am using google collab so first i need to install all the required dependencies which i am gonna use further

#install xgboost
!pip install xgboost



In [98]:
# install statsmodels

! pip install statsmodels



In [99]:
# import all the other required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
import math
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
print ("Required--Libraries--Loaded")

# Lib.. for model training
import xgboost as xgb

Required--Libraries--Loaded


In [0]:
#Upload the train and test data and read the files.
# I am using google collab 
train= pd.read_csv('/content/train.csv')
test= pd.read_csv('/content/test.csv')

In [101]:
print(train.shape)
train.head()

(48120, 4)


Unnamed: 0,DateTime,Junction,Vehicles,ID
0,2015-11-01 00:00:00,1,15,20151101001
1,2015-11-01 01:00:00,1,13,20151101011
2,2015-11-01 02:00:00,1,10,20151101021
3,2015-11-01 03:00:00,1,7,20151101031
4,2015-11-01 04:00:00,1,9,20151101041


### Data Feature structure

<center><img src="https://raw.githubusercontent.com/AIVenture0/ML-of-IOT-JantaHack-AnalyticsVidhya-Hackathons-5th-Rank/master/features.jpg"/></center>

- As ID have no use so it's better to drop them in the beginning.

- Drop the train id.
- Save the test id seperately for final submission.

In [0]:
# Below code explains why we are doing this

ID_test_three= test_data[test['Junction']==3]['ID']
ID_test_remain= test_data[test['Junction']!=3]['ID']

In [0]:
train=train.drop('ID',axis=1)
test_Id=test['ID']
test=test.drop('ID',axis=1)

In [104]:
print(test.shape)
test.head()

(11808, 2)


Unnamed: 0,DateTime,Junction
0,2017-07-01 00:00:00,1
1,2017-07-01 01:00:00,1
2,2017-07-01 02:00:00,1
3,2017-07-01 03:00:00,1
4,2017-07-01 04:00:00,1


## Data Preprocessing

In [105]:
train.dtypes

DateTime    object
Junction     int64
Vehicles     int64
dtype: object

Transform the datatype of DateTime Column

In [0]:
train['DateTime']=pd.to_datetime(train['DateTime'])

test['DateTime']=pd.to_datetime(test['DateTime'])

In [107]:
train.dtypes
test.dtypes

DateTime    datetime64[ns]
Junction             int64
dtype: object

In [108]:
train.head()

Unnamed: 0,DateTime,Junction,Vehicles
0,2015-11-01 00:00:00,1,15
1,2015-11-01 01:00:00,1,13
2,2015-11-01 02:00:00,1,10
3,2015-11-01 03:00:00,1,7
4,2015-11-01 04:00:00,1,9


In [109]:
test.head()

Unnamed: 0,DateTime,Junction
0,2017-07-01 00:00:00,1
1,2017-07-01 01:00:00,1
2,2017-07-01 02:00:00,1
3,2017-07-01 03:00:00,1
4,2017-07-01 04:00:00,1


## Feature analysis and Feature generation

In [0]:
# training data
train['day'] = train['DateTime'].dt.weekday
train['hour'] = train['DateTime'].dt.hour
train['month'] = train['DateTime'].dt.month

train['Year']= train['DateTime'].dt.year%10
train['DayofWeek']=train['DateTime'].dt.day_name()
train['Week']=train['DateTime'].dt.week

# Test data
test['day'] = test['DateTime'].dt.weekday
test['hour'] = test['DateTime'].dt.hour
test['month'] = test['DateTime'].dt.month

test['Year']= test['DateTime'].dt.year%10
test['DayofWeek']=test['DateTime'].dt.day_name()
test['Week']=test['DateTime'].dt.week



### Creating the ordinal feature of the given time series data. 

i.e Timestamp.toordinal() function to return the Gregorian ordinal for the given Timestamp object.

[Check more about .toordinal()](https://www.geeksforgeeks.org/python-pandas-timestamp-toordinal/)

When i add this feature to my data, rank imprvement is very impressive.

In [111]:
print(train['DateTime'][0].toordinal())

735903


To scale it we take a common value 730000.

- Can try MinMaxScaler and see how the results improves.

In [0]:
# With this we scale the day count.
train["DayCount"] = train["DateTime"].apply(lambda m: m.toordinal()/730000) 
test["DayCount"] = test["DateTime"].apply(lambda m: m.toordinal()/730000)

In [113]:
train.head()

Unnamed: 0,DateTime,Junction,Vehicles,day,hour,month,Year,DayofWeek,Week,DayCount
0,2015-11-01 00:00:00,1,15,6,0,11,5,Sunday,44,1.008086
1,2015-11-01 01:00:00,1,13,6,1,11,5,Sunday,44,1.008086
2,2015-11-01 02:00:00,1,10,6,2,11,5,Sunday,44,1.008086
3,2015-11-01 03:00:00,1,7,6,3,11,5,Sunday,44,1.008086
4,2015-11-01 04:00:00,1,9,6,4,11,5,Sunday,44,1.008086


Patter of traffic is very different during working vs holidays.

In [0]:
# Transform the day of week in accordance to the holidy.
# For this data we are considering only sunday as the prime holiday.

train['Is_holiday']=train['DayofWeek'].map({'Sunday':1,'Monday':0,'Tuesday':0,'Wednesday':0,'Thursday':0,'Friday':0,'Saturday':0})

test['Is_holiday']=test['DayofWeek'].map({'Sunday':1,'Monday':0,'Tuesday':0,'Wednesday':0,'Thursday':0,'Friday':0,'Saturday':0})

In [115]:
test.dtypes

DateTime      datetime64[ns]
Junction               int64
day                    int64
hour                   int64
month                  int64
Year                   int64
DayofWeek             object
Week                   int64
DayCount             float64
Is_holiday             int64
dtype: object

In [116]:
train.columns

Index(['DateTime', 'Junction', 'Vehicles', 'day', 'hour', 'month', 'Year',
       'DayofWeek', 'Week', 'DayCount', 'Is_holiday'],
      dtype='object')

In [117]:
test.columns

Index(['DateTime', 'Junction', 'day', 'hour', 'month', 'Year', 'DayofWeek',
       'Week', 'DayCount', 'Is_holiday'],
      dtype='object')

In [0]:
# train=train.drop(['DayofWeek','DayCount'],axis=1)

# test=test.drop(['DayofWeek','DayCount'],axis=1)

Data for 1,2,4 follow the same pattern while data at junction 3 have very randomness. So what we do we create two seperate models.

Model 1: creates prediction for junction point 1,2,4.
Model 2: creates prediction for junction point 3.



In [0]:
test_three= test[test['Junction']==3]
test_remain= test[test['Junction']!=3]

### Training the model

In [120]:
train.columns

Index(['DateTime', 'Junction', 'Vehicles', 'day', 'hour', 'month', 'Year',
       'DayofWeek', 'Week', 'DayCount', 'Is_holiday'],
      dtype='object')

In [0]:
y_1 = train['Vehicles']
data_1 = train[['Junction', 'day', 'hour', 'month', 'DayCount','Year','Is_holiday']]
test_three_data = test_three[['Junction', 'day', 'hour', 'month', 'DayCount', 'Year','Is_holiday']]

# y_1 = train['Vehicles']
# data_1 = train[['Junction', 'day', 'hour', 'month','Year','Week','Is_holiday','DayCount']]
# test_three_data = test_three[['Junction', 'day', 'hour', 'month','Year','Week','Is_holiday','DayCount']]

In [0]:
# evaluation function.
def evaluation_error(preds, dtrain):
    labels = dtrain.get_label()
    return 'error', math.sqrt(mean_squared_error(labels, preds))

In [123]:
# split data to training and validation sets
X_train, X_test, y_train, y_test = train_test_split(data_1, y_1, test_size=0.02, random_state=1)#28776
# convert data to Dmatrix format for xgboost
# In order for XGBoost to be able to use our data we need to transform it in to a specific format that 
#XGBoost can handle.

d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
d_test = xgb.DMatrix(test_three_data)

params = {  
    "n_estimators": 300 ,  
    "max_depth": 6 ,                  
    "learning_rate": 0.0035 ,    
    "colsample_bytree": 1 ,     
    "subsample": 1 ,            
    "gamma": 0.15 ,              
    'reg_alpha': 10 ,           
    "min_child_weight": 4 ,     
    }

# watchlist to see how trained model is performing over validation data and training data.
train_valid = [(d_train, 'train'), (d_valid, 'valid')]

# Train xgboost regressor
reg = xgb.train(params, d_train, 5000, train_valid,  feval = evaluation_error, maximize=False, verbose_eval=50)
# Predict from xgboost regressor
Vehicles_test_3 = reg.predict(d_test)

[0]	train-rmse:30.3911	valid-rmse:28.4291	train-error:30.3909	valid-error:28.4291
[50]	train-rmse:25.7613	valid-rmse:24.1567	train-error:25.7613	valid-error:24.1567
[100]	train-rmse:21.9116	valid-rmse:20.598	train-error:21.9116	valid-error:20.598
[150]	train-rmse:18.7208	valid-rmse:17.6624	train-error:18.7207	valid-error:17.6624
[200]	train-rmse:16.0863	valid-rmse:15.2347	train-error:16.0863	valid-error:15.2347
[250]	train-rmse:13.9196	valid-rmse:13.2372	train-error:13.9196	valid-error:13.2372
[300]	train-rmse:12.1467	valid-rmse:11.6004	train-error:12.1466	valid-error:11.6004
[350]	train-rmse:10.7085	valid-rmse:10.2822	train-error:10.7085	valid-error:10.2822
[400]	train-rmse:9.55204	valid-rmse:9.22996	train-error:9.55206	valid-error:9.22996
[450]	train-rmse:8.62815	valid-rmse:8.39176	train-error:8.62816	valid-error:8.39176
[500]	train-rmse:7.89633	valid-rmse:7.73044	train-error:7.89634	valid-error:7.73044
[550]	train-rmse:7.32702	valid-rmse:7.22048	train-error:7.32704	valid-error:7.220

In [0]:
y = train[train['Junction']!=3]['Vehicles']
data = train[train['Junction']!=3][['Junction', 'day', 'hour', 'month', 'DayCount','Year','Is_holiday']]
sub_test = test_remain[['Junction', 'day', 'hour', 'month', 'DayCount','Year','Is_holiday']]

# y = train[train['Junction']!=3]['Vehicles']
# data = train[train['Junction']!=3][['Junction', 'day', 'hour', 'month','Year','Week','Is_holiday','DayCount']]
# sub_test = test_remain[['Junction', 'day', 'hour', 'month','Year','Week','Is_holiday','DayCount']]

In [125]:
# split data to training and validation sets
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.03, random_state=15)

# convert data to Dmatrix format for xgboost
# In order for XGBoost to be able to use our data we need to transform it in to a specific format that 
#XGBoost can handle.

d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
d_test = xgb.DMatrix(sub_test)

params = {  
    "n_estimators": 250 ,  
    "max_depth": 6 ,              
    "learning_rate": 0.0075 ,
    "colsample_bytree": 1 , 
    "subsample": 1 ,        
    "gamma": 0.21 ,              
    'reg_alpha': 1 ,   
    "min_child_weight": 2 ,     
}

# watchlist to see how trained model is performing over validation data and training data.
train_valid = [(d_train, 'train'), (d_valid, 'valid')]

# Train xgboost regressor
reg = xgb.train(params, d_train, 5500, train_valid,  feval = evaluation_error, maximize=False, verbose_eval=50)
# Predict from xgboost regressor
Vehicles_test_others = reg.predict(d_test)

[0]	train-rmse:34.4917	valid-rmse:34.828	train-error:34.4914	valid-error:34.828
[50]	train-rmse:23.9986	valid-rmse:24.1911	train-error:23.9986	valid-error:24.1911
[100]	train-rmse:16.8791	valid-rmse:16.9557	train-error:16.8791	valid-error:16.9557
[150]	train-rmse:12.0805	valid-rmse:12.0533	train-error:12.0806	valid-error:12.0533
[200]	train-rmse:8.90388	valid-rmse:8.8123	train-error:8.90389	valid-error:8.8123
[250]	train-rmse:6.85926	valid-rmse:6.73915	train-error:6.85925	valid-error:6.73915
[300]	train-rmse:5.57524	valid-rmse:5.4597	train-error:5.57524	valid-error:5.4597
[350]	train-rmse:4.79844	valid-rmse:4.70457	train-error:4.79843	valid-error:4.70457
[400]	train-rmse:4.33297	valid-rmse:4.25935	train-error:4.33297	valid-error:4.25935
[450]	train-rmse:4.05221	valid-rmse:3.99956	train-error:4.05222	valid-error:3.99956
[500]	train-rmse:3.87713	valid-rmse:3.84337	train-error:3.87714	valid-error:3.84337
[550]	train-rmse:3.75663	valid-rmse:3.73827	train-error:3.75663	valid-error:3.73827
[

In [0]:
# save test submission
Submission_id= pd.DataFrame()
Submission_id= ID_test_remain.append(ID_test_three)

Pred_Vehicles= pd.DataFrame()
Pred_Vehicles= np.concatenate((Vehicles_test_others,Vehicles_test_3))


future_prediction = pd.DataFrame({
        "ID":Submission_id,
        "Vehicles": Pred_Vehicles 
    })
future_prediction.to_csv('submission.csv', index=False, encoding='utf-8')

As there is a strict time line of 2 days for this very competition. And i made total 10 submission Each time with little different approach in the same notebook so unable to provide all the notebooks.

But the core idea is same in each and every notebook.. Only difference is the performing parameter tunning for XGboost algorithms.

