# BigQuery-Geotab Intersection Congestion

We’ve all been there: Stuck at a traffic light, only to be given mere seconds to pass through an intersection, behind a parade of other commuters. Imagine if you could help city planners and governments anticipate traffic hot spots ahead of time and reduce the stop-and-go stress of millions of commuters like you.

# Table of contents
- [Imports and initial exploration](#imports)

- [Exploratory Data Analysis](#eda)
    - [Time features](#hmw)
    - [Exploring street features](#streetfeatures)
    - [Latitude and Longitude](#latlon)
    
- [Preprocessing](#prepro)

- [Baseline model](#baseline)

## Imports and initial exploration
<a id='imports'></a>

In [None]:

!pip install pandarallel

import pandas as pd
import pandarallel
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mplleaflet
from collections import Counter

import json

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold
from tensorflow import keras

from mlxtend.regressor import StackingRegressor

pandarallel.pandarallel.initialize(progress_bar=True)

sns.set_style('darkgrid')
sns.set_palette('deep')

np.random.seed(15)

In [None]:
train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')
sample = pd.read_csv('../input/bigquery-geotab-intersection-congestion/sample_submission.csv')
with open('../input/bigquery-geotab-intersection-congestion/submission_metric_map.json') as f:
    submission_metric_map = json.load(f)

## Exploratory Data Analysis
<a id='eda'></a>

### Time features
<a id='hmw'></a>

We all know there is probably a high correlation between the time features and the  values we want to predict, let's visualize this interaction

In [None]:
time_features = ['Hour', 'Month', 'Weekend']

In [None]:
fig, axes = plt.subplots(2,1, figsize=[15,10])

sns.countplot(data=train[train['Weekend']==0], hue='City', x='Hour', ax=axes[0],);
sns.countplot(data=train[train['Weekend']==1], hue='City', x='Hour', ax=axes[1]);
axes[0].legend([])
axes[1].legend(loc=[-0.2,0.7])
axes[0].set_title("Weekdays")
axes[1].set_title("Weekends")
fig.set_dpi(500)

In [None]:
sns.countplot(x='Month', hue='City', data=train)

### Exploring street features
<a id='streetfeatures'></a>

In [None]:
street_features = ['EntryStreetName', 'ExitStreetName', 'EntryHeading', 'ExitHeading', 'Path']

We can see clearly path is just a concatenation of the other features, so we can just drop it

In [None]:
train.drop('Path', axis=1, inplace=True)
test.drop('Path', axis=1, inplace=True)

The cardinal directions can be expressed using the following equation:
$$
\frac{\theta}{\pi}
$$
Where $\theta$ is the angle between the we want to encode direction and the north direction measured clockwise

In [None]:
directions = {
    'N': 0,
    'NE': 1/4,
    'E': 1/2,
    'SE': 3/4,
    'S': 1,
    'SW': 5/4,
    'W': 3/2,
    'NW': 7/4
}

In [None]:
train['EntryHeading'] = train['EntryHeading'].map(directions)
train['ExitHeading'] = train['ExitHeading'].map(directions)

In [None]:
test['EntryHeading'] = test['EntryHeading'].map(directions)
test['ExitHeading'] = test['ExitHeading'].map(directions)

In [None]:
train['diffHeading'] = (train['ExitHeading']-train['EntryHeading'])
test['diffHeading'] = (test['ExitHeading']-test['EntryHeading'])

### Looking at street names

In [None]:
word_count = Counter()
for name in train['EntryStreetName']:
    if pd.isna(name):
        continue
    for word in name.split():
        word_count[word]+=1
        
for name in train['ExitStreetName']:
    if pd.isna(name):
        continue
    for word in name.split():
        word_count[word]+=1

In [None]:
sorted(word_count.items(),key=lambda item: item[1], reverse=True)[:20]

Let's use the following road types: Street, Avenue, Road, Boulevard, Broad and Drive

After searching on the <a href='https://360.here.com/2016/12/30/whats-the-difference-between-a-road-a-street-and-an-avenue/'>internet</a> their differences, I found that Avenue and Street are basically the same thing.

a) Street (for any thoroughfare) 

b) Road (for any thoroughfare) 

c) Way (for major roads - also appropriate for pedestrian routes) 

d) Avenue (for residential roads) 

e) Drive (for residential roads) 

f) Grove (for residential roads) 

g) Lane (for residential roads) 

h) Gardens (for residential roads) subject to there being no confusion with any local open space 

i) Place (for residential roads) 

j) Crescent (for a crescent shaped road) 

k) Court/Close (for a cul-de-sac only) 

l) Square (for a square only) 

m) Hill (for a hillside road only) 

n) Circus (for a large roundabout) 

o) Vale (for residential roads) 

p) Rise (for residential roads) 

q) Row (for residential roads) 

r) Wharf (for residential roads) 

s) Mews (for residential roads) 

t) Mead (for residential roads) 

u) Meadow (for residential roads)

In [None]:
road_encoding = {
    'Street': 0,
    'St': 0,
    'Avenue': 1,
    'Ave': 1,
    'Boulevard': 2,
    'Road': 3,
    'Drive': 4,
    'Lane': 5,
    'Tunnel': 6,
    'Highway': 7,
    'Way': 8,
    'Parkway': 9,
    'Parking': 9,
    'Oval': 10,
    'Square': 11,
    'Place': 12,
    'Bridge': 13,
    'Unknown': 14
}

In [None]:
def encode(x):
    if pd.isna(x):
        return road_encoding['Unknown']
    for road in road_encoding.keys():
        if road in x:
            return road_encoding[road]
        
    return road_encoding['Unknown']

In [None]:
train['EntryType'] = train['EntryStreetName'].parallel_apply(encode)
train['ExitType'] = train['ExitStreetName'].parallel_apply(encode)
test['EntryType'] = test['EntryStreetName'].parallel_apply(encode)
test['ExitType'] = test['ExitStreetName'].parallel_apply(encode)

In [None]:
train['EqualStreets'] = (train['EntryStreetName']==train['ExitStreetName'])
test['EqualStreets'] = (test['EntryStreetName']==test['ExitStreetName'])

### Latitude and Longitude
<a id='latlon'></a>

In [None]:
plt.figure(figsize=[10,10])
tmp = train[train['City']=='Boston'].groupby(['Latitude', 'Longitude'])['RowId'].count().reset_index()
sns.kdeplot(tmp['Longitude'], tmp['Latitude'])

mplleaflet.display()

In [None]:
cities = train['City'].unique()
scalers_lat = {}
scalers_lon = {}
for city in cities:
    latitudes = np.array(train[train['City']==city]['Latitude']).reshape(-1,1)
    longitudes = np.array(train[train['City']==city]['Longitude']).reshape(-1,1)
    scalers_lat[city] = StandardScaler().fit(latitudes)
    scalers_lon[city] = StandardScaler().fit(longitudes)

In [None]:
train['Latitude'] = train.parallel_apply(lambda row: scalers_lat[row['City']].transform(np.array(row['Latitude']).reshape(1,1)), axis=1)

In [None]:
train['Longitude'] = train.parallel_apply(lambda row: scalers_lon[row['City']].transform(np.array(row['Longitude']).reshape(1,1)), axis=1)

In [None]:
test['Latitude'] = test.parallel_apply(lambda row: scalers_lat[row['City']].transform(np.array(row['Latitude']).reshape(1,1)), axis=1)

In [None]:
test['Longitude'] = test.parallel_apply(lambda row: scalers_lon[row['City']].transform(np.array(row['Longitude']).reshape(1,1)), axis=1)

In [None]:
sns.kdeplot(train['Longitude'])

## Preprocessing
<a id='prepro'></a>

Let's create a new dataframe with the new following features: TotaTimeStopped, DistanceToFirstStop and Percentile.

Creating a dataframe in the following way can enable us to use the percentile as a feature and can help us boost the model

In [None]:
new_train_columns = ['IntersectionId', 'Latitude', 'Longitude', 'EntryStreetName',
       'ExitStreetName', 'EntryHeading', 'ExitHeading', 'Hour', 'Weekend', 'DistanceToFirstStop',
       'Month', 'TotalTimeStopped', 'Percentile', 'City', 'diffHeading', 'EntryType', 'ExitType', 'EqualStreets']

In [None]:
new_test_columns = ['IntersectionId', 'Latitude', 'Longitude', 'EntryStreetName',
       'ExitStreetName', 'EntryHeading', 'ExitHeading', 'Hour', 'Weekend',
       'Month', 'Percentile', 'City', 'diffHeading', 'EntryType', 'ExitType', 'EqualStreets']

In [None]:
new_train = pd.DataFrame(columns=new_train_columns)

In [None]:
new_test = pd.DataFrame(columns=new_test_columns)

In [None]:
for per in [20, 40, 50, 60, 80]:
    new_df = train.copy()
    new_df['TotalTimeStopped'] = new_df['TotalTimeStopped_p'+str(per)]
    new_df['DistanceToFirstStop'] = new_df['DistanceToFirstStop_p'+str(per)]
    new_df['Percentile'] = pd.Series([per for _ in range(len(new_df))])
    new_df.drop(['TotalTimeStopped_p20', 'TotalTimeStopped_p40',
       'TotalTimeStopped_p50', 'TotalTimeStopped_p60', 'TotalTimeStopped_p80',
       'TimeFromFirstStop_p20', 'TimeFromFirstStop_p40',
       'TimeFromFirstStop_p50', 'TimeFromFirstStop_p60',
       'TimeFromFirstStop_p80', 'DistanceToFirstStop_p20',
       'DistanceToFirstStop_p40', 'DistanceToFirstStop_p50',
       'DistanceToFirstStop_p60', 'DistanceToFirstStop_p80', 'RowId'], axis=1,inplace=True)
    new_train = pd.concat([new_train, new_df], sort=True)

In [None]:
for per in [20, 50, 80]:
    new_df = test.copy()
    new_df['Percentile'] = pd.Series([per for _ in range(len(new_df))])
    new_test = pd.concat([new_test, new_df], sort=True)

In [None]:
new_train = pd.concat([new_train.drop('City', axis=1), pd.get_dummies(new_train['City'])], axis=1)

In [None]:
new_test = pd.concat([new_test.drop('City', axis=1), pd.get_dummies(new_test['City'])], axis=1)

In [None]:
new_train = new_train.reindex(sorted(new_train.columns), axis=1)
new_test = new_test.reindex(sorted(new_test.columns), axis=1)

In [None]:
new_test = new_test.sort_values(by=['RowId', 'Percentile'])

In [None]:
X_train = np.array(new_train.drop(['EntryStreetName', 'ExitStreetName', 'IntersectionId', 
                                   'TotalTimeStopped', 'DistanceToFirstStop'], axis=1), dtype=np.float32)
X_test = np.array(new_test.drop(['EntryStreetName', 'ExitStreetName', 'IntersectionId', 
                                 'RowId'], axis=1), dtype=np.float32)

In [None]:
y_train = np.array(new_train[['TotalTimeStopped', 'DistanceToFirstStop']], dtype=np.float32)

## Baseline model
<a id='baseline'></a>

In [None]:
from tensorflow.keras import backend as K
def rmse(y_true, y_pred):
    return K.sqrt(K.mean((y_true-y_pred)**2))

In [None]:
def get_model():
    x = keras.layers.Input(shape=[X_train.shape[1]])
    fc1 = keras.layers.Dense(units=45)(x)
    act1 = keras.layers.PReLU()(fc1)
    bn1 = keras.layers.BatchNormalization()(act1)
    dp1 = keras.layers.Dropout(0.15)(bn1)
    concat1 = keras.layers.Concatenate()([x, dp1])
    fc2 = keras.layers.Dense(units=60)(concat1)
    act2 = keras.layers.PReLU()(fc2)
    bn2 = keras.layers.BatchNormalization()(act2)
    dp2 = keras.layers.Dropout(0.2)(bn2)
    concat2 = keras.layers.Concatenate()([concat1, dp2])
    fc3 = keras.layers.Dense(units=40)(concat2)
    act3 = keras.layers.PReLU()(fc3)
    bn3 = keras.layers.BatchNormalization()(act3)
    dp3 = keras.layers.Dropout(0.2)(bn3)
    concat3 = keras.layers.Concatenate([concat2, dp3])
    output = keras.layers.Dense(units=2, activation='softmax')(concat2)
    model = keras.models.Model(inputs=[x], outputs=[output])
    return model

def train_model(X_train, y_train, X_val, y_val):
    model = get_model()
    model.compile(optimizer=RAdam(warmup_proportion=0.1, min_lr=1e-7), loss='mse', metrics=[rmse])
    er = EarlyStopping(patience=20, min_delta=1e-4, restore_best_weights=True, monitor='val_loss')
    model.fit(X_train, y_train, epochs=200, callbacks=[er], validation_data=[X_val, y_val], batch_size=batch_size)
    return model

In [None]:
rkf = RepeatedKFold(n_splits=5, n_repeats=5)

models = []

for tr_idx, vl_idx in rkf.split(X_train, y_train):
    
    x_tr, y_tr = X_train[tr_idx], y_train[tr_idx]
    x_vl, y_vl = X_train[vl_idx], y_train[vl_idx]
    
    model = train_model(x_tr, y_tr, x_vl, y_vl)
    models.append(model)

In [None]:
y_pred = np.mean([model.predict(X_test) for model in models], axis=1)

In [None]:
l = []
for i in range(1920335):
    for j in [0,3,1,4,2,5]:
        l.append(str(i)+'_'+str(j))
sample['TargetId'] = l

In [None]:
sample['Target'] = y_pred.reshape(-1)

In [None]:
sample['temp_1'] = sample['TargetId'].parallel_apply(lambda x : int(x.split('_')[0]))
sample['temp_2'] = sample['TargetId'].parallel_apply(lambda x : int(x.split('_')[1]))
sample = sample.sort_values(by=['temp_1', 'temp_2'])
del sample['temp_1']
del sample['temp_2']

In [None]:
sample.to_csv('sample_submission.csv', index=False)

In [None]:
submission_metric_map