Introduction
This dataset contains information regarding traffic congestion in major US Interstates and attempts to create a model that accurately predicts future congestion. The development of this notebook will take on the following structure:

* Exploratory Analysis: This stage will explore the data, rename labels as appropriate and discover that kind of pre-processing must be made (whether there are empty datapoints, distribution of data, entropy of each feature).
* Preprocessing: This stage will pre-process the data to put it in a way the model can make an accurate prediction.
* Algorithm selection and implementation
* Model evaluation and optimization stage
* Custom model creation (optional)

In [None]:
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import folium
from folium.plugins import HeatMap

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Setting up BigQuery library
PROJECT_ID = 'bigquery-ml-geotab'
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
from google.cloud import storage
storage_client = storage.Client(project=PROJECT_ID)
from google.cloud import automl_v1beta1 as automl
automl_client = automl.AutoMlClient()
from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

# load biquery commands
%load_ext google.cloud.bigquery

Exploratory Analysis¶


In [None]:
# Read data
df_train = pd.read_csv('/kaggle/input/bigquery-geotab-intersection-congestion/train.csv')
df_test = pd.read_csv('/kaggle/input/bigquery-geotab-intersection-congestion/test.csv')

In [None]:
df_train.info()

Our test data however, has a different shape to our training data:

In [None]:
df_test.info()

Training data has 27 features in total while testing data has 13. Some are labeled as objects since they are string data while the rest are numerical data. Next, let's see which, and how much, data is missing. We'll start with numerical data:

In [None]:
df_train.isnull().sum()

There are only two numerical features that have a numeric data type values. We'll have to fill in these values later in the processing stage. Now let's see how much categorical data is missing:

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
obj_df = df_train.select_dtypes(include=['object'])
# Return whether any element is True, potentially over an axis.
obj_df[obj_df.isnull().any(axis=1)].count()

Looks like EntryHeading, ExitHeading, Path and City have the same amount of null values. The amount for each of these values represents about 1.5% of our training data for that column. This seems like a small slice of our data but we can't gurantee that ignoring these rows won't affect the final outcome. We'll try to fill in these values later as well.

To gain a better understanding of the data, let's see what the distirbution is for each of the features. We will start with count for cities since that is the easiest to digest.

We can clearly see that Philadelphia has the highest data count of all cities. The data is therefore unevenly distributed and this could affect our models. However, this is total count for all cities and that doesn't yield too much more information, let's group the datacount by the number of unique Intersection Ids - this is the id given to each intersection where traffic data is being measured.

In [None]:
# Checking for distribution of data BY UNIQUE INTERSECTION ID
fig = df_train.groupby(['City'])['IntersectionId'].nunique().sort_index().plot.bar()
fig.set_title('# of Intersections per city in train Set', fontsize=15)
fig.set_ylabel('# of Intersections', fontsize=15);
fig.set_xlabel('City', fontsize=17);

Interesting, altough Philadelphia has the highest data count, Chicago has the highest number of unique intersections. This tells us that, although Philly has more data, Chicago has the larger share of unique intersections. Does this mean that Chicago has more traffic that Philly? Not necessarily, Chicago can have more intersections but Philadelphia can have more traffic total.

To explore this assumption, let's find out what is the distribution of the traffic by month and week:

In [None]:
# let's see the distribution of traffic by month and date
plt.figure(figsize=(15,12))

plt.subplot(211)
g = sns.countplot(x="Hour", data=df_train, hue='City', dodge=True)
g.set_title("Distribution by hour and city", fontsize=20)
g.set_ylabel("Count",fontsize= 17)
g.set_xlabel("Hours of Day", fontsize=17)
sizes=[]
for p in g.patches:
    height = p.get_height()
    sizes.append(height)

g.set_ylim(0, max(sizes) * 1.15)

plt.subplot(212)
g1 = sns.countplot(x="Month", data=df_train, hue='City', dodge=True)
g1.set_title("Hour Count Distribution by Month and City", fontsize=20)
g1.set_ylabel("Count",fontsize= 17)
g1.set_xlabel("Months", fontsize=17)
sizes=[]
for p in g1.patches:
    height = p.get_height()
    sizes.append(height)

g1.set_ylim(0, max(sizes) * 1.15)

plt.subplots_adjust(hspace = 0.3)

plt.show()

Again, philly comes out on top when it comes to count of traffic data. However this still doesn't give enough support to the theory that Philly has more traffic simply because it has more data. We could check this assumption by seeing how much actual stopping there is in philly traffic vs other city's traffic.