# **Group 13**:
Evan Garcia, Jacob Ramos, Casey Kwinn, Daniel Cook

# **Project**: Traffic Volume

Traffic is a daily dilemma that most people face in their life way too many times. The average driver in America spends 293 hrs annually behind the wheel.

Our group goal is to predict metro traffic volume based on:
Hourly weather features, 
Temperature,
Holidays,
 & Date/Time.

We will use a dataset that contains the number of instances of metro traffic during different conditions and features. We want to run and use different regression models to measure and figure out what attributes/features causes the most noticeable effect in traffic. Some techniques we plan to use are the K-Mean Clustering and Cross-Validation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from sklearn.cluster import KMeans
df = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
df = df.fillna(0)
df

This is our graph that has 9 columns: Holiday, Temp, Rain1Hour, Snow1Hour, CloudAll, MainWeather, WeatherDescription, DateTime, and TrafficVolume. To start our EDA, we first want to figure out how many unique values there are in each category.

In [None]:
df2 = df
df2.nunique(axis=0)

We are now checking the means, count, min, max, and more about this dataset and put it in scientific notation

In [None]:
df2.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

To go further in examining the data, we will focus on getting a better understanding of the variables and values.

In [None]:
df2.weather_main.unique()

In [None]:
df2.snow_1h.unique()

In [None]:
df2.holiday.unique()

In [None]:
df2.clouds_all.unique()

In [None]:
df2.weather_description.unique()

While examining the unique values, one thing we notice in weather description is that some of it is similar to each other, such as 'thunderstorm with light drizzle' and 'thunderstorm with drizzle'. This is redundant and we want to reclassify some of these descriptions.

In [None]:
def clean_weather_description(row):
    
    simplifythunderstormrain = ['thunderstorm with light rain', 'thunderstorm with rain', 'thunderstorm with heavy rain' ]
    simplifythunderstormdrizzle = ['thunderstorm with drizzle', 'thunderstorm with light drizzle']
    simplifydrizzle = ['light intensity drizzle', 'drizzle', 'heavy intensity drizzle']
    simplifyskyclear = ['sky is clear','Sky is Clear']
    simplifyheavyrain = ['heavy intensity rain', 'very heavy rain']
    simplifylightsnow = ['light shower snow', 'light snow']
    simplifyshowerrain = ['proximity shower rain', 'shower rain', 'light intensity shower rain']
    simplifysquall = ['freezing rain', 'light rain and snow', 'SQUALLS']
    
    if row.weather_description in simplifythunderstormrain:
        return 'thunderstorm with rain'   
    if row.weather_description in simplifythunderstormdrizzle:
        return 'thunderstorm with drizzle' 
    if row.weather_description in simplifydrizzle:
        return 'drizzle' 
    if row.weather_description in simplifyskyclear:
        return 'clear'
    if row.weather_description in simplifyheavyrain:
        return 'heavy rain'
    if row.weather_description in simplifylightsnow:
        return 'light snow'
    if row.weather_description in simplifyshowerrain:
        return 'shower rain'
    if row.weather_description in simplifysquall:
        return 'squall'
    return row.weather_description # Clean dataframe
def clean_df(playlist):
    df_cleaned = df2.copy()
    df_cleaned['weather_description'] = df_cleaned.apply(lambda row: clean_weather_description(row), axis=1)
    return df_cleaned# Get df with reclassfied 'condition' column
df_cleaned = clean_df(df2)
df_cleaned.weather_description.unique()

As shown, we reduced some of the variables that were redundant.

Going through the dataset, we realize that the columns of rain and snow 1 hour on the cluster graphs we made have very little use or info to extract from because most of the time the column is equal to 0. So we will rid of these columns.

In [None]:
df_cleaned = df_cleaned.copy().drop(['rain_1h','snow_1h'], axis=1)
df_cleaned.nunique(axis=0)

Now the next step in the EDA is to rid of some outliers, like in the temp category and others.

In [None]:
df_cleaned = df_cleaned[df_cleaned['temp'].between(250.00, 312.00)]
df_cleaned = df_cleaned[df_cleaned['traffic_volume'] > 1000]
df_cleaned = df_cleaned[df_cleaned['date_time'] > '2013-12-31 23:00:00']

column_to_move = df_cleaned.pop('traffic_volume')
df_cleaned.insert(0, 'traffic_volume', column_to_move)
df_cleaned[['date' ,'time']] = df_cleaned.date_time.str.split(expand=True)
df_cleaned = df_cleaned.drop('date_time', axis=1)

df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

Now we will see the brand new cleaned dataset that we will work with on the main part.

In [None]:
df_cleaned

Now that the dataset has been cleaned we can perform some basic exploratory data analysis to see what we're dealing with. We can start with some bar graphs showing the correleation between traffic volume and the other variables in the dataset.

To start, it seems like holidays have, on average, significantly lower traffic volume than non-holidays. Although this makes sense, the dataset only contains several entries of holidays, so the sample size isn't very large.

In [None]:
df_pt = df_cleaned.pivot_table(values='traffic_volume',index='holiday',aggfunc=np.mean)
df_pt.plot(kind='bar')

Next, it appears that temperature and cloud coverage have little affect on traffic volume, although the does seem to be a small correlation between lower temperature and lower traffic volume on average.

In [None]:
df_pt2 = df_cleaned
df_pt2['temp_range'] = pd.cut(x=df_pt2['temp'], bins=[250,255,260,265,270,275,280,285,290])
df_pt2.pivot_table(values='traffic_volume',index='temp_range',aggfunc=np.mean).plot(kind='bar')

In [None]:
df_pt3 = df_cleaned
df_pt3['clouds_range'] = pd.cut(x=df_pt3['clouds_all'], bins=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
#df_pt3['clouds_range'].value_counts()
df_pt3.pivot_table(values='traffic_volume',index='clouds_range',aggfunc=np.mean).plot(kind='bar')

In [None]:
df_pt4 = df_cleaned.pivot_table(values='traffic_volume',index='weather_main',aggfunc=np.mean)
df_pt4.plot(kind='bar')

As for weather, there tends to be lower traffic volume during light and heavy snowfall, whereas traffic volume seems mostly unaffected by other weather patterns.

In [None]:
df_pt5 = df_cleaned.pivot_table(values='traffic_volume',index='weather_description',aggfunc=np.mean)
df_pt5.plot(kind='bar')

Finally, time of day seems to have a very obvious impact on traffic volume. Late hours of the night have very reduced traffic volume, where as working hours of the day have much higher volume, with peaks occuring at the start and end hours of the day.

In [None]:
df_pt6 = df_cleaned.pivot_table(values='traffic_volume',index='time',aggfunc=np.mean)
df_pt6.plot(kind='bar')

## **Main Part**

For our project, we are going to use K-Mean Clustering on our dataset to find clusters that show the attributes that have the most influence on metropolitan traffic. (Just testing and messing with data) **Ignore for now, still cleaning dataset**

In [None]:
test_df = df_cleaned

In [None]:
x = df[['temp', 'traffic_volume']].copy()
x

In [None]:
kmean = KMeans(3)
kmean.fit(x)
identified_clusters = kmean.fit_predict(x)
identified_clusters

In [None]:
data_with_clusters = df.copy()
data_with_clusters['Clusters'] = identified_clusters 
plt.scatter(data_with_clusters['temp'],data_with_clusters['traffic_volume'],c=data_with_clusters['Clusters'],cmap='rainbow')

This is just to see if a elbow test can work

In [None]:
wcss = []
for i in range(1,7):
    kmean = KMeans(i)
    kmean.fit(x)
    wcss_iter = kmean.inertia_
    wcss.append(wcss_iter)

number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

did temp and traffic, now clouds and traffic

In [None]:
y = df[['clouds_all', 'traffic_volume']].copy()
y

In [None]:
kmean2 = KMeans(3)
kmean2.fit(y)
identified_clusters2 = kmean2.fit_predict(y)
identified_clusters2

In [None]:
data_with_clusters2 = df.copy()
data_with_clusters2['Clusters'] = identified_clusters2 
plt.scatter(data_with_clusters2['clouds_all'],data_with_clusters2['traffic_volume'],c=data_with_clusters2['Clusters'],cmap='rainbow')

In [None]:
#df['traffic_volume'] = df['traffic_volume'].astype(float)
#kmeans = KMeans(n_clusters=3, random_state=0).fit(df['traffic_volume'])

# add the cluster labels to the dataframe
#df['traffic_level'] = kmeans.labels_
from sklearn.cluster import KMeans

# reshape the data to have two dimensions
data = np.array(df['traffic_volume']).reshape(-1, 1)
sorted_data = np.sort(df['traffic_volume'])
initial_centroids = np.array([
    [sorted_data[int(0 * len(sorted_data) / 100)]],
    [sorted_data[int(50 * len(sorted_data) / 100)]],
    [sorted_data[int(100 * (len(sorted_data)-1) / 100)]]
])

# fit KMeans model with specified initial centroids
kmeans = KMeans(n_clusters=3, init=initial_centroids).fit(data)

# add the cluster labels to the dataframe
df['traffic_level'] = kmeans.labels_
print(df[['traffic_level', 'traffic_volume']])
# sort the DataFrame by 'traffic_volume' column
df = df.sort_values('traffic_volume')

# normalize the 'traffic_volume' column between 0 and 1
df['normalized_traffic_volume'] = (df['traffic_volume'] - df['traffic_volume'].min()) / (df['traffic_volume'].max() - df['traffic_volume'].min())

# plot the line chart
plt.plot(df['normalized_traffic_volume'], df['traffic_level'])
plt.xlabel('Normalized Traffic Volume')
plt.ylabel('Traffic Level')
plt.title('Traffic Level vs. Normalized Traffic Volume')
plt.show()

In [None]:
rain = df[['rain_1h', 'traffic_volume']].copy()
rain

In [None]:
kmean3 = KMeans(3)
kmean3.fit(rain)
identified_clusters3 = kmean3.fit_predict(rain)
data_with_clusters3 = df.copy()
data_with_clusters3['Clusters'] = identified_clusters3
plt.scatter(data_with_clusters3['rain_1h'],data_with_clusters3['traffic_volume'],c=data_with_clusters3['Clusters'],cmap='rainbow')

In [None]:
snow = df[['snow_1h', 'traffic_volume']].copy()
snow

In [None]:
kmean4 = KMeans(3)
kmean4.fit(snow)
identified_clusters4 = kmean4.fit_predict(rain)
data_with_clusters4 = df.copy()
data_with_clusters4['Clusters'] = identified_clusters4
plt.scatter(data_with_clusters4['snow_1h'],data_with_clusters4['traffic_volume'],c=data_with_clusters4['Clusters'],cmap='rainbow')

In [None]:
time = df[['date_time', 'traffic_volume']].copy()
time