#### Project guidelines 

https://github.com/suneman/socialdata2021/wiki/Final-Project

# Motivation.


### What is your dataset?
* Hourly traffic data for some of Copenhagen busiest roads. The data have been collected between 2005 and 2014 and made available on the [Copenhagen Open Traffic Data page](https://www.opendata.dk/city-of-copenhagen/faste-trafiktaellinger#resource-faste_trafikt%C3%A6llinger_2008.xlsx)
*  We have merged each year of data into one dataset that we stored [here](https://github.com/LarsBryld/socialdata/blob/main/cph_traffic_2005-2009_original.csv) and [here](https://github.com/LarsBryld/socialdata/blob/main/cph_traffic_2010-2014_original.csv) (unfortunately GitHub doesn't allow to upload files bigger than 25MB so we had to split our dataset in 2 separate files)

* The weather data have been downloaded from the [Danish Meteorological Institute archive](https://www.dmi.dk/vejrarkiv/)

* Copenhagen GeoJson polygons (districts maps) have been downloaded from [here](https://giedriusk.carto.com/tables/copenhagen_districts/public)


### Why did you choose this/these particular dataset(s)?
We were interested in:
   - describing CPH traffic flows over time and space 
   - identify patterns in the data that allow for classification of traffic volumes by roads
   - combine Copenhagen traffic data with weather data to try predict traffic volumes over time and space

### What was your goal for the end user's experience?
Building a tool that allows for an easy visualizion of Copenhagen traffic volumes/flows across time/space 





# Basic stats. Let's understand the dataset better

### Traffic dataset stats

The dataset has **183k rows**, with **30 columns** and after cleaning, preprocessing and transformation of the data it has increased to 1.4 million rows. 
The original columns contain the following information:
   - Vej-Id, these markers end with values "T", "+" or "-":
        - "T" are Total vehicles detected on each hour/road/detection point;
        - "+" are the vehicles moving in the direction of increasing house numbers: "+" means that the house numbers go up (1,2,3....). We assumed that the roads numbering starts from the city center and increases with the distance. So intuitively these vehicles should be the ones leaving the city. This is often true, although not always, as confirmed by the CPH traffic data owners, but we will show in our visualizations that it holds for most roads;
        - "-" should then be the vehicles entering the city (vehicles that physically move "against" the house numbering);
   - Vejnavn contains the Road Names;
   - UTM geographical coordinates of the traffic detection points;
   - Date of detection;
   - Hour of detection: basically there are 24 traffic columns for each date row;
  
### Our choices in data cleaning, preprocessing and data transformation
  
The original data have been transformed in the following ways:
   - Vej-Id data have been manipulated to isolate only the last element: "T", "+", "-";
   - UTM coordinates have been transformed into Latitude and Logitude coordinates;
   - hourly column data have been convereted into rows. This has increased the number of rows by a factor of 24, to around 4.4m rows;
   - Vej-Id markers have been used to move "Leaving" and "Entering" traffic data from columns to rows. This has allowed us to create 3 new features: Leaving vehichles, Entering vehichles and "Net Traffic Flow" data, that are very useful in showing hourly traffic patterns and for our ML classiffication tool. This transformation has of course reduced the number of rows to 1/3 to around 1.45m rows;
   - we have randomized Latitude and Longitude, and added these data in 2 new columns ("Lat_rand", "Lon_rand"). this was necessary to facilitate spacial data visualizations;  
   - Finally we have created new features to visualize the data by different timeframes: daily, weekly, yearly, etc.
   - As part of our data preprocessing we have also deleted all the empty or otherwise irrelavant columns of data


### Do we need to add info about Maps/Weather data here???


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import utm
import folium
import json

Downloading data and removing/adding features

In [None]:
# Downloading faste-trafiktaellinger-2008_clean (to be changed if we get data directly form website)
#df = pd.read_csv("C:/Users/User/Dropbox/DTU/02806 Social data analysis and visualization/cph_traffic_2005-2014_original.csv",
#                 parse_dates = ['Dato'],encoding='ISO-8859-1')

df1 = pd.read_csv("https://raw.githubusercontent.com/LarsBryld/socialdata/main/cph_traffic_2005-2009_original.csv",
                 parse_dates = ['Dato'],encoding='ISO-8859-1')
df2 = pd.read_csv("https://raw.githubusercontent.com/LarsBryld/socialdata/main/cph_traffic_2010-2014_original.csv",
                 parse_dates = ['Dato'],encoding='ISO-8859-1')

df = pd.concat((df1,df2))

# cleaning Vej-Id for more clear traffic directions
df['Vej-Id'] = df['Vej-Id'].str.split(n=4).str[-1]

#change the Hours headers
for i in range(7,31):
    df = df.rename(columns={df.columns[i]: df.columns[i].split('.')[1].split('-')[0]})
    df[df.columns[i]] = df[df.columns[i]].str.replace(',', '').fillna(0).astype('float')

### converting UTM coordinates into Latitute/Longitude using the "utm" library: https://pypi.org/project/utm/
# first we create a function that applies the utm api to 2 Series of data
def uf(x):
    return utm.to_latlon(x[0], x[1], 32, 'T')
# then we apply this function to the UTM coordinates in the file
#df['LatLon'] = df[['Easting','Northing']].apply(uf, axis=1)
df[['Lat', 'Lon']] = pd.DataFrame(df[['(UTM32)','(UTM32).1']].apply(uf, axis=1).tolist(), index=df.index)

# removing the unwanted columns
df = df.drop(columns = ['Unnamed: 0','Spor','(UTM32)','(UTM32).1'])

# converting hours data columns into rows
df = df.melt(id_vars=["Vej-Id","Vejnavn","Dato","Lat","Lon"],
        value_vars=['00','01','02','03','04','05','06','07','08','09','10','11','12',
                    '13', '14','15', '16','17', '18','19','20','21','22','23'],
        var_name="Hour", 
        value_name="Vehicles")

### moving rows data for Vehicles Entering the City, Leaving the City and Net Traffic Flows into columns
# Selecting only Entering Vehicles (and creating a unique index)
df_ent = df[df['Vej-Id'] == '-']
df_ent['index'] = df_ent['Vejnavn'] + df_ent['Dato'].dt.strftime('%Y-%m-%d') + df_ent['Hour']
df_ent = df_ent.set_index('index')
# Selecting only Leaving Vehicles (and creating a unique index)
df_ex = df[df['Vej-Id'] == '+']
df_ex['index'] = df_ex['Vejnavn'] + df_ex['Dato'].dt.strftime('%Y-%m-%d') + df_ex['Hour']
df_ex = df_ex.set_index('index')
# Selecting only Total Vehicles on the roads (and creating a unique index)
df = df[df['Vej-Id'] == 'T']
df['index'] = df['Vejnavn'] + df['Dato'].dt.strftime('%Y-%m-%d') + df['Hour']
df = df.set_index('index')
# adding columns for Vehicles Entering the City, Leaving and Net Traffic Flows
df['Entering Vehicles'] = df_ent['Vehicles']
df['Leaving Vehicles'] = df_ex['Vehicles']
df['Net Traffic Flow'] = df['Entering Vehicles'] - df['Leaving Vehicles']
# renaming the Total Vehicles column
df = df.rename(columns={"Vehicles": "Total Vehicles"})

# randomizing Latitude and longitude points
mu, sigma1 = 0, 0.0015
mu, sigma2 = 0, 0.003
noise1 = np.random.normal(mu, sigma1, [len(df),1])
noise2 = np.random.normal(mu, sigma2, [len(df),1]) 
df[['Lat_rand']] = df[['Lat']] + noise1
df[['Lon_rand']] = df[['Lon']] + noise2

# Add Day of the Week, Day, ,Week, Month, Year,
df["DayName"] = df['Dato'].apply(lambda x: x.day_name())
df["WeekDay"] = df['Dato'].dt.weekday
df["DayOfMonth"] = df['Dato'].dt.day
df["Week"] = df['Dato'].dt.week
df["Month"] = df['Dato'].dt.month
df["Year"] = df['Dato'].dt.year

# removing the 'Vej-Id' columns to avoid confusion (now all traffic data are in the columns for "-" , "+" and "T")
df = df.drop(columns = ['Vej-Id'])

#df.head(10)

## Key points/plots from our exploratory data analysis

### Total traffic distribution by Road


Below is the count of (total) vehicles per each Road in descending order recorded in the whole period.
From both the list and the plot below you can see how there are significant differences in traffic volumes among the roads available in the dataset

In [None]:
totcount = df.groupby('Vejnavn')['Total Vehicles'].sum().sort_values(ascending=False)
pd.DataFrame(totcount.values, index = list(totcount.index), columns =['Total Vehicles']) 

In [None]:
pd.DataFrame(totcount.values, index = list(totcount.index), columns =['Vehicles']).plot.bar();

## Traffic distribution over time


### Monthly distribution of total vehicles (All roads)

The main pattern observable is the drop in traffic in July and December: this is probably due to the Danish holiday season

In [None]:
df.groupby('Month')['Total Vehicles'].sum().plot.bar();

### Montly distribution per Road (Total vehicles)

* below we plotted the montlhy distribution of total vehicles per road. The following roads show some monthly drops that could be due to data quality issues (missing data?):
    - Englandsvej
    - Hareskovvej
    - Roskildevej
    - Vejlands Allé
    - Åboulevard

In [None]:
m = df.groupby(["Month", "Vejnavn"]).sum()["Total Vehicles"].unstack()
#m

In [None]:
m.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));


### Weekly distribution of total vehicles (All roads)

#### The main pattern observable is the total traffic drop in the w-e

In [None]:
df.groupby('WeekDay')['Total Vehicles'].sum().plot.bar();

### Weekly distribution per Road (Total vehicles)
#### The plots below show 2 main exceptions to the usual case of w-e drop in traffic:
* **Kalvebod Brygge** where the drop happens on Mondays and tuesdays. Althought this could be due to some quality issue about the data (we need to check if this is still the case when we include all years (now we are only working with 2008 data)
* **Jagtvej** shows a much lower drop in the w-e compared to other roads


In [None]:
w = df.groupby(["WeekDay", "Vejnavn"]).sum()["Total Vehicles"].unstack()
#w

In [None]:
w.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));


### Day of Month distribution of total vehicles (All roads)

The pattern that can is easily observed is that the 31st day of the month shows around half the volumes traffic compared to the other days of the month. This is of course due to the fact that there are ony 7 months out of 12 that contain 31 days

In [None]:
df.groupby('DayOfMonth')['Total Vehicles'].sum().plot.bar()

### Day of the month distribution per Road (Total vehicles)

* when plotting the day-of-the-month distribution for each road we notice that a few roads show strange patterns that could signal some data quality issues:
    - Hareskovvej
    - Roskildevej

In [None]:
d = df.groupby(["DayOfMonth", "Vejnavn"]).sum()["Total Vehicles"].unstack()
#d

In [None]:
d.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));

### Hourly distribution per Road (Total vehicles)
**Nearly all roads share the same pattern in hourly traffic flows:**
- Midnight to 5am: very low traffic 
- 6-8am: people go to work and vehicles volumes increase rapidly for 3 hours. Then a little slow down for a couple of hours
- 11 to 15-16: traffic volumes start growing again until they peak when people start going bach home from work. 
- 17-18: vehicles numbers drop consistently. Dinner time in Denmark
- 19-23: the traffic flows slowly reduce

In [None]:
h = df.groupby(["Hour", "Vejnavn"]).sum()["Total Vehicles"].unstack()
#h

In [None]:
h.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));

### Hourly distribution per Road (Net Entry-Exit flows)
**NOTE: the Exit-Entry analysis is based on one big assumptions:** roads numbering follows an ascending order that starts at zero from the City center and then increases with the distance from the City center

**Most roads share the same pattern shapes, but not all:**
- Midnight to 4am: most roads show cars exiting the city
- 5-9am: vehicles entering the city are the majority and inward volumes constantly increase until they peak around 9-10
- 11 to 15-16: inward traffic volumes start dwindling until the outwarding vehicles start taking over from 15
- 15-18: majority of vehicles are the one exiting the city
- 19-23: no clear pattern: some roads show the majority of vehicles enterring the city again, while others show the majority of cars exiting the city, depending on the timeframe
- some roads, like Roskildevej and Torvegade show opposite patterns than the one described above. The reason is probably because these roads lead to specific locations that attract a high number of workers, respectively **Roskilde and Amager**
- some other roads instead show systematically net inward or systematic net outward flow of vehicles during the whole day. These are, respectively:
  - Englandsvej (inward flow)
  - Islands Brygge (inward flow)
  - Skt. Kjelds Gade
  - Sølvgade
  - Molbechsvej (ouward flow)
  - Vejlands Allé (ouward flow)

In [None]:
hd = df.groupby(["Hour", "Vejnavn"]).sum()["Net Traffic Flow"].unstack()
#hd

In [None]:
hd.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));

### Yearly distribution of total vehicles (All roads)

* when plotting the Yearly distribution of the total traffic across all roads between 2005 and 2014 we notice a strange pattern: the traffic increases from 2005 to 2008 and it decreases steadily from 2008 to 2014.
* before we draw any conlusion about this we better make some additional checks about data quality
* we will see that plotting the yarly distribution of traffic for each road will show some severe issues with the data

In [None]:
df.groupby('Year')['Total Vehicles'].sum().plot.bar()

## Yearly distribution per Road

as you can see from the plots below, the Yearly distribution of total traffic data raises some concerns about the quality of the data:

* some roads are missing entire years of data:
    - Englandsvej
    - Hareskovvej
    - Jagtvej
    - Kalkbrænderihavnsgade
    - Molbechsvej
    - Mozartsvej
    - Roskildevej
    - Skt. Kjelds Gade
    - Vejlands Allé
    - Åboulevard
    - Ørestads Boulevard
* most other roads have suspicious drops in the total traffic numbers in some years. 



In [None]:
y = df.groupby(["Year", "Vejnavn"]).sum()["Total Vehicles"].unstack()
#y

In [None]:
y.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));

# Using visualizations to check for missing data

### Counting the number of data rows (days of data) for each Road

the previous plots have already suggested that there were both actual and potential issues about Traffic data quality. So we  decided to run some sense check about the quality of the data

* as a simple completeness test we checked how many Roads have around 365 rows (days) of data for each year. The plots below confirmed our suspects:
    - no Road has the full list of 365 rows of data for all the available years
    - only a few Roads have 365 rows of data in the years 2011-2014, which is the time window for which the Weather data are available:
        - Ellebjergvej
        - Gadelandet
        - Islands Brygge
        - Kalvebod Brygge
        - Torvegade
* we will use this finding when we will build our ML test later in the NB to make sure that that the data issues don't affect our Classifier. **Basically we will select only the Roads above for our ML exercise**

In [None]:
yc = df.groupby(["Year", "Vejnavn"]).count()["Total Vehicles"].unstack()/24
yc

In [None]:
yc.plot(kind='bar', subplots=True, figsize=(15,60), layout=(9,4));

# Visualization of Traffic volumes for each CPH district over time

* first we find CPH districts through a GeoJson file
* then we map CPH roads to the relevant districts 
* then we add the distric information to the main DataFrame
* then we group the district data by timeframe: weekly/hourly seem like interesting timeframes to investigate
* finally we represent the data on the interactive Choropleth Map

In [None]:
# first we upload CPH districts polygons from GeoJson file
import urllib.request, json 

with urllib.request.urlopen("https://raw.githubusercontent.com/LarsBryld/socialdata/main/copenhagen_districts.geojson") as url:
    cph_districts = json.loads(url.read().decode())

# we extract the districts names from the GeoJson file
districts = []
for i in range(len(cph_districts["features"])):
    districts.append(cph_districts["features"][i]['properties']['name'])
    
#districts

In [None]:
# then we create a unique list of CPH roads, with corresponding Longitude and Latitute
dfu = pd.concat((pd.DataFrame(df['Vejnavn'].unique(),columns=['Vejnavn']),
           pd.DataFrame(df['Lon'].unique(),columns=['Lon']),
           pd.DataFrame(df['Lat'].unique(),columns=['Lat'])), axis=1)
#dfu

In [None]:
# finally we find in what district each road falls, using the code from this example:
# https://stackoverflow.com/questions/57727739/how-to-determine-if-a-point-is-inside-a-polygon-using-geojson-and-shapely

from shapely.geometry import shape, GeometryCollection, Point

dist = []

for i in range(len(dfu)):
    for j in range(len(districts)):
        if shape(cph_districts["features"][j]['geometry']).contains(Point(dfu['Lon'][i],dfu['Lat'][i])):
            dist.append(cph_districts["features"][j]['properties']['name'])


In [None]:
# then we add the district name to our list of unique roads
dfu = pd.concat((dfu, pd.DataFrame(dist,columns=['District'])), axis=1)
dfu

In [None]:
# adding the District info to the original DF
df = pd.merge(df, dfu[['Vejnavn','District']], on="Vejnavn")
df

In [None]:
# grouping traffic data by Year and District
dfc = df.groupby(["Year", "District"]).mean()["Total Vehicles"].unstack()
dfc = dfc.reset_index()
dfc

In [None]:
# for the Plotly Choropleth to work we nee to put all columns into rows 

dfc = dfc.melt(id_vars=['Year'],
        value_vars=dfc.columns.values[1:],
        var_name="District", 
        value_name="Total Vehicles")
dfc

In [None]:
# creating an interactive map for CPH traffic
import plotly.express as px

max_value = dfc['Total Vehicles'].max()
fig = px.choropleth(dfc, locations='District',
                    geojson=cph_districts, featureidkey="properties.name",
                           color='Total Vehicles',
                           color_continuous_scale="Viridis",
                           range_color=(0, max_value),
                           projection="mercator",
                    animation_frame="Year", animation_group="District"
                          )

fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
# grouping traffic data by Hour and District
dfh = df.groupby(["Hour", "District"]).mean()["Total Vehicles"].unstack()
dfh = dfh.reset_index()
dfh

In [None]:
# for the Plotly Choropleth to work we nee to put all columns into rows 

dfh = dfh.melt(id_vars=['Hour'],
        value_vars=dfh.columns.values[1:],
        var_name="District", 
        value_name="Total Vehicles")
dfh

In [None]:
# creating an interactive map for CPH traffic
import plotly.express as px

max_value = dfh['Total Vehicles'].max()
min_value = dfh['Total Vehicles'].min()
fig = px.choropleth(dfh, locations='District',
                    geojson=cph_districts, featureidkey="properties.name",
                           color='Total Vehicles',
                           color_continuous_scale="Viridis",
                           range_color=(min_value, max_value),
                           projection="mercator",
                    animation_frame="Hour", animation_group="District"
                          )

fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Data Analysis
### Describe your data analysis and explain what you've learned about the dataset.
Our data visualizations have shown us that:
- there are huge differences in traffic across different roads in Copenhagen. This is of course very intuitive because different roads serve different purposes: the big traffic arteries have been build with the aim of carrying most of the daily traffic inside and outside of the city, while smaller ones are only used for local traffic. A good example of 2 roads like these are, for example, Ellebjergvej and Mozartsvej. These 2 roads are very close to each other but they serve completely different purposes: 
    - Ellebjergvej allows people to get in and out of the city and is one of the busiest roads in our dataset, while 
    - Mozartsvej only serves the local traffic and is the second less busy road in the dataset.
- Monthly traffic is pretty stable across all months, with 2 biggest exceptions during Danish holidays, when the traffic slows down compared to the other months: July and December
- Weekly traffic in Copenhagen decreases in the weekend, as expected, for all roads
- Hourly (total) traffic shows a pretty consistent pattern across all roads: very low traffic during the night; increase in the early hours of the day (6-8am) when people go to work. It keeps growing until it reaches a peak around 15-16 when people start going back home from work. Beween 16-18 the traffic quickly decreases, and after dinner time (7pm) it slows down a lot.
- Hourly net flows also show a pretty interesting and consistent pattern: 
    - for most roads the net traffic flow outside of the city during the night; 
    - the majority of vehicles starts flowing to the city from 4am, and the process continues until it peaks around 10-11: people leaving outside of the city go to their workplaces or to businesses that are located inside the city; 
    - after 11am the net inflow of cars starts going down and 
    - around 15 the majority of vehicles on the roads are the ones leaving the city: the end of the working day starts approaching.
    - there is only one big exception to the above net traffic flow: Roskildevej shows a pattern opposite to all other roads. This probably suggests that the number of Copehageners working in Roskilde is higher that the number of people living in Roskilde and working in CPH 
    - a handful of other roads **(......)** don't show the hourly net flow of traffic that we have just destribed, and instead these show one-sided traffic flow across the whole day. This is probably due to the fact that our assumption about identifying the traffic direction with the Vej-Id doesn't hold for these roads.  


# Machine Learning

## ML Exercise 1: classifying traffic data by Roads by using hourly Net Traffic Flows

The visualization of Hourly Net Flows (vehicles leaving - vehicles entering the city) above shows that, for example, Roskildevej has a very different pattern from other roads: cars on Roskildevej are leaving the city in the morning and are coming back at night. This is probably due to the fact that the number of Copenhageners working in Roskilde is higher than the number of Roskilde residents working in Copenhagen.

Based on this visual information we have build a classifier that can identify traffic flows on Roskildevej from roads that have a completely different hourly traffic flow (we have chosen Ellebjergvej for our example).

Our classifier uses Random Forests and yields around 80% accurate predictions for both the training and test data


In [None]:
# Test without Weather data

### data preprocessing

# selecting our sample: focussing on data for 2011, 2012, 2013 and 2 roads only
dfml = df[(df['Year'].isin([2011,2012,2013]))
         & (df['Vejnavn'].isin(['Ellebjergvej','Roskildevej']))]  

# keeping only: Hour-of-the-day, Day-of-the-week, Month-of-the-year, and PD-District
dfml = dfml[['Vejnavn','Net Traffic Flow', 'Hour']]
#dfml

In [None]:
# encoding data with LabelEncoder
from sklearn.preprocessing import LabelEncoder

#creating labelencoder
le = LabelEncoder()
# encoding string labels (Category is the target variable)
labels = le.fit_transform(dfml['Vejnavn'])

# encoding string features (not necessary)
#NetTraffic =le.fit_transform(dfml['Hour'])

features=dfml[['Net Traffic Flow', 'Hour']]

#Split Train/test datasets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=42)

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf = clf.fit(X_train,  y_train)

# measuring the classification performance of the RF classifier through cross_val_score
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(clf, X_train,  y_train, cv=10))

In [None]:
# Measuring RF prediction performance (in the Test sample)
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test) # 0:Overcast, 2:Mild
print(classification_report(y_test, y_pred))#, target_names=target_names))

## ML Exercise 2: predict traffic volumes with weather data

### Importing CPH weather data

In [None]:
dfw = pd.read_csv("https://raw.githubusercontent.com/LarsBryld/socialdata/main/cph_weather.csv",
                 parse_dates = ['DateTime'])
dfw

### Merging Traffic and Weather data

In [None]:
dfm = pd.merge(df, dfw, left_on='Dato',right_on='DateTime')
dfm

In [None]:
roadsforML = ['Ellebjergvej',
              'Gadelandet',
              'Islands Brygge',
              'Kalvebod Brygge',
              'Torvegade'
]
yearsforML = [2011, 2012, 2013]

dfml = dfm[(dfm['Vejnavn'].isin(roadsforML)) & 
           (dfm['Year'].isin(yearsforML)) &
           (dfm['WeekDay']<=4) &
           (~dfm['Month'].isin([1,7,12]))]

dfml = dfml.groupby(['Dato']).mean()

dfml

### let's have a look at the data before we check our ML engine: 
* the subset of data that we selected seem to be stable across the 3 years that we are investigating, rather than decreasing wehn we looked at the overall dataset

In [None]:
dfml.groupby('Year')['Total Vehicles'].sum().plot.bar()

### when investigating the distribution of traffic days between: low, medium and high traffic days, we notice some class imbalance in our dataset

In [None]:
dfml['Total Vehicles'].plot.hist(bins=3, alpha=0.5)

### the Random Forrest Classifier on the current data shows a modest predicting power of 59%

In [None]:
#creating labes by splitting floating data into bins
labels = pd.cut(dfml['Total Vehicles'], bins=3, retbins=True, labels=False)[0]

# encoding string features is not necessary
features = dfml[["LowTemp",
               "HighTem",
               "MidTemp",
               "AirPressure",
               "Rain",
               "LowWind",
               "MidWind",
               "HighWind",
               "Sunshine"]]

#Split Train/test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=42)

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf = clf.fit(X_train,  y_train)



# measuring the classification performance of the RF classifier through cross_val_score
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(clf, X_train,  y_train, cv=10))

In [None]:
# Measuring RF prediction performance (in the Test sample)
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test) # 0:Overcast, 2:Mild
print(classification_report(y_test, y_pred))#, target_names=target_names))

### one way to try improve the classifier is to resample the data to eliminate the class imbalance in the data 

### so we need to increase the sample size of classes 0 and 1 (low and medium traffic respectively) and then see if the classifier improves

### how do we do it? first we add the labels to our dataframe

In [None]:
dfml['Labels'] = labels
#dfml

### then we split the dataframe by:
- Train and Test samples
- then we re-sample each category to have an equally weighted Train sample
- we train the sample on the balanced dataset and finally we check the model predictive power on the origial Test data

In [None]:
#Split Train/test datasets
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, labels, test_size=0.33, random_state=42)

In [None]:
# the training dataset now contains both features and labels
TrainSet = pd.concat((y_train1,X_train1), axis=1)
TrainSet

In [None]:
# now we fix the class imbalance and the by re-sampling and inceasing the trainign dataset to 1000 elements for each class
# setting Max Class Size equal to 1.000
mcs = 1000

# extracting random samples of equally sized classes
dfs = pd.DataFrame()
dfs = dfs.append(TrainSet[TrainSet['Total Vehicles'] == 0].sample(n=mcs, random_state=1, replace=True))
dfs = dfs.append(TrainSet[TrainSet['Total Vehicles'] == 1].sample(n=mcs, random_state=1, replace=True))
dfs = dfs.append(TrainSet[TrainSet['Total Vehicles'] == 2].sample(n=mcs, random_state=1, replace=True)) 

dfs

In [None]:
dfs['Total Vehicles'].plot.hist(bins=5, alpha=0.5)

### The data resampling drastically increases the in-sample predictive power of the classifier to nearly 100%

In [None]:
#taking the labels from the df
labels1 = dfs['Total Vehicles']

# tafing weather features from the df
features1 = dfs[["LowTemp",
               "HighTem",
               "MidTemp",
               "AirPressure",
               "Rain",
               "LowWind",
               "MidWind",
               "HighWind",
               "Sunshine"]]

#Split Train/test datasets
#from sklearn.model_selection import train_test_split
#X_train1, X_test1, y_train1, y_test1 = train_test_split(features, labels, test_size=0.33, random_state=42)

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

clf1 = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
#clf1 = clf1.fit(X_train1, y_train1)
clf1 = clf1.fit(features1, labels1)

# measuring the classification performance of the RF classifier through cross_val_score
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(clf1, features1, labels1, cv=10))

### the predictive power on the Test sample instead has not increased much: only from 59% to 61%
* this result makes us conclude that the weather only has a modest predictive power in predicting the traffic flow
* this is probably due to 2 issues:
    - there is not enough granularity in the data: "Vehicles" can include both cars and bicycles for example, while the choic of which vehicle is used could depend on the weather
    - the decision of driving/cycling (or driving any other vehicle) is due to reasons that are independent of the weather (family commitments; job needs; etc.)
    - considering all the above we still believe that the weather has some predicting power on traffic volumes and it could be improved with higher data granularity

In [None]:
np.mean(cross_val_score(clf1, X_test1, y_test1, cv=10))

In [None]:
# Measuring RF prediction performance (in the Test sample)
from sklearn.metrics import classification_report

y_pred1 = clf1.predict(X_test1) # 0:Overcast, 2:Mild
print(classification_report(y_test1, y_pred1))#, target_names=target_names))

### Let's find which weather data are driving the classification

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(clf, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

In [None]:
perm = PermutationImportance(clf1, random_state=1).fit(X_test1, y_test1)
eli5.show_weights(perm, feature_names = X_test1.columns.tolist())

# Genre. Which genre of data story did you use?
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

# Visualizations.
* Explain the visualizations you've chosen.
* Why are they right for the story you want to tell?

# Discussion. Think critically about your creation
* What went well?
    - We managed to get all the data we were looking for from Open data facilities: Traffic, Weather, Maps Polygons
    - The cleaning/processing/merging of all data visualization purposes
    - We were able to extract useful informations from the our visualizations
    - Ther Machine Learning exercises gave us some hints about the predictability of traffic flows
    - The website representation was far from obvious to address, but we managed to do it, although it required a lot of effort
    - We believe that our story representation allows the end user understand the main features/information that the data contain
* What is still missing? What could be improved?, Why?
    - with more granularity in the data (split between different vehicles for example) we might have had a better Classifier/Predictor of Traffic volumes based on Weather conditions
    - to the same aim we could/should have normalised the weather data by month, although the impact of this is far from obvious
    - .....

# Contributions. Who did what?
You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).

In [None]:
df

In [None]:
df.groupby(['WeekDay','Vejnavn']).sum()#["Net Traffic Flow"]
#df.groupby(["Year", "Vejnavn"]).count()["Total Vehicles"].unstack()/24

In [None]:
fig = px.bar(w, x="WeekDay", y="Net Traffic Flow", facet_col='Vejnavn', facet_col_wrap=5)

fig.show()

In [None]:
data_canada

In [None]:
import plotly.express as px
data_canada = px.data.gapminder().query("country == 'Canada'")
fig = px.bar(data_canada, x='year', y='pop')
fig.show()

In [None]:
data_italia = px.data.gapminder().query("country == 'Italy'")
data_italia

In [None]:
data = pd.concat((data_canada,data_italia))
data

In [None]:
import plotly.express as px
fig = px.bar(data, x='year', y='pop')
fig.show()

# Additional space data visualizations using Folium

# Radhusplads

In [None]:
import folium

map_hooray = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12.5)

folium.Marker([55.6761, 12.5683], 
              popup='RadHus Plads', 
              icon=folium.Icon(color='blue')
             ).add_to(map_hooray)

map_hooray

# Visualizing some traffic data (randomized locations)

In [None]:
df1 = df[(df['DayOfMonth'].isin([1,2,3,4,5,2,3,4,5,6,7,8,9,10,
                                11,12,13,14,15,16,17,18,19,20,
                                21,22,23,24,25,26,27,28,29,30,31]))
         & (df['Hour'].isin(['07','08']))
         & (df['Month'].isin([6, 7]))
         & (df['Year'] == 2012)]

df1

In [None]:
map2 = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

folium.Marker([55.6761, 12.5683], 
              popup='City Hall', 
              icon=folium.Icon(color='blue')
             ).add_to(map2)

for i in range(len(df1)):
    folium.Circle(location=[df1.iloc[i]['Lat_rand'], df1.iloc[i]['Lon_rand']],
                  popup=df1.iloc[i]['Month'],
                  radius=4, #data.iloc[i]['value']*10000,
                  color='crimson',
                  fill=True,
                  fill_color='crimson'
                 ).add_to(map2)

map2

# Heatmap

In [None]:
from folium.plugins import HeatMap

map_hooray = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df1[['Lat', 'Lon']]
#heat_df = heat_df.dropna(axis=0, subset=['Y','X'])



# List comprehension to make out list of lists
heat_data = [[row['Lat'],row['Lon']] for index, row in heat_df.iterrows()]

# Plot it on the map
HeatMap(heat_data).add_to(map_hooray)

# Display the map
map_hooray

# HeatMapWithTime  (Weekdays - NON random locations) 

In [None]:
from folium import plugins

map_hooray = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df1[['Lat', 'Lon','Lat_rand', 'Lon_rand']]
#heat_df = heat_df.dropna(axis=0, subset=['Y','X'])

# List comprehension to make out list of lists
heat_data = [[row['Lat'],row['Lon']] for index, row in heat_df.iterrows()]

# Create weight column, using date
heat_df['Weight'] = df1['WeekDay']
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Lat','Lon', 'Weight'])

# List comprehension to make out list of lists
heat_data = [[[row['Lat'],row['Lon']] for index, row in heat_df[heat_df['Weight'] == i].iterrows()] for i in range(0,7)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(map_hooray)
# Display the map
map_hooray

# HeatMapWithTime  (Weekdays - randomized locations)

In [None]:
from folium import plugins

map_hooray = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

heat_rn = pd.concat([heat_df['Lat_rand'], heat_df['Lon_rand'], heat_df['Weight']], axis=1)

# List comprehension to make out list of lists
heat_data_rn = [[[row['Lat_rand'],row['Lon_rand']] for index, row in heat_rn[heat_rn['Weight'] == i].iterrows()] for i in range(0,7)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data_rn,auto_play=True,max_opacity=0.8)
hm.add_to(map_hooray)
# Display the map
map_hooray

# HeatMapWithTime  (Hours - only 2 roads) - we could skip this one

In [None]:
df3 = df[(df['DayOfMonth'].isin([1,2,3,4,5,2,3,4,5,6,7,8,9,10,
                                11,12,13,14,15,16,17,18,19,20,
                                21,22,23,24,25,26,27,28,29,30,31]))
         & (df['Vejnavn'].isin(['Roskildevej','Ellebjergvej']))
#         & (df['Hour'].isin(['07','08']))
         & (df['Month'].isin([6, 7]))
         & (df['Year'] == 2012)]
#        & (df['Vej-Id'] == 'T')]

#df3

In [None]:
map3 = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

folium.Marker([55.6761, 12.5683], 
              popup='City Hall', 
              icon=folium.Icon(color='blue')
             ).add_to(map3)

for i in range(len(df3)):
    folium.Circle(location=[df3.iloc[i]['Lat_rand'], df3.iloc[i]['Lon_rand']],
                  popup=df3.iloc[i]['Month'],
                  radius=4, #data.iloc[i]['value']*10000,
                  color='crimson',
                  fill=True,
                  fill_color='crimson'
                 ).add_to(map3)

map3

In [None]:
from folium import plugins

map_hooray = folium.Map([55.6761, 12.5683], tiles = "Stamen Toner", zoom_start=12)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df3[['Lat_rand', 'Lon_rand']]
#heat_df = heat_df.dropna(axis=0, subset=['Y','X'])

# List comprehension to make out list of lists
heat_data = [[row['Lat_rand'],row['Lon_rand']] for index, row in heat_df.iterrows()]

# Create weight column, using date
heat_df['Weight'] = df3['Hour']
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Lat_rand','Lon_rand', 'Weight'])

# List comprehension to make out list of lists
heat_data = [[[row['Lat_rand'],row['Lon_rand']] for index, row in heat_df[heat_df['Weight'] == i].iterrows()] for i in range(0,24)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(map_hooray)
# Display the map
map_hooray