# Business School of AI
## LiveLab September 21 : Machine Learning on mobility data
- setup your environment and upload your dataset
- clean your data
- exploratory data analysis
- find dependencies between variables
- data visualisation
- predicting trip duration with machine learning
- refine your predictive performance with feature reduction technique

#### this data analysis is done on the data record file https://data.louisvilleky.gov/dataset/dockless-vehicles/resource/e36546f6-888b-4e66-8a87-9b68cab471e6#{view-graph:{graphOptions:{hooks:{processOffset:{},bindEvents:{}}}},graphOptions:{hooks:{processOffset:{},bindEvents:{}}}}

In [None]:
# setup your coding environment
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns



#### note: if you miss any library at the import, you can install it with
!pip install libraryname

#### to import the dataset from your drive to your colab, you'll need to execute the following code:

- it will ask you to copy/paste a validation code on this notebook
- it will say "Mounted at /content/gdrive" when done.

In [None]:
from google.colab import drive 
drive.mount("/content/gdrive")

In [None]:
# upload your dataset
dockless = pd.read_csv('/content/gdrive/My Drive/DocklessTripOpenData_10.csv')

In [None]:
# check dataset size
dockless.shape

In [None]:
# preview dataset content
dockless.head()

### 1. Clean dataset: 
- rename header in lowercase (show error in calling it first)
- delete outliers (tripdistance == 0.000) 
- rename DayOfWeek values

In [None]:
# you want to check if trip duration column has null values
dockless['Tripduration'].isna().sum()
# it raises an error. 
# you messed the spelling with uppercase/lowercase (common mistake)

In [None]:
# to avoid errors in spelling, keep it simple
# rename all your headers with lowercase
dockless.columns= dockless.columns.str.lower()
dockless.head()

In [None]:
# now check null values with simplified spelling
dockless['tripduration'].isna().sum()

In [None]:
# there are no null values recorded (it means each data point has a value recorded)
# but imagine you want to get rid of meaningless data points (such as trip distance = 0.00)
(dockless['tripdistance']==0).sum()

In [None]:
# delete data points where trip distance == 0 (it might affect your prediction performance)
dockless = dockless[dockless['tripdistance'] != 0]

In [None]:
# check for dataset size again
dockless.shape

In [None]:
# get unique values from dayofweek column
dockless['dayofweek'].unique()
# there are 7 days in the week, alright! but which one is 1st in their calendar ?

In [None]:
# import datetime module
import datetime

# from your dataset preview, fill the date where dayofweek == 1
print(datetime.date(2019,9,22).strftime('%A'))

In [None]:
# create a dictionnary of values
days = {1: "Sunday",
        2: "Monday",
        3: "Tuesday",
        4: "Wednesday",
        5: "Thursday",
        6: "Friday",
        7: "Saturday"
       }

# replace numbers by words in column 'dayofweek'
dockless['dayofweek'] = dockless['dayofweek'].replace(days)

In [None]:
dockless.head()

In [None]:
# check hournum column
dockless['hournum'].unique()

In [None]:
# check number of hours listed
len(dockless['hournum'].unique())

In [None]:
# create a dictionary of values
midnight = {24: 0,
            '24:00': '00:00'}

# hournum is read as an integer number, not as a string. convert it first
dockless['hournum'] = dockless['hournum'].astype(object)

# replace 24 with 0
dockless['hournum'] = dockless['hournum'].replace(midnight)
# same with starttime and endtime columns
dockless['starttime'] = dockless['starttime'].replace(midnight)
dockless['endtime'] = dockless['endtime'].replace(midnight)

In [None]:
dockless['hournum'].unique()

### 2. Exploratory Data Analysis:
- look at some distribution, calculate mean and standard dev
- plot histogram of distribution depending on day of the week => which day is more crowded?
- same with hour => which hour of the day is more crowded?

In [None]:
# calculate the mean of trip duration
dockless['tripduration'].mean()

In [None]:
# check standard deviation
dockless['tripduration'].std()
# pretty scattered distribution! 

In [None]:
# check more descriptive statistics
dockless['tripduration'].describe()

In [None]:
# a trip of 3167 ?
print(3167/60) # how many hours 
print(3167/60/24) # how many days
print(3167/9) # how many times the median value ?

In [None]:
# the max value does not make sense. It must be an outlier, affecting our stats.
# visualise variable distribution
x = dockless['tripduration']
ax = sns.distplot(x)
plt.show()

In [None]:
# in a sub-dataset delete trip duration values above 240 min (=4 hours)
dockless_sub = dockless[dockless['tripduration'] < 240]

In [None]:
# plot again
x = dockless_sub['tripduration']
ax = sns.distplot(x)
plt.show()

In [None]:
# tighten to 90 min (= 1.5 hours)
dockless_sub = dockless[dockless['tripduration'] < 90]

In [None]:
# plot again
x = dockless_sub['tripduration']
ax = sns.distplot(x)
plt.show()

In [None]:
# count trips per day
dockless['dayofweek'].value_counts()

In [None]:
# plot the trips per day
dockless['dayofweek'].value_counts().plot(kind='bar')

In [None]:
# plot the use of vehicles by hours of the day
dockless['hournum'].value_counts().plot(kind='bar')

In [None]:
# same but sorted
dockless['hournum'].hist(bins = 24)

In [None]:
# is every day similar ?
dockless['hournum'].hist(by=dockless['dayofweek'], sharex=True, sharey=True)

### 3. Look for dependencies between variables:
- correlation matrix with pearson coefficient
- heatmap

In [None]:
# pearson coefficient is a correlation indicator
# generate a pearson coefficient for each peer of variables
pearson = dockless.corr(method='pearson')
pearson

In [None]:
# visualize correlation between variables through a heatmap
sns.heatmap(pearson, annot=True)
plt.show()

### 4. Data visualization:
- plot the relationship of the most correlated variables (scatter plot of X / Y variables) => linear regression by adding a straight line on plot
- point lat/long on folium map
- draw edges between origin/destination
- aggregate datapoints to visualise it all

In [None]:
# scatter plot the relationship between trip distance and trip duration
sns.scatterplot(data=dockless, x="tripdistance", y="tripduration")

In [None]:
# plot a subset inside the crowded window
sns.scatterplot(data=dockless.loc[(dockless['tripdistance']<20) & (dockless['tripduration']<500)], x="tripdistance", y="tripduration")

In [None]:
# add a linear regression line to the scatterplot
sns.jointplot(x="tripdistance", y="tripduration", data=dockless, kind='reg',joint_kws={'line_kws':{'color':'black'}})
# as we can see, the longer the distance, the longer the duration

In [None]:
# visualise spatial distribution of data points
plt.figure(figsize = (10,10))
sns.jointplot(x=dockless.startlatitude.values, y = dockless.startlongitude.values, height = 10)
plt.ylabel('Longitude', fontsize = 12)
plt.xlabel('Latitude', fontsize = 12)
plt.show()
sns.despine
# there is hyper-concentration in center city and a few outliers

In [None]:
dockless.shape
# there are lots of datapoints

In [None]:
# random selection of 1000 data points
sample = dockless.sample(n=1000)

In [None]:
# what does our sample data looks like on this graph?
plt.figure(figsize = (10,10))
sns.jointplot(x=sample.startlatitude.values, y = sample.startlongitude.values, height = 10)
plt.ylabel('Longitude', fontsize = 12)
plt.xlabel('Latitude', fontsize = 12)
plt.show()
sns.despine

In [None]:
# visualise the spatial data points on maps
import folium
from folium.plugins import HeatMap
from folium.plugins import HeatMapWithTime

In [None]:
# enter Louisville coordinates
Louisville=[38.2527,-85.7585]

In [None]:
# map Louisville
map_louisville = folium.Map(location=Louisville,
                            zoom_start=12)
map_louisville

In [None]:
# map origin points
map_origin = folium.Map(location=Louisville,
                            zoom_start=12)

for lat, lng in zip(sample['startlatitude'],
                    sample['startlongitude']):
    folium.CircleMarker([lat, lng],
                        radius=2, 
                        stroke=False, 
                        fill_color='blue',
                        fill_opacity=0.5).add_to(map_origin)

map_origin

In [None]:
# cluster points to make the map more readable
map_cluster = folium.Map(location=Louisville,
                            zoom_start=12)

cluster = folium.plugins.MarkerCluster().add_to(map_cluster)

for lat, lng in zip(sample['startlatitude'],
                    sample['startlongitude']):
    folium.Marker([lat, lng]).add_to(cluster)
        
map_cluster

In [None]:
# visualise data points on a heatmap
heatmap = folium.Map(location=Louisville,
                            zoom_start=12)

HeatMap(data=sample[['startlatitude', 'startlongitude']].groupby(['startlatitude','startlongitude']).sum().reset_index().values.tolist(),
       radius=8, max_zoom=12).add_to(heatmap)

heatmap

In [None]:
# map destination points
map_destination = folium.Map(location=Louisville,
                            zoom_start=12)

for lat, lng in zip(sample['endlatitude'],
                    sample['endlongitude']):
    folium.CircleMarker([lat, lng],
                        radius=2, 
                        stroke=False, 
                        fill_color='red',
                        fill_opacity=0.5).add_to(map_destination)

map_destination

In [None]:
map_trip = folium.Map(location=Louisville,
                            zoom_start=12)

for i, row in sample.iterrows():
    folium.CircleMarker([row['startlatitude'], row['startlongitude']],
                        radius=4,
                        stroke=False,
                        fill_color='blue',
                        fill_opacity=0.7).add_to(map_trip)
    
    folium.CircleMarker([row['endlatitude'], row['endlongitude']],
                        radius=4,
                        stroke=False,
                        fill_color='red',
                        fill_opacity=0.7).add_to(map_trip)
    
    folium.PolyLine([[row['startlatitude'], row['startlongitude']],
                    [row['endlatitude'], row['endlongitude']]],
                    strokeColor= "#000000"
                   ).add_to(map_trip)

map_trip

### 5. Machine learning on mobility data
- Recall the research question
- Define X-y axes accordingly 
- split dataset in train/test subsets
- 1st run (compare accuracy scores) 
- plot predictions/actual data => want to refine your scores
- plot feature importance
- reduce feature redundancy
- 2nd run (compare accuracy scores) => better accuracy with refined datasets

# Problem statement
What data do we have at our disposal?
- origin/destination points
- trip duration
- trip distance
- day of the week
- hour of the day

Can we predict the hour of a trip from the day of the week ? NO. As seen above, all days has the same hours distribution.
Can we predict the day of the week from the trip duration ? If we can imagine that on the weekend the users are riding longer trips since they have more time, so far, we haven't found any evidence in the correlation matrix.
Can we predict the trip duration from the trip distance ? YES. Is it meaningful? Yes, as all GPS always predict your time of arrival when you enter a destination.

### We will use machine learning to predict the trip duration from a distance between origin and destination points.

#### since trip duration is a continuous numerical value, we will use regression models of ML
suppose we had a categorical value to predict, we would have used classification models

In [None]:
dockless.dtypes

In [None]:
# define X-y axis, excluding non-numerical values
y = dockless['tripduration'] # dependent variable
X = dockless.select_dtypes(exclude=['object']).drop(['tripduration'], axis=1)

In [None]:
# split dataset in train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 100)
# usually 1/3 train and 2/3 test selected randomly

In [None]:
# linear regression
from sklearn import linear_model

lr = linear_model.LinearRegression()
lr_model =lr.fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

In [None]:
# calculate statistical measures
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statistics

MAE = mean_absolute_error(y_test, lr_pred)
MSE = mean_squared_error(y_test, lr_pred)
RMSE = np.sqrt(MSE)
R2 = r2_score(y_test, lr_pred)
print("MAE: %6.2f" % (MAE))
print("MSE: %6.2f" % (MSE))
print("RMSE: %6.2f" % (RMSE))
print("R2: %6.2f" % (R2))

In [None]:
# Visualizing model performance
plt.scatter(y_test, lr_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predictions')

# Ideal predictions plot
plt.plot(y_test,y_test,'r')

In [None]:
# Plotting residuals
fig = plt.figure(figsize=(10,5))
residuals = (y_test- lr_pred)
sns.distplot(residuals)

#### not bad at all for a first run, but we will try to make it better (MAE < 5)
- one way will be to use a meta estimator such as decision tree regressor to identify which feature contribute the most to the trip duration.

In [None]:
# import library
from sklearn.tree import DecisionTreeRegressor

# define the model with DecisionTreeRegressor
model = DecisionTreeRegressor()
# fit the model
model.fit(X_train, y_train)

In [None]:
importance = model.feature_importances_

print(importance[0])

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

#### pretty clear that trip distance is the only feature of valuable importance to predict trip duration

In [None]:
# recursive feature elimination
from sklearn.feature_selection import RFE

# define method
rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=1)
# fit the model
rfe.fit(X,y)
# transform the data
X_rfe = rfe.transform(X)
print("num features: %d" % rfe.n_features_)
print("selected features: %s" % rfe.support_)
print("feature ranking: %s" % rfe.ranking_)

In [None]:
# train/test split the new dataset
X_train, X_test, y_train, y_test = train_test_split(X_rfe, y, test_size = 0.33, random_state = 100)

In [None]:
# linear regression on refined dataset
lr = linear_model.LinearRegression()
lr_model =lr.fit(X_train, y_train)

lr_pred2 = lr_model.predict(X_test)

In [None]:
# performance metrics
MAE2 = mean_absolute_error(y_test, lr_pred2)
MSE2 = mean_squared_error(y_test, lr_pred2)
RMSE2 = np.sqrt(MSE)
R22 = r2_score(y_test, lr_pred2)
print("MAE: %6.2f" % (MAE2))
print("MSE: %6.2f" % (MSE2))
print("RMSE: %6.2f" % (RMSE2))
print("R2: %6.2f" % (R22))

#### Reducing features didn't made the predictions better. 
It seems that the other variables such as origin/destination positions slightly plays a role in the trip duration too.

- another way to improve our machine learning algorithm will be to delete outliers from the dataset

In [None]:
# remember dockless_sub with trip duration < 90 min ? redefine X-y axis
y_sub = dockless_sub['tripduration'] # dependent variable
X_sub = dockless_sub.select_dtypes(exclude=['object']).drop(['tripduration'], axis=1)

In [None]:
# split sub dataset in train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sub, y_sub, test_size = 0.33, random_state = 100)

In [None]:
# linear regression
from sklearn import linear_model

lr = linear_model.LinearRegression()
lr_model =lr.fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

In [None]:
# calculate statistical measures
MAE = mean_absolute_error(y_test, lr_pred)
MSE = mean_squared_error(y_test, lr_pred)
RMSE = np.sqrt(MSE)
R2 = r2_score(y_test, lr_pred)
print("MAE: %6.2f" % (MAE))
print("MSE: %6.2f" % (MSE))
print("RMSE: %6.2f" % (RMSE))
print("R2: %6.2f" % (R2))

### better but we can try to improve it again

In [None]:
# as most courses are short check on a trip duration < 20
dockless_sub = dockless_sub[dockless_sub['tripdistance'] < 20]

In [None]:
# remember dockless_sub with trip duration < 90 min ? redefine X-y axis
y_sub = dockless_sub['tripduration'] # dependent variable
X_sub = dockless_sub.select_dtypes(exclude=['object']).drop(['tripduration'], axis=1)

In [None]:
# split sub dataset in train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sub, y_sub, test_size = 0.33, random_state = 100)

In [None]:
# linear regression
from sklearn import linear_model

lr = linear_model.LinearRegression()
lr_model =lr.fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

In [None]:
# calculate statistical measures
MAE = mean_absolute_error(y_test, lr_pred)
MSE = mean_squared_error(y_test, lr_pred)
RMSE = np.sqrt(MSE)
R2 = r2_score(y_test, lr_pred)
print("MAE: %6.2f" % (MAE))
print("MSE: %6.2f" % (MSE))
print("RMSE: %6.2f" % (RMSE))
print("R2: %6.2f" % (R2))

# from 7.80 MAE we decreased to 5.98.

## Conclusion:
- this is how we program intelligent machines, by testing combinations of data and looping through datasets to reach the best predictive performances.
- when your accuracy score reachs a threshold value (usually >95%) you can deploy your model for production on completely new datasets and assume the predictions will be statistically valid. 
- in a more advanced fashion, we could make predictions on time or spatial data: for example trying to predict the car traffic or the destination of a trip depending its origin and duration.
### Artificial intelligences are probabilities and statistics performed by performative brains (machines)!

## => Now your turn! Do the same with "Louisville-Dockless-Trips.csv" file. Follow the notebook line by line, fill the code with your data specificities