# Uber Fares 🚙🚙
In this exercise, we'll use Random Forests in order to estimate the price of a Uber ride.

## Importing libraries and dataset
0. Import the usual libraries and read the dataset from this url:
"https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Decision+trees/uber.csv"

In [173]:
import pandas as pd
import datetime as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
)
import matplotlib.pyplot as plt
import warnings

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio


from sklearn.feature_selection import  SequentialFeatureSelector
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error


In [146]:
dataset = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Decision+trees/uber.csv")
dataset.head()

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,48462598,2015-05-07 10:24:44.0000004,13.0,2015-05-07 10:24:44 UTC,-73.971664,40.797035,-73.958939,40.777649,1
1,6637611,2014-07-09 09:14:04.0000002,5.5,2014-07-09 09:14:04 UTC,-73.991635,40.749855,-73.98825,40.741341,2
2,8357193,2013-11-11 18:51:00.000000240,8.5,2013-11-11 18:51:00 UTC,-73.982352,40.777042,-73.995912,40.759757,1
3,40466112,2014-05-22 01:54:00.00000069,19.0,2014-05-22 01:54:00 UTC,-73.991455,40.7517,-73.936357,40.812327,1
4,35405035,2011-06-21 23:37:33.0000002,7.7,2011-06-21 23:37:33 UTC,-73.974749,40.756255,-73.952276,40.778332,1


## Basic exploring and cleaning
1. Display basic statistics about the dataset. Do you notice some inconsistent values?

In [147]:
dataset.shape

(20000, 9)

In [148]:
print(dataset.describe(include="all").T) # transpose


                     count unique                          top freq  \
Unnamed: 0         20000.0    NaN                          NaN  NaN   
key                  20000  20000  2015-05-07 10:24:44.0000004    1   
fare_amount        20000.0    NaN                          NaN  NaN   
pickup_datetime      20000  19967      2012-08-28 14:03:00 UTC    2   
pickup_longitude   20000.0    NaN                          NaN  NaN   
pickup_latitude    20000.0    NaN                          NaN  NaN   
dropoff_longitude  20000.0    NaN                          NaN  NaN   
dropoff_latitude   20000.0    NaN                          NaN  NaN   
passenger_count    20000.0    NaN                          NaN  NaN   

                             mean              std        min          25%  \
Unnamed: 0         27679493.67825  16011228.617798     3949.0  13834759.75   
key                           NaN              NaN        NaN          NaN   
fare_amount             11.358151          9.89199     

In [149]:
print(dataset.duplicated().sum())


0


In [150]:
print(dataset.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         20000 non-null  int64  
 1   key                20000 non-null  object 
 2   fare_amount        20000 non-null  float64
 3   pickup_datetime    20000 non-null  object 
 4   pickup_longitude   20000 non-null  float64
 5   pickup_latitude    20000 non-null  float64
 6   dropoff_longitude  20000 non-null  float64
 7   dropoff_latitude   20000 non-null  float64
 8   passenger_count    20000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 1.4+ MB
None


2. Drop the useless columns and the rows containing outliers.

In [151]:
dataset = dataset.iloc[:,2:]

In [152]:
# virer les lignes où y a des valeurs aberrantes
to_keep = (dataset["fare_amount"] > 0)
dataset = dataset.loc[to_keep, :]
dataset.shape


(19998, 7)

## Feature engineering
### Dealing with datetime objects
3. Convert the `pickup_datetime` column into datetime format. Use panda's [dt module](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) to create the following columns:
* Year
* Month
* Day
* Weekday: contains the **name** of the day of week

Then, you can drop the column `pickup_datetime`.

In [153]:
dataset["pickup_datetime"] = pd.to_datetime(dataset["pickup_datetime"])
dataset.head() 

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,13.0,2015-05-07 10:24:44+00:00,-73.971664,40.797035,-73.958939,40.777649,1
1,5.5,2014-07-09 09:14:04+00:00,-73.991635,40.749855,-73.98825,40.741341,2
2,8.5,2013-11-11 18:51:00+00:00,-73.982352,40.777042,-73.995912,40.759757,1
3,19.0,2014-05-22 01:54:00+00:00,-73.991455,40.7517,-73.936357,40.812327,1
4,7.7,2011-06-21 23:37:33+00:00,-73.974749,40.756255,-73.952276,40.778332,1


In [154]:
dataset["year"] = dataset["pickup_datetime"].dt.year
dataset["month"] = dataset["pickup_datetime"].dt.month
dataset["day"] = dataset["pickup_datetime"].dt.day
dataset["weekday"] = dataset["pickup_datetime"].dt.day_name()
dataset.drop("pickup_datetime", axis=1, inplace=True)
dataset.head()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,year,month,day,weekday
0,13.0,-73.971664,40.797035,-73.958939,40.777649,1,2015,5,7,Thursday
1,5.5,-73.991635,40.749855,-73.98825,40.741341,2,2014,7,9,Wednesday
2,8.5,-73.982352,40.777042,-73.995912,40.759757,1,2013,11,11,Monday
3,19.0,-73.991455,40.7517,-73.936357,40.812327,1,2014,5,22,Thursday
4,7.7,-73.974749,40.756255,-73.952276,40.778332,1,2011,6,21,Tuesday


### Haversine formula

It would be very interesting to compute the ride distance from the GPS coordinates. [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) allows to do this 🤓:

$$
d = 2r \arcsin \big(\sqrt{\sin^2(\frac{\phi_2 - \phi_1}{2}) + \cos \phi_1 \cos \phi_2 \sin^2(\frac{\lambda_2 - \lambda_1}{2})} \big)
$$

where:
* $d$ is the ride distance in kilometers
* $r$ is the Earth's radius in kilometers
* $\phi_1$ is the pickup latitude in radians
* $\phi_2$ is the dropoff latitude in radians
* $\lambda_1$ is the pickup longitude in radians
* $\lambda_2$ is the dropoff longitude in radians

We've implemented for you a function that computes this formula for one ride with coordinates `lon_1`, `lon_2`, `lat_1` and `lat_2`:

In [155]:
def haversine(lon_1, lon_2, lat_1, lat_2):
    
    lon_1, lon_2, lat_1, lat_2 = map(np.radians, [lon_1, lon_2, lat_1, lat_2])  # Convert degrees to Radians
    
    
    diff_lon = lon_2 - lon_1
    diff_lat = lat_2 - lat_1
    

    distance_km = 2*6371*np.arcsin(np.sqrt(np.sin(diff_lat/2.0)**2 + np.cos(lat_1) * np.cos(lat_2) * np.sin(diff_lon/2.0)**2)) # earth radius: 6371km
    
    return distance_km

4. Apply the `haversine` function to he whole dataset to create a new column `ride_distance`. [This stackoverflow post](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe?answertab=trending#tab-top) might help you!

In [156]:
dataset["ride_distance"] = dataset.apply(lambda row: haversine(row["pickup_longitude"], row["dropoff_longitude"], row["pickup_latitude"], row["dropoff_latitude"]),axis=1) 
dataset.head()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,year,month,day,weekday,ride_distance
0,13.0,-73.971664,40.797035,-73.958939,40.777649,1,2015,5,7,Thursday,2.407225
1,5.5,-73.991635,40.749855,-73.98825,40.741341,2,2014,7,9,Wednesday,0.988729
2,8.5,-73.982352,40.777042,-73.995912,40.759757,1,2013,11,11,Monday,2.235651
3,19.0,-73.991455,40.7517,-73.936357,40.812327,1,2014,5,22,Thursday,8.183379
4,7.7,-73.974749,40.756255,-73.952276,40.778332,1,2011,6,21,Tuesday,3.099698


## Preprocessing
5. Separate the target from the features

In [157]:
target_name = "fare_amount"

X = dataset.drop(target_name, axis = 1)
Y = dataset.loc[:,target_name]

Y.head()



0    13.0
1     5.5
2     8.5
3    19.0
4     7.7
Name: fare_amount, dtype: float64

In [158]:
X.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,year,month,day,weekday,ride_distance
0,-73.971664,40.797035,-73.958939,40.777649,1,2015,5,7,Thursday,2.407225
1,-73.991635,40.749855,-73.98825,40.741341,2,2014,7,9,Wednesday,0.988729
2,-73.982352,40.777042,-73.995912,40.759757,1,2013,11,11,Monday,2.235651
3,-73.991455,40.7517,-73.936357,40.812327,1,2014,5,22,Thursday,8.183379
4,-73.974749,40.756255,-73.952276,40.778332,1,2011,6,21,Tuesday,3.099698


6. Detect names of numeric/categorical features

In [159]:
numeric_features = X.select_dtypes(include="number").columns
numeric_features

Index(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'passenger_count', 'year', 'month', 'day',
       'ride_distance'],
      dtype='object')

In [160]:
categorical_features = X.select_dtypes(exclude="number").columns
categorical_features

Index(['weekday'], dtype='object')

7. Make a train/test splitting with test_size = 0.2

In [161]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

8. Make all the necessary preprocessings.

Hint: in this exercise, we'll first create a baseline model with a multivariate **linear regression**. So don't forget to make all the transformations that are required for this kind of model 😉

In [162]:
numeric_transformer = Pipeline(
  steps=[
    # ("imputer_num", SimpleImputer(strategy="median")),  
    ("scaler_num" , StandardScaler()),
  ]
)


In [163]:
categorical_transformer = Pipeline(
  steps=[
    # ("imputer_cat", SimpleImputer(strategy="most_frequent")),  
     ("encoder_cat", OneHotEncoder()), # ! pas de drop first                 
  ]
)

In [164]:
preprocessor = ColumnTransformer(
  transformers=[
      ("num", numeric_transformer,     numeric_features),
      ("cat", categorical_transformer, categorical_features),
  ]
)



In [165]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)
X_train[0:5]


array([[-0.14344502,  0.14080045, -0.14430518,  0.12085588, -0.52674365,
         0.14178757, -1.22949054,  0.03136581, -0.04966598,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ],
       [-0.14549195,  0.14195277, -0.14828083,  0.11791826, -0.52674365,
         0.68019666,  1.08407898,  0.84070488, -0.0471955 ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.14693462,  0.13853306, -0.1486907 ,  0.11680219, -0.52674365,
        -1.47343971, -0.36190197,  0.72508501, -0.05173695,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-0.14296573,  0.14131223, -0.1442259 ,  0.11449285, -0.52674365,
         0.68019666,  1.37327517, -0.66235339, -0.04180048,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ],
       [-0.14341738,  0.14246957, -0

# Pas de LabelEncoder sur y

## Baseline: Linear Regression
9. Train a linear regression model and evaluate its performances. Is it satisfying?

In [166]:
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train, Y_train)

Y_train_pred = lin_reg_model.predict(X_train)
Y_test_pred = lin_reg_model.predict(X_test)


lin_reg_model.score(X_train, Y_train)



0.02419859579741468

In [167]:
lin_reg_model.score(X_test, Y_test)


0.017058651115981593

## Random Forest
10. Train a Random Forest model with default hyperparameters. Are the performances better?

In [168]:
rf_regressor = RandomForestRegressor()
rf_regressor.fit(X_train, Y_train)


In [169]:
Y_train_pred = rf_regressor.predict(X_train)
Y_test_pred = rf_regressor.predict(X_test)

In [170]:
print("score training set : ", rf_regressor.score(X_train, Y_train))
print("score test set     : ", rf_regressor.score(X_test, Y_test))


score training set :  0.9687913290366996
score test set     :  0.765760487189884


### Grid search
11. Use grid search to tune the model's hyperparameters. You can try the following values:

```
params = {
    'max_depth': [10, 12, 14],
    'min_samples_split': [4, 8],
    'n_estimators': [60, 80, 100]
}
```



In [171]:
params = {
  'max_depth'         : [10, 12, 14],
  'min_samples_split' : [4, 8],
  'n_estimators'      : [60, 80, 100]
}

gridsearch = GridSearchCV(rf_regressor, param_grid=params, cv=3)  
# gridsearch = GridSearchCV(rf_regressor, param_grid=params, cv=3, verbose=2)  
gridsearch.fit(X_train, Y_train)


In [172]:
print("Best hyperparameters     : ", gridsearch.best_params_)
print("Best validation accuracy : ", gridsearch.best_score_)

Best hyperparameters     :  {'max_depth': 12, 'min_samples_split': 8, 'n_estimators': 80}
Best validation accuracy :  0.7666427220949809


### Performances
12. Display the R2-score and the [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean%20absolute%20error#sklearn.metrics.mean_absolute_error) on train set and test set. What do you think of this model? Would it be interesting to use it to estimate the fares on new data?

In [175]:
print("R2 score on training set            : ", gridsearch.score(X_train, Y_train))
print("R2 score on test set                : ", gridsearch.score(X_test, Y_test))

Y_train_pred = gridsearch.predict(X_train)
Y_test_pred = gridsearch.predict(X_test)

print("Mean Absolute Error on training set : ", mean_absolute_error(Y_train, Y_train_pred))
print("Mean Fare on training set           : ", Y_train.mean())

print("Mean Absolute Error on test set     : ", mean_absolute_error(Y_test, Y_test_pred))
print("Mean Fare on test set               : ", Y_test.mean())
print("Standard-deviation on test set      : ", Y_test.std())

R2 score on training set            :  0.8943738472056981
R2 score on test set                :  0.7759783206969804
Mean Absolute Error on training set :  1.6856426840790661
Mean Fare on training set           :  11.330150643830478
Mean Absolute Error on test set     :  2.16849539022123
Mean Fare on test set               :  11.481740000000002
Standard-deviation on test set      :  9.631479542206511


R2 score on training set :  0.8773412391126457
R2 score on test set :  0.7793227198490924
Predictions on training set...
...Done.
[ 6.17142586  9.3272303   4.84357674 ...  7.81504034 24.47627732
 10.67468317]

Predictions on test set...
...Done.
[22.10284768  7.84187096 18.5937745  ...  5.27930482  8.80036975
  7.59332401]

Mean Absolute Error on training set :  1.8398203933642716
Mean Fare on training set :  11.330150643830464

Mean Absolute Error on test set :  2.1609093314552403
Mean Fare on test set :  11.481740000000038
Standard-deviation on test set :  9.631479542206511


## Feature importance
13. Make a bar plot with the importances of each feature. Are you surprised?

In [179]:
column_names = []
for name, step, features_list in preprocessor.transformers_:  
  if name == "num":  
    features = features_list
  else:  
    features = step.get_feature_names_out()
  column_names.extend(features)  # concatenate features names

column_names

['pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count',
 'year',
 'month',
 'day',
 'ride_distance',
 'weekday_Friday',
 'weekday_Monday',
 'weekday_Saturday',
 'weekday_Sunday',
 'weekday_Thursday',
 'weekday_Tuesday',
 'weekday_Wednesday']

In [177]:
# Create a pandas DataFrame
feature_importance = pd.DataFrame(
  index=column_names,
  data=gridsearch.best_estimator_.feature_importances_,
  columns=["feature_importances"],
)

feature_importance = feature_importance.sort_values(by="feature_importances")

In [182]:
fig = px.bar(feature_importance, orientation="h")
fig.update_layout(
    showlegend=False, margin={"l": 120}  
)
fig.show()

# Faudrait voir en virant les long et latitudes qu'on a utilisé dans ride_distance

14. Would the model be able to make good predictions if we hadn't included the ride distance by hand? Train a new Random Forest model (with grid search) by dropping the `ride_distance` column from the features, and conclude.

# À TERMINER !!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Found numeric features  ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'year', 'month', 'day']
Found categorical features  ['weekday']


Dividing into train and test sets...
...Done.



Performing preprocessings on train set...
       pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \
8152         -73.966328        40.757692         -73.958325         40.768067   
1380         -73.988100        40.764807         -74.001052         40.746947   
14079        -74.003445        40.743692         -74.005457         40.738923   
7725         -73.961230        40.760852         -73.957473         40.722320   
14918        -73.966034        40.767998         -73.954902         40.783116   

       passenger_count  year  month  day    weekday  
8152                 1  2012      2   16   Thursday  
1380                 1  2013     10   23  Wednesday  
14079                1  2009      5   22     Friday  
7725                 1  2013     11   10     Sunday  
14918                1  2014      4    2  Wednesday  
...Done.
[[-0.14344502  0.14080045 -0.14430518  0.12085588 -0.52674365  0.14178757
  -1.22949054  0.03136581  0.          0.          0.          1

Grid search...
Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] END .max_depth=10, min_samples_split=4, n_estimators=60; total time=   2.6s
[CV] END .max_depth=10, min_samples_split=4, n_estimators=60; total time=   2.5s
[CV] END .max_depth=10, min_samples_split=4, n_estimators=60; total time=   2.5s
[CV] END .max_depth=10, min_samples_split=4, n_estimators=80; total time=   3.4s
[CV] END .max_depth=10, min_samples_split=4, n_estimators=80; total time=   3.4s
[CV] END .max_depth=10, min_samples_split=4, n_estimators=80; total time=   3.4s
[CV] END max_depth=10, min_samples_split=4, n_estimators=100; total time=   4.2s
[CV] END max_depth=10, min_samples_split=4, n_estimators=100; total time=   4.2s
[CV] END max_depth=10, min_samples_split=4, n_estimators=100; total time=   4.3s
[CV] END .max_depth=10, min_samples_split=8, n_estimators=60; total time=   2.6s
[CV] END .max_depth=10, min_samples_split=8, n_estimators=60; total time=   2.6s
[CV] END .max_depth=10, min_sampl

R2 score on training set :  0.850107087831722
R2 score on test set :  0.713239229422019
Predictions on training set...
...Done.
[ 7.45782206  8.61356464  7.42620888 ...  7.48197772 25.77105996
  8.8216215 ]

Predictions on test set...
...Done.
[14.02542084  8.62151493  7.51974234 ...  7.44094536  7.82347779
  8.03063538]

Mean Absolute Error on training set :  2.4255949901205986
Mean Fare on training set :  11.330150643830464

Mean Absolute Error on test set :  2.860988261156146
Mean Fare on test set :  11.481740000000038
Standard-deviation on test set :  9.631479542206511
