# Previsão de tarifas em corridas de taxi

Usar modelos de machine learning é ideal para realizar previsões. Um uso comum atualmente é para prever o valor de uma corrida baseado em dados de corridas passadas. Nesse projeto foram utilizados dados de corridas no período de 2009 a 2015 para prever o valor das corridas. Os dados disponíveis são referentes à 
- taxa de corrida 
- horário de partida 
- longitude de partida 
- latitude de partida 
- longitude de chegada 
- latitude de chegada 
- número de passageiros 

Durante o projeto faremos uso de análise exploratória de dados (EDA), limpeza e tratamento de dados, modelos de regressão linear, Random Forest e hyperparameter tuning.

# 1) Importação das bibliotecas


In [40]:
import pandas as pd
import numpy as np
import seaborn as srn
import statistics as sts
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from math import sqrt
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
import joblib
from pathlib import Path

# 2) Visualização dos dados


In [41]:
dataset = pd.read_csv('data_driver.csv', sep=',')

dataset.head()

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


In [42]:
dataset.shape

(200000, 9)

In [43]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         200000 non-null  int64  
 1   key                200000 non-null  object 
 2   fare_amount        200000 non-null  float64
 3   pickup_datetime    200000 non-null  object 
 4   pickup_longitude   200000 non-null  float64
 5   pickup_latitude    200000 non-null  float64
 6   dropoff_longitude  199999 non-null  float64
 7   dropoff_latitude   199999 non-null  float64
 8   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB


In [44]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
max,55423570.0,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


- 8 Variáveis
- A variável alvo é  _fare amount_ (taxa de corrida). 
- A Variável fare_amount apresenta outliers
- As variáveis dropoff_longitude e dropoff_latitude apresentam valores faltantes
- A variável passenger_count possui valores 0, o que seria logicamente incorreto

- A variável 'Key' pode ser removida, haja visto que se trata apenas de indexação
- A variável 'pickup_datetime' deve ser convertida para o formato datetime

In [45]:
dataset = dataset.drop(['key'], axis=1)

dataset['pickup_datetime'] = pd.to_datetime(dataset['pickup_datetime'])

min_date = dataset['pickup_datetime'].min()
max_date = dataset['pickup_datetime'].max()
print(min_date)
print(max_date)

2009-01-01 01:15:22+00:00
2015-06-30 23:40:39+00:00


In [46]:
# Análise de correlação de features com a variável alvo

correlation = dataset.corr()
correlation['fare_amount'].sort_values(ascending=False)

fare_amount          1.000000
pickup_datetime      0.122769
pickup_longitude     0.010457
passenger_count      0.010150
dropoff_longitude    0.008986
Unnamed: 0           0.000589
pickup_latitude     -0.008481
dropoff_latitude    -0.011014
Name: fare_amount, dtype: float64

Analisando a correlação de variáveis, percebe-se que ainda há pouca correlação com a variável alvo. Assim, o próximo passo é o cálculo da distância em milhas baseado na latitude e longitude de partida e de chegada, além dos dias da semana, horas, minutos e milisegundos no momento da partida. 

In [47]:
def calcular_diferenca_em_milissegundos(data_hora):
    inicio_do_dia = data_hora.replace(hour=0, minute=0, second=0, microsecond=0)
    diferenca_em_milissegundos = int((data_hora - inicio_do_dia).total_seconds() * 1000)
    return diferenca_em_milissegundos

# Aplicar a função à coluna 'data_hora' e criar uma nova coluna 'dif_em_milissegundos'
dataset['pickup_millisecond'] = dataset['pickup_datetime'].apply(calcular_diferenca_em_milissegundos)

def calcular_diferenca_em_minutos(data_hora):
    inicio_do_dia = data_hora.replace(hour=0, minute=0, second=0, microsecond=0)
    diferenca_em_minutos = int((data_hora - inicio_do_dia).total_seconds() / 60)
    return diferenca_em_minutos

# Aplicar a função à coluna 'data_hora' e criar uma nova coluna 'dif_em_minutos'
dataset['pickup_minutes'] = dataset['pickup_datetime'].apply(calcular_diferenca_em_minutos)

def calcular_diferenca_em_horas(data_hora):
    inicio_do_dia = data_hora.replace(hour=0, minute=0, second=0, microsecond=0)
    diferenca_em_horas = int((data_hora - inicio_do_dia).total_seconds() / 3600)
    return diferenca_em_horas

# Aplicar a função à coluna 'data_hora' e criar uma nova coluna 'dif_em_horas'
dataset['pickup_hours'] = dataset['pickup_datetime'].apply(calcular_diferenca_em_horas)

In [48]:
for i, row in dataset.iterrows():
    dt = row['pickup_datetime']
    dataset.at[i, 'week_day'] = dt.weekday()
    x = (row['dropoff_longitude'] - row['pickup_longitude']) * 54.6 # 1 grau == 54.6 milhas
    y = (row['dropoff_latitude'] - row['pickup_latitude']) * 69.0   # 1 grau == 69 milhas
    distance = sqrt(x**2 + y**2)
    dataset.at[i, 'distance'] = distance

Dessa forma conseguiremos melhores valores de correlação utilizando as novas Features:

In [49]:
correlation = dataset.corr()
correlation['fare_amount'].sort_values(ascending=False)

fare_amount           1.000000
pickup_datetime       0.122769
distance              0.011112
pickup_longitude      0.010457
passenger_count       0.010150
dropoff_longitude     0.008986
week_day              0.007501
Unnamed: 0            0.000589
pickup_latitude      -0.008481
dropoff_latitude     -0.011014
pickup_hours         -0.021473
pickup_minutes       -0.021806
pickup_millisecond   -0.021808
Name: fare_amount, dtype: float64

Descartamos variáveis de menor correlação:

In [50]:

dataset.drop(columns=['Unnamed: 0', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'], inplace=True)
dataset.head()

Unnamed: 0,fare_amount,pickup_datetime,passenger_count,pickup_millisecond,pickup_minutes,pickup_hours,week_day,distance
0,7.5,2015-05-07 19:52:06+00:00,1,71526000,1192,19,3.0,1.044567
1,7.7,2009-07-17 20:04:56+00:00,1,72296000,1204,20,4.0,1.525023
2,12.9,2009-08-24 21:45:00+00:00,1,78300000,1305,21,0.0,3.196405
3,5.3,2009-06-26 08:22:21+00:00,3,30141000,502,8,4.0,1.045342
4,16.0,2014-08-28 17:47:00+00:00,5,64020000,1067,17,3.0,2.878848


Descartamos linhas com passenger_count menor do que 1 e maior do que 6 (outliers)

In [51]:
dataset = dataset[(dataset['passenger_count'] <= 6) & (dataset['passenger_count'] > 0)]

dataset['pickup_datetime'] = pd.to_datetime(dataset['pickup_datetime'])

In [52]:
dataset.describe(percentiles=[0.01, 0.1, 0.15, 0.25, 0.5, 0.75, 0.95, 0.975]).transpose()

Unnamed: 0,count,mean,std,min,1%,10%,15%,25%,50%,75%,95%,97.5%,max
fare_amount,199290.0,11.36671,9.910588,-52.0,3.3,4.5,5.0,6.0,8.5,12.5,30.33,43.49325,499.0
passenger_count,199290.0,1.689493,1.30542,1.0,1.0,1.0,1.0,1.0,1.0,2.0,5.0,5.0,6.0
pickup_millisecond,199290.0,50361330.0,23482530.0,0.0,780000.0,12676800.0,25260000.0,33851250.0,52680000.0,70260000.0,82729000.0,84489780.0,86399000.0
pickup_minutes,199290.0,839.093,391.3796,0.0,13.0,211.0,421.0,564.0,878.0,1171.0,1378.0,1408.0,1439.0
pickup_hours,199290.0,13.49255,6.51627,0.0,0.0,3.0,7.0,9.0,14.0,19.0,22.0,23.0,23.0
week_day,199290.0,3.049395,1.946746,0.0,0.0,0.0,1.0,1.0,3.0,5.0,6.0,6.0,6.0
distance,199290.0,15.05396,557.4972,0.0,0.0,0.4584712,0.5709057,0.768395,1.339259,2.436893,6.45923,9.762905,161505.0


- Descarta-se linhas com fare_amount menor do que 2 e maior do que 45 (outliers)
- Descarta-se linhas com distance menor do que 0 e maiores do que 10  (outliers)

In [53]:
dataset = dataset[(dataset['fare_amount'] > 2) & (dataset['fare_amount'] < 45)]
dataset = dataset[(dataset['distance'] > 0) & (dataset['distance'] < 10)]

Analisando mais uma vez a relação de correlação, percebe-se uma melhora drástica nos valores de distance devido a remoção de outliers

In [54]:
correlation = dataset.corr()
correlation['fare_amount'].sort_values(ascending=False)

fare_amount           1.000000
distance              0.876391
pickup_datetime       0.142341
passenger_count       0.013886
week_day              0.009700
pickup_hours         -0.022775
pickup_millisecond   -0.023121
pickup_minutes       -0.023123
Name: fare_amount, dtype: float64

# 3) Treinamento de modelos

## Realizando a divisão de treino e teste de dados

In [55]:
X = dataset.drop(columns=['fare_amount', 'pickup_datetime'], axis=1)
Y = dataset['fare_amount']

x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=7)

## Criando modelos de regressão

Faremos a análise comparativa de 3 modelos de regressão: Regressão Linear, Random Forest e Extra Trees. Também realizaremos a tunagem de parâmetros para o modelo de Random Forest e analisaremos qual dos modelos perfoma melhor.

In [56]:
scores = {}

In [57]:
linear_regression_model = LinearRegression()
linear_regression_model.fit(x_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [58]:
LR_predict = linear_regression_model.predict(x_test)

In [59]:
LR_r2_score = r2_score(y_test, LR_predict)
LR_MAE = mean_absolute_error(y_test, LR_predict)
LR_MSE = mean_squared_error(y_test, LR_predict)

print('r2_score: ', LR_r2_score)
print('MAE: ', LR_MAE)
print('MSE: ', LR_MSE)

scores['Linear Regression'] = {'MAE': LR_MAE, 'MSE': LR_MSE, 'R2': LR_r2_score}

r2_score:  0.7677961652448972
MAE:  1.993897390250624
MSE:  9.6968619511865


In [60]:
random_forest_model = RandomForestRegressor()
random_forest_model.fit(x_train, y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [61]:
RF_predict = random_forest_model.predict(x_test)

In [62]:
RF_r2_score = r2_score(y_test, RF_predict)
RF_MAE = mean_absolute_error(y_test, RF_predict)
RF_MSE = mean_squared_error(y_test, RF_predict)

print('r2_score: ', RF_r2_score)
print('MAE: ', RF_MAE)
print('MSE: ', RF_MSE)

scores['Random Forest'] = {'MAE': RF_MAE, 'MSE': RF_MSE, 'R2': RF_r2_score}

r2_score:  0.7690057135660799
MAE:  2.027107805544224
MSE:  9.646351058004363


In [63]:
forest = ExtraTreesRegressor()
forest.fit(x_train, y_train)
importancies = forest.feature_importances_
importancies

array([0.019699  , 0.04899931, 0.04693099, 0.0184599 , 0.02795251,
       0.8379583 ])

In [64]:
forest_predict = forest.predict(x_test)

forest_r2_score = r2_score(y_test, forest_predict)
forest_MAE = mean_absolute_error(y_test, forest_predict)
forest_MSE = mean_squared_error(y_test, forest_predict)

print('r2_score: ', forest_r2_score)
print('MAE: ', forest_MAE)
print('MSE: ', forest_MSE)

scores['Extra Trees Regressor'] = {'MAE': forest_MAE, 'MSE': forest_MSE, 'R2': forest_r2_score}

r2_score:  0.7461094832608046
MAE:  2.1260974389443623
MSE:  10.602500575117142


### Utilizando variáveis mais importantes para o modelo

Utilizaremos as variáveis de maior grau de importância para o treinamento do modelo Random Forest

In [65]:
important_features = np.array(importancies)
indexes = np.where(important_features > 0.045)[0]
important_feature_columns = x_train.iloc[:, indexes]
x_test_important_feature = x_test.iloc[:, indexes]

In [66]:
important_feature_columns= important_feature_columns.drop('pickup_minutes', axis=1)
x_test_important_feature = x_test_important_feature.drop('pickup_minutes', axis=1)

In [67]:
random_forest_importancies_model = RandomForestRegressor()
random_forest_importancies_model.fit(important_feature_columns, y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [68]:
RF_importancies_predict = random_forest_importancies_model.predict(x_test_important_feature)

In [69]:
RF_importancies_r2_score = r2_score(y_test, RF_importancies_predict)
RF_importancies_MAE = mean_absolute_error(y_test, RF_importancies_predict)
RF_importancies_MSE = mean_squared_error(y_test, RF_importancies_predict)

print('r2_score: ', RF_importancies_r2_score)
print('MAE: ', RF_importancies_MAE)
print('MSE: ', RF_importancies_MSE)

scores['Random Forest Importancies'] = {'MAE': RF_importancies_MAE, 'MSE': RF_importancies_MSE, 'R2': RF_importancies_r2_score}

r2_score:  0.7538625112621841
MAE:  2.0940031197346713
MSE:  10.278733130396219


### Realizando tunagem de hiperparâmetros para o modelo de Random Forest

In [70]:
random_grid = {
    'max_depth': [None, 16],
    'min_samples_leaf': [1, 3],
    'n_estimators': [300, 400, 600]
}

rf = RandomForestRegressor()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=7, n_jobs = -1)

In [71]:
rf_random.fit(important_feature_columns, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits




0,1,2
,estimator,RandomForestRegressor()
,param_distributions,"{'max_depth': [None, 16], 'min_samples_leaf': [1, 3], 'n_estimators': [300, 400, ...]}"
,n_iter,100
,scoring,
,n_jobs,-1
,refit,True
,cv,3
,verbose,2
,pre_dispatch,'2*n_jobs'
,random_state,7

0,1,2
,n_estimators,600
,criterion,'squared_error'
,max_depth,16
,min_samples_split,2
,min_samples_leaf,3
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [72]:
best_random_estimator = rf_random
rf_random_predict = best_random_estimator.predict(x_test_important_feature)

rf_random_r2_score = r2_score(y_test, rf_random_predict)
rf_random_MAE = mean_absolute_error(y_test, rf_random_predict)
rf_random_MSE = mean_squared_error(y_test, rf_random_predict)

print('r2_score: ', rf_random_r2_score)
print('MAE: ', rf_random_MAE)
print('MSE: ', rf_random_MSE)

scores['Random Forest Tuning'] = {'MAE': rf_random_MAE, 'MSE': rf_random_MSE, 'R2': rf_random_r2_score}

r2_score:  0.7809627644053806
MAE:  1.9516790193829032
MSE:  9.147023079832508


In [73]:
df_scores = pd.DataFrame(scores).T
df_scores

Unnamed: 0,MAE,MSE,R2
Linear Regression,1.993897,9.696862,0.767796
Random Forest,2.027108,9.646351,0.769006
Extra Trees Regressor,2.126097,10.602501,0.746109
Random Forest Importancies,2.094003,10.278733,0.753863
Random Forest Tuning,1.951679,9.147023,0.780963


Podemos perceber que o modelo Random Forest com tunagem de hiperparâmetros se destacou nas métricas de MAE e MSE, sendo o modelo ideal para ser utilizado.

### 4) Conclusões

- Ao longo deste projeto foram utilizadas várias técnicas de ciência de dados, como limpeza de dados, modelagem preditiva, feature selection e feature engineering.
- Em resumo a análise proporcionou compreender as variáveis e suas relações, permitindo a criação de um modelo de previsão de valor das corridas.
- Os pontos principais do projeto foram:
  - Identificação das características que mais influeciavam no modelo:
  - Comparação analítica entre diferentes modelos;
  - Tunagem de parâmetros.