# Sweet Lift Taxi Prediction Model

## Introduction
The Sweet Lift Taxi company has collected historical data on taxi orders at airports. To attract more drivers during peak hours, we need to predict the number of taxi orders for the next hour. Build a model for this prediction. The RMSE metric on the test set must not exceed 48.

The data of the _.csv_ file is the follwing:
- The data is stored in the file taxi.csv.
- The number of orders is in the column num_orders.

## 1. Data exploration and preprocessing

### 1.1. Libreries initialization

In [25]:
import numpy as np
import pandas as pd
import time

import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
import os


### 1.2. Data loading

In [26]:
file_path = os.path.join("datasets", "taxi.csv")

# Cargar el DataFrame
df = pd.read_csv(file_path)

# Convertir los nombres de las columnas a minúsculas
df.columns = df.columns.str.lower()

# Muestra aleatoria de 10 filas
df.sample(10)


Unnamed: 0,datetime,num_orders
8107,2018-04-26 07:10:00,0
17922,2018-07-03 11:00:00,12
12015,2018-05-23 10:30:00,14
17733,2018-07-02 03:30:00,20
22900,2018-08-07 00:40:00,28
5911,2018-04-11 01:10:00,16
22245,2018-08-02 11:30:00,10
25476,2018-08-24 22:00:00,37
22593,2018-08-04 21:30:00,21
21530,2018-07-28 12:20:00,17


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26496 entries, 0 to 26495
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   datetime    26496 non-null  object
 1   num_orders  26496 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 414.1+ KB


In [28]:
df["datetime"] = pd.to_datetime(df["datetime"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26496 entries, 0 to 26495
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    26496 non-null  datetime64[ns]
 1   num_orders  26496 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 414.1 KB


In [29]:
df.set_index("datetime", inplace=True)
df_resampled = df.resample("H").sum()

print(df_resampled.head())

                     num_orders
datetime                       
2018-03-01 00:00:00         124
2018-03-01 01:00:00          85
2018-03-01 02:00:00          71
2018-03-01 03:00:00          66
2018-03-01 04:00:00          43


  df_resampled = df.resample("H").sum()


### 1.3. Data descriptive exploration

In [30]:
df_resampled.describe().round()

Unnamed: 0,num_orders
count,4416.0
mean,84.0
std,45.0
min,0.0
25%,54.0
50%,78.0
75%,107.0
max,462.0


## 2. Model Trainning

For this section, we are going to try different models in order to choose the best model configuration. The models we are testing are:

- Linear Regression 
- Decision Tree
- Random Forest 

The criteria of the election is:

- Prediction quality  
- Prediction speed  
- Training time

In [31]:
df = df_resampled
df["hour"] = df.index.hour
df["day_of_week"] = df.index.dayofweek
df["month"] = df.index.month
df["year"] = df.index.year

X = df[["hour", "day_of_week", "month", "year"]]  # Features
y = df["num_orders"]  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=False)


In [32]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    start_train = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_train

    start_pred = time.time()
    y_pred = model.predict(X_test)
    pred_time = time.time() - start_pred

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    return rmse, train_time, pred_time

# 1. Linear Regression
lr = LinearRegression()
lr_rmse, lr_train_time, lr_pred_time = evaluate_model(lr, X_train, X_test, y_train, y_test)
print(f"Linear Regression -> RMSE: {lr_rmse:.4f}, Train Time: {lr_train_time:.4f} sec, Pred Time: {lr_pred_time:.4f} sec")

# 2. Decision Tree
dt = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_rmse, dt_train_time, dt_pred_time = evaluate_model(dt, X_train, X_test, y_train, y_test)
print(f"Decision Tree -> RMSE: {dt_rmse:.4f}, Train Time: {dt_train_time:.4f} sec, Pred Time: {dt_pred_time:.4f} sec")

# 3. Random Forest
rf = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)
rf_rmse, rf_train_time, rf_pred_time = evaluate_model(rf, X_train, X_test, y_train, y_test)
print(f"Random Forest -> RMSE: {rf_rmse:.4f}, Train Time: {rf_train_time:.4f} sec, Pred Time: {rf_pred_time:.4f} sec")


Linear Regression -> RMSE: 66.5314, Train Time: 0.0259 sec, Pred Time: 0.0015 sec
Decision Tree -> RMSE: 50.4308, Train Time: 0.0066 sec, Pred Time: 0.0015 sec
Random Forest -> RMSE: 48.4948, Train Time: 0.1679 sec, Pred Time: 0.0287 sec


In [33]:
df_results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'RMSE': [lr_rmse, dt_rmse, rf_rmse],
    'Training Time (s)': [lr_train_time, dt_train_time, rf_train_time],
    'Prediction Time (s)': [lr_pred_time, dt_pred_time, rf_pred_time]
})

df_results['Total Time (s)'] = df_results['Training Time (s)'] + df_results['Prediction Time (s)']

print(df_results.sort_values(by='RMSE'))

               Model       RMSE  Training Time (s)  Prediction Time (s)  \
2      Random Forest  48.494831           0.167932             0.028671   
1      Decision Tree  50.430824           0.006614             0.001536   
0  Linear Regression  66.531417           0.025925             0.001527   

   Total Time (s)  
2        0.196604  
1        0.008150  
0        0.027452  


# Conclusion

The best model for this group of data it's Decision Tree because it has a good overall performance and the good prediction Time. 

# Lista de revisión

- [x]  	
Jupyter Notebook está abierto.
- [ ]  El código no tiene errores
- [ ]  Las celdas con el código han sido colocadas en el orden de ejecución.
- [ ]  	
Los datos han sido descargados y preparados.
- [ ]  Se ha realizado el paso 2: los datos han sido analizados
- [ ]  Se entrenó el modelo y se seleccionaron los hiperparámetros
- [ ]  Se han evaluado los modelos. Se expuso una conclusión
- [ ] La *RECM* para el conjunto de prueba no es más de 48