# Práctica Calificada 2
## Predicción de tarifas de taxis
El objetivo de esta evaluación es construir un modelo de aprendizaje que sea capaz de predecir la tarifa que cobra un taxi de acuerdo a cierta información de entrada.


In [2]:
import pandas as pd
import numpy as np
print("Pandas = ", pd.__version__)
print("Numpy = ", np.__version__)

Pandas =  1.1.5
Numpy =  1.19.5


# Obteniendo del conjunto de datos

In [3]:
df_train = pd.read_csv("./train.csv", nrows=10000000)

In [4]:
df_train.tail()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
9999995,2012-08-12 01:18:00.000000208,5.7,2012-08-12 01:18:00 UTC,-73.999464,40.728452,-73.993299,40.7421,2
9999996,2013-08-07 10:28:00.000000147,5.5,2013-08-07 10:28:00 UTC,-73.968467,40.759367,-73.964967,40.769027,1
9999997,2013-10-29 08:29:00.00000082,14.0,2013-10-29 08:29:00 UTC,-73.997952,40.733717,-73.973448,40.759122,5
9999998,2012-04-07 16:41:33.0000004,10.5,2012-04-07 16:41:33 UTC,-73.9927,40.752021,-73.964705,40.772849,1
9999999,2010-03-30 19:27:00.00000066,8.5,2010-03-30 19:27:00 UTC,-73.96539,40.768572,-73.998188,40.761073,1


Tenemos las siguientes columnas

* ID: cadena que identifica de manera única a cada registro
* pickup_datetime: timestamp indicando cuando el viaje a empezado
* pickup_longitude: número real indicando la ubicación en longitud en donde el viaje empezó
* pickup_latitude: número real indicando la ubicación en latitud en donde el viaje empezó
* dropoff_longitude: número real indicando la ubicación en longitud en donde el viaje terminó
* dropoff_latitude: número real indicando la ubicación en latitud en donde el viaje terminó
* passenger_count: número entero indicando el número de pasajeros en el servicio de taxi
* fare_amount: número real indicando el costo del taxi. Esta es la variable a predecir.

Borramos la columna ID ya que no es un caracterisitica que nos interese

In [5]:
df_train = df_train.drop(columns=['key'])

In [6]:
df_train.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,10000000.0,10000000.0,10000000.0,9999931.0,9999931.0,10000000.0
mean,11.33854,-72.50775,39.91934,-72.50897,39.91913,1.684793
std,9.79993,12.99421,9.322539,12.87532,9.23728,1.323423
min,-107.75,-3439.245,-3492.264,-3426.601,-3488.08,0.0
25%,6.0,-73.99207,40.73491,-73.99139,40.73403,1.0
50%,8.5,-73.98181,40.75263,-73.98016,40.75316,1.0
75%,12.5,-73.9671,40.76712,-73.96367,40.7681,2.0
max,1273.31,3457.626,3344.459,3457.622,3351.403,208.0


In [7]:
df_train.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 7 columns):
 #   Column             Non-Null Count     Dtype  
---  ------             --------------     -----  
 0   fare_amount        10000000 non-null  float64
 1   pickup_datetime    10000000 non-null  object 
 2   pickup_longitude   10000000 non-null  float64
 3   pickup_latitude    10000000 non-null  float64
 4   dropoff_longitude  9999931 non-null   float64
 5   dropoff_latitude   9999931 non-null   float64
 6   passenger_count    10000000 non-null  int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 534.1+ MB


De lo anterior se puede observar que tenemos valores nulos, entonces verificamos cuando valores nulos hay por cada columna:

In [8]:
print(df_train.isnull().sum())

fare_amount           0
pickup_datetime       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude    69
dropoff_latitude     69
passenger_count       0
dtype: int64


In [9]:
# Eliminando
df_train = df_train.dropna(how='any', axis=0)

# Analizando Fare_amount

In [10]:
df_train[['fare_amount']].describe()

Unnamed: 0,fare_amount
count,9999931.0
mean,11.33849
std,9.799845
min,-107.75
25%,6.0
50%,8.5
75%,12.5
max,1273.31


Obervamos que el el minimo del monto de tarifa es negativo, veamos cuantos montos de tarifa menores o iguales a cero, tenemos: 

In [11]:
len(df_train[df_train['fare_amount'] <= 0].index)

689

In [12]:
df_train[df_train['fare_amount'] <= 0]

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2039,-2.9,2010-03-09 23:37:10 UTC,-73.789450,40.643498,-73.788665,40.641952,1
2486,-2.5,2015-03-22 05:14:27 UTC,-74.000031,40.720631,-73.999809,40.720539,1
10002,0.0,2010-02-15 14:26:01 UTC,-73.987115,40.738808,-74.005911,40.713960,1
13032,-3.0,2013-08-30 08:57:10 UTC,-73.995062,40.740755,-73.995885,40.741357,4
27891,0.0,2015-05-15 21:40:28 UTC,-74.077927,40.805714,-74.077919,40.805721,1
...,...,...,...,...,...,...,...
9891251,-5.7,2010-03-26 22:26:10 UTC,-73.989857,40.739000,-73.995942,40.744332,1
9895476,-2.5,2015-05-10 22:07:39 UTC,-73.789360,40.646481,-73.791451,40.645355,1
9914973,0.0,2010-03-10 15:45:34 UTC,-73.977401,40.763754,-74.185760,40.693433,1
9951612,-2.5,2010-03-24 22:22:10 UTC,-74.010215,40.720048,-74.010188,40.719817,5


Solo queremos los datos donde el monto de la tarifa sea mayor que cero:


In [13]:
df_train = df_train[df_train['fare_amount'] > 0]

In [14]:
df_train[['fare_amount']].describe()

Unnamed: 0,fare_amount
count,9999242.0
mean,11.33966
std,9.798609
min,0.01
25%,6.0
50%,8.5
75%,12.5
max,1273.31


# Analizando la longitud y latidud

El rango de la latidud es de -90 hasta 90 grados, mientras que el grado de la longitud es de -180 hasta los 180 grados.



In [15]:
df_train = df_train[(df_train['pickup_longitude'] >= -180) & (df_train['pickup_longitude'] <= 180)]

In [16]:
df_train = df_train[(df_train['pickup_latitude'] >= -90) & (df_train['pickup_latitude'] <= 90)]

In [17]:
df_train = df_train[(df_train['dropoff_longitude'] >= -180) & (df_train['dropoff_longitude'] <= 180)]

In [18]:
df_train = df_train[(df_train['dropoff_latitude'] >= -90) & (df_train['dropoff_latitude'] <= 90)]

In [19]:
def distancia(df):
  # Radio medio de la tierra en Km
  R = 6371.0
  # Conversion a radianes
  lt1 = np.radians(df.pickup_latitude)
  lg1 = np.radians(df.pickup_longitude)
  lt2 = np.radians(df.dropoff_latitude)
  lg2 = np.radians(df.dropoff_longitude)
  # Defirencia entre latitudes y longitudes
  dlt = lt2 - lt1
  dlg = lg2 - lg1
  # Haversine
  hav = np.sin(dlt / 2)**2 + np.cos(lt1) * np.cos(lt2) * np.sin(dlg / 2)**2
  d = 2 * R * np.arccos(np.sqrt(hav))
  return d

In [20]:
df_train['distancia'] = distancia(df_train)

In [21]:
df_train = df_train.drop(columns=['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])

# Analizando pickup_datetime

In [22]:
df_train['pickup_datetime']

0          2009-06-15 17:26:21 UTC
1          2010-01-05 16:52:16 UTC
2          2011-08-18 00:35:00 UTC
3          2012-04-21 04:30:42 UTC
4          2010-03-09 07:51:00 UTC
                    ...           
9999995    2012-08-12 01:18:00 UTC
9999996    2013-08-07 10:28:00 UTC
9999997    2013-10-29 08:29:00 UTC
9999998    2012-04-07 16:41:33 UTC
9999999    2010-03-30 19:27:00 UTC
Name: pickup_datetime, Length: 9998766, dtype: object

In [23]:
df_train['pickup_datetime'] = df_train['pickup_datetime'].str.replace(" UTC", "")

In [24]:
df_train['pickup_datetime'] = pd.to_datetime(df_train['pickup_datetime'])

In [25]:
df_train['año'] = df_train.pickup_datetime.dt.year
df_train['mes'] = df_train.pickup_datetime.dt.month
df_train['dia'] = df_train.pickup_datetime.dt.day
df_train['hora'] = df_train.pickup_datetime.dt.hour

In [26]:
df_train = df_train.drop(columns=['pickup_datetime'])

In [27]:
df_train.head()

Unnamed: 0,fare_amount,passenger_count,distancia,año,mes,dia,hora
0,4.5,1,20014.056032,2009,6,15,17
1,16.9,1,20006.636662,2010,1,5,16
2,5.7,2,20013.697271,2011,8,18,0
3,7.7,1,20012.287526,2012,4,21,4
4,5.3,1,20013.087639,2010,3,9,7


In [27]:
predictors = ['passenger_count', 'distancia', 'año', 'mes', 'dia', 'hora']
salida = 'fare_amount'

X = df_train[predictors]
y = df_train[salida]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=22)

rf = RandomForestRegressor(n_estimators=500,
                            oob_score = True,
                            random_state=1,
                            max_depth=8)
rf.fit(X_train, y_train)

In [None]:
from joblib import dump, load
dump(rf, 'Modelo.joblib')