# **Introduction**
Hello, 
In this notebook we will deal with a simple regression task, predicting the fare amount for taxi trips in new york city. The notebook focues on dealing with outliers, cleaning data, engineering some features, and then training a random forest regressor model. 

# **Table of contents:** 

1. Loading data and EDA

2. Data Processing

    2.1 Data cleaning: Dealing with outliers

    2.2 Feature Engineering
    
    2.3 Normalization

3. Training, predictions and submitting results

    3.1 Training

    3.2 Predictions
    
    3.3 Submitting results



In [None]:
# setting up the libraries that we will need 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV, cross_val_score
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

## **1. Loading data and EDA:**

The New york city fare prediction dataset have 55M line, we will work only with the first 2M lines due to memory contraints.

In [None]:
# loading the data
train = pd.read_csv("../input/new-york-city-taxi-fare-prediction/train.csv", nrows= 2000000)
test = pd.read_csv("../input/new-york-city-taxi-fare-prediction/test.csv")
train.name = "Train"
test.name = "Test"

In [None]:
def corr(df):
    corr_matrix = df.corr()
    print(corr_matrix["fare_amount"].sort_values(ascending=False))

In [None]:
#data exploration
train.head()

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
corr(train)

In [None]:
test.describe()

In [None]:
coordinates_columns = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
data_set = [train, test]
for coord in coordinates_columns:
    for data in data_set:
        maxi = data[coord].max()
        mini = data[coord].min()
        print ("Range of {} in {} data is : ({:.3f}, {:.3f})".format(coord, data.name, maxi, mini))
del data_set

In [None]:
test.passenger_count.value_counts()

In [None]:
train.passenger_count.value_counts()

In [None]:
#Negative fare amount
train.loc[train['fare_amount']<0].shape

The longitude of NYC is from 71° 47' 25" W to 79° 45' 54" W and the Latitude is from  40° 29' 40" N to 45° 0' 42" N. However in this dataset we find some noise values: for ex longitude ~ 3457 and latitude ~ 3344. Whereas in test data the boundary is more compatible with NYC longitude and latitude. The same goes for passenger_count ~ 280   or = 0, and negative fare.
In Data processing section we will delete those outliers.

In [None]:
# create a an fare amount category attribute with five bins to understand better this attribute
train["fare_amount_1"]=pd.cut(train["fare_amount"],
                            bins=[0., 6.0,12.,48.,150., np.inf],
                                  labels = [1,2,3,4,5])

In [None]:
train["fare_amount_1"].hist()

In [None]:
train.fare_amount_1.value_counts()

Most of the values are from 6 to 12 USD then an equal distribution of values for the range 0 to 6 USD and 12 to 48 USD. The values that are supperior to 150 USD are unreal, despite being around 200 values they might affect our model so we will delete them in the data processing section.

In [None]:
#a dictionary with NYC coordinates from test data that will be used in deleting outliers
coordinates = {'min_long': min(test.pickup_longitude.min(), test.dropoff_longitude.min()),
              'max_long': max(test.pickup_longitude.max(), test.dropoff_longitude.max()),
              'min_lat': min(test.pickup_latitude.min(), test.dropoff_latitude.min()),
              'max_lat' : min(test.pickup_latitude.max(), test.dropoff_latitude.max()),}

In [None]:
# we will use plt.xlim to limit the axes while plotting , to get a better observation of the data
city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

In [None]:
train[(train.pickup_longitude >= coordinates['min_long']) & 
      (train.pickup_longitude <= coordinates['max_long']) & 
      (train.pickup_latitude >= coordinates['min_lat']) & 
      (train.pickup_latitude <= coordinates['max_lat']) &
      (train.dropoff_longitude >= coordinates['min_long']) &
      (train.dropoff_longitude <= coordinates['max_long']) &
      (train.dropoff_latitude >= coordinates['min_lat']) & 
      (train.dropoff_latitude<= coordinates['max_lat'])].plot(
        kind ='scatter', x='pickup_longitude', y='pickup_latitude',s=.02, alpha =0.4)

plt.ylim(city_lat_border)
plt.xlim(city_long_border)

The plot of the pickup data from train shows that we have a map simular to NYC, we can note that there are 3 particular places with high pickup density:
* JFK airport
* LA Gurdia airport

NYC has another airpot which is Erward airport. 
These informations could help us to generate more feature: pick up and drop off from each airport.

## **2. Data Processing**
### 2.1 Data Cleaning: Dealing with outliers
We did notice in the previous section that our data has too much outliers (not realistic data) which will affect our model negatively. So here we will deal with this remove them.
* Delete rows with negative or superior to 150 USD fare amount 
* Delete rows with unrealistic coordinates
* Round coordinates to 3 round number to 3 decimal places (gain computing power)

In [None]:
class DataCleaning (BaseEstimator, TransformerMixin):
    def __init__ (self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        l=len(X)
        if X.name == 'Train':
            X = X[(X.fare_amount>0) & (X.fare_amount<150)]
            X = X.drop("fare_amount_1", axis=1) # we don't need this feature anymore
            X = X[(X.passenger_count>=0) &(X.passenger_count<=8)]
            X = X[(X.pickup_longitude>=coordinates['min_long']) & (X.pickup_longitude<=coordinates['max_long'])]
            X = X[(X.pickup_latitude>=coordinates['min_lat']) & (X.pickup_latitude<=coordinates['max_lat'])]
            X = X[(X.dropoff_longitude>=coordinates['min_long']) & (X.dropoff_longitude<=coordinates['max_long'])]
            X = X[(X.dropoff_latitude>=coordinates['min_lat']) & (X.dropoff_latitude<=coordinates['max_lat'])]
        X.pickup_longitude = X.pickup_longitude.apply(lambda x: round(x,3))
        X.pickup_latitude = X.pickup_latitude.apply(lambda x: round(x,3))
        X.dropoff_longitude = X.dropoff_longitude.apply(lambda x: round(x,3))
        X.dropoff_latitude = X.dropoff_latitude.apply(lambda x: round(x,3))
        print(l - len(X), " row has been deleted" )
        return X

### 2.2 Feature Engineering:
All we have is 5 features but this number is small to train a decent model so we will add a few attributes:
#### 2.2.1 Extracting date and time
Using the date attribute which is not useful in this format we will generate 5 more attributes which are: Year, Month, Day, Day of The week, and hour.

In [None]:
class date_time_extraction (BaseEstimator, TransformerMixin):
    def __init__ (self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['key'] = pd.to_datetime(X.key, format="%Y-%m-%d %H:%M:%S")
        X['year'] = X.key.dt.year
        X['month'] = X.key.dt.month
        X['day'] = X.key.dt.month
        X['dayOftheWeek'] = X.key.dt.dayofweek
        X["hour"]=X.key.dt.hour
        X.drop("pickup_datetime", axis=1, inplace=True)
        return X           

#### 2.2.2 Airport pickup and dropoff
As mentionned earlier, NYC has 3 airports. We all now that taxis pick and drop passangers frequently from these kind of locations. So for each airport we will add 2 columnes: pick_up and drop_off.

In [None]:
#nyc_airports coordinates
nyc_airports={'JFK':{'min_lng':-73.835,
                     'min_lat':40.619,
                     'max_lng':-73.740, 
                     'max_lat':40.665},
              
              'EWR':{'min_lng':-74.192,
                     'min_lat':40.670, 
                     'max_lng':-74.153, 
                     'max_lat':40.708},
              
        'LaGuardia':{'min_lng':-73.889, 
                     'min_lat':40.766, 
                     'max_lng':-73.855, 
                     'max_lat':40.793}
                }

In [None]:
# a function to assign 1 if we have a pick up or drop off from an airport
def Airport(latitude, longitude, airport_name):
    if (latitude>=nyc_airports[airport_name]['min_lat'] and
      latitude<=nyc_airports[airport_name]['max_lat'] and
      longitude>=nyc_airports[airport_name]['min_lng'] and
      longitude<=nyc_airports[airport_name]['max_lng']):
        return 1
    else:
        return 0
        

In [None]:
class Airport_data (BaseEstimator, TransformerMixin):
    def __init__ (self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['pick_up_JFK']=X.apply(lambda row:Airport(row['pickup_latitude'],row['pickup_longitude'],'JFK'),axis=1)
        X['dropoff_JFK']=X.apply(lambda row:Airport(row['dropoff_latitude'],row['dropoff_longitude'],'JFK'),axis=1)
        X['pickup_EWR']=X.apply(lambda row:Airport(row['pickup_latitude'],row['pickup_longitude'],'EWR'),axis=1)
        X['dropoff_EWR']=X.apply(lambda row:Airport(row['dropoff_latitude'],row['dropoff_longitude'],'EWR'),axis=1)
        X['pickup_la_guardia']=X.apply(lambda row:Airport(row['pickup_latitude'],row['pickup_longitude'],'LaGuardia'),axis=1)
        X['dropoff_la_guardia']=X.apply(lambda row:Airport(row['dropoff_latitude'],row['dropoff_longitude'],'LaGuardia'),axis=1)
        return X    
        
        

#### 2.2.3 Distance
Another important feature that we can add is trip distance. We can also add latitude distance as difference between pickup and dropoff latitude. The same for longitude distance.

In [None]:
# This forumla is availabe on the internet to understand it better
def trip_distance(lat1, lat2, lon1,lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [None]:
class distance (BaseEstimator, TransformerMixin):
    def __init__ (self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['trip_distance']=X.apply(lambda row:trip_distance(row['pickup_latitude'],row['dropoff_latitude'],row['pickup_longitude'],row['dropoff_longitude']),axis=1)
        X["diff_lat"]=abs(X.pickup_latitude-X.dropoff_latitude)
        X["diff_long"]=abs(X.pickup_longitude-X.dropoff_longitude)
        return X

### 2.3 Normalization:
Now that all our features are ready we will reduce the size of our data so our model train faster

In [None]:
class Normalization (BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X.year = (X.year - 2009)
        max_year = X.year.max()
        X.month = X.month + X.year*12
        X.year = X.year/max_year
        max_month = X.month.max()
        X.month = X.month/max_month
        max_day = X.day.max()
        X.day = X.day/max_day
        max_hour = X.hour.max()
        X.hour = X.hour / max_hour
        X.pickup_latitude = coordinates['max_lat'] - X.pickup_latitude
        X.dropoff_latitude = coordinates['max_lat'] - X.dropoff_latitude
        X.pickup_longitude = coordinates['max_long'] - X.pickup_longitude
        X.dropoff_longitude = coordinates['max_long'] - X.dropoff_longitude
        return X

### 2.4 Data Processing pipeline:


In [None]:
Data_processing_pipeline = Pipeline([
    ('cleaning', DataCleaning()),
    ('Date and time extraction', date_time_extraction()),
    ('Pick up or drop off in an airport', Airport_data()),
    ('usefull distances', distance()),
    ('Normalization', Normalization())   
])

In [None]:
X_train= Data_processing_pipeline.fit_transform(train)
labels = X_train["fare_amount"].copy()
X_train = X_train.drop("fare_amount", axis=1) # drop labels for training set
X_train = X_train.drop("key", axis=1) 
X_test = Data_processing_pipeline.fit_transform(test)
X_test = X_test.drop("key", axis=1) 
del train, test

## **3. Training, predictions and submitting results**
#### 3.1 Training
I have tried several different models and decided to work with RandomForestRegrosser as it is a basic model and give acceptable results. Due to the big number of training 1 model will consume a lot of time while training. 

In [None]:
param_grid = [
    {'n_estimators': [3, 10, 30, 40, 50], 'max_features': [2, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]


In [None]:
#loading model
forest_reg = RandomForestRegressor()
# grid search to determine the best set of HP using grid search and 5 folders for cross validation
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                          scoring = 'neg_mean_squared_error',
                          return_train_score=True)

In [None]:
#fitting and selecting best set of parameters
grid_search.fit(X_train, labels)
best_model = grid_search.best_estimator_
grid_search.best_params_

#### 3.2 Predictions

In [None]:
final_predictions = best_model.predict(X_test)

#### 3.3 Submitting results


In [None]:
SS = pd.read_csv("../input/new-york-city-taxi-fare-prediction/sample_submission.csv")
SS['fare_amount']= final_predictions
SS.to_csv('SS.csv',index=False)