# What is Machine Learning?

Machine learning is the science of getting computers to act without being explicitly
programmed. -(Stanford University)

# Dataset

https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

# Part 0 - Exploring the data

In [None]:
# Import necessary python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime

In [None]:
# Load 1 million rows from the csv file
train=pd.read_csv("train.csv",nrows=1000000)

In [None]:
# Taking a look at the data
train.head()

In [None]:
train.describe()

In [None]:
# Get rides with 0 passenger counts
no_passenger_rides = train[train['passenger_count']==0]
no_passenger_rides.head()

In [None]:
# Count number of such instances.
len(no_passenger_rides)

In [None]:
# Get rows with negative fare amount
negative_fare_rides = train[train['fare_amount']<0]

In [None]:
negative_fare_rides

In [None]:
# Convert negative fare amount to positive fare amount 
train['fare_amount'] = train['fare_amount'].apply(lambda x: abs(x))

In [None]:
# Check the fares
negative_fare_rides = train[train['fare_amount']<0]

In [None]:
negative_fare_rides

In [None]:
train.describe()

In [None]:
# Drop rows with atleast 1 nan value
train.dropna(inplace=True)

In [None]:
train.describe()

In [None]:
train.head()

In [None]:
# Calculates distance from geographic coordinates
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [None]:
train['distance_miles']=distance(train.pickup_latitude,train.pickup_longitude,train.dropoff_latitude,train.dropoff_longitude)

In [None]:
train.head()

In [None]:
def get_absolute_difference_longitude(a):
    #print(a)
    return abs(a['dropoff_longitude']-a['pickup_longitude'])
def get_absolute_difference_latitude(a):
    #print(a)
    return abs(a['dropoff_latitude']-a['pickup_latitude'])

In [None]:
train['abs_longitude']=train.apply(get_absolute_difference_longitude,axis=1)

In [None]:
train['abs_latitude']=train.apply(get_absolute_difference_latitude,axis=1)

In [None]:
train.head()

Distance histogram

In [None]:
train.distance_miles.hist(bins=100)
plt.xlabel('distance miles')
plt.title('Histogram ride distances in miles')
plt.show()

Distance histogram for less than 50 miles

In [None]:
less_than_1000=train[train['distance_miles']<50]

In [None]:
less_than_1000.distance_miles.hist(bins=100)
plt.xlabel('distance miles')
plt.title('Histogram ride distances in miles')
plt.show()

In [None]:
# Returns weekday for the given date
from dateutil import parser
def weekday(dates):
    days=[]
    d={'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6}
    for i in dates:
        days.append(d[parser.parse(i).strftime("%A")])
    return days
    

In [None]:
train['pickup_weekday']=weekday(train.pickup_datetime)

In [None]:
# Returns hour of the day for the given date
def hour(dates):
    hours=[]
    for i in dates:
        hours.append(parser.parse(str(i)).hour)
    return hours

In [None]:
parser.parse('2009-06-15 17:26:21 UTC').hour

In [None]:
pick_up_hours=hour(train.pickup_datetime)

In [None]:
train['pick_up_hour']=pick_up_hours

In [None]:
train.head()

## Part 1 - Linear Regression

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm


In [None]:
X=train[['distance_miles','passenger_count','pickup_weekday']]
y=train['fare_amount']

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
lr=linear_model.LinearRegression()
lr.fit(X_train,y_train)

In [None]:
lr.coef_

In [None]:
lr.intercept_

In [None]:
y_pred=lr.predict(X_train)

In [None]:
print("Root Mean Squared Error for training data: ",pow(mean_squared_error(y_train, y_pred),0.5))

In [None]:
y_pred_test=lr.predict(X_test)

In [None]:
print("Root Mean Squared Error for testing data: ",pow(mean_squared_error(y_test, y_pred_test),0.5))

# Part 2 - Decision Tree Regressor

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

https://www.youtube.com/watch?v=p17C9q2M00Q - Decision Tree (Classifier)

https://www.youtube.com/watch?v=zvUOpbgtW3c - Decision Tree (Regressor)

In [None]:
X=train[['distance_miles','passenger_count','pickup_weekday']]
y=train['fare_amount']

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
clf = DecisionTreeRegressor(random_state=12)

In [None]:
clft=clf.fit(X_train,y_train)

In [None]:
y_pred=clf.predict(X_train)

In [None]:
print("Root Mean Squared Error for training data: ",pow(mean_squared_error(y_train, y_pred),0.5))

In [None]:
y_pred_test=clf.predict(X_test)

In [None]:
print("Root Mean Squared Error for testing data: ",pow(mean_squared_error(y_test, y_pred_test),0.5))