# Project Inroduction

The dataset used for this project contains data gathered by the New York City Taxi & Limousine Commission. For each trip, there are many different data variables gathered as seen in the dictionary below. The main goal of this project is to build a model that can predict the "total_amount" or final cost of a trip before the trip is taken so that this price may be shown to prospective riders to increase sales since buyers are more likely to buy a service when they know what they are getting. Also it's important to note that for the purposes of this project we are making the assumption that riders enter in the relevant information (Mainly Pickup/Dropoff Location from which many other variables can be derived) prior to booking the ride, the same way you would if you were using UBER for example.

### Data Dictionary / Feature Definitions

**ID**: Trip identification number.

**VendorID**: A code indicating the TPEP provider that provided the record.

        1= Creative Mobile Technologies, LLC;

        2= VeriFone Inc.

**tpep_pickup_datetime**: The date and time when the meter was engaged.

**tpep_dropoff_datetime**: The date and time when the meter was disengaged.

**Passenger_count**: The number of passengers in the vehicle. This is a driver-entered value.

**Trip_distance**: The elapsed trip distance in miles reported by the taximeter.

**PULocationID**: TLC Taxi Zone in which the taximeter was engaged.

**DOLocationID**: TLC Taxi Zone in which the taximeter was disengaged.

**RateCodeID**: The final rate code in effect at the end of the trip.

        1= Standard rate

        2= JFK

        3= Newark

        4= Nassau or Westchester

        5= Negotiated fare

**Store_and_fwd_flag**: This flag indicates whether the trip record was held in vehicle memory before being sent to
                           the vendor, aka “store and forward,” because the vehicle did not have a connection to the
                           server.

        Y= store and forward trip

        N= not a store and forward trip

***Payment_type***: A numeric code signifying how the passenger paid for the trip.

        1= Credit card

        2= Cash

        3= No charge

        4= Dispute

        5= Unknown

        6= Voided trip

**Fare_amount**: The time-and-distance fare calculated by the meter.

**Extra**: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.

**MTA_tax**: $0.50 MTA tax that is automatically triggered based on the metered rate in use.

**Improvement_surcharge**: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.

**Tip_amount**: Tip amount - This field is automatically populated for credit card tips. Cash tips are not included.

**Tolls_amount**: Total amount of all tolls paid in trip.

#### _Let's start_ by taking an overview of the features and thier importancet to the main goal and see if there is any columns that can be premtively dropped so as to not waste time exploring them. (This would obviously be impractical on a dataset with a lot of features however I am including this here due to this being a portfolio project and therefore my thought process may be important to see)

**Trip ID**: This is a statically increasing integer count of trips that I do not believe to even be specific to each vehicle (i.e. each vehicle tracks their own ride numbers) so this feature cannot be used in any way to predict the final cost of the ride.

**Vendor ID**: There are two vendors in this dataset and they may have different rates they charge with thier taxis.

**Pickup/Dropoff Datetime**: These values can be used to analyze if certain times in the day have more or less demand and thus may incur a premuim due the laws of supply/demand. If certain times can be proven to be more expensive then we can use this to predict final cost.

**Passenger Count**: Since we don't know the ins and outs the policies at these taxi companies, drivers may be permitted to charge a premium based on certain amounts of riders, so this feature will need investigation as it may be useful.

**Trip Distance**: We would expect this to be the one of the features with the highest correlation with the final cost of the ride.

**Pickup/Dropoff Location**: These values can be used to analyze if certain zones have more or less traffic and thus take the taxi driver longer to navigate them making the ride run longer which would increase the final cost of the ride.

**Ratecode**: This variable will need investgation as to how much of an effect it has on the final cost

**Store & Forward Flag**: This variable is not expected to be valueable in predicting the final cost since it only tracks information that is gathed after the ride has taken place.

**Payment Type**: We will need to checkk whether it makes a difference whether a rider pays by cash or card.

**Fare Amount**: Since this feature is made up of other features which will be included in the final model, it is not necessary to include, since it will add undue additional weight to those features comprising it.

**Extra**: This is a varible taken down at rides end and not available beforehand but since it only tracks certain premiums based on the time of day we can make binary columns in the dataset to track it instead.

**MTA Tax**: A nearly perfectly static 50 cent charge applied after the ride ends, its value is included in the final cost already so this column is not necessary.

**Improvement Surcharge**: A static 30 cent charge on every ride,, provides no predictive power since it never changes

**Tip Amount**: Since tips are a variable controlled by the rider it can't reliably be predicted in a general sense, and probably not even in a per rider sense, even if we knew who was taking the ride and could remember thier previous tips. We will need to create a new column in the data set that tracks the subtotal of the ride (total_amount without tips) and then try to predict that

**Tolls Amount**: This is another variable we would not have prior to the ride being completed, and since we do not have the data or capabilities within this setting to gather data and route information to estimate what tolls may be paid during the course of the trip, we will have to accept this value being added to the total and remove what variance we can by with the pickup and dropoff locations, since they may correlate with the tolls.

**Total Amount**: As mentioned above we will be creating a new final cost column called subtotal, which is this value with tips subtracted.

**In conclusion**, the features we will remove will be ID, Store_and_fwd_flag, Fare_amount, Extra, MTA_tax, Improvement_surcharge, Tip_amount, Tolls_amount, Total_amount.

In [1]:
# Imports
import pandas as pd, seaborn as sns, numpy as np
from scipy import stats
from matplotlib import pyplot as plt
from datetime import datetime, date, timedelta

In [2]:
df = pd.read_csv("taxicab_original.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

In [5]:
df["tolls_amount"].value_counts()

tolls_amount
0.00     21525
5.76       847
5.54       239
10.50       21
12.50       11
2.64        10
2.54         6
11.52        3
16.26        3
16.50        2
16.00        2
8.50         2
18.00        2
15.50        2
18.28        1
8.40         1
16.20        1
5.45         1
2.70         1
16.62        1
8.16         1
5.16         1
8.00         1
4.32         1
11.75        1
15.58        1
17.50        1
6.00         1
13.45        1
5.49         1
17.28        1
5.44         1
2.16         1
13.00        1
19.10        1
18.26        1
15.00        1
6.32         1
Name: count, dtype: int64