# Project Overview and Purpose

This is a project from Google Advanced Data Analytics. 

We aim to conduct a data analytics project for the New York City Taxi and Limousine Commission (TLC). New York City TLC is an agency responsible for licensing and regulating New York City's taxi cabs and for-hire vehicles. The agency needs to develop a regression model that helps estimate taxi fares before the ride, based on data that TLC has gathered. 


As a part of this project, we are willing to analyze the relationship between fare amount and payment type. Specifically, we want to find out wether or not paying the fare with card may result to a highr fare amount.


The purpose of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. A/B test results should aim to find ways to generate more revenue for taxi cab drivers.


In [2]:
# import libraries

import numpy as np
import pandas as pd
import scipy


In [8]:
Taxi_data = pd.read_csv("C:\\Users\\Amirhossein Hosseini\\OneDrive - Queen's University\\Coursera_Google_Advanced_Data_Analytics_Professional\\2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)


In [9]:
Taxi_data.head(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8
23345809,2,03/25/2017 8:34:11 PM,03/25/2017 8:42:11 PM,6,2.3,1,N,161,236,1,9.0,0.5,0.5,2.06,0.0,0.3,12.36
37660487,2,05/03/2017 7:04:09 PM,05/03/2017 8:03:47 PM,1,12.83,1,N,79,241,1,47.5,1.0,0.5,9.86,0.0,0.3,59.16
69059411,2,08/15/2017 5:41:06 PM,08/15/2017 6:03:05 PM,1,2.98,1,N,237,114,1,16.0,1.0,0.5,1.78,0.0,0.3,19.58
8433159,2,02/04/2017 4:17:07 PM,02/04/2017 4:29:14 PM,1,1.2,1,N,234,249,2,9.0,0.0,0.5,0.0,0.0,0.3,9.8
95294817,1,11/10/2017 3:20:29 PM,11/10/2017 3:40:55 PM,1,1.6,1,N,239,237,1,13.0,0.0,0.5,2.75,0.0,0.3,16.55


`ID`: Trip identification number

`VendorID`: A code indicating the TPEP provider that provided the record.  


`tpep_pickup_datetime`: The date and time when the meter was engaged. 

`tpep_dropoff_datetime`: The date and time when the meter was disengaged. 

`Passenger_count`: 

The number of passengers in the vehicle.  

This is a driver-entered value.

`Trip_distance`:

The elapsed trip distance in miles reported by the taximeter.

`PULocationID`:

TLC Taxi Zone in which the taximeter was engaged

`DOLocationID`:

TLC Taxi Zone in which the taximeter was disengaged

`RateCodeID`: The final rate code in effect at the end of the trip. 

1= Standard rate 

2=JFK 

3=Newark 

4=Nassau or Westchester 

5=Negotiated fare 

6=Group ride

`Store_and_fwd_flag`: This flag indicates whether the trip record was held in vehicle memory before being sent to the vendor, aka “store and forward,”  because the vehicle did not have a connection to the server. 

Y= store and forward trip 

N= not a store and forward trip

`Payment_type`: A numeric code signifying how the passenger paid for the trip.  

1= Credit card 

2= Cash 

3= No charge 

4= Dispute 

5= Unknown 

6= Voided trip

`Fare_amount`: The time-and-distance fare calculated by the meter.

`Extra`: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.

`MTA_tax`: $0.50 MTA tax that is automatically triggered based on the metered rate in use.

`Improvement_surcharge`: $0.30 improvement surcharge assessed trips at the flag drop. The  improvement surcharge began being levied in 2015.

`Tip_amount`: Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.

`Tolls_amount`: Total amount of all tolls paid in trip. 

`Total_amount`: The total amount charged to passengers. Does not include cash tips.

In [6]:
Taxi_data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,,22687,22688,,,,2,,,,,,,,,,
top,,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,,2,2,,,,22600,,,,,,,,,,
mean,56758490.0,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,32744930.0,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,12127.0,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,28520560.0,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,56731500.0,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,85374520.0,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [7]:
# There is no Null value in the dataframe
Taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

In [100]:
""" We are interested in the relationship between payment type and the fare amount the customer pays. 
    One approach is to look at the average fare amount for each payment type.""";

Payment_types = {1: 'Credit card', 2: 'Cash', 3: 'No charge', 4: 'Dispute'}

mean_total_amount = Taxi_data['fare_amount'].groupby(Taxi_data['payment_type']).mean().reset_index()

# start indexing from 1 instead of 0 to be consistent with the int values of each payment_type
mean_total_amount.index = range(1, len(mean_total_amount)+1)

mean_total_amount['count'] = Taxi_data['payment_type'].value_counts()

mean_total_amount['payment_type'] = mean_total_amount['payment_type'].replace(Payment_types)

mean_total_amount


Unnamed: 0,payment_type,fare_amount,count
1,Credit card,13.429748,15265
2,Cash,12.213546,7267
3,No charge,12.186116,121
4,Dispute,9.913043,46


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash (or others). However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.

# A/B test

We aim to conduct an A/B test using a two-sample t-test with the significance level of %5.

`Null hypothesis`: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

`Alternative hypothesis`: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.


Note that we can assume the normal samples t-distribution are close to normal distribution since the number of samples are large. 

In [99]:
scipy.stats.ttest_ind(Taxi_data.loc[Taxi_data['payment_type'] == 1, 'fare_amount'],
                      Taxi_data.loc[Taxi_data['payment_type'] == 2, 'fare_amount'],
                      equal_var=False)



Ttest_indResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12)

# Results and recommendations

p_value ~ 0 < 0.05. So, we can reject the null hypothesis and conclude that people pay more fare amount when the use credit cards.


The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.