## Project Goal

The leadership have requested an additional item to be added to the initial project scope. They would like a detailed statistical analysis of payment type. That is, do the customers who use a credit card pay higher fare amounts than those who use cash? 

That said, the New York City TLC team is asking us to consider the following: 

The relationship between fare amount and payment type. 

Test the hypothesis that customers who use a credit card pay higher fare amounts.

Should you conclude that there is a statistically significant relationship between credit card payment and fare amount, discuss what the next steps should be: what are your thoughts on strategies our team could implement to encourage customers to pay with credit card?

### Task is to: 
* conduct an A/B test to analyze the relationship between fare amount and payment type

#### Goal slipt in three parts

* Part 1: Imports and data loading
* Part 2: Conduct EDA and hypothesis testing
* Part 3: Communicate insights with stakeholders

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import os
import math


In [5]:
cwd = os.getcwd()
data_path = os.path.join(cwd,"Raw_data","2017_Yellow_Taxi_Trip_Data.csv")
raw_tlc = pd.read_csv(data_path , index_col=0)

raw_tlc.head(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8
23345809,2,03/25/2017 8:34:11 PM,03/25/2017 8:42:11 PM,6,2.3,1,N,161,236,1,9.0,0.5,0.5,2.06,0.0,0.3,12.36
37660487,2,05/03/2017 7:04:09 PM,05/03/2017 8:03:47 PM,1,12.83,1,N,79,241,1,47.5,1.0,0.5,9.86,0.0,0.3,59.16
69059411,2,08/15/2017 5:41:06 PM,08/15/2017 6:03:05 PM,1,2.98,1,N,237,114,1,16.0,1.0,0.5,1.78,0.0,0.3,19.58
8433159,2,02/04/2017 4:17:07 PM,02/04/2017 4:29:14 PM,1,1.2,1,N,234,249,2,9.0,0.0,0.5,0.0,0.0,0.3,9.8
95294817,1,11/10/2017 3:20:29 PM,11/10/2017 3:40:55 PM,1,1.6,1,N,239,237,1,13.0,0.0,0.5,2.75,0.0,0.3,16.55


payment_type is encoded in integers:

1: Credit card
2: Cash
3: No charge
4: Dispute
5: Unknown

In [12]:
payment_null_mask = raw_tlc["payment_type"].isna()

payment_null_mask.sum() # there are no nulls in the paymemt_type

np.int64(0)

In [13]:
raw_tlc[["payment_type","fare_amount"]].describe()

Unnamed: 0,payment_type,fare_amount
count,22699.0,22699.0
mean,1.336887,13.026629
std,0.496211,13.243791
min,1.0,-120.0
25%,1.0,6.5
50%,1.0,9.5
75%,2.0,14.5
max,4.0,999.99


In [40]:
payment_type_desc= raw_tlc.groupby("payment_type").agg({"fare_amount":["count","sum","mean","std","max","min"]})

payment_type_desc

Unnamed: 0_level_0,fare_amount,fare_amount,fare_amount,fare_amount,fare_amount,fare_amount
Unnamed: 0_level_1,count,sum,mean,std,max,min
payment_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,15265,205005.1,13.429748,13.848964,999.99,0.0
2,7267,88755.84,12.213546,11.68994,450.0,0.0
3,121,1474.52,12.186116,14.894232,65.5,-4.5
4,46,456.0,9.913043,24.162943,52.0,-120.0


In [38]:
payment_type_desc["fare_amount"].loc[1:2,["mean","sum"]]

Unnamed: 0_level_0,mean,sum
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1
1,13.429748,205005.1
2,12.213546,88755.84


In [39]:
raw_tlc.shape

(22699, 17)

In [43]:
cash_data = raw_tlc[ raw_tlc["payment_type"] == 2]
credit_card_data = raw_tlc[ raw_tlc["payment_type"] == 1]

cash_sample = cash_data.sample(n = 1000, replace= True, random_state= 13500)
credit_card_sample = credit_card_data.sample( n = 1000, replace= True, random_state= 18750)

print(f"Credit Card sample mean: {credit_card_sample["fare_amount"].mean():.4f}")
print(f"Cash sample mean: {cash_sample["fare_amount"].mean():.4f}")

Credit Card sample mean: 13.3697
Cash sample mean: 12.6753


# State the hypothesis test

* H0 = There is **no** difference in the fare_amount and the payment type selected by the customers
* Ha = There is difference in the fare_amount and the payment type selected by the customers

In [44]:
# Set the significance level 

significance_level = 0.05

t_score , p_value = stats.ttest_ind(a = credit_card_sample["fare_amount"] , b = cash_sample["fare_amount"], equal_var= False)

print(f"The t_score for this two tail test is: {t_score:.4f}")
print(f"The p_value for this test is: {p_value:.4f}")

The t_score for this two tail test is: 1.0672
The p_value for this test is: 0.2860


* Based on the sample data taken by the data set p_value 0.2860 is higher than the significance level 0.05, therefore I fail to reject the null hypothesis 

* This means that there is no significant evidence that the difference in the fare_amount is due to the payment type

***
* try with the full records for both payment_methods

In [45]:

# Set the significance level 

significance_level = 0.05

t_score , p_value = stats.ttest_ind(a = credit_card_data["fare_amount"] , b = cash_data["fare_amount"], equal_var= False)

print(f"The t_score for this two tail test is: {t_score:.4f}")
print(f"The p_value for this test is: {p_value:.4f}")

The t_score for this two tail test is: 6.8668
The p_value for this test is: 0.0000


* However If I don't crop the dataframe into a sample of 1000 I found that the p_value is almost 0 being lower than the significance level of 5% and in this escenario I reject the null hypothesis

* This means that there is significance probability evidence that the difference in the fare amount is due to the payment method