# Conduct A/B Testing

## Part [01] : Plan

### Research question
Consider a research question now, at the start of this task.

**response:** The research question for this data project: “Is there a relationship between total fare amount and payment type?”

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [2]:
import pandas as pd 
from scipy import stats

In [3]:
taxi_data = pd.read_csv("data/2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

In [4]:
taxi_data.head(5)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


## Part [02] : Analyze and Conduct

Data professionals use descriptive statistics for Exploratory Data Analysis, to quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps me quickly compare the average total fare amount among different payment types.


### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown

In [6]:
# descriptive stats for EDA
taxi_data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


I am interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type.

In [7]:
taxi_data.groupby('payment_type')[['fare_amount']].mean()

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.429748
2,12.213546
3,12.186116
4,9.913043


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, I conduct a hypothesis test.

### Task 3. Hypothesis testing

**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash. 

**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash

I choose `5%` as the significance level and proceed with a `two-sample t-test`.

My goal in this step is to conduct a `two-sample t-test`. 
The steps for conducting a hypothesis test: 

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

In [8]:
#hypothesis test, A/B test
#significance level

credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var = False)

TtestResult(statistic=np.float64(6.866800855655372), pvalue=np.float64(6.797387473030518e-12), df=np.float64(16675.48547403633))

**response:** Since the `p-value` is significantly smaller than the significance level of `5%`, I `reject` the null hypothesis. 

*Notice the 'e-12' at the end of the pvalue result.*

I conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.