**Automatidata ‚Äì A/B Test on Payment Type vs Fare Amount (Course 4)**

**Project Summary**

This project explores whether the payment method (credit card vs cash) is associated with differences in average fare amount for NYC Yellow Taxi trips.

The analysis uses descriptive statistics and a two-sample t-test to determine whether the observed difference is statistically significant.


**Table of Contents:**

1- Import Required Libraries & load the Dataset

2- Exploratory Data Analysis (EDA)


*   2.1 Descriptive Statistics
*   2.2 Average Fare Amount by Payment Type



3- Hypothesis Testing

*   3.1 Define Hypotheses
*   3.2 Perform Two-Sample T-Test


4- Communicating Insights

* 4.1 Business Insight

* 4.2 Limitations & Assumptions

5- Final Summary



---



**1- Import necessary libraries & load the dataset.**

In [1]:
#This line is only run if we want to import data from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
# Load dataset into dataframe
taxi_data = pd.read_csv("/content/drive/MyDrive/Data & Research üìâ/Courses/Google Advanced Analytics Certificate/4- Power of statistics /Project/Code/2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

How can descriptive statistics help at this stage?

Descriptive statistics can help me check if there are any missing values as well as the range of values for each variable. It can also demonstrate the spread of the dataset using the standard deviation, we can closely take a look at the two variables we're interested in, the fare amount and payment type.

**2- Exploratory Data Analysis**

In [4]:
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [5]:
taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 1

In [6]:
# check the dataset characteristics
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,10/17/2017 10:54:24 AM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


We are interested in the relationship between payment type and the fare amount the customer pays. One way is to look at the average fare amount for each payment type.

In [10]:
# check the average fare amount for each payment type
taxi_data.groupby('payment_type')['fare_amount'].mean()

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.429748
2,12.213546
3,12.186116
4,9.913043


**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



The averages suggest that credit card users typically pay higher fares than cash users. However, this observed difference could simply be due to random variation in the sample. To determine whether the difference reflects a real underlying effect, a hypothesis test is performed.


**3- Hypothesis Testing**

3.1- Define the Hypotheses

H‚ÇÄ: There is no difference in mean fare between credit card and cash users.

H‚Çê: The mean fare differs between the two payment types.


The steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



3.2- Two-Sample T-Test

In [8]:
# set the significance level
significance_level= 0.05

In [9]:
# conduct a two-tail T test
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

TtestResult(statistic=np.float64(6.866800855655372), pvalue=np.float64(6.797387473030518e-12), df=np.float64(16675.48547403633))

The p-value is lower than the significance level which means that it's highly unlikely that the difference observed in the average fare amount is due to chance.

**4- Communicating Insights**

4.1- Business Insight

Since customers who pay with credit card pay a higher fare amount on average than those who pay in cash, the company can launch a marketing campaign to encourage more customers to pay with their credit cards, this can be acheived by providing offers and promotions as well as perks specific for them.

4.2- Limitations and Assumptions
* The observed behavior was concluded only from the available data which might not reflect the behavior of the overall population.
Another point is that the difference in fare amount between the two types is not very large.

* We assumed that customers are randomly assigned to a payment type, but in reality payment choice might be driven by trip characteristics (e.g., long trips encouraging credit card use)

**5- Final Summary**

This A/B test shows a statistically significant difference in average fare between credit card and cash payments. While the effect size is small, the findings provide useful direction for promotional strategies and further investigation into customer behavior.