1. What is your research question for this data project? 

The primary research question for this data project is: Do customers who pay with credit cards have a higher fare amount compared to customers who pay with cash?

This question aims to analyze whether there is a significant difference in fare amounts based on the payment type.

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [12]:
#data preparation
taxi_data = pd.read_csv("C://Users//hp//Desktop//PYTHON//Model Development//2017_Yellow_Taxi_Trip_Data.csv")
taxi_data.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing
descriptive statistics help you learn more about your data in this stage of your analysis?

Computing descriptive statistics helps you learn more about your data by:

Summarization: Provides a quick overview of key metrics like mean, median, and standard deviation.

Comparison: Facilitates easy comparison between different groups (e.g., credit card vs. cash).

Identifying Patterns: Reveals the distribution shape and data characteristics, such as skewness and outliers.

Informed Hypothesis Testing: Helps choose appropriate statistical tests based on data properties.

Visual Insights: Supports the creation of visualizations that illustrate data trends and distributions clearly.

In short, descriptive statistics offer crucial insights that guide further analysis and inform decision-making.

In [13]:
#data cleaning
taxi_data.isnull().sum()

Unnamed: 0               0
VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
dtype: int64

In [14]:
taxi_data.dtypes

Unnamed: 0                 int64
VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
RatecodeID                 int64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [17]:
#Data exploration
#Use descriptive statistics to conduct Exploratory Data Analysis (EDA).
#Note: In the dataset, payment_type is encoded in integers: * 1: Credit card * 2: Cash * 3: No charge * 4: Dispute * 5: Unknown


#map payment types
payment_mapping = { 1: 'Credit Card', 2: 'Cash', 3: 'No Charge', 4: 'Dispute', 5: 'Unknown'}
taxi_data['payment_type'].map(payment_mapping)

taxi_data[['payment_type', 'fare_amount']].head()

Unnamed: 0,payment_type,fare_amount
0,Credit Card,13.0
1,Credit Card,16.0
2,Credit Card,6.5
3,Credit Card,20.5
4,Cash,16.5


In [21]:
# Calculate descriptive statistics for fare_amount by payment type
taxi_data.groupby('payment_type')['fare_amount'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Cash,7267.0,12.213546,11.68994,0.0,6.0,9.0,14.0,450.0
Credit Card,15265.0,13.429748,13.848964,0.0,7.0,9.5,15.0,999.99
Dispute,46.0,9.913043,24.162943,-120.0,5.0,8.5,17.625,52.0
No Charge,121.0,12.186116,14.894232,-4.5,2.5,7.0,15.0,65.5


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger
fare amount than customers who pay in cash. However, this difference might arise from random
sampling, rather than being a true difference in fare amount. To assess whether the difference is
statistically significant, you conduct a hypothesis test.

Hypothesis testing

In [22]:
# Filter data for relevant payment types (Credit Card and Cash)
filtered_data = taxi_data[taxi_data['payment_type'].isin(['Credit Card', 'Cash'])]

In [23]:
# Separate the fare amounts
credit_card_fares = filtered_data[filtered_data['payment_type'] == 'Credit Card']['fare_amount']
cash_fares = filtered_data[filtered_data['payment_type'] == 'Cash']['fare_amount']

In [24]:
# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(credit_card_fares, cash_fares, equal_var=False)

In [26]:
print(t_stat,p_value)

6.866800855655372 6.797387473030518e-12


In [27]:
# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    decision = "Reject the null hypothesis (H0)"
else:
    decision = "Fail to reject the null hypothesis (H0)"

print(decision)

Reject the null hypothesis (H0)


There is a statistically significant difference in average fare amounts between credit card and cash customers.

Communicate insights with stakeholders
1. What business insight(s) can you draw from the result of your hypothesis test?
Payment Impact: Credit card users tend to pay more, suggesting that encouraging credit card use could boost revenue for drivers.

Targeted Strategies: Opportunities to market promotions or incentives for credit card payments.

Policy Considerations: NYC TLC might implement policies to enhance credit card transactions for better cash flow.

2. Consider why this A/B test project might not be realistic, and what assumptions had to be 
made for this educational project
Random Sampling: Assumes customers were randomly assigned to payment types, which may not reflect real-world choices.

Controlled Environment: Assumes external factors (e.g., trip conditions) do not influence fare amounts, which might not be realistic.

Data Integrity: Relies on the accuracy of fare and payment data, which could have anomalies.

Sample Size: The dataset may not fully represent the diverse NYC taxi customer base..