# Project Context: Automatidata and NYC TLC

This project is being undertaken by a data professional at Automatidata, a data consulting firm. The current project for the New York City Taxi & Limousine Commission (NYC TLC) has reached its midpoint, with the completion of a project proposal, initial Python coding work, and exploratory data analysis.

A new request has been received from the New York City TLC, communicated by Uli King, Automatidata’s project manager. This request involves analyzing the relationship between fare amount and payment type. Subsequently, a follow-up assignment from Luana, a senior data analyst, specifies the need to conduct an A/B test to address this request.




# Statistical Analysis Project: A/B Testing and Hypothesis Testing

This activity focuses on applying statistical methods to analyze and interpret data, covering fundamental concepts such as descriptive statistics and hypothesis testing. The provided data will be explored, and A/B tests, along with hypothesis testing, will be conducted.

**Project Purpose:** The purpose of this project is to demonstrate proficiency in preparing, creating, and analyzing A/B tests. The A/B test results are intended to identify strategies for increasing revenue for taxi cab drivers.

**Important Note on Data Assumption:** For the scope of this exercise, it is assumed that the sample data originates from an experiment where customers were randomly assigned to one of two groups: 1) customers required to pay with a credit card, or 2) customers required to pay with cash. This assumption is crucial for drawing causal conclusions regarding how payment method influences fare amount.

**Project Goal:** The goal is to apply descriptive statistics and hypothesis testing using Python. Specifically, the objective of this A/B test is to sample data and analyze whether a relationship exists between payment type and fare amount. For example, the analysis aims to discover if customers who use credit cards tend to pay higher fare amounts than customers who use cash.

This activity is structured into four parts:

**Part 1: Imports and Data Loading**
* **Necessary Data Packages for Hypothesis Testing:** The essential packages for conducting hypothesis testing include `numpy` for numerical operations, `pandas` for data manipulation, and functions from `scipy.stats` for statistical tests.

**Part 2: Conduct EDA and Hypothesis Testing**
* **Contribution of Descriptive Statistics to Data Analysis:**
    Descriptive statistics are valuable for quickly exploring and understanding large datasets.
    * **Mean:** The mean (average) provides a general indication of the typical fare in the dataset, offering insight into the overall trend of the data. While useful for understanding the average amount customers are paying, it can be influenced by outliers.
    * **Median:** The median represents the "middle" fare when the data is sorted. Its utility lies in its robustness to extreme values, providing a more accurate reflection of the typical fare in the presence of outliers.
    * **Standard Deviation:** The standard deviation quantifies the dispersion or variability of fares around the mean. A small standard deviation suggests consistency in fares, while a large standard deviation indicates greater spread and variation. This metric helps in understanding the consistency or variability of fares relative to the average.
    * **Identifying Trends:** By computing and comparing the mean or median fare for both credit card and cash payments, trends can be identified. For instance, if the mean fare for credit card payments is higher than for cash payments, it suggests a tendency for credit card users to pay more on average.
    * **Checking for Outliers:** Outliers can be identified by comparing the mean with the maximum and minimum values. Significant discrepancies between the mean and these extreme values suggest the presence of outliers—data points that deviate significantly from the usual pattern and may require further investigation.

* **Formulation of Null and Alternative Hypotheses:**
    The hypotheses for this A/B test are formulated as follows:
    * **Null Hypothesis (H₀):** There is no difference in the average fare amount between credit card and cash payments.
    * **Alternative Hypothesis (H₁):** Customers who use a credit card pay higher fare amounts than those who pay with cash.

**Part 3: Communicate Insights with Stakeholders**
* **Key Business Insights from the A/B Test:** What significant business insights emerged from the results of the A/B test?
* **Proposed Business Recommendations:** Based on the test results, what actionable business recommendations can be proposed?

**Part 4: Evaluate and Share Results**

# **Conduct an A/B test**


 **PACE stages** 

   * Plan
   * Analyze
   * Construct
   * Execute

## PACE: Plan

In this stage, the research question for this data project is considered, which will subsequently inform the formulation of null and alternative hypotheses for the hypothesis test.


**Research Question: Does the payment type (credit card or cash) affect the taxi fare amount?**


### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [11]:
import pandas as pd
import numpy as np
from scipy import stats

In [13]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)


## PACE: **Analyze and Construct**

In this stage, the focus shifts to deeper data analysis and the construction of analytical components.

**Utility of Descriptive Statistics in Analysis:**
    
Descriptive statistics are fundamental for Exploratory Data Analysis (EDA) and play a crucial role in learning more about the data during this analytical stage.

 **In general, descriptive statistics are useful because they let you quickly explore and understand large amounts of data.**
   
   **The mean (average)** gives us a general idea of the typical fare in the dataset. It’s useful for understanding the overall trend of the data. For example, when looking at taxi fares, the mean fare shows us the average amount customers are paying, providing a quick sense of the typical fare. However, the mean can be affected by very high or very low values (outliers), which is why it's often helpful to also consider the median and standard deviation for a fuller picture of the data.

 **Median:** The median tells us the "middle" fare when the data is sorted. It’s useful because it isn’t affected by extreme values, offering a more accurate reflection of the typical fare when there are outliers.

 **Standard Deviation:** The standard deviation shows how much the fares deviate (or vary) from the mean.If the standard deviation is small, it means the fares are mostly close to the mean (indicating consistency).If the standard deviation is large, it means the fares are more spread out, with some being much higher or lower than the mean (indicating more variation).In short, the standard deviation helps us understand how consistent or varied the fares are compared to the average fare.
          
 **Identifying Trends:** When calculating the mean or median fare for both credit card and cash payments, we can compare these values to see if one payment method generally results in higher or lower fares. For example:If the mean fare for credit card payments is higher than for cash payments, this suggests that credit card users tend to pay more on average.Similarly, comparing the median fare for each payment type helps identify trends, especially when there are outliers, as the median is less affected by extreme values.
          
**Checking for Outliers:** To identify outliers, we compare the mean with the maximum and minimum values. Here's how this works:Maximum/Minimum Values: If the maximum or minimum values are far away from the mean, it suggests that there might be outliers. For example, if the average fare is 30, but the maximum fare is 1000 or the minimum fare is 1, these extreme values may not be typical and could be outliers.
      
**Mean vs. Range:** If there is a big difference between the mean and the maximum or minimum values, it suggests that most of the data points are close to the mean, but there are a few extreme values (either very high or very low) that are far from the mean. These extreme values could be outliers—values that don't follow the usual pattern of the data and might need further investigation.


### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [22]:
taxi_data.head(5)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [24]:
taxi_data.shape

(22699, 17)

22699 rows and 17 columns

In [27]:
taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 1

In [29]:
# Missing values
taxi_data.isnull().sum()


VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
dtype: int64

* No null values in any column

In [32]:
# summary statistics

taxi_data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


**about fare_amount **

**Mean (13.02):**
* The average fare is approximately $13.02. However, the mean can be influenced by extreme values (outliers), such as very high or very low fares.

**Median (9.5):**
* The middle value of the fares is $9.50. Since the median is lower than the mean, it suggests that there may be some high outliers pulling the mean upward.

**Standard Deviation (13.243):**
* The mean is 13.02, and the standard deviation is 13.243.
 When the standard deviation is close to or larger than the mean, it suggests high variability in the data.
This means that fares are spread out widely, with many values far from the average fare of $13.02.

** Minimum Value (-$120):**
* Unusual Compared to Mean:
The mean fare is $13.02, which is reasonable for a typical taxi ride.
A negative fare is not only far from the mean but also logically incorrect since fares can't be negative.
This suggests a data error, such as a mistake during data entry, a refund recorded incorrectly, or an issue in data collection.
**Maximum Value ($999.99):**
* Extreme Compared to Mean:
The mean fare is $13.02, while the maximum is $999.99—this is almost 77 times the mean.
**25% (6.5):**
* 25% of fares are below $6.50, indicating that many fares are relatively low.

**75% (14.5):**
* 75% of fares are below $14.50. This shows that the majority of fares are clustered within a reasonable range, with only a few going beyond this value.

**Negative Minimum (-120):**
* A negative fare is not valid and should be flagged as a potential data error.

**High Maximum (999.99):**
* This could be a legitimate fare for an exceptionally long trip, but it should be checked for validity.

**Overall Insights**
* Most fares fall between $6.50 (25%) and $14.50 (75%), with a median of $9.50, showing that the majority of fares are affordable and consistent.
* The mean being higher than the median suggests the presence of outliers that skew the average upward.
The data contains anomalies, such as the negative fare, which may need to be cleaned or excluded during analysis.

**Next Steps**
**Data Cleaning:**
Remove or investigate the negative fares and unusually high fares.
**Further Analysis:**
Examine the distribution of the data (e.g., histogram) to confirm the skewness caused by outliers.
Consider using robust measures like the median and interquartile range (IQR) to summarize the data if the outliers significantly affect the analysis.

**about payment_type**

* In the dataset, payment_type is encoded in integers:

1: Credit card
2: Cash
3: No charge
4: Dispute
5: Unknown

* For categorical variables( payment_type ), summary statistics such as mean, standard deviation (std), minimum, and maximum do not make sense because these calculations are meant for numerical data, not categories.
* Let us Use Python's value_counts() function to determine how often each payment type appears.


In [36]:
taxi_data['payment_type'].value_counts()

payment_type
1    15265
2     7267
3      121
4       46
Name: count, dtype: int64

**The counts show the frequency of each payment type:**

* Credit Card (1): 15,265 trips, the most common payment method.
* Cash (2): 7,267 trips, the second most frequent method.
* No Charge (3): 121 trips, likely complimentary or promo rides.
* Dispute (4): 46 trips, where payments were contested.( a "dispute" often means the payment is in question, and it might not be finalized until the issue is resolved.)

Insights:
    * Most trips are paid with credit cards, followed by cash.
    * Rare scenarios like "No Charge" and "Disputes" might need further investigation.

In [39]:
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


**We are interested in the relationship between payment type and the fare amount the customer pays.** One approach is to look at the average fare amount for each payment type. 

In [43]:
taxi_data.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

* The dataset represents a sample of yellow taxi trips taken in 2017, not every trip in New York City. As a result, the difference observed between fare amounts for credit card and cash payments may be due to random sampling, rather than reflecting a true difference across all trips.

* Based on the averages, customers who pay with a credit card tend to have slightly higher fare amounts compared to those paying with cash. However, this difference might not be significant—it could simply be a result of the sample selected.

* To determine whether this difference is statistically significant and not just a random finding, a hypothesis test is needed. The test will help assess whether the observed difference is large enough to be considered meaningful or if it could simply be attributed to chance within the sample. Even a small difference could have practical significance, but conducting the hypothesis test will confirm whether this difference holds true beyond just this sample of data


### Task 3. Hypothesis Testing

Prior to conducting the hypothesis test, it is essential to clearly define the null and alternative hypotheses. These hypotheses frame the statistical question being investigated regarding the relationship between payment type and fare amount.

**Hypotheses for this project are as follows:**

* **Null Hypothesis ($H_0$):** There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

* **Alternative Hypothesis ($H_1$):** There is a difference in the average fare amount between customers who use credit cards and customers who use cash.


### Conducting a Two-Sample t-test

This step focuses on performing a two-sample t-test. The standard procedure for conducting a hypothesis test involves the following steps:

1.  **State the Null Hypothesis ($H_0$) and the Alternative Hypothesis ($H_1$).**
2.  **Choose a Significance Level ($\alpha$).**
3.  **Find the p-value.**
4.  **Reject or Fail to Reject the Null Hypothesis.**

We choose 5% as the significance level and proceed with a two-sample t-test.

In [52]:
#hypothesis test, A/B test
#significance level
credit_card = taxi_data[taxi_data['payment_type']==1]['fare_amount']
cash = taxi_data[taxi_data['payment_type']==2]['fare_amount']
stats.ttest_ind(a=credit_card,b=cash,equal_var= False)

TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

* The pvalue 6.797387473030518e-12 is a very small number written in scientific notation, equivalent to 0.000000000006797387473030518.It is significantly smaller than the significance level of 5%, we reject the null hypothesis.

* We conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.


## PACE: **Execute**


### Task 4. Communicate insights with stakeholders


This stage involves reflecting on the execution phase of the project, particularly focusing on the insights derived from the hypothesis test and the underlying assumptions.

* **Business Insight(s) from Hypothesis Test:**
    The key business insight drawn from the hypothesis test is that encouraging customers to pay with credit cards has the potential to generate more revenue for taxi cab drivers.

* **Realism of A/B Test and Assumptions:**
    It is important to consider why this A/B test project might not be entirely realistic in a real-world scenario and to acknowledge the assumptions made for this educational project. In a true A/B test, the groups must be randomly selected to ensure the results are fair and unbiased. However, in this specific scenario, the dataset does not reflect a real-world experiment where individuals were randomly assigned to use either cash or credit card for payment.

    The project's premise assumes that, for the sake of the A/B test, riders were randomly grouped into two categories (cash vs. credit card) and instructed to pay using one specified method. In reality, passengers typically choose their payment method based on personal preference and availability (e.g., carrying sufficient cash or having a card). Therefore, the fundamental assumption of random assignment, which is crucial for drawing causal conclusions, does not align with how payment methods typically operate in real life.
