# Statistical analysis: A/B Testing
As a data professional in a data consulting firm, called Automatidata. The current project for our newest client, the New York City Taxi & Limousine Commission.

## Project objectives
Using A/B Testing to analyze the relationship between fare amount and payment type.

In this project, I will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. I will be exploring the data provided and conduct A/B and hypothesis testing.  
<br/>   
**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. My A/B test results aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this project, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This projects has four parts:*

**Part 1:** Imports and data loading
* Includes loading packages necessary for hypothesis testing.

**Part 2:** Conduct EDA and hypothesis testing
* Compute descriptive statistics to help in data analyze.

* Forumlate null hypothesis and alternative hypothesis.

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from the A/B test.

* What are the proposed business recommendations based on the results.

### Task 1. Imports and data loading
Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [21]:
import pandas as pd
from scipy import stats

In [22]:
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

### Task 2. Data exploration
Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown

In [23]:
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [24]:
# Assuming payment_type is the column containing the numeric codes (1 for credit card, 2 for cash and so on...)
payment_type_mapping = {1: 'Credit Card', 2: 'Cash', 3: 'No charge', 4: 'Dispute', 5: 'Unknown'}

# Create a new column 'payment_type_str' by replacing values based on the mapping
taxi_data['payment_type_str'] = taxi_data['payment_type'].replace(payment_type_mapping)

# Display the DataFrame with the new column
print(taxi_data[['payment_type', 'payment_type_str']].head())

           payment_type payment_type_str
24870114              1      Credit Card
35634249              1      Credit Card
106203690             1      Credit Card
38942136              1      Credit Card
30841670              2             Cash


Shows how many payments are there for each payment type

In [25]:
taxi_data.groupby(['payment_type','payment_type_str'])['fare_amount'].count().reset_index()

Unnamed: 0,payment_type,payment_type_str,fare_amount
0,1,Credit Card,15265
1,2,Cash,7267
2,3,No charge,121
3,4,Dispute,46


Shows the mean amount for each payment type

In [26]:
taxi_data.groupby(['payment_type','payment_type_str'])['fare_amount'].mean().reset_index()

Unnamed: 0,payment_type,payment_type_str,fare_amount
0,1,Credit Card,13.429748
1,2,Cash,12.213546
2,3,No charge,12.186116
3,4,Dispute,9.913043


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, a hypothesis test is necessary.

### Task 3. Hypothesis testing

**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash.  
**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash

**Objective:** To conduct a two-sample t-test.  

Steps for conducting a hypothesis test: 

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 

**The Hypothesises**  

$H_0$: There is `no difference` in the average fare amount between customers who use credit cards and customers who use cash.  
$H_A$: There is `a difference` in the average fare amount between customers who use credit cards and customers who use cash.  

Choose `5% as the significance level` and proceed with a two-sample t-test.

In [27]:
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
result = stats.ttest_ind(a=credit_card, b=cash, equal_var=False)
print(f"t-statistic: {(result.statistic):.2f}%")
print(f"P-value: {(result.pvalue*100):.10f}%")

t-statistic: 6.87%
P-value: 0.0000000007%


There are two main rules for drawing a conclusion about a hypothesis test:   
•	If `p-value` < `significance level`, **reject** the null hypothesis.  
•	If `p-value` > `significance level`, **fail to reject** the null hypothesis.    


In this scenario, the p-value of 0.0000000007% is < significance level of 5%  
So we **reject** the null hypothesis

We can conclude based on the hypothesis testing that there is a `statistically significant difference` in the average fare amount between customers who use credit cards and customers who use cash.

*In conclusion, ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this educational project.

### What are the proposed business recommendations based on the results

1.   The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers. 

2.   This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa. 