Data is about day to day Taxi & Limousine Services by a client. The client requests on analysing the data & find relationship on fare amount & payment type.

**Note**:
1. For the purpose of this Notebook we consider that we are asked to analyze the relationship between fare amount and payment type (2 features in given dataset).

2. We are also to assume, this sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.



The **goal** is  by apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.

##**Part 1.** Import & load data

In [20]:
# importing packages & libraries


import pandas as pd                  # data manipulation libraries
import numpy as np                   # data manipulation libraries
import matplotlib.pyplot as plt      # data visualisation libraries
from scipy import stats              # stat libraries

In [3]:
# import the data

data = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')

In [4]:
data.head()       # view the data

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [5]:
# how many column & rows are there?

data.shape

(22699, 18)

##**Part 2.** Conduct EDA

**Note**: In the dataset, payment_type is encoded in integers:


- 1: Credit card
- 2: Cash
- 3: No charge
- 4: Dispute
- 5: Unknown

In [8]:
data.describe()

Unnamed: 0.1,Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,56758490.0,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,32744930.0,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,12127.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,28520560.0,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,56731500.0,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,85374520.0,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,113486300.0,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


We are interested in the **fare_amount** & **payment_type** columns. Lets find the average fare of each payment types.

In [12]:
# Creating a mapping dictionary for payment_type

payment_type_mapping = {
    1: 'Credit card',
    2: 'Cash',
    3: 'No charge',
    4: 'Dispute',
    5: 'Unknown'
}

In [14]:
payment_type_avg_fare = data.groupby('payment_type')['fare_amount'].mean()

# mapping payment type
payment_type_avg_fare.index = payment_type_avg_fare.index.map(payment_type_mapping)

# average fares of payment types
payment_type_avg_fare

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
Credit card,13.429748
Cash,12.213546
No charge,12.186116
Dispute,9.913043


As shown by the average fare data, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash.

However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.

## **Part 3.**  Hypothesis Testing

We can see there is difference in-between the average fares of Payment types. so we consider the:
- **Null Hypothesis:** There is no difference in average fare between customers who use credit cards and customers who use cash.
- **Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash.

**Significance Level** considered as 5%.

For the purpose of this mitebook we consider this hypothesis test as the main component of A/B Test.

**2-sample t-test**

In [17]:
# filter fare_amount by credit card data

credit_card = data[data['payment_type'] == 1]['fare_amount']
credit_card

Unnamed: 0,fare_amount
0,13.0
1,16.0
2,6.5
3,20.5
5,9.0
...,...
22692,19.0
22693,7.5
22695,52.0
22697,10.5


In [22]:
# filter fare amount by cash

cash = data[data['payment_type'] == 2]['fare_amount']
cash

Unnamed: 0,fare_amount
4,16.5
8,9.0
18,5.0
20,6.5
27,5.5
...,...
22673,5.0
22675,7.5
22688,4.0
22694,4.0


In [23]:
# t-test

stats.ttest_ind(a = credit_card, b = cash, equal_var = False)

TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

**Note** the 'e-12' at the end of the pvalue result.

Since the p-value is significantly smaller than the significance level of 5%, we reject the null hypothesis.

We conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

##**Part 4.** Communicate Insights to Stakeholders

- What key insight(s) emerged from your A/B test?
- What business recommendations do you propose based on your results?

This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, passengers always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test.

1. The key insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.

2. This dataset analysis does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.