# Automatidata project 

**Scenario** 

The New York City Taxi & Limousine Commission has tasked the Automatidata team to analyze the relationship between fare amount and payment type. A follow-up email from includes your specific assignment: to conduct an A/B test. 


**The purpose** is to demostrate knowledge of how to prepare, create, and analyze A/B tests. The A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, we assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This activity has three parts:*

**Part 1:** Imports and data loading

**Part 2:** Conduct EDA and hypothesis testing

**Part 3:** Communicate insights with stakeholders

### Imports and data loading

In [1]:
import pandas as pd
from scipy import stats

In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

### Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [4]:
taxi_data['payment_type'].value_counts()

1    15265
2     7267
3      121
4       46
Name: payment_type, dtype: int64

We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [5]:
taxi_data.groupby(['payment_type']).mean()[['fare_amount']]

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.429748
2,12.213546
3,12.186116
4,9.913043


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we will conduct a hypothesis test.


### Hypothesis testing


Steps for conducting a hypothesis test: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 


#### 1. Stating the null and alternative hypothesis 

$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

#### 2. Choosing a significance level

You choose 5% as the significance level and proceed with a two-sample t-test.

#### 3. Finding the P-value

In [6]:
card = taxi_data[taxi_data['payment_type'] == 1]
cash = taxi_data[taxi_data['payment_type'] == 2]

stats.ttest_ind(a=card['fare_amount'], b=cash['fare_amount'], equal_var=False)

Ttest_indResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12)

The P value is 6.797387473030518e-12 or 0.000000000006.

#### 4. Rejecting or failing to reject the null hypothesis

The P-value is much lower than the chosen significance level. This means we **reject the null hypothesis** and conclude that there is statistically significant difference in the fare amounts between customers who pay with credit card compared to cash.

## Conclusion

1.   The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers. 

2.   This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa. 