# Project Demonstrating Statistical analysis

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests.
  
**The goal** is to apply descriptive statistics and hypothesis testing in Python.

<br/>  

**Part 1:** Imports and data loading

**Part 2:** Conduct hypothesis testing

**Part 3:** Communicate insights with stakeholders


## Part 1: Imports and data loading

In [1]:
import pandas as pd
from scipy import stats

In [2]:
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

payment_type encoding:
1: credit card
2: Cash
3: No charge
4: Dispute
5: unknown

In [3]:
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [4]:
#Average fare amount for each payment type 
taxi_data.groupby('payment_type')['total_amount'].mean()

payment_type
1    17.663577
2    13.545821
3    13.579669
4    11.238261
Name: total_amount, dtype: float64

Customers tend to pay more with credit card although this can be from random sampling. To assess whehter this is statistically significant we will conduct hypothesis testing.


## Part 2: Hypothesis testing

Hypothesis: There is no difference between customers who use credit card or cash

Alternative hypothesis: There is a difference between customers who use credit card versus those that use cash

Significance level: 5%

In [6]:
stats.ttest_ind(a=(taxi_data[taxi_data['payment_type']== 1]['total_amount']),
                b = (taxi_data[taxi_data['payment_type']==2]['total_amount']),
                equal_var = False)

Ttest_indResult(statistic=20.34644022783838, pvalue=4.5301445359736376e-91)

P_value < 0.05 So we reject the null hypothesis and state that there is a difference between customers who use credit card versus those that use cash.

## Part 3: Communicate insights with stakeholders

*In conclusion, ask yourself the following questions:*

Buisness Insights we can recieve is that customers that pay with card are more likely to pay more for taxi cab drivers.

A/B test may not be the most realisitic for this project because people had to pay the taxi cab and there are other possible reasons why people pay with card that were not included. This is a great example of correlation instead of causation. For example, for longer trips people might not have enough cash to pay more the trip and have to resort to credit card. Tips are also in a different column so for the next test we should try getting rid of some of those factors such as using the **fare_amount** instead of the **total_amount** column. 