# Predicting Taxi Fare

**Introduction:**
This project focuses on predicting taxi fares based on relevant features extracted from a dataset containing information about taxi rides. The primary objective is to develop a machine learning model capable of estimating taxi fares accurately.

**Task:**
1. Perform exploratory data analysis (EDA) to understand the distribution and relationships of key variables, including fare amounts, trip distances, and time of day.
2. Preprocess the data by handling missing values, outliers, and feature engineering.
3. Develop a predictive model using machine learning algorithms to forecast taxi fares based on input features.
4. Evaluate the performance of the predictive model using appropriate metrics and refine it as necessary to improve accuracy.


# Statistical Analysis 
This Jupyter Notebook project focuses on performing statistical analysis to interpret data from a dataset containing information about taxi rides. The primary objective is to apply fundamental statistical concepts, such as descriptive statistics and hypothesis testing, to gain insights into the relationship between payment type and fare amount.

## Overview

We aim to conduct a comprehensive analysis of the provided dataset to understand the impact of payment type (credit card vs. cash) on taxi fare amounts. By utilizing statistical techniques, we seek to uncover patterns, trends, and potential correlations between these variables. The ultimate goal is to derive actionable insights that can help taxi cab drivers optimize revenue generation strategies.


# **Conduct an A/B test**
The research question is: “Is there a relationship between total fare amount and payment type?”

### Task 1. Imports and data loading

In [1]:
import pandas as pd
from scipy import stats

In [2]:
taxi_data=pd.read_csv('C:/Users/Windows/Documents/Automatidata/2017_Yellow_Taxi_Trip_Data.csv')

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown


In [3]:
# descriptive stats code for EDA
taxi_data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,,22687,22688,,,,2,,,,,,,,,,
top,,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,,2,2,,,,22600,,,,,,,,,,
mean,56758490.0,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,32744930.0,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,12127.0,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,28520560.0,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,56731500.0,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,85374520.0,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [4]:
taxi_data.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.

### Task 3. Hypothesis testing

**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash. 
**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash
Our goal in this step is to conduct a two-sample t-test. 
1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 


$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

We choose 5% as the significance level and proceed with a two-sample t-test.

In [5]:
#hypothesis test, A/B test
#significance level

credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

Since the p-value is significantly smaller than the significance level of 5%, we reject the null hypothesis. 

We conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

### Task 4. Communicate insights with stakeholders

1.   The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers. 

2.   This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.