# **Project Name**    - NYC taxi driver



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

This project utilizes a dataset with key details of taxi trips, including information on pickup and dropoff locations, passenger counts, and trip duration. The dataset includes both geographic data (longitude and latitude for pickup/dropoff points) and temporal data (pickup and dropoff times). The goal is to analyze and build predictive models that can help optimize taxi services. By exploring relationships between features such as pickup_datetime, trip_duration, and the location data, insights into traffic patterns, ride durations, and passenger behavior can be uncovered. The project aims to provide valuable recommendations for improving operational efficiency, pricing models, and service quality

# **GitHub Link -**

https://github.com/HeMANSC/HeMANSC/blob/b4b77c7084230c853c024ba7e81d75e09a5ec641/NYC_taxi_project.ipynb

# **Problem Statement**


The project aims to **improve taxi services** by **predicting trip duration** and **optimizing vehicle dispatch** to reduce idle time. Dynamic pricing models adjust fares based on trip factors,** balancing supply and demand**.** Analyzing peak times** helps optimize scheduling and staffing. **Identifying inefficient routes** improves customer satisfaction, while anomaly detection ensures data integrity and reduces fraud.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import geodesic

from scipy.stats import ttest_ind

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from scipy.stats.mstats import winsorize

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, GridSearchCV


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/Copy of NYC Taxi Data.csv'  ###optional: use (nrows = 15000) for faster calculations
                 ###
                 )

In [None]:
df.shape

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()
print(df)

In [None]:
df.dropna( inplace=True)

In [None]:
df.dropped = df.dropna(axis=0, inplace=True)

In [None]:
# Visualizing the missing values
sns.histplot(df.isnull(),cbar=False)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Id**: Unique identifier for each trip or record.

**Vendor_id**: Identifier for the taxi or service provider (e.g., Vendor A or B).

**Passenger_count**: Number of passengers for the trip.

**Store_and_fwd_flag**: Indicates whether the trip data was stored locally and forwarded later (Y) or sent in real-time (N).

**Trip_duration**: Duration of the trip, typically in seconds.

**Distance_km**: Distance of the trip in kilometers.

**Duration in mins**: Duration of the trip in minutes (converted from seconds).

**Pick date**: Date of trip pickup.

**Pick time**: Time of trip pickup.

**Drop date**: Date of trip drop-off.

**Drop time**: Time of trip drop-off.

**Pickup_datetime**:  The exact date and time when the trip began

**Dropoff_datetime**:  The exact date and time when the trip ended

**Pickup_longitude**: The geographic longitude coordinate where the trip started.

**Pickup_latitude**: The geographic latitude coordinate where the trip started.

**Dropoff_longitude**: The geographic longitude coordinate where the trip ended.

**Dropoff_latitude**: The geographic latitude coordinate where the trip ended.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Calculate distances and save them in a new column

df['distance_km'] = df.apply(
    lambda row: geodesic(
        (row['pickup_latitude'], row['pickup_longitude']),
        (row['dropoff_latitude'], row['dropoff_longitude'])
    ).kilometers,
    axis=1
)



In [None]:

# Convert 'pickup_datetime' and 'dropoff_datetime' to datetime format
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'], errors='coerce')

# Extract useful features from 'pickup_datetime' and 'dropoff_datetime'
df['pickup_year'] = df['pickup_datetime'].dt.year
df['pickup_month'] = df['pickup_datetime'].dt.month
df['pickup_day'] = df['pickup_datetime'].dt.day
df['pickup_hour'] = df['pickup_datetime'].dt.hour
df['pickup_minute'] = df['pickup_datetime'].dt.minute
df['pickup_dayofweek'] = df['pickup_datetime'].dt.dayofweek  # Monday=0, Sunday=6

df['dropoff_year'] = df['dropoff_datetime'].dt.year
df['dropoff_month'] = df['dropoff_datetime'].dt.month
df['dropoff_day'] = df['dropoff_datetime'].dt.day
df['dropoff_hour'] = df['dropoff_datetime'].dt.hour
df['dropoff_minute'] = df['dropoff_datetime'].dt.minute
df['dropoff_dayofweek'] = df['dropoff_datetime'].dt.dayofweek

# Optional: Calculate trip duration if it's not available (in minutes)
df['trip_duration'] = (df['dropoff_datetime'] - df['pickup_datetime']).dt.total_seconds() / 60



In [None]:
#Create zones for pickup and dropoff
grid_size = 0.01
df['pickup_zone'] = list(zip((df['pickup_latitude'] // grid_size).astype(float),
                     (df['pickup_longitude'] // grid_size).astype(float)))

df['dropoff_zone'] = list(zip((df['dropoff_latitude'] // grid_size).astype(float),
                     (df['dropoff_longitude'] // grid_size).astype(float)))

In [None]:
count_duplicates = df.duplicated().sum()

print("Number of duplicate rows:", count_duplicates)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info

### What all manipulations have you done and insights you found?

1.Function to calculate the distance in KM.

2.Converting trip duration from seconds to minutes.

3.Create separate columns for date and time.

4.Function to map times into 30-minute intervals.

5.Day of pickup in week format.

6.Create zones for pickup and dropoff.

7.Dropped all the no more needed columns.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
df.describe().T
df.describe(include = 'object').T

##### 1. Why did you pick the specific chart?

descriptive statistics for brief understanding of data.

> Add blockquote



##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
for i in df.select_dtypes(include = 'number').columns:
    sns.histplot(data=df, x=i)
    plt.show()

In [None]:
# Chart - 2 visualization code
columns_to_plot = ['passenger_count', 'distance_km', 'trip_duration']

for col in columns_to_plot:
    sns.boxplot(data=df, x=col)
    plt.show()

##### 1. Why did you pick the specific chart?

scatterplot to check the outliers.

##### 2. What is/are the insight(s) found from the chart?

This identifies anomalies, like very long durations for short distances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight will help to understand trip efficiency.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
for i in ['vendor_id','passenger_count', 'store_and_fwd_flag', 'distance_km', 'trip_duration']:
       sns.scatterplot(data=df, x=i, y='trip_duration')
       plt.show()




##### 1. Why did you pick the specific chart?

To understand trip usage patterns based on passenger_count.

##### 2. What is/are the insight(s) found from the chart?

This can reveal if most trips are solo or group trips, influencing resource allocation or pricing strategies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This will help in classifing passenger for better offers and pricing.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
df.groupby('vendor_id')[['trip_duration', 'distance_km']].mean()


##### 1. Why did you pick the specific chart?

To compare trip_duration, distance_km, or passenger_count between vendor_id values.

##### 2. What is/are the insight(s) found from the chart?

This helps determine if one vendor provides longer or shorter trips on average, indicating potential differences in service types

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Count pickups per zone, hour, and day
pickup_counts = df.groupby(['pickup_zone', 'pickup_hour', 'pickup_day']).size().reset_index(name='pickup_count')
# Count dropoffs per zone, hour, and day
dropoff_counts = df.groupby(['dropoff_zone', 'pickup_hour', 'pickup_day']).size().reset_index(name='dropoff_count')
# Identify zones with the most pickups for each hour and day
max_pickup_zones = pickup_counts.loc[pickup_counts.groupby(['pickup_hour', 'pickup_day'])['pickup_count'].idxmax()]
# Identify zones with the most dropoffs for each hour and day
max_dropoff_zones = dropoff_counts.loc[dropoff_counts.groupby(['pickup_hour', 'pickup_day'])['dropoff_count'].idxmax()]

heatmap_data = pickup_counts.pivot_table(index='pickup_day', columns='pickup_hour', values='pickup_count', aggfunc='sum')

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, cmap='Reds', annot=False)
plt.title('Heatmap of Pickups by Hour and Day')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
pickup_by_hour = heatmap_data.sum(axis=0)
# Plot the line chart
plt.figure(figsize=(12, 6))
pickup_by_hour.plot(kind='line', marker='o', color='red')
plt.title('Total Pickups by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Total Pickups')
plt.xticks(range(0, 24))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
pip install folium

In [None]:
df.columns

In [None]:
df=df.interpolate()
df

In [None]:
# Chart - 7 visualization code
import folium
from folium.plugins import HeatMap

df = pd.DataFrame(df)

# Ensure the datetime column is in datetime format
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

# Filter data for a specific high-demand hour (e.g., 8 AM to 9 AM)
filtered_df = df[(df['pickup_datetime'].dt.hour == 8)]

# Prepare data for heatmap: [[lat, lon, weight], ...]
heatmap_data = filtered_df[['pickup_latitude', 'pickup_longitude']].values.tolist()

# Create a map centered on a location (e.g., New York City)
map_center = [filtered_df['pickup_latitude'].mean(), filtered_df['pickup_longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=12, scrollWheelZoom=False, zoom_control=False)

# Add the heatmap layer
#HeatMap(heatmap_data).add_to(m)
HeatMap(heatmap_data, opacity=0.6).add_to(m)

# Display the map
m

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8, 5))
df['passenger_count'].value_counts().sort_index().plot(kind='bar', color='orange')
plt.title('Passenger Count Distribution')
plt.xlabel('Number of Passengers')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Analyze typical passenger counts (e.g., solo trips vs. group trips).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
plt.figure(figsize=(10, 6))
df['pickup_hour'].value_counts().sort_index().plot(label='Pickups', color='blue')
df['dropoff_hour'].value_counts().sort_index().plot(label='Dropoffs', color='green')
plt.title('Pickups and Dropoffs by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Trips')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Compare pickup and dropoff trends over the course of the day

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8, 5))
df['vendor_id'].value_counts().plot(kind='bar', color=['blue', 'red'])
plt.title('Number of Trips by Vendor')
plt.xlabel('Vendor ID')
plt.ylabel('Number of Trips')
plt.show()

##### 1. Why did you pick the specific chart?

To compare vendor performance analysis.

##### 2. What is/are the insight(s) found from the chart?

Compare trip volumes across different vendors

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
zone_counts = df['pickup_zone'].value_counts()

# Plot a pie chart for the distribution of trips by pickup zone
plt.figure(figsize=(8, 8))
zone_counts.head(10).plot(kind='pie', autopct='%1.1f%%', startangle=90, cmap='Set3')
plt.title('Top 10 Pickup Zones Distribution', size=16)
plt.ylabel('')  # Remove the y-axis label
plt.show()


##### 1. Why did you pick the specific chart?

This heatmap helps us visualize how pickup zones are related to dropoff zones. It shows the frequency of trips that occur from each pickup zone to each dropoff zone

##### 2. What is/are the insight(s) found from the chart?

To get insights about

1.Are there any zones that have a high frequency of trips from and to specific other zones?

2.Are there zones that rarely have pickups or dropoffs? (This might indicate areas that are underserved or rarely visited.)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# Plot the bar chart
pickup_by_day = df.groupby('pickup_dayofweek').size()

# Plotting the bar chart
plt.figure(figsize=(10, 6))
ax = pickup_by_day.plot(kind='bar', color='skyblue', alpha=0.8)

# Set the title and labels
plt.title('Total Pickups by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Pickups')

# Set x-axis ticks and labels
ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], rotation=45)

# Add grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the number of trips per month
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='pickup_month', palette='Set2')
plt.title('Number of Trips per Month', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Plot average trip duration per month
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='pickup_month', y='trip_duration', palette='Blues')
plt.title('Average Trip Duration per Month', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Trip Duration (seconds)', fontsize=12)
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Plot the average number of passengers per month
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='pickup_month', y='passenger_count', palette='coolwarm')
plt.title('Average Number of Passengers per Month', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Passengers', fontsize=12)
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Heatmap to show trip counts across months and days
plt.figure(figsize=(12, 6))
monthly_day_counts = pd.crosstab(df['pickup_month'], df['pickup_dayofweek'])
sns.heatmap(monthly_day_counts, annot=True, cmap='YlGnBu', fmt='d', linewidths=0.5)
plt.title('Trips per Month and Day of Week', fontsize=16)
plt.xlabel('Day of the Week', fontsize=12)
plt.ylabel('Month', fontsize=12)
plt.xticks(ticks=range(7), labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()




##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Trips per Month: Identifies seasonal variations in trip demand.

Average Trip Duration per Month: Highlights how trip durations vary by month.

Average Passengers per Month: Shows trends in group ride frequency.

Trips per Month and Day: Reveals patterns in trip activity across different days and months, helping identify peak days and times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
selected_columns = [
    'vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude',
    'dropoff_longitude', 'dropoff_latitude', 'trip_duration', 'distance_km',
    'pickup_year', 'pickup_month', 'pickup_day', 'pickup_hour', 'pickup_minute',
    'pickup_dayofweek', 'dropoff_year', 'dropoff_month', 'dropoff_day',
    'dropoff_hour', 'dropoff_minute', 'dropoff_dayofweek'
]

# Filter the DataFrame to include only the selected columns
selected_df = df[selected_columns]

# Calculate the correlation matrix
corrheat = selected_df.corr()

# Plot the heatmap
plt.figure(figsize=(12, 10))  # Set figure size
sns.heatmap(corrheat, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, cbar=True)

# Display the plot
plt.title('Correlation Heatmap of Selected Numeric Columns')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap helps identify relationships between variables, spot multicollinearity, simplify data exploration, aid in feature selection, and improve model performance through insights.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(selected_df)
plt.suptitle('Pairplot of Selected Numeric Columns', size=16)
plt.show()

##### 1. Why did you pick the specific chart?

Visualize Pairwise Relationships

Explore Distributions of Individual Variables

Identify Correlations

Spot Outliers

##### 2. What is/are the insight(s) found from the chart?

The pairplot reveals correlations between variables like trip duration and distance, aiding predictive modeling and resource planning. It helps identify multicollinearity, outliers, and non-linear relationships, improving data quality. Insights into time-based patterns guide demand forecasting, pricing strategies, and operational efficiency, supporting better business decision-making and customer service.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

t-test to test if there is a statistically significant difference in the means of the trip_duration between two groups—specifically, the two vendors (vendor 1 and vendor 2)

**Null hypothesis**: There is no significant difference in the means of trip_duration between vendor 1 and vendor 2.

**Alternative hypothesis **: The means of trip_duration between vendor 1 and vendor 2 are significantly different

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
group1 = df[df['vendor_id'] == 1]['trip_duration']
group2 = df[df['vendor_id'] == 2]['trip_duration']

# Perform the t-test
stat, p_value = ttest_ind(group1, group2, equal_var=False)  # Use `equal_var=True` if variances are equal

# Display the results
print(f"T-Statistic: {stat}")
print(f"P-Value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis: The means of trip_duration are significantly different between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference in means of trip_duration between the two groups.")

##### Which statistical test have you done to obtain P-Value?

t-test

##### Why did you choose the specific statistical test?

A t-test is a statistical test used to compare the means of two groups to determine whether they are significantly different from each other.

Purpose of the T-Test
The purpose is to test if the average (mean) value of a continuous variable (e.g., trip_duration) differs between two groups (e.g., vendor 1 vs. vendor 2).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

I want to test if there is a correlation between distance_km and trip_duration.

**Null Hypothesis**: There is no correlation between distance_km and trip_duration.

**Alternative Hypothesis** : There is a significant correlation between distance_km and trip_duration.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Perform Pearson Correlation
corr, p_value = pearsonr(df['distance_km'], df['trip_duration'])

print(f"Correlation Coefficient: {corr}")
print(f"P-Value: {p_value}")

if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant correlation between distance_km and trip_duration.")
else:
    print("Fail to reject the null hypothesis: No significant correlation between distance_km and trip_duration.")

##### Which statistical test have you done to obtain P-Value?

**Correlation Test**

##### Why did you choose the specific statistical test?

I want to test if there is a correlation between distance_km and trip_duration.

Hypothesis:
**Null Hypothesis**: There is no correlation between distance_km and trip_duration.

**Alternative Hypothesis** : There is a significant correlation between distance_km and trip_duration.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

I want to test if two continuous variables (e.g., distance_km and trip_duration) are correlated:

Use Pearson’s correlation for linear relationships or Spearman’s correlation for non-linear relationships.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr, spearmanr

# Pearson correlation
pearson_corr, pearson_p = pearsonr(df['distance_km'], df['trip_duration'])
print(f"Pearson Correlation: {pearson_corr}, P-value: {pearson_p}")

# Spearman correlation
spearman_corr, spearman_p = spearmanr(df['distance_km'], df['trip_duration'])
print(f"Spearman Correlation: {spearman_corr}, P-value: {spearman_p}")

##### Which statistical test have you done to obtain P-Value?

Pearson’s correlation and Spearman’s correlation test

##### Why did you choose the specific statistical test?

**Pearson's correlation** shows a very weak linear relationship, but the small p-value indicates statistical significance.

**Spearman's correlation** suggests a strong monotonic relationship, with a significant p-value, though not linear.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing values are found in data at this stage.


### 2. Handling Outliers

In [None]:


df_cap = df.copy()
df['distance_km'] = winsorize(df_cap['distance_km'], limits=(0.05, 0.05))
df['trip_duration'] = winsorize(df_cap['trip_duration'], limits=(0.05, 0.05))

df['vendor_id'] = df_cap['vendor_id']

In [None]:
print(df_cap['distance_km'].shape)
print(df_cap['trip_duration'].shape)

In [None]:
df.shape

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df.dtypes

In [None]:


# Label encode 'store_and_fwd_flag'
label_encoder = LabelEncoder()
df['store_and_fwd_flag'] = label_encoder.fit_transform(df['store_and_fwd_flag'])

In [None]:
df.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Nominal encoding (one hot encoding) on 'store_and_fwd_flag'

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation/Feature construction


In [None]:
# Manipulate Features to minimize feature correlation and create new features


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation/Feature extraction

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:


# List of numerical columns
numerical_cols = ['dropoff_longitude', 'dropoff_latitude', 'distance_km', 'pickup_hour', 'dropoff_hour']

# Scale the numerical features
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
df.columns

In [None]:
columns_to_drop = ['id', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude','pickup_zone', 'dropoff_zone','pickup_datetime', 'dropoff_datetime'
       ]

# Drop the columns
df = df.drop(columns=columns_to_drop)

# Display the DataFrame after dropping the columns
print(df.head())

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
df.info()

In [None]:


# Define features and target
X = df.drop('trip_duration', axis=1)  # Features
y = df['trip_duration']  # Target variable

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation


# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Predict on the model
accuracy = r2 * 100
print(f"Model Accuracy: {accuracy:.2f}%")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


# Initialize the model
model = LinearRegression()
# Fit the Algorithm
cv_scores = cross_val_score(model, X_train, y_train, cv=10, scoring='r2')
# Predict on the model
print(f"Cross-Validation R-squared Scores: {cv_scores}")
print(f"Mean R-squared Score: {cv_scores.mean():.2f}")
print(f"Standard Deviation of R-squared: {cv_scores.std():.2f}")

accuracy = r2 * 100
print(f"Model Accuracy: {accuracy:.2f}%")


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:


rf_model = RandomForestRegressor(n_estimators=10,random_state=42).fit(X_train, y_train)

# Predict and evaluate R-squared
r2 = r2_score(y_test, rf_model.predict(X_test))
print(f"R-squared: {r2:.2f}")

accuracy = r2 * 100
print(f"Model Accuracy: {accuracy:.2f}%")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


# Cross-validation for Random Forest
cv_scores = cross_val_score(RandomForestRegressor(), X_train, y_train, cv=5, scoring='r2')
print(f"CV Mean R-squared: {cv_scores.mean():.2f}, Std: {cv_scores.std():.2f}")

# Hyperparameter tuning with GridSearchCV
grid_search = GridSearchCV(RandomForestRegressor(), {'n_estimators': [10, 50], 'max_depth': [5, 10]}, cv=3)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")

##### Which hyperparameter optimization technique have you used and why?

**GridSearchCV**

 It is useful for improving model performance by fine-tuning hyperparameters such as the number of trees in a random forest, the maximum depth of trees, or learning rates.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation


dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# 5. Make predictions
y_pred = dt_model.predict(X_test)

# 6. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Decision Tree MSE: {mse:.2f}")
print(f"Decision Tree R-squared: {r2:.2f}")

accuracy = r2 * 100
print(f"Model Accuracy: {accuracy:.2f}%")



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
dt_model = DecisionTreeRegressor(random_state=42)
cv_scores = cross_val_score(dt_model, X_train, y_train, cv=5, scoring='r2')
print(f"CV Mean R-squared: {cv_scores.mean():.2f}")

from sklearn.model_selection import GridSearchCV

# Define the model and parameter grid
dt_model = DecisionTreeRegressor(random_state=42)
param_grid = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'criterion': ['gini', 'entropy']
}

# Grid Search
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

# Output the best hyperparameters and score
print(f"Best Params: {grid_search.best_params_}")
print(f"CV Mean R-squared: {grid_search.best_score_:.2f}")


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Conclusion**

In conclusion, this project leverages data analysis to address key business challenges in the taxi and ride-hailing industry. By optimizing resource allocation, implementing dynamic pricing, and improving service efficiency, businesses can enhance operational performance and customer satisfaction. Additionally, detecting anomalies ensures data integrity, minimizing fraud and errors. Overall, these insights enable more informed decision-making, leading to improved profitability and a competitive edge in the market.

Also applied ML algorithms like LinerRegression, RandonForest, DecisionTree to predict the model accuracy that ultimately reached upto 95%.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***