<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  <title>Project 1 - Predicting Trip Duration</title>
  <style>
    body {
      margin: 0;
      font-family: 'Segoe UI', sans-serif;
      background-color: #f4f6f8;
    }

    .project-container {
      background: linear-gradient(135deg, #2ecc71, #27ae60);
      padding: 60px 40px;
      border-radius: 25px;
      color: white;
      text-align: center;
      max-width: 1000px;
      margin: 50px auto;
      box-shadow: 0 8px 24px rgba(0,0,0,0.25);
      transition: transform 0.3s ease, box-shadow 0.3s ease;
    }

    .project-container:hover {
      transform: translateY(-5px);
      box-shadow: 0 12px 28px rgba(0,0,0,0.35);
    }

    .project-container h1 {
      font-size: 3.2rem;
      margin-bottom: 25px;
    }

    .project-container h1 span {
      color: #a2f78d;
    }

    .project-container p {
      font-size: 1.25rem;
      line-height: 1.8;
      color: #d0f0c0;
      max-width: 800px;
      margin: 0 auto;
    }

    @media (max-width: 768px) {
      .project-container {
        padding: 40px 20px;
      }

      .project-container h1 {
        font-size: 2.2rem;
      }

      .project-container p {
        font-size: 1.1rem;
      }
    }
  </style>
</head>
<body>
  <main>
    <div class="project-container">
      <h1><span>Predicting</span> Trip Duration (EDA)</h1>
      <p>
        The analysis uncovers patterns in travel time based on location, time, distance, and passenger count. It involves thorough data cleaning, visualization, and modeling to improve prediction accuracy.
      </p>
    </div>
  </main>
</body>
</html>


<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #18e972ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Import Libraries
  </span>
</div>
<a id="import-data"></a>


In [9]:
import numpy as np
import pandas as pd
import joblib
import logging
from pathlib import Path
from typing import Optional, Tuple, Union, Dict, Any
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import QuantileTransformer, MinMaxScaler, StandardScaler, PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import scipy.stats as stats
import math
from plotly.subplots import make_subplots
from pathlib import Path





<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #18d25cff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Import Data
  </span>
</div>
<a id="import-data"></a>


In [10]:
path= Path(r'D:\ML\Homework for ML\Projects\Project_1\New_Project\dataset\sample')
df= pd.read_csv(path)

In [11]:
# Train , Test = train_test_split(df, test_size=0.02, random_state=42 ,shuffle=True)

In [12]:
# df_train=pd.DataFrame(Train)
# df_test =pd.DataFrame(Test)

In [13]:
# df_train.head(50000).to_csv(f'dataset\sample')

<!-- 💠 Section Title -->
<div style="
  text-align: center; 
  margin: 50px auto 30px auto;
">
  <span style="
    display: inline-block;
    background: rgba(46, 204, 113, 0.95);
    backdrop-filter: blur(6px);
    color: white;
    font-family: 'Segoe UI', sans-serif;
    font-size: 1.8rem;
    font-weight: 700;
    padding: 12px 30px;
    border-radius: 15px;
    box-shadow: 0 8px 22px rgba(0, 0, 0, 0.25);
  ">
    📄 Dataset Fields Description
  </span>
</div>

<!-- 📋 Stylish Field Table -->
<table style="
  border-collapse: collapse;
  width: 95%;
  max-width: 1100px;
  margin: 30px auto;
  font-family: 'Segoe UI', sans-serif;
  font-size: 1rem;
  border-radius: 12px;
  overflow: hidden;
  box-shadow: 0 6px 18px rgba(0,0,0,0.12);
">
  <thead>
    <tr style="background: linear-gradient(135deg, #27ae60, #2ecc71); color: white;">
      <th style="padding: 16px 14px; text-align: left;">🧾 Field</th>
      <th style="padding: 16px 14px; text-align: left;">📌 Description</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: #ffffff;">
      <td><code>id</code></td>
      <td>Unique identifier for each trip record.</td>
    </tr>
    <tr style="background-color: #f7fdf9;">
      <td><code>vendor_id</code></td>
      <td>Code for the trip provider (e.g., 1 or 2).</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td><code>pickup_datetime</code></td>
      <td>Timestamp when the trip started (meter engaged).</td>
    </tr>
    <tr style="background-color: #f7fdf9;">
      <td><code>dropoff_datetime</code></td>
      <td>Timestamp when the trip ended (meter disengaged).</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td><code>passenger_count</code></td>
      <td>Number of passengers in the cab, recorded by the driver.</td>
    </tr>
    <tr style="background-color: #f7fdf9;">
      <td><code>pickup_longitude</code> / <code>pickup_latitude</code></td>
      <td>Coordinates for where the trip began.</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td><code>dropoff_longitude</code> / <code>dropoff_latitude</code></td>
      <td>Coordinates for where the trip ended.</td>
    </tr>
    <tr style="background-color: #f7fdf9;">
      <td><code>store_and_fwd_flag</code></td>
      <td>Indicates if data was stored due to no server connection: <code>Y</code> = stored, <code>N</code> = sent live.</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td><code>trip_duration</code></td>
      <td>Total trip time in seconds (target variable).</td>
    </tr>
  </tbody>
</table>


In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,629258,id2577727,1,2016-03-05 10:55:36,2016-03-05 11:12:16,1,-74.004593,40.741821,-73.971588,40.755577,N,1000
1,882834,id1467851,1,2016-04-02 20:02:54,2016-04-02 20:26:33,2,-73.994034,40.726749,-73.975769,40.770561,N,1419
2,1067278,id1937181,2,2016-01-11 07:41:39,2016-01-11 07:48:58,2,-73.930069,40.767254,-73.925339,40.752518,N,439
3,1061532,id0223311,1,2016-06-20 18:21:28,2016-06-20 18:28:48,1,-74.009727,40.720703,-74.007622,40.70322,N,440
4,676892,id0188577,1,2016-03-26 14:12:35,2016-03-26 14:19:55,3,-74.004265,40.742401,-73.988586,40.746269,N,440


In [15]:
df.shape

(50000, 12)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          50000 non-null  int64  
 1   id                  50000 non-null  object 
 2   vendor_id           50000 non-null  int64  
 3   pickup_datetime     50000 non-null  object 
 4   dropoff_datetime    50000 non-null  object 
 5   passenger_count     50000 non-null  int64  
 6   pickup_longitude    50000 non-null  float64
 7   pickup_latitude     50000 non-null  float64
 8   dropoff_longitude   50000 non-null  float64
 9   dropoff_latitude    50000 non-null  float64
 10  store_and_fwd_flag  50000 non-null  object 
 11  trip_duration       50000 non-null  int64  
dtypes: float64(4), int64(4), object(4)
memory usage: 4.6+ MB


In [17]:
df.columns.tolist()

['Unnamed: 0',
 'id',
 'vendor_id',
 'pickup_datetime',
 'dropoff_datetime',
 'passenger_count',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'store_and_fwd_flag',
 'trip_duration']

<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #17d85eff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Missing and duplicate values
  </span>
</div>
<a id="import-data"></a>


In [18]:
df.isnull().sum()

Unnamed: 0            0
id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [19]:
df.duplicated().sum()

np.int64(0)

In [20]:
for col in ['Unnamed: 0','id','dropoff_datetime','vendor_id'] :
    if col in df.columns:
        df.drop(columns=[col] , inplace=True)

<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #17d85eff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Feature Engineering & EDA
  </span>
</div>
<a id="import-data"></a>


<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Time Features 
  </span>
</div>
<a id="import-data"></a>


In [21]:
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

# Basic time features
df['pickup_year'] = df['pickup_datetime'].dt.year
df['pickup_month'] = df['pickup_datetime'].dt.month
df['pickup_quarter'] = df['pickup_datetime'].dt.quarter
df['pickup_day'] = df['pickup_datetime'].dt.day
df['pickup_hour'] = df['pickup_datetime'].dt.hour
df['pickup_minute'] = df['pickup_datetime'].dt.minute
df['pickup_dayofweek'] = df['pickup_datetime'].dt.dayofweek
df['pickup_dayofyear'] = df['pickup_datetime'].dt.dayofyear
df['pickup_week_of_year'] = df['pickup_datetime'].dt.isocalendar().week.astype(int)

# Enhanced time categorization
df['pickup_is_weekend'] = (df['pickup_dayofweek'] >= 5).astype(int)
df['is_morning_rush'] = df['pickup_hour'].between(7, 9).astype(int)
df['is_evening_rush'] = df['pickup_hour'].between(16, 19).astype(int)
df['is_lunch_hour'] = df['pickup_hour'].between(11, 14).astype(int)
df['is_late_night'] = df['pickup_hour'].between(22, 23).astype(int)
df['is_early_morning'] = df['pickup_hour'].between(0, 6).astype(int)
df['is_rush_hour'] = ((df['is_morning_rush'] + df['is_evening_rush']) > 0).astype(int)
df['is_night'] = df['pickup_hour'].apply(lambda x: 1 if (x >= 22 or x <= 5) else 0)

# Day-specific patterns
df['is_monday'] = (df['pickup_dayofweek'] == 0).astype(int)
df['is_friday'] = (df['pickup_dayofweek'] == 4).astype(int)
df['is_sunday'] = (df['pickup_dayofweek'] == 6).astype(int)

# Time interactions
df['weekend_rush_hour'] = ((df['pickup_is_weekend'] == 1) & (df['is_rush_hour'] == 1)).astype(int)
df['weekday_night'] = ((df['pickup_is_weekend'] == 0) & (df['is_night'] == 1)).astype(int)
df['weekend_night'] = ((df['pickup_is_weekend'] == 1) & (df['is_night'] == 1)).astype(int)

# Cyclical encoding for time features
df['hour_sin'] = np.sin(2 * np.pi * df['pickup_hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['pickup_hour'] / 24)
df['day_sin'] = np.sin(2 * np.pi * df['pickup_dayofweek'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['pickup_dayofweek'] / 7)
df['month_sin'] = np.sin(2 * np.pi * df['pickup_month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['pickup_month'] / 12)
df['minute_sin'] = np.sin(2 * np.pi * df['pickup_minute'] / 60)
df['minute_cos'] = np.cos(2 * np.pi * df['pickup_minute'] / 60)

# Quarter hour features
df['quarter_hour'] = (df['pickup_minute'] // 15).astype(int)
df['quarter_hour_sin'] = np.sin(2 * np.pi * df['quarter_hour'] / 4)
df['quarter_hour_cos'] = np.cos(2 * np.pi * df['quarter_hour'] / 4)
# Drop original datetime column
df.drop(columns=['pickup_datetime'], inplace=True)

<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Time Features  (EDA)
  </span>
</div>
<a id="import-data"></a>


In [2]:
# import plotly.express as px
# green_shades = ['#006400', '#228B22', '#32CD32', '#90EE90']
# df_monthly_avg = df.groupby(['pickup_year', 'pickup_month'])['trip_duration'].mean().reset_index()

# fig = px.line(
#     df_monthly_avg,
#     x='pickup_month',
#     y='trip_duration',
#     color='pickup_year',
#     template='plotly_dark',
#     title='Average Trip Duration by Month and Year',
#     labels={
#         'pickup_month': 'Month',
#         'trip_duration': 'Average Trip Duration (seconds)',
#         'pickup_year': 'Year'
#     },
#     color_discrete_sequence=green_shades
# )

# fig.update_layout(title_x=0.5)
# # print(df_monthly_avg)
# fig.show()



![Plot Image](figures/newplot.png)


| Year | Month | Trip Duration (seconds) |
|-------|-------|------------------------|
| 2016  | 1     | 948                    |
| 2016  | 2     | 890                    |
| 2016  | 3     | 949                    |
| 2016  | 4     | 934                    |
| 2016  | 5     | 1035                   |
| 2016  | 6     | 1034                   |

### `Key insights`:

- Trip times are almost steady from January to April (around 890 to 950 seconds).
- Trip times increase noticeably in May and June (around 1035 seconds).
- All data is from the year 2016.
- The increase might be due to seasonal changes or more traffic in May and June.


In [3]:
# # Aggregate average trip duration by year and quarter
# df_quarterly_avg = df.groupby(['pickup_year', 'pickup_quarter'])['trip_duration'].mean().reset_index()

# fig = px.line(
#     df_quarterly_avg,
#     x='pickup_quarter',
#     y='trip_duration',
#     color='pickup_year',
#     template='plotly_dark',
#     title='Average Trip Duration by Quarter and Year',
#     labels={
#         'pickup_quarter': 'Quarter',
#         'trip_duration': 'Average Trip Duration (seconds)',
#         'pickup_year': 'Year'
#     },
#     color_discrete_sequence=px.colors.qualitative.Set2
# )

# fig.update_layout(title_x=0.5)
# # print(df_quarterly_avg)
# fig.show()


![Plot Image](figures/newplot_1.png)



| Year | Quarter | Average Trip Duration (seconds) |
|-------|---------|--------------------------------|
| 2016  | 1       | 929.36                         |
| 2016  | 2       | 999.70                         |

### `Key Insights` :

- The average trip duration increased from about **929 seconds** in the first quarter to about **1000 seconds** in the second quarter of 2016.


In [4]:
# # Aggregate average trip duration by year and day of month
# df_day_avg = df.groupby(['pickup_year', 'pickup_day'])['trip_duration'].mean().reset_index()

# fig = px.line(
#     df_day_avg,
#     x='pickup_day',
#     y='trip_duration',
#     color='pickup_year',
#     template='plotly_dark',
#     title='Average Trip Duration by Day of Month and Year',
#     labels={
#         'pickup_day': 'Day of Month',
#         'trip_duration': 'Average Trip Duration (seconds)',
#         'pickup_year': 'Year'
#     },
#     color_discrete_sequence=px.colors.qualitative.Pastel
# )

# fig.update_layout(title_x=0.5)
# # print(df_day_avg)
# fig.show()


![Plot Image](figures/newplot_2.png)



| Day | Trip Duration (seconds) |
|------|------------------------|
| 1    | 837                    |
| 2    | 1086                   |
| 3    | 1036                   |
| ...  | ...                    |
| 31   | 925                    |

### `Key Insights` :

- Trip times change a lot from day to day.
- Some days have longer trips (over 1000 seconds).
- Some days have shorter trips (around 840 seconds).
- These changes could be because of traffic or other daily factors.
- Data is only for January 2016.



In [5]:

# rush_hours_df = df.groupby(['is_morning_rush', 'is_evening_rush']).size().reset_index(name='count')

# rush_hours_df['Morning Label'] = rush_hours_df['is_morning_rush'].apply(lambda x: 'Morning Rush' if x == 1 else 'Non-Morning')
# rush_hours_df['Evening Label'] = rush_hours_df['is_evening_rush'].apply(lambda x: 'Evening Rush' if x == 1 else 'Non-Evening')

# color_map = {
#     'Evening Rush': '#32a885',   
#     'Non-Evening': '#0dc6de'     
# }

# fig = px.bar(
#     rush_hours_df,
#     x='Morning Label',
#     y='count',
#     color='Evening Label',
#     barmode='group',
#     title='Trip Counts: Morning Rush vs Evening Rush',
#     labels={'Morning Label': 'Time Window', 'count': 'Number of Trips', 'Evening Label': 'Evening Rush'},
#     color_discrete_map=color_map,
#     template='plotly_dark'
# )
# fig.update_layout(title_x=0.5)
# # print(rush_hours_df)
# fig.show()



![Plot Image](figures/newplot_3.png)




| is_morning_rush | is_evening_rush | Count  | Morning Label | Evening Label  |
|-----------------|-----------------|--------|---------------|----------------|
| 0               | 0               | 32468  | Non-Morning   | Non-Evening    |
| 0               | 1               | 10967  | Non-Morning   | Evening Rush   |
| 1               | 0               | 6565   | Morning Rush  | Non-Evening    |


### `Key Points:`

- Most trips (32,468) happen outside of rush hours (neither morning nor evening rush).
- Evening rush trips account for 10,967 trips.
- Morning rush trips are 6,565 trips.
- This shows evening rush trips are more frequent than morning rush trips in the data.


In [6]:
# time_periods = ['is_late_night', 'is_early_morning', 'is_rush_hour', 'is_night']
# rows = []

# for period in time_periods:
#     avg_duration = df[df[period] == 1]['trip_duration'].mean()
#     rows.append({'Time Period': period, 'Avg Trip Duration': avg_duration})

# df_time_summary = pd.DataFrame(rows)

# colors = ['#0ccc33', '#00fca4', '#09e0c4', '#12b6db']

# fig = px.bar(
#     df_time_summary,
#     x='Time Period',
#     y='Avg Trip Duration',
#     title='Average Trip Duration by Time Period',
#     labels={'Avg Trip Duration': 'Average Trip Duration (seconds)'},
#     color='Time Period',
#     color_discrete_sequence=colors,
#     template='plotly_dark'  
# )

# fig.update_layout(title_x=0.5)
# # print(df_time_summary)
# fig.show()


![Plot Image](figures/newplot_4.png)



| Time Period       | Average Trip Duration (seconds) |
|-------------------|--------------------------------|
| Late Night        | 965.12                         |
| Early Morning     | 964.89                         |
| Rush Hour         | 929.88                         |
| Night             | 974.79                         |

### `Key Points`:

- Trips during **Rush Hour** have the shortest average duration (~930 seconds).
- Trips during **Night** and **Late Night** have longer average durations (~965 to 975 seconds).
- Early Morning trips have a similar average duration to Late Night trips (~965 seconds).
- This suggests trips during rush hour might be shorter due to possibly shorter distances or heavier traffic affecting trip duration differently.


In [7]:


# df['pickup_is_weekend'] = df['pickup_dayofweek'] >= 5

# rows = []
# for is_weekend in [False, True]:
#     count = df[df['pickup_is_weekend'] == is_weekend].shape[0]
#     label = 'Weekend' if is_weekend else 'Weekday'
#     rows.append({'Day Type': label, 'Number of Trips': count})

# df_weekend_summary = pd.DataFrame(rows)

# colors = ['#4bdb12', '#12db9f']  

# fig = px.bar(
#     df_weekend_summary,
#     x='Day Type',
#     y='Number of Trips',
#     title='Number of Trips: Weekday vs Weekend',
#     labels={'Number of Trips': 'Number of Trips', 'Day Type': 'Day Type'},
#     color='Day Type',
#     color_discrete_sequence=colors,
#     template='plotly_dark'
# )

# fig.update_layout(title_x=0.5)

# # print(df_weekend_summary)
# fig.show()


![Plot Image](figures/newplot_5.png)



| Day Type | Number of Trips |
|----------|-----------------|
| Weekday  | 35,756          |
| Weekend  | 14,244          |

### `Key Points`:

- Most trips happen on **weekdays** (35,756 trips).
- Fewer trips occur on **weekends** (14,244 trips).
- This shows that weekdays have about 2.5 times more trips than weekends.


In [8]:
# from matplotlib.colors import LinearSegmentedColormap, rgb2hex

# days = [0,1,2,3,4,5,6]
# day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# rows = []
# for day in days:
#     avg_duration = df[df['pickup_dayofweek'] == day]['trip_duration'].mean()
#     rows.append({'Day': day_names[day], 'Avg Trip Duration': avg_duration})

# df_day_avg = pd.DataFrame(rows)

# cmap = LinearSegmentedColormap.from_list("custom_cmap", ['#77f25e', '#5ef2f2'])
# colors = [rgb2hex(cmap(i)) for i in np.linspace(0, 1, 7)]

# fig = px.bar(
#     df_day_avg,
#     x='Day',
#     y='Avg Trip Duration',
#     title='Average Trip Duration by Day of Week',
#     labels={'Avg Trip Duration': 'Average Trip Duration (seconds)', 'Day': 'Day of Week'},
#     color='Day',
#     color_discrete_sequence=colors,
#     template='plotly_dark'
# )

# fig.update_layout(title_x=0.5)
# # print(df_day_avg)
# fig.show()


![Plot Image](figures/newplot_6.png)



| Day       | Average Trip Duration (seconds) |
|-----------|---------------------------------|
| Monday    | 827.10                          |
| Tuesday   | 984.30                          |
| Wednesday | 1017.09                         |
| Thursday  | 994.76                          |
| Friday    | 985.17                          |
| Saturday  | 971.54                          |
| Sunday    | 953.17                          |

### `Key Points`:

- **Monday** has the shortest average trip duration (~827 seconds).
- Trip duration increases from Tuesday to Wednesday, peaking on Wednesday (~1017 seconds).
- Duration slightly decreases towards the weekend but remains above 950 seconds.
- Weekdays generally have longer trips compared to Monday.


In [9]:

# fig = px.scatter(
#     df,
#     x='hour_sin',
#     y='hour_cos',
#     color='trip_duration',
#     title='Cyclical Hour Encoding vs Trip Duration',
#     labels={'hour_sin': 'Hour (sin)', 'hour_cos': 'Hour (cos)', 'trip_duration': 'Trip Duration (sec)'},
#     color_continuous_scale=px.colors.sequential.Plasma,
#     hover_data={'trip_duration': ':.2f', 'hour_sin': False, 'hour_cos': False}  
# )

# fig.update_traces(marker=dict(size=7, opacity=0.7))  
# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='Hour (sin)',
#     yaxis_title='Hour (cos)',
#     template='plotly_dark',
    
# )
# fig.show()


![Plot Image](figures/newplot_7.png)


In [10]:
# fig = px.box(
#     df,
#     x='quarter_hour',
#     y='trip_duration',
#     title='Trip Duration Distribution by Quarter Hour',
#     labels={'quarter_hour': 'Quarter Hour Segment', 'trip_duration': 'Trip Duration (seconds)'},
#     color='quarter_hour',
#     color_discrete_sequence=px.colors.qualitative.Pastel
# )
 
# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='Quarter Hour Segment',
#     yaxis_title='Trip Duration (seconds)',
#     template='plotly_dark'
# )
# fig.show()


![Plot Image](figures/newplot_8.png)


<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Distance & Location Features
  </span>
</div>
<a id="import-data"></a>


In [31]:

# Basic distance calculations
df['delta_longitude'] = df['dropoff_longitude'] - df['pickup_longitude']
df['delta_latitude'] = df['dropoff_latitude'] - df['pickup_latitude']
df['euclidean_distance'] = np.sqrt(df['delta_longitude']**2 + df['delta_latitude']**2)
df['manhattan_distance'] = np.abs(df['delta_longitude']) + np.abs(df['delta_latitude'])




def haversine_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    R = 6371  # Earth's radius in kilometers
    
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)
    
    a = (np.sin(dphi/2)**2 + 
            np.cos(phi1) * np.cos(phi2) * np.sin(dlambda/2)**2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    return R * c


# Haversine distance
df['haversine_distance'] = haversine_distance(
    df['pickup_latitude'], df['pickup_longitude'],
    df['dropoff_latitude'], df['dropoff_longitude']
)

# Distance transformations and interactions
df['haversine_distance_sq'] = df['haversine_distance'] ** 2
df['haversine_distance_cube'] = df['haversine_distance'] ** 3
df['log_haversine_distance'] = np.log1p(df['haversine_distance'])
df['sqrt_haversine_distance'] = np.sqrt(df['haversine_distance'])
df['inv_haversine_distance'] = 1 / (df['haversine_distance'] + 1e-6)

# Distance ratios and relationships
df['euclid_haversine_ratio'] = df['euclidean_distance'] / (df['haversine_distance'] + 1e-6)
df['manhattan_haversine_ratio'] = df['manhattan_distance'] / (df['haversine_distance'] + 1e-6)
df['manhattan_euclidean_ratio'] = df['manhattan_distance'] / (df['euclidean_distance'] + 1e-6)

# Bearing and direction features
df['bearing'] = np.degrees(np.arctan2(df['delta_longitude'], df['delta_latitude']))
df['bearing_sin'] = np.sin(np.radians(df['bearing']))
df['bearing_cos'] = np.cos(np.radians(df['bearing']))

# Enhanced direction features
df['trip_direction'] = (df['bearing'] + 360) % 360
df['trip_dir_sin'] = np.sin(np.radians(df['trip_direction']))
df['trip_dir_cos'] = np.cos(np.radians(df['trip_direction']))

# Directional categories
df['is_north_bound'] = (df['delta_latitude'] > 0).astype(int)
df['is_south_bound'] = (df['delta_latitude'] < 0).astype(int)
df['is_east_bound'] = (df['delta_longitude'] > 0).astype(int)
df['is_west_bound'] = (df['delta_longitude'] < 0).astype(int)

# Absolute deltas with more features
df['abs_delta_longitude'] = df['delta_longitude'].abs()
df['abs_delta_latitude'] = df['delta_latitude'].abs()
df['delta_ratio'] = df['abs_delta_latitude'] / (df['abs_delta_longitude'] + 1e-6)

# Speed estimation features
df['estimated_speed_40'] = df['haversine_distance'] / (40/3600)  # Assume 40 km/h
df['estimated_speed_30'] = df['haversine_distance'] / (30/3600)  # Assume 30 km/h
df['estimated_speed_50'] = df['haversine_distance'] / (50/3600)  # Assume 50 km/h



In [32]:
NYC_LANDMARKS = {
    'times_square': {'lat': 40.7580, 'lon': -73.9855},
    'central_park': {'lat': 40.7812, 'lon': -73.9665},
    'brooklyn_bridge': {'lat': 40.7061, 'lon': -73.9969},
    'wall_street': {'lat': 40.7074, 'lon': -74.0113},
    'empire_state': {'lat': 40.7484, 'lon': -73.9857}
}
NYC_CENTER = {'lat': 40.7580, 'lon': -73.9855}
JFK_AIRPORT = {'lat': 40.6413, 'lon': -73.7781}
LGA_AIRPORT = {'lat': 40.7769, 'lon': -73.8740}
EWR_AIRPORT = {'lat': 40.6895, 'lon': -74.1745}

# Distance to NYC center
df['pickup_center_distance'] = haversine_distance(
    df['pickup_latitude'], df['pickup_longitude'],
    NYC_CENTER['lat'], NYC_CENTER['lon']
)
df['dropoff_center_distance'] = haversine_distance(
    df['dropoff_latitude'], df['dropoff_longitude'],
    NYC_CENTER['lat'], NYC_CENTER['lon']
)
df['total_center_distance'] = df['pickup_center_distance'] + df['dropoff_center_distance']
df['center_distance_diff'] = df['dropoff_center_distance'] - df['pickup_center_distance']

# Distance to airports
for airport_name, coords in [('JFK', JFK_AIRPORT), ('LGA', LGA_AIRPORT), ('EWR', EWR_AIRPORT)]:
    df[f'dist_pickup_{airport_name}'] = haversine_distance(
        df['pickup_latitude'], df['pickup_longitude'],
        coords['lat'], coords['lon']
    )
    df[f'dist_dropoff_{airport_name}'] = haversine_distance(
        df['dropoff_latitude'], df['dropoff_longitude'],
        coords['lat'], coords['lon']
    )
    df[f'is_pickup_near_{airport_name}'] = (df[f'dist_pickup_{airport_name}'] < 2).astype(int)
    df[f'is_dropoff_near_{airport_name}'] = (df[f'dist_dropoff_{airport_name}'] < 2).astype(int)
    df[f'airport_{airport_name}_trip'] = (df[f'is_pickup_near_{airport_name}'] | df[f'is_dropoff_near_{airport_name}']).astype(int)

# Distance to landmarks
for landmark_name, coords in NYC_LANDMARKS.items():
    df[f'pickup_dist_{landmark_name}'] = haversine_distance(
        df['pickup_latitude'], df['pickup_longitude'],
        coords['lat'], coords['lon']
    )
    df[f'dropoff_dist_{landmark_name}'] = haversine_distance(
        df['dropoff_latitude'], df['dropoff_longitude'],
        coords['lat'], coords['lon']
    )

# Borough/zone approximations (simplified)
df['pickup_is_manhattan'] = (
    (df['pickup_latitude'].between(40.700, 40.800)) & 
    (df['pickup_longitude'].between(-74.020, -73.930))
).astype(int)

df['dropoff_is_manhattan'] = (
    (df['dropoff_latitude'].between(40.700, 40.800)) & 
    (df['dropoff_longitude'].between(-74.020, -73.930))
).astype(int)

df['manhattan_trip'] = (df['pickup_is_manhattan'] & df['dropoff_is_manhattan']).astype(int)
df['to_manhattan'] = ((~df['pickup_is_manhattan'].astype(bool)) & df['dropoff_is_manhattan'].astype(bool)).astype(int)
df['from_manhattan'] = (df['pickup_is_manhattan'].astype(bool) & (~df['dropoff_is_manhattan'].astype(bool))).astype(int)



<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Distance & Location Features (EDA)
  </span>
</div>
<a id="import-data"></a>


In [11]:
# fig = px.scatter(
#     df,
#     x='haversine_distance',
#     y='trip_duration',
#     color='trip_duration',  
#     size_max=10,
#     title='Trip Duration vs. Haversine Distance',
#     labels={
#         'haversine_distance': 'Haversine Distance (km)',
#         'trip_duration': 'Trip Duration (sec)'
#     },
#     opacity=0.6,
#     color_continuous_scale=px.colors.sequential.Turbo,  
#     hover_data={'trip_duration': ':.2f', 'haversine_distance': ':.2f'}
# )

# fig.update_traces(marker=dict(size=7))
# fig.update_layout(
#     template='plotly_dark',
#     title_x=0.5,
#     xaxis_title='Haversine Distance (km)',
#     yaxis_title='Trip Duration (sec)'
# )

# fig.show()


![Plot Image](figures/newplot_9.png)


In [12]:
# df['distance_ratio'] = df['manhattan_distance'] / df['euclidean_distance']

# fig = px.histogram(
#     df,
#     x='distance_ratio',
#     nbins=50,
#     title='Distribution of Manhattan-to-Euclidean Distance Ratio',
#     labels={'distance_ratio': 'Manhattan / Euclidean Ratio'},
#     color_discrete_sequence=['#12db9f'],
#     opacity=0.8,
# )

# fig.update_layout(
#     template='plotly_dark',
#     title_x=0.5,
#     xaxis_title='Manhattan / Euclidean Ratio',
#     yaxis_title='Count',
#     bargap=0.05
# )

# fig.show()


![Plot Image](figures/newplot_10.png)


In [13]:

# fig = px.scatter(
#     df, 
#     x='bearing', 
#     y='trip_duration',
#     title='Trip Duration vs. Bearing (Direction)',
#     labels={
#         'bearing': 'Trip Direction (degrees)',
#         'trip_duration': 'Trip Duration (sec)'
#     },
#     opacity=0.6,
#     color='trip_duration',  
#     color_continuous_scale='Turbo', 
#     hover_data={'trip_duration': ':.2f', 'bearing': ':.2f'}
# )

# fig.update_traces(marker=dict(size=5))
# fig.update_layout(
#     template='plotly_dark',
#     title_x=0.5,
#     xaxis_title='Trip Direction (degrees)',
#     yaxis_title='Trip Duration (sec)'
# )

# fig.show()


![Plot Image](figures/newplot_11.png)


In [14]:

# def calculate_bearing(lat1, lon1, lat2, lon2):
#     lat1 = np.radians(lat1)
#     lat2 = np.radians(lat2)
#     diff_long = np.radians(lon2 - lon1)
#     x = np.sin(diff_long) * np.cos(lat2)
#     y = np.cos(lat1) * np.sin(lat2) - (np.sin(lat1) * np.cos(lat2) * np.cos(diff_long))
#     initial_bearing = np.degrees(np.arctan2(x, y))
#     compass_bearing = (initial_bearing + 360) % 360
#     return compass_bearing

# df['bearing'] = calculate_bearing(
#     df['pickup_latitude'], df['pickup_longitude'],
#     df['dropoff_latitude'], df['dropoff_longitude']
# )

# bins = [i * 30 for i in range(13)]  
# labels = [f'{i}°' for i in bins[:-1]]  

# df['bearing_bin'] = pd.cut(df['bearing'], bins=bins, labels=labels, include_lowest=True, right=False)

# df_bearing_avg = df.groupby('bearing_bin', observed=True)['trip_duration'].mean().reset_index()

# df_bearing_avg['bearing_bin'] = pd.Categorical(
#     df_bearing_avg['bearing_bin'], 
#     categories=labels, 
#     ordered=True
# )

# fig = px.line_polar(
#     df_bearing_avg,
#     r='trip_duration',
#     theta='bearing_bin',
#     line_close=True,
#     title='Average Trip Duration by Direction',
#     labels={'trip_duration': 'Avg Duration (sec)', 'bearing_bin': 'Direction'},
#     color_discrete_sequence=['#00cc96']
# )

# fig.update_traces(line=dict(width=3), marker=dict(size=7, color='white'))  
# fig.update_layout(
#     polar=dict(
#         radialaxis=dict(showticklabels=True, ticks='', gridcolor='gray'),
#         angularaxis=dict(direction="clockwise", rotation=90)
#     ),
#     template='plotly_dark',
#     title_x=0.5
# )
# # print(df_bearing_avg)
# fig.show()


![Plot Image](figures/newplot_12.png)


#### `Average Trip Duration by Bearing Bin`

| Bearing Bin | Average Trip Duration (seconds) |
|-------------|----------------------------------|
| 0°          | 901.43                           |
| 30°         | 860.83                           |
| 60°         | 1039.41                          |
| 90°         | 1013.47                          |
| 120°        | 1041.37                          |
| 150°        | 971.91                           |
| 180°        | 855.84                           |
| 210°        | 990.81                           |
| 240°        | 1201.99                          |
| 270°        | 1000.55                          |
| 300°        | 1215.43                          |
| 330°        | 932.93                           |

- The **longest durations** occur at **300° (1215 s)** and **240° (1202 s)**.
- The **shortest durations** are at **180° (856 s)** and **30° (861 s)**.
- This might reflect traffic or directional patterns in the city layout.


In [15]:
# fig = px.box(
#     df,
#     x='airport_JFK_trip',  #
#     y='trip_duration',
#     title='Trip Duration for JFK-related Trips vs Others',
#     labels={'airport_JFK_trip': 'JFK Trip (1=Yes, 0=No)', 'trip_duration': 'Duration (sec)'},
#     color='airport_JFK_trip',
#     color_discrete_map={1: '#EF553B', 0: '#636EFA'},  
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='JFK Trip',
#     yaxis_title='Trip Duration (sec)',
#     legend_title_text='Is JFK Trip?'
# )

# fig.show()


![Plot Image](figures/newplot_13.png)


In [16]:

# df['is_early_morning'] = df['pickup_hour'].between(0, 6).astype(int)
# df['is_late_night'] = df['pickup_hour'].between(22, 23).astype(int)

# early_morning_airport = df.groupby('airport_JFK_trip')['is_early_morning'].mean().reset_index()
# late_night_airport = df.groupby('airport_JFK_trip')['is_late_night'].mean().reset_index()

# airport_times = pd.DataFrame({
#     'Airport Trip': ['No', 'Yes'] * 2,
#     'Time Period': ['Early Morning'] * 2 + ['Late Night'] * 2,
#     'Proportion': list(early_morning_airport['is_early_morning']) + list(late_night_airport['is_late_night'])
# })

# colors = {'No': '#636EFA', 'Yes': '#EF553B'}  

# fig = px.bar(
#     airport_times,
#     x='Time Period',
#     y='Proportion',
#     color='Airport Trip',
#     barmode='group',
#     title='Proportion of JFK Trips During Early Morning and Late Night',
#     color_discrete_map=colors,
#     labels={'Proportion': 'Proportion of Trips', 'Time Period': 'Time Period'},
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     yaxis=dict(tickformat=".0%"), 
#     legend_title_text='JFK Trip',
#     font=dict(size=14)
# )
# # print(airport_times)
# fig.show()


![Plot Image](figures/newplot_14.png)


### `Airport Trips by Time of Day`

| Airport Trip | Time Period    | Proportion |
|--------------|---------------|------------|
| No           | Early Morning | 14%        |
| Yes          | Early Morning | 15%        |
| No           | Late Night    | 10%        |
| Yes          | Late Night    | 9%         |

- In the early morning, airport trips are about 15% of trips.
- At late night, airport trips are about 9% of trips.
- Airport trips happen a bit more in early morning than late night.


In [39]:
airport_summary = df.groupby('airport_JFK_trip').agg({
    'trip_duration': 'mean',
    'haversine_distance': 'mean'
}).reset_index()

airport_summary['airport_JFK_trip'] = airport_summary['airport_JFK_trip'].map({0: 'No', 1: 'Yes'})
# print(airport_summary)


### `JFK Airport Trips vs Others`

| Airport JFK Trip | Average Trip Duration (seconds) | Average Distance (km) |
|------------------|---------------------------------|-----------------------|
| No               | 917                             | 3.0                   |
| Yes              | 2624                            | 18.1                  |

#### `Key Points:`

- Trips **to/from JFK airport** are much longer in duration (about 2624 seconds) compared to non-JFK trips (about 917 seconds).
- JFK trips also cover a much greater average distance (~18 km vs ~3 km).

In [17]:
# colors = {'Long Trips': '#EF553B', 'Short Trips': '#636EFA'}  

# # Define long trips threshold, e.g., top 25% duration
# long_trip_threshold = df['trip_duration'].quantile(0.75)
# df['is_long_trip'] = (df['trip_duration'] >= long_trip_threshold).astype(int)

# landmark_cols = [col for col in df.columns if col.startswith('pickup_dist_') or col.startswith('dropoff_dist_')]

# landmark_freq_long = {}
# landmark_freq_short = {}

# for col in landmark_cols:
#     near_col = (df[col] < 0.5).astype(int)
#     landmark_freq_long[col] = near_col[df['is_long_trip'] == 1].sum()
#     landmark_freq_short[col] = near_col[df['is_long_trip'] == 0].sum()

# import pandas as pd
# freq_df = pd.DataFrame({
#     'Landmark': [col.replace('pickup_dist_', '').replace('dropoff_dist_', '').replace('_', ' ').title() for col in landmark_cols],
#     'Long Trips': list(landmark_freq_long.values()),
#     'Short Trips': list(landmark_freq_short.values())
# })


# fig = px.bar(
#     freq_df.melt(id_vars='Landmark', value_vars=['Long Trips', 'Short Trips'], var_name='Trip Type', value_name='Count'),
#     x='Landmark',
#     y='Count',
#     color='Trip Type',
#     barmode='group',
#     title='Landmark Frequencies in Long vs Short Trips',
#     color_discrete_map=colors,
#     labels={'Count': 'Number of Trips', 'Landmark': 'Landmark', 'Trip Type': 'Trip Type'},
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis_tickangle=-45,
#     font=dict(size=13),
    
#     legend_title_text='Trip Type'
# )
# # print(freq_df)
# fig.show()


![Plot Image](figures/newplot_15.png)


### `Landmark Trip Counts`

| Landmark         | Long Trips | Short Trips |
|------------------|------------|-------------|
| Times Square     | 576, 508   | 1787, 1483  |
| Central Park     | 75, 43     | 200, 177    |
| Brooklyn Bridge  | 2, 10      | 9, 18       |
| Wall Street      | 283, 288   | 427, 482    |
| Empire State     | 397, 390   | 1552, 1366  |


<!-- - Each landmark appears twice, likely from different time periods or categories. -->
- Times Square and Empire State have the highest counts for both long and short trips.
- Brooklyn Bridge has the lowest counts overall.


In [18]:
# colors = {
#     'Other': '#636EFA',            
#     'Intra-Manhattan': '#EF553B', 
#     'To Manhattan': '#00CC96',    
#     'From Manhattan': '#AB63FA'   
# }

# df['zone_type'] = 'Other'
# df.loc[df['manhattan_trip'] == 1, 'zone_type'] = 'Intra-Manhattan'
# df.loc[df['to_manhattan'] == 1, 'zone_type'] = 'To Manhattan'
# df.loc[df['from_manhattan'] == 1, 'zone_type'] = 'From Manhattan'

# fig = px.box(
#     df,
#     x='zone_type',
#     y='trip_duration',
#     title='Trip Duration by Zone Type',
#     labels={'zone_type': 'Trip Zone Type', 'trip_duration': 'Trip Duration (sec)'},
#     color='zone_type',
#     color_discrete_map=colors,
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='Trip Zone Type',
#     yaxis_title='Trip Duration (seconds)',
#     legend_title_text='Zone Type',
#     font=dict(size=14)
# )

# fig.show()


![Plot Image](figures/newplot_16.png)


In [19]:

# colors = {
#     'Other': '#636EFA',            
#     'Intra-Manhattan': '#EF553B', 
#     'To Manhattan': '#00CC96',    
#     'From Manhattan': '#AB63FA'   
# }
# df['estimated_speed'] = df['haversine_distance'] / (df['trip_duration'] / 3600)  # km/h

# speed_avg = df.groupby('zone_type')['estimated_speed'].mean().reset_index()
# fig = px.bar(
#     speed_avg,
#     x='zone_type',
#     y='estimated_speed',
#     title='Average Estimated Speed by Zone Type',
#     labels={'zone_type': 'Trip Zone Type', 'estimated_speed': 'Speed (km/h)'},
#     color='zone_type',
#     color_discrete_map=colors,
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='Trip Zone Type',
#     yaxis_title='Average Estimated Speed (km/h)',
#     legend_title_text='Zone Type',
#     font=dict(size=14)
# )
# # print(speed_avg)
# fig.show()


![Plot Image](figures/newplot_17.png)


In [20]:
# colors = {
#     'Other': '#636EFA',
#     'Intra-Manhattan': '#EF553B',
#     'To Manhattan': '#00CC96',
#     'From Manhattan': '#AB63FA'
# }
# zone_counts = df['zone_type'].value_counts().reset_index()
# zone_counts.columns = ['zone_type', 'count']

# fig = px.pie(
#     zone_counts,
#     values='count',
#     names='zone_type',
#     title='Proportion of Trips by Zone Type',
#     color='zone_type',
#     color_discrete_map=colors,
#     template='plotly_dark',
#     hole=0.3  
# )

# fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=14)

# fig.update_layout(title_x=0.5)
# # print(zone_counts)
# fig.show()



![Plot Image](figures/newplot_18.png)


### `Average Estimated Speed by Zone Type`

| Zone Type         | Average Speed (km/h) |
|-------------------|----------------------|
| From Manhattan    | 21.74                |
| Intra-Manhattan   | 12.84                |
| Other             | 20.03                |
| To Manhattan      | 21.26                |

#### `Key Points:`

- Trips **from Manhattan** and **to Manhattan** have the highest average speeds (~21 km/h).
- Trips **within Manhattan** are slower, averaging about 12.8 km/h.
- Trips in **other zones** average around 20 km/h.
- This suggests traffic or conditions inside Manhattan slow down trips compared to traveling in/out of Manhattan.


In [21]:
# # Assume Haversine distance in km is already calculated
# df['duration_30_kph'] = (df['haversine_distance'] / 30) * 3600  # seconds
# df['duration_40_kph'] = (df['haversine_distance'] / 40) * 3600
# df['duration_50_kph'] = (df['haversine_distance'] / 50) * 3600

# # Melt the DataFrame for plotting
# speed_df = df[['trip_duration', 'duration_30_kph', 'duration_40_kph', 'duration_50_kph']].sample(5000)
# melted = speed_df.melt(var_name='Assumption', value_name='Duration')
# colors = {
#     'trip_duration': '#12db9f',       # Actual
#     'duration_30_kph': '#636efa',     # 30 km/h
#     'duration_40_kph': '#ef553b',     # 40 km/h
#     'duration_50_kph': '#ab63fa'      # 50 km/h
# }

# fig = px.box(
#     melted,
#     x='Assumption',
#     y='Duration',
#     title='Actual vs. Estimated Trip Durations (30/40/50 km/h)',
#     labels={
#         'Duration': 'Trip Duration (seconds)',
#         'Assumption': 'Speed Assumption'
#     },
#     color='Assumption',
#     color_discrete_map=colors,
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     # boxmode='group'
# )
# fig.show()


![Plot Image](figures/newplot_19.png)


In [22]:
# colors = {
#     'Intra-Manhattan': '#12db9f',
#     'To Manhattan': '#3bd9c1',
#     'From Manhattan': '#5ef2f2',
#     'Other': '#1774d8'
# }

# fig = px.box(
#     df,
#     x='zone_type',
#     y='estimated_speed',
#     title='Estimated Speed by Zone Type',
#     labels={
#         'zone_type': 'Trip Zone Type',
#         'estimated_speed': 'Speed (km/h)'
#     },
#     color='zone_type',
#     color_discrete_map=colors,
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     # boxmode='group'
# )

# fig.show()


![Plot Image](figures/newplot_20.png)


In [23]:
# import plotly.figure_factory as ff

# speed_data = [
#     df[df['zone_type'] == 'Intra-Manhattan']['estimated_speed'].dropna(),
#     df[df['zone_type'] == 'To Manhattan']['estimated_speed'].dropna(),
#     df[df['zone_type'] == 'From Manhattan']['estimated_speed'].dropna()
# ]

# group_labels = ['Intra-Manhattan', 'To Manhattan', 'From Manhattan']

# colors = ['#12db9f', '#ab63fa', '#1774d8']

# fig = ff.create_distplot(
#     speed_data,
#     group_labels,
#     show_hist=False,
#     colors=colors
# )

# fig.update_layout(
#     title='Distribution of Estimated Speeds by Zone',
#     title_x=0.5,
#     template='plotly_dark',
#     xaxis_title='Estimated Speed (km/h)',
#     yaxis_title='Density',
#     legend_title='Zone Type'
# )
# # print(speed_data)
# fig.show()


![Plot Image](figures/newplot_21.png)


In [24]:
# df['log_haversine'] = np.log1p(df['haversine_distance'])  
# df['sqrt_haversine'] = np.sqrt(df['haversine_distance'])
# df['inv_haversine'] = 1 / (df['haversine_distance'] + 1e-6)  

# fig = px.scatter(
#     df.sample(3000),
#     x='log_haversine',
#     y='trip_duration',
#     trendline='ols',
#     title='Log(Haversine Distance) vs. Trip Duration',
#     labels={
#         'log_haversine': 'Log(Haversine Distance)',
#         'trip_duration': 'Trip Duration (seconds)'
#     },
#     color_discrete_sequence=['#1fa5f9']
# )

# fig.update_layout(
#     template='plotly_dark',
#     title_x=0.5,
#     xaxis_title='Log(Haversine Distance)',
#     yaxis_title='Trip Duration (seconds)',
#     showlegend=False
# )

# fig.show()


![Plot Image](figures/newplot_22.png)


In [25]:

# df['euclidean_haversine_ratio'] = df['euclidean_distance'] / (df['haversine_distance'] + 1e-6)

# fig = px.box(
#     df,
#     x='zone_type',
#     y='euclidean_haversine_ratio',
#     title='Euclidean / Haversine Ratio by Zone Type',
#     labels={
#         'euclidean_haversine_ratio': 'Euclidean / Haversine Ratio',
#         'zone_type': 'Trip Zone Type'
#     },
#     color='zone_type',
#     color_discrete_sequence=px.colors.qualitative.Set1
# )

# fig.update_layout(
#     template='plotly_dark',
#     title_x=0.5,
#     xaxis_title='Trip Zone Type',
#     yaxis_title='Euclidean / Haversine Ratio',
#     showlegend=False
# )

# fig.show()


![Plot Image](figures/newplot_23.png)


<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
   ㅤPassenger Features
  </span>
</div>
<a id="import-data"></a>


In [49]:


# Clean passenger count (cap at reasonable values)
df['passenger_count'] = df['passenger_count'].clip(1, 6)

# Passenger interactions with other features
df['distance_passenger'] = df['haversine_distance'] * df['passenger_count']
df['distance_per_passenger'] = df['haversine_distance'] / df['passenger_count']
df['hour_passenger'] = df['pickup_hour'] * df['passenger_count']
df['weekend_passenger'] = df['pickup_is_weekend'] * df['passenger_count']
df['rush_hour_passenger'] = df['is_rush_hour'] * df['passenger_count']

# Passenger categories
df['is_solo_trip'] = (df['passenger_count'] == 1).astype(int)
df['is_couple_trip'] = (df['passenger_count'] == 2).astype(int)
df['is_group_trip'] = (df['passenger_count'] > 2).astype(int)
df['is_large_group'] = (df['passenger_count'] > 4).astype(int)

# Passenger count transformations
df['log_passenger_count'] = np.log1p(df['passenger_count'])
df['passenger_count_sq'] = df['passenger_count'] ** 2




<div style="margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #1774d8ff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  "
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
   ㅤPassenger Features (EDA)
  </span>
</div>
<a id="import-data"></a>


In [26]:

# df['passenger_category'] = 'Group'
# df.loc[df['passenger_count'] == 1, 'passenger_category'] = 'Solo'
# df.loc[df['passenger_count'] == 2, 'passenger_category'] = 'Couple'
# df.loc[df['passenger_count'] > 4, 'passenger_category'] = 'Large Group'

# colors = {
#     'Solo': '#EF553B',       
#     'Couple': '#00CC96',      
#     'Group': '#636EFA',      
#     'Large Group': '#AB63FA' 
# }

# fig = px.box(
#     df,
#     x='passenger_category',
#     y='trip_duration',
#     color='passenger_category',
#     title='Trip Duration by Passenger Category',
#     labels={'trip_duration': 'Duration (seconds)', 'passenger_category': 'Passenger Category'},
#     color_discrete_map=colors,
#     template='plotly_dark'
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis_title='Passenger Category',
#     yaxis_title='Trip Duration (seconds)',
#     # boxmode='group',
#     font=dict(size=14)
# )

# fig.show()



![Plot Image](figures/newplot_24.png)


In [27]:

# avg_passengers_hour = df.groupby('pickup_hour')['passenger_count'].count().reset_index()

# fig = px.line(
#     avg_passengers_hour,
#     x='pickup_hour',
#     y='passenger_count',
#     title='Passenger Count by Hour of Day',
#     labels={'pickup_hour': 'Hour of Day', 'passenger_count': 'Average Passenger Count'},
#     template='plotly_dark',
#     color_discrete_sequence=['#00cc96'],  
#     markers=True
# )

# fig.update_layout(
#     title_x=0.5,
#     xaxis=dict(dtick=1), 
#     yaxis_title='Passenger Count',
#     font=dict(size=13)
# )
# # print(avg_passengers_hour)
# fig.show()


![Plot Image](figures/newplot_25.png)


#### `Number of Passengers by Pickup Hour`

| Pickup Hour | Passenger Count |
|-------------|-----------------|
| 0           | 1,805           |
| 1           | 1,301           |
| 2           | 968             |
| 3           | 743             |
| 4           | 533             |
| 5           | 519             |
| 6           | 1,101           |
| 7           | 1,917           |
| 8           | 2,344           |
| 9           | 2,304           |
| 10          | 2,223           |
| 11          | 2,383           |
| 12          | 2,492           |
| 13          | 2,456           |
| 14          | 2,577           |
| 15          | 2,535           |
| 16          | 2,234           |
| 17          | 2,633           |
| 18          | 2,973           |
| 19          | 3,127           |
| 20          | 2,803           |
| 21          | 2,874           |
| 22          | 2,749           |
| 23          | 2,406           |

- Passenger count is lowest between 4 AM and 5 AM.
- Passenger count rises significantly from 7 AM onwards.
- Peak passenger counts happen between 17:00 (5 PM) and 20:00 (8 PM).


In [52]:
# Distance and time interactions
df['distance_hour'] = df['haversine_distance'] * df['pickup_hour']
df['distance_dayofweek'] = df['haversine_distance'] * df['pickup_dayofweek']
df['distance_weekend'] = df['haversine_distance'] * df['pickup_is_weekend']
df['distance_rush'] = df['haversine_distance'] * df['is_rush_hour']
df['distance_night'] = df['haversine_distance'] * df['is_night']

# Center distance interactions
df['center_dist_hour'] = df['pickup_center_distance'] * df['pickup_hour']
df['center_dist_weekend'] = df['pickup_center_distance'] * df['pickup_is_weekend']

# Speed-related interactions (more realistic estimates)
df['rush_hour_speed_penalty'] = df['haversine_distance'] * df['is_rush_hour'] * 1.5
df['weekend_speed_bonus'] = df['haversine_distance'] * df['pickup_is_weekend'] * 0.8
df['night_speed_bonus'] = df['haversine_distance'] * df['is_night'] * 0.7


In [53]:
df['store_and_fwd_flag'] = df['store_and_fwd_flag'].map({'N': 0, 'Y': 1})


In [54]:
boolean_cols = ['pickup_year', 'is_morning_rush', 'is_evening_rush', 'is_lunch_hour',
                 'is_late_night', 'is_early_morning', 'is_night', 'is_monday', 
                 'is_friday', 'is_sunday', 'weekend_rush_hour', 'weekday_night', 'weekend_night', 
                 'is_pickup_near_JFK', 'is_dropoff_near_JFK', 'airport_JFK_trip', 'is_pickup_near_LGA',
                'is_dropoff_near_LGA', 'airport_LGA_trip', 'is_pickup_near_EWR', 'is_dropoff_near_EWR', 
                'airport_EWR_trip', 'pickup_is_manhattan', 'dropoff_is_manhattan', 'manhattan_trip', 
                'to_manhattan', 'from_manhattan', 'is_couple_trip', 'is_group_trip', 
                'is_large_group', 'distance_night', 'night_speed_bonus', 'pickup_year','store_and_fwd_flag']

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.difference(boolean_cols)


<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #17d85eff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
Outliers
  </span>
</div>
<a id="import-data"></a>


In [55]:
def calculate_outliers_iqr(df, columns=None):
    """
    Calculate the percentage of outliers (%) using IQR for each specified column
    and print a summary.
    """
    outliers_dict = {}

    for col in columns:
        if col in df.columns:
            # 1. Calculate Q1 and Q3
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)

            # 2. Calculate IQR
            IQR = Q3 - Q1

            # 3. Calculate lower and upper bounds
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            # 4. Identify outliers
            outlier_indices = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index.tolist()
            outlier_percentage = (len(outlier_indices) / len(df)) * 100

            # 5. Save the results
            outliers_dict[col] = outlier_percentage

            # 6. Print the summary
            print(f"{col}: {outlier_percentage:.2f}% outliers")

    return outliers_dict

outliers_percentage = calculate_outliers_iqr(df, numeric_cols)


abs_delta_latitude: 7.33% outliers
abs_delta_longitude: 9.63% outliers
bearing: 0.00% outliers
bearing_cos: 0.00% outliers
bearing_sin: 0.00% outliers
center_dist_hour: 6.44% outliers
center_dist_weekend: 15.61% outliers
center_distance_diff: 10.13% outliers
day_cos: 0.00% outliers
day_sin: 0.00% outliers
delta_latitude: 7.54% outliers
delta_longitude: 9.59% outliers
delta_ratio: 11.20% outliers
dist_dropoff_EWR: 4.54% outliers
dist_dropoff_JFK: 11.13% outliers
dist_dropoff_LGA: 4.37% outliers
dist_pickup_EWR: 5.28% outliers
dist_pickup_JFK: 8.22% outliers
dist_pickup_LGA: 5.27% outliers
distance_dayofweek: 8.59% outliers
distance_hour: 9.32% outliers
distance_passenger: 10.00% outliers
distance_per_passenger: 9.02% outliers
distance_ratio: 0.00% outliers
distance_rush: 9.48% outliers
distance_weekend: 14.09% outliers
dropoff_center_distance: 7.06% outliers
dropoff_dist_brooklyn_bridge: 2.96% outliers
dropoff_dist_central_park: 3.52% outliers
dropoff_dist_empire_state: 6.13% outliers
d

In [56]:
def  cap_outliers(df, numeric_cols, boolean_cols):
    """Cap outliers in numeric columns using IQR method."""
    df = df.copy()
    for col in numeric_cols:
        if col in df.columns:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            df[col] = df[col].clip(lower, upper)
    return df[numeric_cols.tolist() + boolean_cols]

In [57]:
train_df , test_df = train_test_split(df, test_size=0.1, random_state=42 ,shuffle=True)

train_df = cap_outliers(train_df, numeric_cols, boolean_cols)

test_df = cap_outliers(test_df, numeric_cols, boolean_cols)


## ✅ Train-Test Split Before Outlier Treatment

To maintain data integrity and avoid any form of data leakage, we **split the dataset into training and testing subsets before applying outlier handling**:

```python
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42, shuffle=True)

train_df = cap_outliers(train_df, numeric_cols, boolean_cols)
test_df = cap_outliers(test_df, numeric_cols, boolean_cols)


<div style="text-align: center; margin: 30px 0;">
  <span style="
    font-family: 'Playfair Display', serif;
    background-color: #17d85eff;
    color: white;
    font-size: 1.5em;
    font-weight: 600;
    font-style: italic;
    padding: 12px 24px;
    border-radius: 12px;
    display: inline-block;
    box-shadow: 0 4px 12px rgba(18, 17, 17, 0.15);
    transition: transform 0.2s ease, box-shadow 0.2s ease;
    cursor: default;
  " 
  onmouseover="this.style.transform='scale(1.03)'; this.style.boxShadow='0 6px 16px rgba(0,0,0,0.2)'"
  onmouseout="this.style.transform='scale(1)'; this.style.boxShadow='0 4px 12px rgba(0,0,0,0.15)'"
  >
    Features selection
  </span>
</div>
<a id="import-data"></a>


In [58]:
# Calculate correlation of all numeric columns with the target 'trip_duration'
correlation = train_df[numeric_cols].corr()['trip_duration']

# Select features whose absolute correlation with the target is between 0.2 and 0.9
selected_corr = correlation[correlation.abs().between(0.2, 0.9)]

# Drop the target column itself from the selected features (if present)
selected_corr = selected_corr.drop('trip_duration', errors='ignore')

# Print the selected features sorted by correlation value in descending order
# print("Features with |correlation| between 0.2 and 0.9:")
# print(selected_corr.sort_values(ascending=False))

# Get the list of selected feature names
top_features = selected_corr.index.tolist()
top_features.append('trip_duration')
# Use the selected features to filter the original dataframe
# X_selected = df[top_features]


In [59]:
train_df =train_df[top_features + boolean_cols]
test_df = test_df[top_features + boolean_cols]

In [60]:
train_df.columns

Index(['abs_delta_latitude', 'abs_delta_longitude', 'center_dist_hour',
       'dist_pickup_JFK', 'distance_dayofweek', 'distance_hour',
       'distance_passenger', 'distance_per_passenger', 'distance_rush',
       'dropoff_center_distance', 'dropoff_dist_central_park',
       'dropoff_dist_empire_state', 'dropoff_dist_times_square',
       'duration_30_kph', 'duration_40_kph', 'duration_50_kph',
       'estimated_speed_30', 'estimated_speed_40', 'estimated_speed_50',
       'euclidean_distance', 'haversine_distance', 'haversine_distance_cube',
       'haversine_distance_sq', 'inv_haversine', 'inv_haversine_distance',
       'is_long_trip', 'log_haversine', 'log_haversine_distance',
       'manhattan_distance', 'pickup_center_distance',
       'pickup_dist_central_park', 'pickup_dist_empire_state',
       'pickup_dist_times_square', 'rush_hour_speed_penalty', 'sqrt_haversine',
       'sqrt_haversine_distance', 'total_center_distance', 'trip_duration',
       'pickup_year', 'pickup_yea

## 🔍 Feature Selection Based on Correlation with Target

To ensure our model focuses on the most informative features, we performed correlation-based feature selection using the training dataset. This helps reduce dimensionality and potential noise, leading to better model performance.
