#  New York City Taxi and Limousine Commission Project

---

**This notebook has a goal apply de PACE strategy on 2017_Yellow_Taxi_Trip_Data dataset to build a predictive model for total amount of a taxi trip**

---

## **PACE: Plan**

This stage is where i've the first contact with the dataset and when i start to plan what to do with the data to achieve my goals. Normally, I start making some questions to dataset. Let's get start  (Remenber, this first questions don't have high accuracy )

### What are the data columns most relevant to your deliverable?

  - tpep_pickup_datetime, tpep_dropoff_datetime: To calculate the trip duration 
  - Trip_distance: The elapsed trip distance in miles reported by the taximeter.
  - Fare_amount: The time-and-distance fare calculated by the meter.
  - Tip_amount: Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
  - Tolls_amount: Total amount of all tolls paid in trip. 
  - Total_amount: The total amount charged to passengers. Does not include cash tips.

### What are the data columns most irrelevant to your deliverable?

  - Store_and_fwd_flag: dont have correlation in all dataset  


### Are in your dataset missing or incomplete data?
  No, but i have incositent data. The columns  with are: 
  - Fare_amount: values 0 or negative 
  - Extra: negative values, wrong values (upper then one)
  - Mta_tax: negative values
  - Improvement_surcharge: values 0 or negative
  - Total_amount: values 0 or negative
  
### Which EDA practices will be required to begin this project?
  - Data cleaning & Validity checks
  - Temporal Analysis
  - Univariate Analysis (Distribution of Target)
  - Bivariate Analysis (Feature Relationships)
  

## **Pace: Analyze**

In this stage i start data cleaning and validade checks and witch steps is i've do to achive the best resoults. We can continue with some questions. 


### What steps need to be taken to perform EDA in the most effective way to achieve the project goal?
  #### First: I start plotting my data with Ydata to start my analisys (this is valid for Plan fase too)


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

df_taxi = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv")
df_taxi.drop("ID", axis=1, inplace=True)
df_taxi.drop("VendorID", axis=1, inplace=True)
profile_taxi = ProfileReport(df_taxi, title="Taxi analysis with ydata", explorative=True)
profile_taxi.to_file("taxi_analysis_report.html")

#### Second: I start cleaning inconsistant data i find in my plan fase

In [37]:
df_taxi.drop(columns=['store_and_fwd_flag'], errors='ignore', inplace=True)
df_taxi['tpep_pickup_datetime'] = pd.to_datetime(df_taxi['tpep_pickup_datetime'])
df_taxi['tpep_dropoff_datetime'] = pd.to_datetime(df_taxi['tpep_dropoff_datetime'])
df_taxi = df_taxi[df_taxi['fare_amount'] > 0]
df_taxi = df_taxi[df_taxi['extra'] > 0]
df_taxi = df_taxi[df_taxi['extra'] <= 1]
df_taxi = df_taxi[df_taxi['mta_tax'] > 0]
df_taxi = df_taxi[df_taxi['improvement_surcharge'] > 0]
df_taxi = df_taxi[df_taxi['total_amount'] > 0]


#### Third: Now i add a new columns to help analize data


In [64]:
df_taxi['duration'] = df_taxi['tpep_dropoff_datetime'] - df_taxi['tpep_pickup_datetime']
df_taxi['duration_minutes'] = df_taxi['duration'].dt.total_seconds() / 60
df_taxi = df_taxi[df_taxi['duration_minutes'] > 0]
df_taxi.drop(columns=['duration'], inplace=True)

#### Then i get anoter report with ydata and print some graphs to identify outliers and reanilyze the dataset

In [65]:
profile_taxi = ProfileReport(df_taxi, title="Taxi analysis with ydata", explorative=True)
profile_taxi.to_file("taxi_analysis_report_cleaned.html")

Summarize dataset:  33%|███▎      | 7/21 [00:00<00:01,  7.97it/s, Describe variable: duration_minutes]     
100%|██████████| 16/16 [00:00<00:00, 366.95it/s]00:00, 22.60it/s, Describe variable: duration_minutes]
Summarize dataset: 100%|██████████| 106/106 [00:12<00:00,  8.79it/s, Completed]                                
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.60s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 11.82it/s]


In [66]:
# Define columns to check for outliers
cols_to_check = ['trip_distance', 'fare_amount', 'total_amount', 'duration_minutes']

# 1. Visualization: Boxplots
plt.figure(figsize=(10, 15))
for i, col in enumerate(cols_to_check):
    plt.subplot(2, 2, i+1)
    sns.boxplot(y=df_taxi[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.savefig('outlier_boxplots.png')

# 2. Statistical Identification: IQR Method
outlier_stats = {}
for col in cols_to_check:
    Q1 = df_taxi[col].quantile(0.25)
    Q3 = df_taxi[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df_taxi[(df_taxi[col] < lower_bound) | (df_taxi[col] > upper_bound)]
    outlier_stats[col] = {
        'IQR': IQR,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound,
        'Num Outliers': len(outliers),
        'Percentage': (len(outliers) / len(df_taxi)) * 100
    }

# Print stats
print("Outlier Statistics (IQR Method):")
for col, stats in outlier_stats.items():
    print(f"\n{col}:")
    print(f"  IQR: {stats['IQR']:.2f}")
    print(f"  Bounds: {stats['Lower Bound']:.2f} to {stats['Upper Bound']:.2f}")
    print(f"  Outliers: {stats['Num Outliers']} ({stats['Percentage']:.2f}%)")

# Plotting
plt.figure(figsize=(10, 15))
sns.boxplot(y=df_taxi['total_amount'])
plt.title(f'Boxplot of total_amount')
plt.tight_layout()
plt.savefig('outlier_boxplots_fixed.png')

plt.figure(figsize=(10, 6))
sns.scatterplot(x='trip_distance', y='total_amount', data=df_taxi, alpha=0.5)
plt.title('Trip Distance vs Total Amount (Outlier Detection)')
plt.xlabel('Trip Distance (miles)')
plt.ylabel('Total Amount ($)')
plt.savefig('scatter_outliers_fixed.png')

Outlier Statistics (IQR Method):

trip_distance:
  IQR: 2.20
  Bounds: -2.30 to 6.50
  Outliers: 987 (9.28%)

fare_amount:
  IQR: 7.75
  Bounds: -5.12 to 25.88
  Outliers: 800 (7.52%)

total_amount:
  IQR: 9.00
  Bounds: -4.70 to 31.30
  Outliers: 793 (7.45%)

duration_minutes:
  IQR: 10.77
  Bounds: -9.60 to 33.47
  Outliers: 453 (4.26%)
