# A first look at the dataset

In [71]:
from taxipred.backend.data_processing import TaxiData
taxidata = TaxiData()

In [72]:
# use info to see the column names aswell as number of nullvalues aswell as typing
taxidata.df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       950 non-null    float64
 1   Time_of_Day            950 non-null    object 
 2   Day_of_Week            950 non-null    object 
 3   Passenger_Count        950 non-null    float64
 4   Traffic_Conditions     950 non-null    object 
 5   Weather                950 non-null    object 
 6   Base_Fare              950 non-null    float64
 7   Per_Km_Rate            950 non-null    float64
 8   Per_Minute_Rate        950 non-null    float64
 9   Trip_Duration_Minutes  950 non-null    float64
 10  Trip_Price             951 non-null    float64
dtypes: float64(7), object(4)
memory usage: 86.1+ KB


In [73]:
# checking the actual look of the dataset. to better understand the columns
taxidata.df.head(10)


Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.618
5,8.64,Afternoon,Weekend,2.0,Medium,Clear,2.55,1.71,0.48,89.33,60.2028
6,3.85,Afternoon,Weekday,4.0,High,Rain,3.51,1.66,,5.05,11.2645
7,43.44,Evening,Weekend,3.0,,Clear,2.97,1.87,0.23,,101.1216
8,30.45,Morning,Weekday,3.0,High,Clear,2.77,1.78,0.34,110.33,
9,35.7,Afternoon,Weekday,2.0,Low,Rain,3.39,1.52,0.47,,75.5657


In [74]:
import numpy as np
# figure out important correlations
# suspecting abnormally close to 1 correlation in a couple of these columns
matrix = taxidata.df.select_dtypes(include=np.number).corr()
print(matrix)

                       Trip_Distance_km  Passenger_Count  Base_Fare  \
Trip_Distance_km               1.000000        -0.048397   0.032218   
Passenger_Count               -0.048397         1.000000   0.022932   
Base_Fare                      0.032218         0.022932   1.000000   
Per_Km_Rate                   -0.017041         0.030213   0.003092   
Per_Minute_Rate               -0.025902         0.034068  -0.019150   
Trip_Duration_Minutes         -0.022102         0.022845   0.012035   
Trip_Price                     0.849123        -0.014223   0.035533   

                       Per_Km_Rate  Per_Minute_Rate  Trip_Duration_Minutes  \
Trip_Distance_km         -0.017041        -0.025902              -0.022102   
Passenger_Count           0.030213         0.034068               0.022845   
Base_Fare                 0.003092        -0.019150               0.012035   
Per_Km_Rate               1.000000         0.029241               0.027199   
Per_Minute_Rate           0.029241       

## My Feature Selection Plan and Justification

### The Main Predictor
After looking at the data, it's obvious that **`Trip_Distance_km`** is the biggest factor for the **`Trip_Price`**. My correlation check proved this with a strong positive value, so it's the main feature I'll be using.

***
### Dropping Columns to Avoid Data Leakage

I'm dropping several columns to ensure my model is realistic and doesn't "cheat" by looking at parts of the answer.

**Fare Component Columns (`Base_Fare`, Rates, etc.)**

My initial thought was that **`Base_Fare`**, **`Per_Km_Rate`**, and **`Per_Minute_Rate`** are used to calculate the final price. The low correlation values were confusing, so I decided to manually verify this to be sure.

First, I needed a complete row of data to work with, so I chose **Row 0** since it had no missing values. Based on the column names, I pieced together the most likely formula:

`Total Price = Base_Fare + (Trip_Distance_km * Per_Km_Rate) + (Trip_Duration_Minutes * Per_Minute_Rate)`

I then plugged in the numbers from Row 0 to test this theory:

* **Base Fare:** `3.56`
* **Distance Cost:** `19.35 km * 0.80` = `15.48`
* **Duration Cost:** `53.82 min * 0.32` = `17.2224`

When I summed these components, the result was **36.2624**, which was a perfect match for the actual **`Trip_Price`**. This test confirmed that the price is a direct result of these columns, proving the data leakage I suspected.

**The Trip Duration Problem**

I'm also dropping **`Trip_Duration_Minutes`**. This was a tricky one since duration and price are clearly connected. However, the column in this dataset is the *actual* time the trip took, which is something I'd only know *after* it's over. For my model to be realistic, it has to predict the price from stuff I'd know at the start.

If I had start and stop locations, I would have used an API to get an *estimated* duration and used that as a feature. Since I don't have that, using the actual duration is just cheating.

***
### Final Approach

Based on this, I'll move forward using **`Trip_Distance_km`** and my categorical features: **`Time_of_Day`** , **`Day_of_Week`**,**`Passenger_Count`**,`Traffic_Conditions` to build the model.

### Repairing Key Columns Using the Fare Formula

Now that the exact mathematical formula connecting the fare components has been identified, I can use it as a powerful tool for data repair.

By algebraically rearranging this formula, it's possible to calculate and fill in missing values for my key columns—the target variable **`Trip_Price`** and the main feature **`Trip_Distance_km`**. This is a deterministic process that allows me to repair these values with 100% accuracy, salvaging valuable rows that would otherwise be dropped.

In [75]:
from taxipred.backend.data_repair import repair_taxi_data
repaired_df = repair_taxi_data(taxidata.df)

--- Before Repair ---
Missing values in key columns:
Trip_Price          49
Trip_Distance_km    50
dtype: int64
-------------------------
--- After Repair ---
Missing values in key columns:
Trip_Price          17
Trip_Distance_km     6
dtype: int64
-------------------------


### Drop useless columns
now we create a df with only the wanted columns

In [76]:
columns_to_drop = [
    "Base_Fare",
    "Per_Km_Rate",
    "Per_Minute_Rate",
    "Trip_Duration_Minutes",
]
df_final_columns = repaired_df.drop(columns=columns_to_drop)

df_final_columns.head()

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,
2,36.87,Evening,Weekend,1.0,High,Clear,52.9032
3,30.33,Evening,Weekday,4.0,Low,,36.4698
4,8.64,Evening,Weekday,3.0,High,Clear,15.618


### identifying makeup of nulls 
ive concluded that there is a fair bit of null values in the dataset.
i wanna identify how spread out it is. to see if any rows have an overwhelming number of null values or if its limited to 1 column

In [77]:
df_final_columns.isnull().sum(axis=1).value_counts(normalize=True)*100 

0    75.3
1    22.1
2     2.6
Name: proportion, dtype: float64

### Dropping 2.6% of the dataset
ive decided that dropping the 2.6% of the dataset that contains 2 null values per row is an acceptable loss.

i will try to repair the remaining data using machinelearning

In [78]:
df_with_max_1_null = df_final_columns[df_final_columns.isnull().sum(axis=1) <2]
print(df_with_max_1_null.info())


<class 'pandas.core.frame.DataFrame'>
Index: 974 entries, 0 to 999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Trip_Distance_km    968 non-null    float64
 1   Time_of_Day         927 non-null    object 
 2   Day_of_Week         938 non-null    object 
 3   Passenger_Count     935 non-null    float64
 4   Traffic_Conditions  930 non-null    object 
 5   Weather             937 non-null    object 
 6   Trip_Price          962 non-null    float64
dtypes: float64(3), object(4)
memory usage: 60.9+ KB
None
