# A first look at the dataset

In [None]:
from taxipred.backend.data_processing import TaxiData
from taxipred.utils.constants import ORIGINAL_CSV_PATH,ALTERED_CSV_PATH
taxidata = TaxiData(ORIGINAL_CSV_PATH)

In [None]:
# use info to see the column names aswell as number of nullvalues aswell as typing
taxidata.df.info()


In [None]:
# checking valuecounts to see if any categorical columns are misspelled
from taxipred.backend.data_processing import find_categorical_columns
cat_cols = find_categorical_columns(taxidata.df)
for name in cat_cols:
    print(taxidata.df[name].value_counts())

### no misspelled rows for categorical
i found that the actual values inside the categorical columns are misspelled

In [None]:
# checking the actual look of the dataset. to better understand the columns
taxidata.df.head(10)


In [None]:
import numpy as np
# figure out important correlations
# suspecting abnormally close to 1 correlation in a couple of these columns
matrix = taxidata.df.select_dtypes(include=np.number).corr()
print(matrix)

## My Feature Selection Plan and Justification

### The Main Predictor
After looking at the data, it's obvious that **`Trip_Distance_km`** is the biggest factor for the **`Trip_Price`**. My correlation check proved this with a strong positive value, so it's the main feature I'll be using.

***
### Dropping Columns to Avoid Data Leakage

I'm dropping several columns to ensure my model is realistic and doesn't "cheat" by looking at parts of the answer.

**Fare Component Columns (`Base_Fare`, Rates, etc.)**

My initial thought was that **`Base_Fare`**, **`Per_Km_Rate`**, and **`Per_Minute_Rate`** are used to calculate the final price. The low correlation values were confusing, so I decided to manually verify this to be sure.

First, I needed a complete row of data to work with, so I chose **Row 0** since it had no missing values. Based on the column names, I pieced together the most likely formula:

`Total Price = Base_Fare + (Trip_Distance_km * Per_Km_Rate) + (Trip_Duration_Minutes * Per_Minute_Rate)`

I then plugged in the numbers from Row 0 to test this theory:

* **Base Fare:** `3.56`
* **Distance Cost:** `19.35 km * 0.80` = `15.48`
* **Duration Cost:** `53.82 min * 0.32` = `17.2224`

When I summed these components, the result was **36.2624**, which was a perfect match for the actual **`Trip_Price`**. This test confirmed that the price is a direct result of these columns, proving the data leakage I suspected.

**The Trip Duration Problem**

I'm also dropping **`Trip_Duration_Minutes`**. This was a tricky one since duration and price are clearly connected. However, the column in this dataset is the *actual* time the trip took, which is something I'd only know *after* it's over. For my model to be realistic, it has to predict the price from stuff I'd know at the start.

If I had start and stop locations, I would have used an API to get an *estimated* duration and used that as a feature. Since I don't have that, using the actual duration is just cheating.

***
### Final Approach

Based on this, I'll move forward using **`Trip_Distance_km`** and my categorical features: **`Time_of_Day`** , **`Day_of_Week`**,**`Passenger_Count`**,`Traffic_Conditions` to build the model.

### Repairing Key Columns Using the Fare Formula

Now that the exact mathematical formula connecting the fare components has been identified, I can use it as a powerful tool for data repair.

By algebraically rearranging this formula, it's possible to calculate and fill in missing values for my key columns—the target variable **`Trip_Price`** and the main feature **`Trip_Distance_km`**. This is a deterministic process that allows me to repair these values with 100% accuracy, salvaging valuable rows that would otherwise be dropped.

In [None]:
taxidata.repair_data_using_algebra()


### Uing imputation
by using imputation i can fill in remaining nulls so long as there isnt more than 1 null value per row

In [None]:
# iterating over each column as target using the rest as features until it cannot fill anymore nulls
# this was the most timeconsuming portion of my project
taxidata.repair_using_imputation()

### Drop useless columns
now i drop the columns which wont be included in model for predicting trip prices

In [None]:
columns_to_drop = [
    "Base_Fare",
    "Per_Km_Rate",
    "Per_Minute_Rate",
    "Trip_Duration_Minutes",
]
taxidata.drop_columns(columns_to_drop)


### Remaining nulls
checking to see how distribution of remaining nulls look.

In [None]:
taxidata.df.info()

In [None]:
taxidata.df.isnull().sum(axis=1).value_counts(normalize=True)*100 

### Dropping 2.3% of the dataset
ive decided that dropping the 2.3% of the dataset that contains 2 null values per row is an acceptable loss.
reson being the dropped columns have to much importance in the data to retain the rows with the values missing.

In [None]:
taxidata.df = taxidata.df.dropna()
taxidata.df.info()

### Exporting to csv
exporting the new dataset to csv for ingesting into the real price predicting ml model

In [None]:
taxidata.to_csv(ALTERED_CSV_PATH)

### Testing new dataset
just simply loading it in for test purposes

In [None]:
import pandas as pd
df = pd.read_csv(ALTERED_CSV_PATH)
df.info()