# Basic eda

In [70]:
from taxipred.backend.data_processing import TaxiData

taxidata = TaxiData()



In [71]:
# use info to see the column names aswell as number of nullvalues aswell as typing
taxidata.df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       950 non-null    float64
 1   Time_of_Day            950 non-null    object 
 2   Day_of_Week            950 non-null    object 
 3   Passenger_Count        950 non-null    float64
 4   Traffic_Conditions     950 non-null    object 
 5   Weather                950 non-null    object 
 6   Base_Fare              950 non-null    float64
 7   Per_Km_Rate            950 non-null    float64
 8   Per_Minute_Rate        950 non-null    float64
 9   Trip_Duration_Minutes  950 non-null    float64
 10  Trip_Price             951 non-null    float64
dtypes: float64(7), object(4)
memory usage: 86.1+ KB


In [72]:
# checking the actual look of the dataset. to better understand the columns
taxidata.df.head(10)


Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.618
5,8.64,Afternoon,Weekend,2.0,Medium,Clear,2.55,1.71,0.48,89.33,60.2028
6,3.85,Afternoon,Weekday,4.0,High,Rain,3.51,1.66,,5.05,11.2645
7,43.44,Evening,Weekend,3.0,,Clear,2.97,1.87,0.23,,101.1216
8,30.45,Morning,Weekday,3.0,High,Clear,2.77,1.78,0.34,110.33,
9,35.7,Afternoon,Weekday,2.0,Low,Rain,3.39,1.52,0.47,,75.5657


In [73]:
import numpy as np
# figure out important correlations
# suspecting abnormally close to 1 correlation in a couple of these columns
matrix = taxidata.df.select_dtypes(include=np.number).corr()
print(matrix)

                       Trip_Distance_km  Passenger_Count  Base_Fare  \
Trip_Distance_km               1.000000        -0.048397   0.032218   
Passenger_Count               -0.048397         1.000000   0.022932   
Base_Fare                      0.032218         0.022932   1.000000   
Per_Km_Rate                   -0.017041         0.030213   0.003092   
Per_Minute_Rate               -0.025902         0.034068  -0.019150   
Trip_Duration_Minutes         -0.022102         0.022845   0.012035   
Trip_Price                     0.849123        -0.014223   0.035533   

                       Per_Km_Rate  Per_Minute_Rate  Trip_Duration_Minutes  \
Trip_Distance_km         -0.017041        -0.025902              -0.022102   
Passenger_Count           0.030213         0.034068               0.022845   
Base_Fare                 0.003092        -0.019150               0.012035   
Per_Km_Rate               1.000000         0.029241               0.027199   
Per_Minute_Rate           0.029241       

## My Feature Selection Plan and Justification

### The Main Predictor
After looking at the data, it's obvious that **`Trip_Distance_km`** is the biggest factor for the **`Trip_Price`**. My correlation check proved this with a strong positive value, so it's the main feature I'll be using.

***
### Dropping Columns to Avoid Data Leakage

I'm dropping several columns to ensure my model is realistic and doesn't "cheat" by looking at parts of the answer.

**Fare Component Columns (`Base_Fare`, Rates, etc.)**

My initial thought was that **`Base_Fare`**, **`Per_Km_Rate`**, and **`Per_Minute_Rate`** are used to calculate the final price. The low correlation values were confusing, so I decided to manually verify this to be sure.

First, I needed a complete row of data to work with, so I chose **Row 0** since it had no missing values. Based on the column names, I pieced together the most likely formula:

`Total Price = Base_Fare + (Trip_Distance_km * Per_Km_Rate) + (Trip_Duration_Minutes * Per_Minute_Rate)`

I then plugged in the numbers from Row 0 to test this theory:

* **Base Fare:** `3.56`
* **Distance Cost:** `19.35 km * 0.80` = `15.48`
* **Duration Cost:** `53.82 min * 0.32` = `17.2224`

When I summed these components, the result was **36.2624**, which was a perfect match for the actual **`Trip_Price`**. This test confirmed that the price is a direct result of these columns, proving the data leakage I suspected.

**The Trip Duration Problem**

I'm also dropping **`Trip_Duration_Minutes`**. This was a tricky one since duration and price are clearly connected. However, the column in this dataset is the *actual* time the trip took, which is something I'd only know *after* it's over. For my model to be realistic, it has to predict the price from stuff I'd know at the start.

If I had start and stop locations, I would have used an API to get an *estimated* duration and used that as a feature. Since I don't have that, using the actual duration is just cheating.

***
### Final Approach

Based on this, I'll move forward using **`Trip_Distance_km`** and my categorical features: **`Time_of_Day`** , **`Day_of_Week`**,**`Passenger_Count`**,`Traffic_Conditions` to build the model.

### Repairing Key Columns Using the Fare Formula

Now that the exact mathematical formula connecting the fare components has been identified, I can use it as a powerful tool for data repair.

By algebraically rearranging this formula, it's possible to calculate and fill in missing values for my key columns—the target variable **`Trip_Price`** and the main feature **`Trip_Distance_km`**. This is a deterministic process that allows me to repair these values with 100% accuracy, salvaging valuable rows that would otherwise be dropped.

In [None]:
import pandas as pd
import numpy as np

COLUMNS = {
    "DISTANCE": "Trip_Distance_km",
    "BASE_FARE": "Base_Fare",
    "KM_RATE": "Per_Km_Rate",
    "MIN_RATE": "Per_Minute_Rate", 
    "DURATION": "Trip_Duration_Minutes",
    "PRICE": "Trip_Price"
}

def _log_repair_status(df: pd.DataFrame, stage: str, repaired_cols: list):
    """Internal helper function to log the status of missing values."""
    print(f"--- {stage} ---")
    print("Missing values in key columns:")
    print(df[repaired_cols].isnull().sum())
    print("-" * 25)

def repair_taxi_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Repairs price and distance columns using a known fare calculation formula
    and logs the changes.

    This function modifies the DataFrame in place.
    """
    # Log the initial state before any changes are made
    _log_repair_status(df, 'Before Repair', [COLUMNS["PRICE"], COLUMNS["DISTANCE"]])

    # --- Repair Missing Trip Price ---
    price_components = [
        COLUMNS["BASE_FARE"], COLUMNS["DISTANCE"], COLUMNS["KM_RATE"],
        COLUMNS["DURATION"], COLUMNS["MIN_RATE"]
    ]
    mask_repair_price = df[COLUMNS["PRICE"]].isnull() & df[price_components].notnull().all(axis=1)

    if mask_repair_price.any():
        df.loc[mask_repair_price, COLUMNS["PRICE"]] = (
            df.loc[mask_repair_price, COLUMNS["BASE_FARE"]] +
            (df.loc[mask_repair_price, COLUMNS["DISTANCE"]] * df.loc[mask_repair_price, COLUMNS["KM_RATE"]]) +
            (df.loc[mask_repair_price, COLUMNS["DURATION"]] * df.loc[mask_repair_price, COLUMNS["MIN_RATE"]])
        )

    # --- Repair Missing Trip Distance ---
    distance_components = [
        COLUMNS["BASE_FARE"], COLUMNS["KM_RATE"], COLUMNS["MIN_RATE"],
        COLUMNS["DURATION"], COLUMNS["PRICE"]
    ]
    mask_repair_distance = df[COLUMNS["DISTANCE"]].isnull() & df[distance_components].notnull().all(axis=1)
    mask_repair_distance &= (df[COLUMNS["KM_RATE"]] != 0)

    if mask_repair_distance.any():
        df.loc[mask_repair_distance, COLUMNS["DISTANCE"]] = (
            (df.loc[mask_repair_distance, COLUMNS["PRICE"]] -
             df.loc[mask_repair_distance, COLUMNS["BASE_FARE"]] -
             (df.loc[mask_repair_distance, COLUMNS["DURATION"]] * df.loc[mask_repair_distance, COLUMNS["MIN_RATE"]])) /
            df.loc[mask_repair_distance, COLUMNS["KM_RATE"]]
        )
    
    # Log the final state after all repairs are attempted
    _log_repair_status(df, 'After Repair', [COLUMNS["PRICE"], COLUMNS["DISTANCE"]])
    
    return df

# --- HOW TO USE THE FUNCTION ---

# 1. Load your data (replace with your actual data)
data = {
    COLUMNS["DISTANCE"]: [19.35, 47.59, 36.87, np.nan, 8.64, np.nan],
    COLUMNS["BASE_FARE"]: [3.56, 2.95, 2.70, 3.48, 2.55, 3.10],
    COLUMNS["KM_RATE"]: [0.80, 0.62, 1.21, 0.51, 1.71, 0.90],
    COLUMNS["MIN_RATE"]: [0.32, 0.43, 0.15, 0.15, 0.48, 0.30],
    COLUMNS["DURATION"]: [53.82, 40.57, 37.27, 116.81, 89.33, 60.0],
    COLUMNS["PRICE"]: [36.2624, np.nan, 52.9032, 36.4698, 60.2028, 62.0]
}
taxidata_df = pd.DataFrame(data)

# 2. Run the repair function. It will now print the before/after status automatically.
repaired_df = repair_taxi_data(taxidata_df)



--- Before Repair ---
Missing values in key columns:
Trip_Price          1
Trip_Distance_km    2
dtype: int64
-------------------------
--- After Repair ---
Missing values in key columns:
Trip_Price          0
Trip_Distance_km    0
dtype: int64
-------------------------


In [None]:

repaired_df = repair_taxi_data(taxidata.df)

--- Before Repair ---
Missing values in key columns:
Trip_Price          49
Trip_Distance_km    50
dtype: int64
-------------------------
--- After Repair ---
Missing values in key columns:
Trip_Price          17
Trip_Distance_km     6
dtype: int64
-------------------------


Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.80,0.32,53.82,36.2624
1,47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
2,36.87,Evening,Weekend,1.0,High,Clear,2.70,1.21,0.15,37.27,52.9032
3,30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
4,8.64,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.6180
...,...,...,...,...,...,...,...,...,...,...,...
995,5.49,Afternoon,Weekend,4.0,Medium,Clear,2.39,0.62,0.49,58.39,34.4049
996,45.95,Night,Weekday,4.0,Medium,Clear,3.12,0.61,,61.96,62.1295
997,7.70,Morning,Weekday,3.0,Low,Rain,2.08,1.78,,54.18,33.1236
998,47.56,Morning,Weekday,1.0,Low,Clear,2.67,0.82,0.17,114.94,61.2090


### identifying makeup of nulls 
ive concluded that there is a fair bit of null values in the dataset.
i wanna identify how spread out it is. to see if any rows have an overwhelming number of null values or if its limited to 1 column

In [76]:
df_with_max_1_null = taxidata.df[taxidata.df.isnull().sum(axis=1) <2]
df_with_max_1_null.info()
df_with_max_1_null["Weather"]

<class 'pandas.core.frame.DataFrame'>
Index: 917 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       917 non-null    float64
 1   Time_of_Day            876 non-null    object 
 2   Day_of_Week            887 non-null    object 
 3   Passenger_Count        886 non-null    float64
 4   Traffic_Conditions     881 non-null    object 
 5   Weather                883 non-null    object 
 6   Base_Fare              886 non-null    float64
 7   Per_Km_Rate            887 non-null    float64
 8   Per_Minute_Rate        883 non-null    float64
 9   Trip_Duration_Minutes  888 non-null    float64
 10  Trip_Price             917 non-null    float64
dtypes: float64(7), object(4)
memory usage: 86.0+ KB


0      Clear
2      Clear
3        NaN
4      Clear
5      Clear
       ...  
995    Clear
996    Clear
997     Rain
998    Clear
999    Clear
Name: Weather, Length: 917, dtype: object

In [77]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np



feature_columns = ["Trip_Distance_km", "Passenger_Count"]
X = df_with_max_1_null[feature_columns]

y = df_with_max_1_null["Trip_Price"] # You need to define your target variable

# Initialize the model
model = LinearRegression()

# Split data into training and testing sets for robust evaluation
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Train the model using the training data
model.fit(Xtrain, ytrain)

# 2. Make predictions on the unseen test data
predictions = model.predict(Xtest)

# 3. Evaluate the model's performance
mse = mean_squared_error(ytest, predictions)
r2 = r2_score(ytest, predictions)

print(f"Model Performance on Test Data:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

# You can also inspect the learned parameters of the line
print(f"\nModel Parameters:")
print(f"Coefficient (slope): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values