# Flight Price Prediction with Scikit-learn (Gradient Boosting)

In this notebook, we:
- Preprocess the dataset using pandas and sklearn
- Train a Gradient Boosting Regressor
- Evaluate the model using standard regression metrics
- Prepare results for comparison with a PySpark version

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Load Dataset
df = pd.read_csv("/kaggle/input/flight-price-prediction/Clean_Dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


### Step 1: Clean & Preprocess the Data

In [3]:
# Drop unnecessary columns
df.drop(columns=["Unnamed: 0", "flight"], inplace=True)

# Encode 'stops' to ordinal
stop_mapping = {
    "zero": 0,
    "one": 1,
    "two_or_more": 2
}
df["stops"] = df["stops"].map(stop_mapping)

# Encode class
df["class"] = df["class"].map({"Economy": 0, "Business": 1})

# Check for nulls
print("Missing values:\n", df.isnull().sum())

Missing values:
 airline             0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64


### Step 2: Feature Encoding with OneHotEncoder
We’ll use `ColumnTransformer` to apply one-hot encoding only on selected categorical features.

In [4]:
# Define features and target
X = df.drop("price", axis=1)
y = df["price"]

# Categorical columns for encoding
categorical_cols = ["airline", "source_city", "departure_time", "arrival_time", "destination_city"]
numeric_cols = ["stops", "class", "duration", "days_left"]

# Column Transformer for preprocessing
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols)
], remainder="passthrough")

### Step 3: Train-Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 4: Build and Train Gradient Boosting Model

In [6]:
# Build pipeline
model = Pipeline([
    ("preprocess", preprocessor),
    ("gb", GradientBoostingRegressor(random_state=42))
])

# Fit model
model.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



### Step 5: Evaluate the Model

In [7]:
from math import sqrt

# Predictions
y_pred = model.predict(X_test)

# Evaluation metrics
rmse = sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R² Score: {r2:.4f}")

RMSE: 4998.02
MAE: 2947.88
R² Score: 0.9515


## Summary

- **Preprocessing:** Combined OneHotEncoding with ordinal mapping and direct numerical features.
- **Model:** GradientBoostingRegressor from `sklearn` with default hyperparameters.
- **Evaluation:** RMSE, MAE, and R² show model performance.