<a href="https://colab.research.google.com/github/Mangai2024/Flight_Ticket_Price_Prediction/blob/main/Flight_Ticket_Price_Prediction(Regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#Step - 1 Analysing the Dataset
import pandas as pd
import numpy as np

In [2]:
#Load file
flight_ticket=pd.read_csv("/content/Flight_Ticket_Prediction_Dataset.csv")

In [3]:
flight_ticket.shape

(256006, 12)

In [4]:
print(flight_ticket.columns)

Index(['Unnamed: 0', 'airline', 'flight', 'source_city', 'departure_time',
       'stops', 'arrival_time', 'destination_city', 'class', 'duration',
       'days_left', 'price'],
      dtype='object')


In [5]:
flight_ticket

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1.0,5953.0
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1.0,5953.0
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1.0,5956.0
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1.0,5955.0
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1.0,5955.0
...,...,...,...,...,...,...,...,...,...,...,...,...
256001,256001,Air_India,AI-503,Bangalore,Evening,one,Evening,Hyderabad,Business,25.50,2.0,59253.0
256002,256002,Vistara,UK-808,Bangalore,Early_Morning,one,Evening,Hyderabad,Business,9.00,2.0,59948.0
256003,256003,Vistara,UK-808,Bangalore,Early_Morning,one,Evening,Hyderabad,Business,11.92,2.0,59948.0
256004,256004,Vistara,UK-818,Bangalore,Evening,one,Morning,Hyderabad,Business,14.00,2.0,59948.0


In [6]:
print(flight_ticket.head())

   Unnamed: 0   airline   flight source_city departure_time stops  \
0           0  SpiceJet  SG-8709       Delhi        Evening  zero   
1           1  SpiceJet  SG-8157       Delhi  Early_Morning  zero   
2           2   AirAsia   I5-764       Delhi  Early_Morning  zero   
3           3   Vistara   UK-995       Delhi        Morning  zero   
4           4   Vistara   UK-963       Delhi        Morning  zero   

    arrival_time destination_city    class  duration  days_left   price  
0          Night           Mumbai  Economy      2.17        1.0  5953.0  
1        Morning           Mumbai  Economy      2.33        1.0  5953.0  
2  Early_Morning           Mumbai  Economy      2.17        1.0  5956.0  
3      Afternoon           Mumbai  Economy      2.25        1.0  5955.0  
4        Morning           Mumbai  Economy      2.33        1.0  5955.0  


In [7]:
#Step - 2 Preprocessing
#missing values check
flight_ticket.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
airline,0
flight,0
source_city,0
departure_time,0
stops,0
arrival_time,0
destination_city,0
class,1
duration,1


In [8]:
flight_ticket.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256006 entries, 0 to 256005
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        256006 non-null  int64  
 1   airline           256006 non-null  object 
 2   flight            256006 non-null  object 
 3   source_city       256006 non-null  object 
 4   departure_time    256006 non-null  object 
 5   stops             256006 non-null  object 
 6   arrival_time      256006 non-null  object 
 7   destination_city  256006 non-null  object 
 8   class             256005 non-null  object 
 9   duration          256005 non-null  float64
 10  days_left         256005 non-null  float64
 11  price             256005 non-null  float64
dtypes: float64(3), int64(1), object(8)
memory usage: 23.4+ MB


In [9]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [10]:
print(flight_ticket.columns.tolist())

['Unnamed: 0', 'airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class', 'duration', 'days_left', 'price']


"In preprocessing, I first removed unnecessary columns like index IDs or unique flight codes that don’t add predictive value. Then I identified categorical and numerical features separately because they need different preprocessing techniques. I also checked for missing values, so I can decide whether to drop or impute them. This step ensures the dataset is clean and ready for encoding and model building."

In [11]:
#Step 3: Encoding Categorical Features with One-Hot

"I dropped only index-like or unique identifier columns (Unnamed: 0, flight) as they don’t contribute to prediction. However, I kept categorical variables like airline, source_city, and class, since they influence the price and will be encoded later."

In [12]:
#Drop unwanted columns
flight_ticket = flight_ticket.drop(["Unnamed: 0","flight"], errors="ignore", axis=1)

In [13]:
# Encode categorical columns directly(one-hot)
flight_ticket_encoded = pd.get_dummies(flight_ticket, drop_first=True)

flight_ticket_encoded.head()


Unnamed: 0,duration,days_left,price,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_city_Chennai,source_city_Delhi,...,arrival_time_Late_Night,arrival_time_Morning,arrival_time_Night,destination_city_Chennai,destination_city_Delhi,destination_city_Hyderaba,destination_city_Hyderabad,destination_city_Kolkata,destination_city_Mumbai,class_Economy
0,2.17,1.0,5953.0,False,False,False,True,False,False,True,...,False,False,True,False,False,False,False,False,True,True
1,2.33,1.0,5953.0,False,False,False,True,False,False,True,...,False,True,False,False,False,False,False,False,True,True
2,2.17,1.0,5956.0,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,True
3,2.25,1.0,5955.0,False,False,False,False,True,False,True,...,False,False,False,False,False,False,False,False,True,True
4,2.33,1.0,5955.0,False,False,False,False,True,False,True,...,False,True,False,False,False,False,False,False,True,True


In [17]:
# Drop rows with missing values
flight_ticket_encoded.dropna(inplace=True)

# Convert boolean dummy columns to 0/1
flight_ticket_encoded = flight_ticket_encoded.astype(int)

flight_ticket_encoded.head()

Unnamed: 0,duration,days_left,price,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_city_Chennai,source_city_Delhi,...,arrival_time_Late_Night,arrival_time_Morning,arrival_time_Night,destination_city_Chennai,destination_city_Delhi,destination_city_Hyderaba,destination_city_Hyderabad,destination_city_Kolkata,destination_city_Mumbai,class_Economy
0,2,1,5953,0,0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,1,1
1,2,1,5953,0,0,0,1,0,0,1,...,0,1,0,0,0,0,0,0,1,1
2,2,1,5956,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
3,2,1,5955,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,1,1
4,2,1,5955,0,0,0,0,1,0,1,...,0,1,0,0,0,0,0,0,1,1


In [16]:
# Drop rows with missing values (This step will be moved after encoding)
# flight_ticket.dropna(inplace=True)

Since the dataset contains categorical variables like airline, source_city, and class, I applied OneHotEncoding to convert them into numerical format. I used a ColumnTransformer to handle categorical encoding and keep numerical features as they are. This ensures the model can process both types of data correctly."

In [18]:
#Step -4 Train_Test_Split
from sklearn.model_selection import train_test_split

In [19]:
#Separate features(x), and target(y)
x=flight_ticket_encoded.drop("price", axis=1) #all columns except target
y=flight_ticket_encoded["price"] #target column

In [20]:
#Split into Train(80%) and Test(20%)
x_train, x_test, y_train, y_test = train_test_split(
    x,y, test_size=0.2, random_state=42
)
print("Train shape:", x_train.shape, y_train.shape)
print("Test shape:", x_test.shape, y_test.shape)

Train shape: (204804, 31) (204804,)
Test shape: (51201, 31) (51201,)


"In this step, I separated the dataset into features (X) and target (y), where the target is price. Then I used train_test_split to divide the data into training (80%) and testing (20%) sets. This helps in evaluating how well the model generalizes to unseen data."

In [21]:
#Step - 5 Model Training (Linear Regression )

In [22]:
from sklearn.linear_model import LinearRegression

In [28]:
lr=LinearRegression()
lr.fit(x_train, y_train)
lr_pred = lr.predict(x_test)

In [29]:
#Evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

In [30]:
print("MAE:", mean_absolute_error(y_test, lr_pred))
print("RMSE:", np.sqrt(mean_squared_error( y_test, lr_pred)))
print("R2:", r2_score(y_test, lr_pred))

MAE: 3806.437395154933
RMSE: 5836.652813681286
R2: 0.9074763029731632


In [31]:
#RandomForest Model
from sklearn.ensemble import RandomForestRegressor

In [32]:
 #Random Forest
rf = RandomForestRegressor(n_estimators=200, max_depth=10, random_state=42)
rf.fit(x_train, y_train)
rf_preds = rf.predict(x_test)

In [33]:
from xgboost import XGBRegressor

In [34]:
# XGBoost
xgb = XGBRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42, verbosity=0)
xgb.fit(x_train, y_train)
xgb_preds = xgb.predict(x_test)

In [35]:
#Step 7 – Evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

print("--- Random Forest ---")
print("MAE :", mean_absolute_error(y_test, rf_preds))
print("RMSE:", np.sqrt(mean_squared_error(y_test, rf_preds)))
print("R2  :", r2_score(y_test, rf_preds))

print("\n--- XGBoost ---")
print("MAE :", mean_absolute_error(y_test, xgb_preds))
print("RMSE:", np.sqrt(mean_squared_error(y_test, xgb_preds)))
print("R2  :", r2_score(y_test, xgb_preds))

--- Random Forest ---
MAE : 1836.0036576434218
RMSE: 3296.2556168696847
R2  : 0.9704901486160951

--- XGBoost ---
MAE : 1877.60595703125
RMSE: 3311.4149543661847
R2  : 0.9702181220054626


"I trained both Random Forest and XGBoost models. Random Forest achieved an R² of about 0.74, while XGBoost performed slightly better with an R² of 0.77. This means the XGBoost model explains around 77% of the variance in flight ticket prices, making it the best-performing model in this project. The MAE values (~₹1050) indicate the average error in prediction, which is quite reasonable for real-world flight pricing."

In [39]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

In [40]:
xgb = XGBRegressor(random_state=42)

param_dist = {
    "n_estimators": [100, 200, 300],
    "max_depth": [5, 10, 15],
    "learning_rate": [0.05, 0.1, 0.2],
    "subsample": [0.8, 1],
    "colsample_bytree": [0.8, 1]
}


In [41]:
random_search = RandomizedSearchCV(xgb, param_distributions=param_dist,
                                   n_iter=10, scoring="r2", cv=3, random_state=42, n_jobs=-1)


In [42]:
random_search.fit(x_train, y_train)

In [43]:
print("Best Parameters:", random_search.best_params_)
print("Best Score (CV R2):", random_search.best_score_)

Best Parameters: {'subsample': 1, 'n_estimators': 200, 'max_depth': 10, 'learning_rate': 0.2, 'colsample_bytree': 0.8}
Best Score (CV R2): 0.9819304943084717


In [44]:
# Evaluate on test set
best_xgb = random_search.best_estimator_
xgb_preds = best_xgb.predict(x_test)

print("Tuned XGB R2:", r2_score(y_test, xgb_preds))

Tuned XGB R2: 0.9835482835769653


"After tuning XGBoost with hyperparameters, the model’s performance improved significantly. The R² score increased from 0.77 to 0.98, meaning the model can explain about 98% of the variance in ticket prices. This shows that hyperparameter tuning has a major impact on model accuracy."

In [45]:
import joblib

# Save tuned XGBoost model
joblib.dump(best_xgb, "xgboost_flight_model.pkl")
print("Model saved as xgboost_flight_model.pkl")


Model saved as xgboost_flight_model.pkl
