Data Preprocessing & Modeling
The main objectives of preprocessing & modeling are as follows:

1. Removing useless or Irrelevant Columns and modify some features:

Checking the dataset for irrelevant columns that do not serve our prediction and looking for duplications

2. Data Splitting:

Dividing the data set to features (x) and the target (y)

3. Handling Categorical Variables:

Handling categorical variables using different methods like binary encoding and lable encoding.

4. Encoder Creation:

Creating an encoder using a column transformer to use it in a pipeline with different encoder types and then checking model accuracy.

5. Pipeline Creation and Model Evaluation:

Creating a pipeline using different encoders, scalers, and various types of models for evaluation.

6. Hyperparameter Tuning:

Utilizing RandomizedSearchCV for hyperparameter tuning for better performance and generalization of the model.

7. Model Saving and Deployment Preparation:

Using joblib to dump the model with the best estimator, column names (features), and creating a dummy file.

8. Deployment step.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from category_encoders import BinaryEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler,StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import cross_validate,GridSearchCV,RandomizedSearchCV
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,make_scorer
from datasist.structdata import detect_outliers

import joblib

In [2]:
# loading the dataset.
data = pd.read_csv('cleaned_data.csv')
data

Unnamed: 0,airline,source,destination,route,duration,total_stops,price,year,month,day,day_name,holiday,season,dep_hour,dep_minute,dep_time,arrival_hour,arrival_minute,arrival_time
0,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,445,2,7662,2019,1,5,Saturday,1,Winter,5,50,Early Morning,13,15,Afternoon
1,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,325,1,6218,2019,12,5,Thursday,0,Winter,18,5,Evening,23,30,Night
2,IndiGo,Banglore,Delhi,BLR → NAG → DEL,285,1,13302,2019,1,3,Thursday,0,Winter,16,50,Evening,21,35,Night
3,SpiceJet,Kolkata,Banglore,CCU → BLR,145,0,3873,2019,6,24,Monday,0,Rainy,9,0,Morning,11,25,Morning
4,Multiple carriers,Delhi,Cochin,DEL → BOM → COK,470,1,8625,2019,5,27,Monday,0,Summer,11,25,Morning,19,15,Evening
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8417,Air Asia,Kolkata,Banglore,CCU → BLR,150,0,4107,2019,9,4,Wednesday,0,Rainy,19,55,Evening,22,25,Night
8418,Air India,Kolkata,Banglore,CCU → BLR,155,0,4145,2019,4,27,Saturday,1,Summer,20,45,Night,23,20,Night
8419,Jet Airways,Banglore,Delhi,BLR → DEL,180,0,7229,2019,4,27,Saturday,1,Summer,8,20,Morning,11,20,Morning
8420,Vistara,Banglore,Delhi,BLR → DEL,160,0,12648,2019,1,3,Thursday,0,Winter,11,30,Morning,14,10,Afternoon


In [3]:
# Geitting Information about the Data
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8422 entries, 0 to 8421
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   airline         8422 non-null   object
 1   source          8422 non-null   object
 2   destination     8422 non-null   object
 3   route           8422 non-null   object
 4   duration        8422 non-null   int64 
 5   total_stops     8422 non-null   int64 
 6   price           8422 non-null   int64 
 7   year            8422 non-null   int64 
 8   month           8422 non-null   int64 
 9   day             8422 non-null   int64 
 10  day_name        8422 non-null   object
 11  holiday         8422 non-null   int64 
 12  season          8422 non-null   object
 13  dep_hour        8422 non-null   int64 
 14  dep_minute      8422 non-null   int64 
 15  dep_time        8422 non-null   object
 16  arrival_hour    8422 non-null   int64 
 17  arrival_minute  8422 non-null   int64 
 18  arrival_

## 1. Removing useless or Irrelevant Columns:

In [4]:
# statistcal description for numeric features
round(data.describe(),2)

Unnamed: 0,duration,total_stops,price,year,month,day,holiday,dep_hour,dep_minute,arrival_hour,arrival_minute
count,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0,8422.0
mean,527.9,0.72,8446.46,2019.0,5.24,14.38,0.23,11.65,24.32,14.39,24.56
std,447.17,0.65,4415.85,0.0,2.61,8.82,0.42,5.53,18.71,6.56,16.99
min,75.0,0.0,1759.0,2019.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0
25%,170.0,0.0,4839.75,2019.0,3.0,5.0,0.0,7.0,5.0,10.0,10.0
50%,390.0,1.0,7318.0,2019.0,5.0,15.0,0.0,10.0,25.0,16.0,25.0
75%,750.0,1.0,11264.0,2019.0,6.0,21.0,0.0,17.0,40.0,19.0,35.0
max,2820.0,3.0,79512.0,2019.0,12.0,27.0,1.0,23.0,55.0,23.0,55.0


In [5]:
# statistcal description for categorical features
data.describe(include="O")

Unnamed: 0,airline,source,destination,route,day_name,season,dep_time,arrival_time
count,8422,8422,8422,8422,8422,8422,8422,8422
unique,12,5,5,112,7,3,6,6
top,Jet Airways,Delhi,Cochin,DEL → BOM → COK,Thursday,Summer,Morning,Evening
freq,2575,3511,3511,1924,1795,4396,2365,2296


* route and total_stops feature have the same meaning that is why we are going to drop route.
* year feature have only one unique values 2019 we are going to drop it.
* day_name,season,holiday,dep_time, and arrival_time are all drived from existing columns and better to be droped

In [6]:
# Removing route and year columns
data.drop(['route','year','season','dep_time','arrival_time','day_name','holiday'],axis=1,inplace=True)

In [7]:
#checking for duplication in data 
data.duplicated().sum()

0

In [8]:
# getting copy of the data set
df = data.copy()

In [9]:
# checking data info after dropping the useless columns.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8422 entries, 0 to 8421
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   airline         8422 non-null   object
 1   source          8422 non-null   object
 2   destination     8422 non-null   object
 3   duration        8422 non-null   int64 
 4   total_stops     8422 non-null   int64 
 5   price           8422 non-null   int64 
 6   month           8422 non-null   int64 
 7   day             8422 non-null   int64 
 8   dep_hour        8422 non-null   int64 
 9   dep_minute      8422 non-null   int64 
 10  arrival_hour    8422 non-null   int64 
 11  arrival_minute  8422 non-null   int64 
dtypes: int64(9), object(3)
memory usage: 789.7+ KB


In [10]:
# checking airline feature for low value company to be replaced with other
df.airline.value_counts()

Jet Airways                          2575
IndiGo                               1875
Air India                            1243
Multiple carriers                    1093
SpiceJet                              762
Vistara                               395
Air Asia                              273
GoAir                                 184
Multiple carriers Premium economy      13
Jet Airways Business                    5
Vistara Premium economy                 3
Trujet                                  1
Name: airline, dtype: int64

In [11]:
df.airline.unique()

array(['Air India', 'IndiGo', 'SpiceJet', 'Multiple carriers',
       'Jet Airways', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [12]:
# replacing low values airline with other.
airline_list = ['Air India', 'IndiGo', 'SpiceJet', 'Multiple carriers','Jet Airways', 'GoAir', 'Vistara', 'Air Asia']
df.airline = df.airline.apply(lambda x: x if x in airline_list else 'other')
df.airline.value_counts()

Jet Airways          2575
IndiGo               1875
Air India            1243
Multiple carriers    1093
SpiceJet              762
Vistara               395
Air Asia              273
GoAir                 184
other                  22
Name: airline, dtype: int64

## 2. Data Splitting:

In [13]:
# Dividing the data set to features (x) and the target (y)
x = df.drop('price',axis=1)
y = df['price']

### 3. Handling Categorical Variables:

In [14]:
cat_columns = x.select_dtypes(include = ['object']).columns.tolist()
cat_columns 

['airline', 'source', 'destination']

In [15]:
# As all categorical features are nominal I am going to try Binary Encoding
Binary_encoder = BinaryEncoder(cols=cat_columns,return_df=True)
x_encoded = Binary_encoder.fit_transform(x)
x_encoded.head()

Unnamed: 0,airline_0,airline_1,airline_2,airline_3,source_0,source_1,source_2,destination_0,destination_1,destination_2,duration,total_stops,month,day,dep_hour,dep_minute,arrival_hour,arrival_minute
0,0,0,0,1,0,0,1,0,0,1,445,2,1,5,5,50,13,15
1,0,0,1,0,0,0,1,0,0,1,325,1,12,5,18,5,23,30
2,0,0,1,0,0,1,0,0,1,0,285,1,1,3,16,50,21,35
3,0,0,1,1,0,0,1,0,0,1,145,0,6,24,9,0,11,25
4,0,1,0,0,0,1,1,0,1,1,470,1,5,27,11,25,19,15


### 4. Encoder Creation:

In [16]:
# Creating the encoder using ColumnTransformer with remainder="passthrough" to pass numerical features
binary_encoder = ColumnTransformer(transformers=[("Binary",BinaryEncoder(), ['airline', 'source', 'destination'])], remainder="passthrough")

### 5.  Pipeline Creation and Model Evaluation:
* Creating pipe line using encoder and Scaler with different type of models 

In [17]:
# Creating list of models to compare between them
models = []
models.append(("LR", LinearRegression()))
models.append(("DTR", DecisionTreeRegressor()))
models.append(("KNNR", KNeighborsRegressor()))
models.append(("RF", RandomForestRegressor()))
models.append(("XGR", XGBRegressor()))

In [18]:
for model in models:
    steps = []
    steps.append(('binary_encoder',binary_encoder))
    steps.append(('Scaler', StandardScaler()))
    steps.append(model)
    pipline = Pipeline(steps=steps) 
    scoring = {'r2_score': make_scorer(r2_score),'neg_mean_absolute_error': make_scorer(mean_absolute_error),'neg_mean_squared_error': make_scorer(mean_squared_error)}
    result = cross_validate(pipline, x,y, cv=5, scoring=scoring, return_train_score=True)
    print(f"{model[0]} has Mean Training Accuracy of {round(result['train_r2_score'].mean(),2)}")
    print(f"{model[0]} has Mean Test Accuracy of {round(result['test_r2_score'].mean(),2)}")
    print(f"{model[0]} has Mean Test mean_absolute_error score of {round(result['test_neg_mean_absolute_error'].mean(),2)}")
    print(f"{model[0]} has Mean Test mean_squared_error score of {round(result['test_neg_mean_squared_error'].mean(),2)}")
    print("*" * 100)

LR has Mean Training Accuracy of 0.53
LR has Mean Test Accuracy of 0.53
LR has Mean Test mean_absolute_error score of 1998.88
LR has Mean Test mean_squared_error score of 9163060.92
****************************************************************************************************
DTR has Mean Training Accuracy of 0.97
DTR has Mean Test Accuracy of 0.69
DTR has Mean Test mean_absolute_error score of 1293.22
DTR has Mean Test mean_squared_error score of 6014704.88
****************************************************************************************************
KNNR has Mean Training Accuracy of 0.8
KNNR has Mean Test Accuracy of 0.7
KNNR has Mean Test mean_absolute_error score of 1479.42
KNNR has Mean Test mean_squared_error score of 5749612.38
****************************************************************************************************
RF has Mean Training Accuracy of 0.95
RF has Mean Test Accuracy of 0.8
RF has Mean Test mean_absolute_error score of 1137.61
RF has Mean Test

* From the above XGR has the best 94% train, 83% test scores, 1129.75 Mean Absolute Errror, and 3281954.61 Mean Squared Error

In [19]:
for model in models:
    steps = []
    steps.append(('binary_encoder',binary_encoder))
    steps.append(('Scaler', RobustScaler()))
    steps.append(model)
    pipline = Pipeline(steps=steps) 
    scoring = {'r2_score': make_scorer(r2_score),'neg_mean_absolute_error': make_scorer(mean_absolute_error),'neg_mean_squared_error': make_scorer(mean_squared_error)}
    result = cross_validate(pipline, x,y, cv=5, scoring=scoring, return_train_score=True)
    print(f"{model[0]} has Mean Training Accuracy of {round(result['train_r2_score'].mean(),2)}")
    print(f"{model[0]} has Mean Test Accuracy of {round(result['test_r2_score'].mean(),2)}")
    print(f"{model[0]} has Mean Test mean_absolute_error score of {round(result['test_neg_mean_absolute_error'].mean(),2)}")
    print(f"{model[0]} has Mean Test mean_squared_error score of {round(result['test_neg_mean_squared_error'].mean(),2)}")
    print("*" * 100)

LR has Mean Training Accuracy of 0.53
LR has Mean Test Accuracy of 0.53
LR has Mean Test mean_absolute_error score of 1998.66
LR has Mean Test mean_squared_error score of 9163190.72
****************************************************************************************************
DTR has Mean Training Accuracy of 0.97
DTR has Mean Test Accuracy of 0.7
DTR has Mean Test mean_absolute_error score of 1293.44
DTR has Mean Test mean_squared_error score of 5836039.65
****************************************************************************************************
KNNR has Mean Training Accuracy of 0.82
KNNR has Mean Test Accuracy of 0.72
KNNR has Mean Test mean_absolute_error score of 1469.12
KNNR has Mean Test mean_squared_error score of 5497879.0
****************************************************************************************************
RF has Mean Training Accuracy of 0.95
RF has Mean Test Accuracy of 0.8
RF has Mean Test mean_absolute_error score of 1135.74
RF has Mean Test

* RobustScaler show little better result than stander scaler for XGR model scores.
    - mean_absolute_error score from 1129.75 to 1129.61
    - mean_squared_error score from 3281954.61 to 3281397.92
* I going to use RobustScaler

### 6.   Hyperparameter Tuning:
* Utilizing RandomizedSearchCV for hyperparameter tuning for better performance and generalization of the model.

In [20]:
steps = []
steps.append(('binary_encoder',binary_encoder))
steps.append(('Scaler', RobustScaler()))
steps.append(("XGR", XGBRegressor()))
pipline = Pipeline(steps=steps) 

In [21]:
# Define the parameter grid to search
param_grid = {
    'XGR__n_estimators': [50, 100, 200],
    'XGR__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'XGR__max_depth': [3, 5, 7],
    'XGR__subsample': [0.8, 0.9, 1.0],
    'XGR__colsample_bytree': [0.8, 0.9, 1.0],
}

# Define the scoring metrics
scoring = {'r2': 'r2',
           'neg_mean_absolute_error': 'neg_mean_absolute_error',
           'neg_mean_squared_error': 'neg_mean_squared_error'}

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(pipline, param_distributions=param_grid, n_iter=100,
                                   scoring=scoring, refit='r2', cv=5, random_state=42)

In [23]:
# Fit the model
random_search.fit(x, y)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('binary_encoder',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('Binary',
                                                                               BinaryEncoder(),
                                                                               ['airline',
                                                                                'source',
                                                                                'destination'])])),
                                             ('Scaler', RobustScaler()),
                                             ('XGR',
                                              XGBRegressor(base_score=None,
                                                           booster=None,
                                                           callbacks=None,
      

In [24]:
# Display the best parameters and corresponding scores
print("Best Parameters:", random_search.best_params_)
print("Best R-squared:", random_search.best_score_)
print("Best MAE:", -random_search.cv_results_['mean_test_neg_mean_absolute_error'][random_search.best_index_])
print("Best MSE:", -random_search.cv_results_['mean_test_neg_mean_squared_error'][random_search.best_index_])

Best Parameters: {'XGR__subsample': 1.0, 'XGR__n_estimators': 200, 'XGR__max_depth': 5, 'XGR__learning_rate': 0.3, 'XGR__colsample_bytree': 0.9}
Best R-squared: 0.8403076032470473
Best MAE: 1123.8470226148313
Best MSE: 3103373.338506357


In [25]:
# Define the parameter grid to search
param_grid = {
    'XGR__n_estimators': [400,500,600],
    'XGR__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'XGR__max_depth': [7, 9,12,15],
    'XGR__subsample': [0.8, 0.9, 1.0],
    'XGR__colsample_bytree': [0.8, 0.9, 1.0],
}

# Define the scoring metrics
scoring = {'r2': 'r2',
           'neg_mean_absolute_error': 'neg_mean_absolute_error',
           'neg_mean_squared_error': 'neg_mean_squared_error'}

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(pipline, param_distributions=param_grid, n_iter=200,
                                   scoring=scoring, refit='r2', cv=5, random_state=42)

# Fit the model
random_search.fit(x, y)

# Display the best parameters and corresponding scores
print("Best Parameters:", random_search.best_params_)
print("Best R-squared:", random_search.best_score_)
print("Best MAE:", -random_search.cv_results_['mean_test_neg_mean_absolute_error'][random_search.best_index_])
print("Best MSE:", -random_search.cv_results_['mean_test_neg_mean_squared_error'][random_search.best_index_])

Best Parameters: {'XGR__subsample': 0.8, 'XGR__n_estimators': 600, 'XGR__max_depth': 9, 'XGR__learning_rate': 0.01, 'XGR__colsample_bytree': 0.8}
Best R-squared: 0.8408466505242934
Best MAE: 1083.5949883582057
Best MSE: 3089763.2108200593


In [26]:
model = random_search.best_estimator_

In [27]:
model.fit(x,y)

Pipeline(steps=[('binary_encoder',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('Binary', BinaryEncoder(),
                                                  ['airline', 'source',
                                                   'destination'])])),
                ('Scaler', RobustScaler()),
                ('XGR',
                 XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=0.8, early_stopping_rounds=None,
                              enable_categori...
                              gamma=0, gpu_id=-1, grow_policy='depthwise',
                              importance_type=None, interaction_constraints='',
                              learning_rate=0.01, max_bin=256,
                              max_cat_to_onehot=4, max_delta_step=0,
                              max_depth=9, max_lea

### 7. Model Saving and Deployment Preparation:

In [38]:
# using joblib to dump the model
joblib.dump(model,'model.pkl')

['model.pkl']

In [39]:
features = x.columns.tolist()
features 

['airline',
 'source',
 'destination',
 'duration',
 'total_stops',
 'month',
 'day',
 'dep_hour',
 'dep_minute',
 'arrival_hour',
 'arrival_minute']

In [40]:
# using joblib to dump features names
joblib.dump(features,'features.pkl')

['features.pkl']

In [41]:
x.airline.unique()

array(['Air India', 'IndiGo', 'SpiceJet', 'Multiple carriers',
       'Jet Airways', 'GoAir', 'Vistara', 'Air Asia', 'other'],
      dtype=object)

## 8. Model deployment

In [42]:
%%writefile app.py

import streamlit as st
import pandas as pd
import joblib
import sklearn
import category_encoders
from datetime import datetime, time

model = joblib.load('model.pkl')
features = joblib.load('features.pkl')

def make_prediction(airline,source,destination,duration,total_stops,month,day,dep_hour,dep_minute,arrival_hour,arrival_minute):
    df_pred = pd.DataFrame(columns=features)
    df_pred.at[0,'airline'] = airline
    df_pred.at[0,'destination'] = destination
    df_pred.at[0,'duration'] = duration
    df_pred.at[0,'total_stops'] = total_stops
    df_pred.at[0,'month'] = month
    df_pred.at[0,'day'] = day
    df_pred.at[0,'dep_hour'] = dep_hour
    df_pred.at[0,'dep_minute'] = dep_minute
    df_pred.at[0,'arrival_hour'] = arrival_hour
    df_pred.at[0,'arrival_minute'] = arrival_minute
    result = model.predict(df_pred)
    return result[0]


    
def main():
    st.title('India Airline Ticket Price prediction')
    # Departure Date input
    dep_date = st.date_input("Select Departure date", datetime.today())
    # Departure Time input
    dep_time = st.time_input("Select Departure time", time())
    selected_dep_datetime = datetime.combine(dep_date, dep_time)
    # Display the selected departure datetime
    st.write("Selected Departure DateTime:", selected_dep_datetime)
    month = int(pd.to_datetime(selected_dep_datetime, format="%Y-%m-%dt%H:%M").month)
    day = int(pd.to_datetime(selected_dep_datetime, format="%Y-%m-%dt%H:%M").day)
    dep_hour = int(pd.to_datetime(selected_dep_datetime, format="%Y-%m-%dt%H:%M").hour)
    dep_minute = int(pd.to_datetime(selected_dep_datetime, format="%Y-%m-%dt%H:%M").minute)
    
    # Arrivale Date input
    Arrivale_date = st.date_input("Select Arrivale date", datetime.today())
    # Arrivale Time input
    Arrivale_time = st.time_input("Select Arrivale time", time())
    selected_Arrivale_datetime = datetime.combine(Arrivale_date, Arrivale_time)
    # Display the selected Arrivale datetime
    st.write("Selected Arrivale DateTime:", selected_Arrivale_datetime)
    arrival_hour = int(pd.to_datetime(selected_Arrivale_datetime, format="%Y-%m-%dt%H:%M").hour)
    arrival_minute = int(pd.to_datetime(selected_Arrivale_datetime, format="%Y-%m-%dt%H:%M").minute)
    
    # Calculate the duration
    total_duration = selected_Arrivale_datetime - selected_dep_datetime
    # Convert the duration to minutes
    duration = total_duration.total_seconds() / 60
    st.write("Trip Duration in minutes:", duration)
    
    airline = st.selectbox("airline company",['Air India', 'IndiGo', 'SpiceJet', 'Multiple carriers',
       'Jet Airways', 'GoAir', 'Vistara', 'Air Asia', 'other'])
    source = st.selectbox('Source City',['Kolkata', 'Banglore', 'Delhi', 'Chennai', 'Mumbai'])
    destination = st.selectbox('Destination City',['Banglore', 'Delhi', 'Cochin', 'Kolkata', 'Hyderabad'])
    total_stops = st.selectbox('Select Trip total stops (0 for dirct or non stop flight)',[0, 1, 2, 3])
    
    if st.button("Predict"):
        result = make_prediction(airline,source,destination,duration,total_stops,month,day,dep_hour,dep_minute,arrival_hour,arrival_minute)
        st.write("Ticket price in Indina Rubee '₹''")
        st.text(result)
        
main()

Overwriting app.py
