## Setup

**python -m venv ./venv**

Run that in your terminal to set up the virtual environment, then run the below command. If you don't care about using a virtual environment, just run the below command to install the required packages for your current version of Python.

In [1]:
%pip install pandas numpy matplotlib seaborn scikit-learn tensorflow

Collecting tensorflow
  Downloading tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata (3.6 kB)
Collecting tensorflow-intel==2.15.0 (from tensorflow)
  Downloading tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     ---------------------------------------- 0.0/57.5 kB ? eta -:--:--
     ---------------------------------------- 57.5/57.5 kB 1.5 MB/s eta 0:00:00
Collecting h5py>=2.9.0 (from tensorflow-intel==2.15.0->tensorflo

You can also run the following:

In [None]:
%pip install -r requirements.txt

## Process 1: Data Cleaning

In [1]:
#import data
import pandas as pd
import numpy as np

data = pd.read_csv('./Raw Data/Clean_Dataset.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [2]:
#drop 1st column
data = data.drop(data.columns[0], axis=1)
data.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [3]:
#CREATE NEW COLUMN TO CONVERT PRICES TO USD

#conversion rate from INR to USD
conversion_rate = 0.012

#function to convert price from INR to USD
def convert_to_usd(price_inr):
    return price_inr * conversion_rate

#create new column for USD price, applying conversion and round to the nearest cent
data['priceUSD'] = data['price'].apply(convert_to_usd).apply(lambda x: round(x, 2))

#rename current price column to indicate it is INR (Indian Rupee)
data = data.rename(columns={'price': 'priceINR'})

data.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,priceINR,priceUSD
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953,71.44
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953,71.44
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956,71.47
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955,71.46
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955,71.46


In [4]:
#CONVERT STOPS VALUES FROM STRING TO NUMERICAL

#mapping dictionary
number_mapping = {'zero': 0, 'one': 1, 'two_or_more': 2}

# Convert string versions of numbers to numerical values
data['stops'] = data['stops'].map(number_mapping)

data.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,priceINR,priceUSD
0,SpiceJet,SG-8709,Delhi,Evening,0,Night,Mumbai,Economy,2.17,1,5953,71.44
1,SpiceJet,SG-8157,Delhi,Early_Morning,0,Morning,Mumbai,Economy,2.33,1,5953,71.44
2,AirAsia,I5-764,Delhi,Early_Morning,0,Early_Morning,Mumbai,Economy,2.17,1,5956,71.47
3,Vistara,UK-995,Delhi,Morning,0,Afternoon,Mumbai,Economy,2.25,1,5955,71.46
4,Vistara,UK-963,Delhi,Morning,0,Morning,Mumbai,Economy,2.33,1,5955,71.46


## Process 2: Building the models

In [11]:
# necessary imports for the next steps
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# this part encodes categorical variables using OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore')
categoricalFeatures = ['airline', 'sourceCity', 'departureTime', 'stops', 'arrivalTime', 'destinationCity', 'class']
encodedFeatures = encoder.fit_transform(data[categoricalFeatures])

# this part concatenates encoded categorical features with numerical features
numericalFeatures = data[['duration', 'daysLeft', 'priceUSD']].to_numpy()
features = np.concatenate([encodedFeatures, numericalFeatures], axis=1)

# this part splits the data into features and target variable
X = features[:, :-1]  # all features except priceUSD
y = features[:, -1]   # target variable is the price in USD

# this part splits the dataset into training and testing set
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=42)

# linear regression model
linearReg = LinearRegression()
linearReg.fit(XTrain, yTrain)
yPredLr = linearReg.predict(XTest)

# evaluate linear regression model
print("Linear Regression Model Performance:")
print("mean squared error (MSE):", mean_squared_error(yTest, yPredLr))
print("coefficient of determination (R^2):", r2_score(yTest, yPredLr))

# random forest regressor model
randomForestReg = RandomForestRegressor(n_estimators=100, random_state=42)
randomForestReg.fit(XTrain, yTrain)
yPredRf = randomForestReg.predict(XTest)

# evaluate random forest regressor model
print("\nRandom Forest Regressor Model Performance:")
print("mean squared error (MSE):", mean_squared_error(yTest, yPredRf))
print("coefficient of determination (R^2):", r2_score(yTest, yPredRf))




Linear Regression Model Performance:
Mean Squared Error (MSE): 6583.8376650293485
Coefficient of Determination (R^2): 0.91130428869681

Random Forest Regressor Model Performance:
Mean Squared Error (MSE): 1128.8521300717916
Coefficient of Determination (R^2): 0.98479240410731


In [14]:
import numpy as np

categoricalFeatures = ['airline', 'sourceCity', 'departureTime', 'stops', 'arrivalTime', 'destinationCity', 'class']
# the 'duration' and 'daysLeft' features are numerical and handled separately

# adjusted 'newData' dictionary to use string representation for 'stops'
newData = {
    'airline': 'Vistara',
    'sourceCity': 'Delhi',
    'departureTime': 'Morning',
    'stops': 'zero',  # adjusted to string to match training data
    'arrivalTime': 'Afternoon',
    'destinationCity': 'Mumbai',
    'class': 'Economy',
    'duration': 2.5,
    'daysLeft': 10
}

def predictPrice(newData, model):
    # convert 'stops' in newData to string if it's numeric
    stopsMapping = {0: 'zero', 1: 'one', 2: 'twoOrMore'}
    if newData['stops'] in stopsMapping:
        newData['stops'] = stopsMapping[newData['stops']]
    
    # construct the input data for prediction by matching the training data structure
    newDataProcessed = [newData[feature] for feature in categoricalFeatures]
    
    newDataEncoded = encoder.transform([newDataProcessed])

    completeFeatures = np.hstack((newDataEncoded, [[newData['duration'], newData['daysLeft']]]))
    
    # predict using the provided model
    predictedPrice = model.predict(completeFeatures)
    return predictedPrice[0]


linearRegPredictedPrice = predictPrice(newData, linearReg)
print(f"Predicted Flight Price (LinReg): USD {linearRegPredictedPrice:.2f}")

randomForestRegPredictedPrice = predictPrice(newData, randomForestReg)
print(f"Predicted Flight Price (RFReg): USD {randomForestRegPredictedPrice:.2f}")

Predicted Flight Price (LinReg): USD 27.89
Predicted Flight Price (RFReg): USD 86.43




## Process 3: Testing and improving the models

In [9]:
%pip install xgboost

Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-win_amd64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-win_amd64.whl (99.8 MB)
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.1/99.8 MB 2.0 MB/s eta 0:00:51
   ---------------------------------------- 0.3/99.8 MB 3.4 MB/s eta 0:00:30
   ---------------------------------------- 0.4/99.8 MB 3.0 MB/s eta 0:00:34
   ---------------------------------------- 0.5/99.8 MB 2.8 MB/s eta 0:00:35
   ---------------------------------------- 0.6/99.8 MB 2.7 MB/s eta 0:00:37
   ---------------------------------------- 0.8/99.8 MB 2.7 MB/s eta 0:00:38
   ---------------------------------------- 0.8/99.8 MB 2.6 MB/s eta 0:00:39
   ---------------------------------------- 0.9/99.8 MB 2.5 MB/s eta 0:00:41
   ---------------------------------------- 0.9/99.8 MB 2.2 MB/s eta 0:00:45
   ---------------------------------------- 1.0/99.8 MB 2.1 MB/s eta 0:00:48
   ----------

In [10]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

# Initialize an XGBoost regressor object
xgb_reg = xgb.XGBRegressor(objective ='reg:squarederror', 
                           colsample_bytree = 0.3, 
                           learning_rate = 0.1,
                           max_depth = 5, 
                           alpha = 10, 
                           n_estimators = 100)

# Fit the regressor to the training set
xgb_reg.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_reg.predict(X_test)

# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost Regressor Model Performance:")
print("Mean Squared Error (MSE):", mse_xgb)
print("Coefficient of Determination (R^2):", r2_xgb)

XGBoost Regressor Model Performance:
Mean Squared Error (MSE): 3509.1572340479674
Coefficient of Determination (R^2): 0.9527255663362059


In [15]:
from sklearn.model_selection import cross_val_score

linearScores = cross_val_score(linearReg, X, y, cv=5, scoring='neg_mean_squared_error')
linearRmseScores = np.sqrt(-linearScores)

forestScores = cross_val_score(randomForestReg, X, y, cv=5, scoring='neg_mean_squared_error')
forestRmseScores = np.sqrt(-forestScores)

print("Linear Regression RMSE scores:", linearRmseScores)
print("Random Forest Regressor RMSE scores:", forestRmseScores)

Linear Regression RMSE scores: [ 66.7640571   52.17395103  58.00590151 113.46358387 137.29694748]
Random Forest Regressor RMSE scores: [ 29.02420269  28.08694974  28.99588015 102.44775832 137.52382268]


In [16]:
from sklearn.model_selection import cross_validate

scoring = ['neg_mean_squared_error', 'r2']
linearResults = cross_validate(linearReg, X, y, cv=5, scoring=scoring, return_train_score=True)
forestResults = cross_validate(randomForestReg, X, y, cv=5, scoring=scoring, return_train_score=True)

linearRmse = np.sqrt(-linearResults['test_neg_mean_squared_error'].mean())
forestRmse = np.sqrt(-forestResults['test_neg_mean_squared_error'].mean())

print("Linear Regression:")
print("Average RMSE:", linearRmse)
print("Average R^2:", linearResults['test_r2'].mean())

print("\nRandom Forest Regressor:")
print("Average RMSE:", forestRmse)
print("Average R^2:", forestResults['test_r2'].mean())

Linear Regression:
Average RMSE: 91.9442023709165
Average R^2: -0.2952834909663406

Random Forest Regressor:
Average RMSE: 79.85034394904606
Average R^2: 0.5447142754196432


## Process 4 (extra): Using GridSearch

*This part took too long to run*

In [17]:
from sklearn.model_selection import GridSearchCV

paramGrid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

gridSearch = GridSearchCV(randomForestReg, paramGrid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
gridSearch.fit(XTrain, yTrain)

print("Best parameters:", gridSearch.best_params_)
bestModel = gridSearch.best_estimator_

KeyboardInterrupt: 

In [18]:
finalPredictions = bestModel.predict(XTest)
finalMse = mean_squared_error(yTest, finalPredictions)
finalRmse = np.sqrt(finalMse)

print("Final RMSE on Test Set:", finalRmse)

NameError: name 'bestModel' is not defined