In this Notebook we will be building a model to predict Flight Price. In doing so, we would be going through the life cycle of a data science project. Data is given from Kaggle.

Here's the link to the dataset: https://www.kaggle.com/nikhilmittal/flight-fare-prediction-mh


So, Let's begin...

1. Import Libraries
2. Load Train Data
3. Basic EDA
4. Data_Cleaning & Feature Extraction
5. Handling Categorical Data
    * Airline
    * Source & Destination
    * Route & Additional_Info
    * Total_Stops
6. Load Test data
7. Preprocessing
8. Feature Selection
9. Feature Importance using ExtraTreeRegressor
10. Model Building
    * RandomForestRegressor

In [None]:
#to work with excelfile run below command
!pip install openpyxl

# Import Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# Load Train Data

In [None]:
train_data = pd.read_excel(r'../input/flight-fare-prediction-mh/Data_Train.xlsx')
# to see maximum data values
pd.set_option('display.max_columns', None)
train_data.head()

In [None]:
# get insights from data
train_data.info()

As we can see that our data has many independent features of object type and one dependent feature i.e. Price of integer type ,which we will be predicting.

We need to convert the data to numeric values to build the model.

In [None]:
train_data.Duration.value_counts()

In [None]:
#Let's have a look at the shape of our data
train_data.shape

In [None]:
#Drop na values
train_data.dropna(inplace = True)

In [None]:
#Check the shape again
train_data.shape

As we can see there is just one na value

In [None]:
#Let's see if there are any null values
train_data.isnull().sum()

# EDA

We will be converting the data, so it will be suitable to build the model.
Convert the object date to datetime. and separate it intodaya and month.

In [None]:
import datetime as dt
train_data['Journey_day'] = pd.to_datetime(train_data.Date_of_Journey, format='%d/%m/%Y').dt.day
train_data['Journey_month'] = pd.to_datetime(train_data.Date_of_Journey, format='%d/%m/%Y').dt.month

In [None]:
train_data.head()

In [None]:
#Now drop the column
train_data.drop(['Date_of_Journey'], axis = 1, inplace = True)

In [None]:
#Convert and extract hour, minute from departure time, then drop the column
train_data['Dep_hour'] = pd.to_datetime(train_data.Dep_Time).dt.hour
train_data['Dep_minute'] = pd.to_datetime(train_data.Dep_Time).dt.minute
train_data.drop(['Dep_Time'], axis = 1, inplace = True)
train_data.head()

In [None]:
#Convert and extract hour, minute from arrival time, then drop the column
train_data['Arrival_hour'] = pd.to_datetime(train_data.Arrival_Time).dt.hour
train_data['Arrival_minute'] = pd.to_datetime(train_data.Arrival_Time).dt.minute
train_data.drop(['Arrival_Time'], axis = 1, inplace = True)
train_data.head()

In [None]:
# Time taken by plane to reach destination is called Duration
# It is the differnce betwwen Departure Time and Arrival time

# Assigning and converting Duration column into list
duration = list(train_data["Duration"])

for i in range(len(duration)):
    if len(duration[i].split()) != 2:    # Check if duration contains only hour or mins
        if "h" in duration[i]:
            duration[i] = duration[i].strip() + " 0m"   # Adds 0 minute
        else:
            duration[i] = "0h " + duration[i]           # Adds 0 hour

duration_hours = []
duration_mins = []
for i in range(len(duration)):
    duration_hours.append(int(duration[i].split(sep = "h")[0]))    # Extract hours from duration
    duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1]))   # Extracts only minutes from duration

In [None]:
#Create new columns and store the values
train_data['Duration_hours'] = duration_hours
train_data['Duration_mins'] = duration_mins

In [None]:
#Drop the column
train_data.drop(['Duration'], axis = 1, inplace = True)

In [None]:
train_data.head()

# Handling Categorical Values

We will be converting reamaning objects to numeric values

In [None]:
train_data.Airline.value_counts()

In [None]:
#Let's visualize the price of all airlines
sns.catplot(y = 'Price', x = 'Airline', data = train_data.sort_values('Price', ascending = False), kind = 'boxen', height = 6, aspect = 3)

As we can see jet airways has high prices, rest are quiet similar in price

As Airline, Source, Destination are  Nominal Categorical data we will perform OneHotEncoding

In [None]:
Airline = train_data[["Airline"]]

Airline = pd.get_dummies(Airline, drop_first= True)

Airline.head()

In [None]:
train_data['Source'].value_counts()

In [None]:
#Visualize the prices at the source 
sns.catplot(y = 'Price', x = 'Source', data = train_data.sort_values('Price', ascending = False), kind = 'boxen', height = 4, aspect = 3)
plt.show()

In [None]:
Source = train_data[['Source']]
Source = pd.get_dummies(Source, drop_first = True)
Source.head()

In [None]:
train_data['Destination'].value_counts()

In [None]:
Destination = train_data[['Destination']]
Destination = pd.get_dummies(Destination, drop_first = True)
Destination.head()

In [None]:
train_data['Route']

In [None]:

# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each other

train_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)

In [None]:
train_data.head()

In [None]:
train_data['Total_Stops'].value_counts()

In [None]:
# As this is case of Ordinal Categorical type we perform LabelEncoder
# Here Values are assigned with corresponding keys

train_data.replace({'non-stop': 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4}, inplace = True)

In [None]:
train_data.head()

In [None]:
#combine all data
data_train = pd.concat([train_data, Airline, Source, Destination], axis = 1)

In [None]:
data_train.head()

In [None]:
data_train.drop(['Airline','Source', 'Destination'], axis = 1, inplace = True)
data_train.head()

In [None]:
data_train.shape

# Load Test Data

Load the test data and perform all the preprocessing steps we did for train data

In [None]:
test_data = pd.read_excel(r'../input/flight-fare-prediction-mh/Test_set.xlsx')
test_data.head()

# Preprocessing

In [None]:
print('Test data info:')
print('_' * 80)
print(test_data.info())

print('\n \n Null values:')
print('_' * 80)
test_data.dropna(inplace = True)
print(test_data.isnull().sum())

# EDA of Test Data

In [None]:
#Date_of_Journey
test_data['Journey_day'] = pd.to_datetime(test_data.Date_of_Journey,format = '%d/%m/%Y').dt.day
test_data['Journey_month'] = pd.to_datetime(test_data.Date_of_Journey, format = '%d/%m/%Y').dt.month
test_data.drop(['Date_of_Journey'], axis = 1,inplace = True)

# Dep_Time
test_data["Dep_hour"] = pd.to_datetime(test_data["Dep_Time"]).dt.hour
test_data["Dep_min"] = pd.to_datetime(test_data["Dep_Time"]).dt.minute
test_data.drop(["Dep_Time"], axis = 1, inplace = True)

# Arrival_Time
test_data["Arrival_hour"] = pd.to_datetime(test_data.Arrival_Time).dt.hour
test_data["Arrival_min"] = pd.to_datetime(test_data.Arrival_Time).dt.minute
test_data.drop(["Arrival_Time"], axis = 1, inplace = True)

In [None]:
# Duration
Duration = list(test_data['Duration'])

for i in range(len(Duration)):
    if len(Duration[i].split()) != 2:
        if 'h' in Duration[i]:
            Duration[i] = Duration[i].strip() + ' 0m'
        else:
            Duration[i] = '0h ' + Duration[i]
        
Duration_hour = []
Duration_mins = []
for i in range(len(Duration)):
    Duration_hour.append(int(Duration[i].split(sep = 'h')[0]))
    Duration_mins.append(int(Duration[i].split(sep = 'm')[0].split()[-1]))

test_data['Duration_hours'] = Duration_hour
test_data['Duration_minutes'] = Duration_mins
test_data.drop(['Duration'], axis = 1, inplace = True)

In [None]:
# Categorical data

print("Airline")
print("-"*75)
print(test_data["Airline"].value_counts())
Airline = pd.get_dummies(test_data["Airline"], drop_first= True)

print()

print("Source")
print("-"*75)
print(test_data["Source"].value_counts())
Source = pd.get_dummies(test_data["Source"], drop_first= True)

print()

print("Destination")
print("-"*75)
print(test_data["Destination"].value_counts())
Destination = pd.get_dummies(test_data["Destination"], drop_first = True)

In [None]:
# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each other
test_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)

In [None]:
# Replacing Total_Stops
test_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)

In [None]:
#Combine all data:  test_data + Airline + Source + Destination
data_test = pd.concat([test_data, Airline, Source, Destination], axis = 1)

data_test.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)

print("Shape of test data : ", data_test.shape)

In [None]:
data_test.head()

# Feature Selection

Finding out the best feature which will contribute and have good relation with target variable. Following are some of the feature selection methods,

**heatmap**

**feature_importance_**

**SelectKBest**

In [None]:
data_train.shape

Separate independent and dependent data

In [None]:
#Independent data
x = data_train.drop(['Price'], axis = 1)
x.head()

In [None]:
#Dependent data
y = data_train['Price']
y.head()

In [None]:
# Finds correlation between Independent and dependent attributes

plt.figure(figsize = (18,18))
sns.heatmap(data_train.corr(), annot = True, cmap = 'Blues')
plt.show()

In [None]:
# Important feature using ExtraTreesRegressor
from sklearn.ensemble import ExtraTreesRegressor

model = ExtraTreesRegressor()
model.fit(x, y)

In [None]:
print(model.feature_importances_)

In [None]:
#plot graph of feature importances for better visualization

plt.figure(figsize = (10, 10))
feat_imp = pd.Series(model.feature_importances_, index = x.columns)
feat_imp.nlargest(20).plot(kind = 'barh')
plt.show()

# Fitting model using Random Forest


Split the data using train_test_split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

Create model and fit the data

In [None]:
from sklearn.ensemble import RandomForestRegressor
reg_rf = RandomForestRegressor()
reg_rf.fit(X_train, y_train)

Predict on x_test data 

In [None]:
y_pred = reg_rf.predict(X_test)

Check the score

In [None]:
reg_rf.score(X_train, y_train)

In [None]:
reg_rf.score(X_test, y_test)

Plot the error difference 

In [None]:
sns.distplot(y_test-y_pred)
plt.show()

In [None]:
plt.scatter(y_test, y_pred, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

Get the MAE, MSE, and RSME score

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
metrics.r2_score(y_test, y_pred)

# Hyperparameter Tuning

* Choose following method for hyperparameter tuning
    * RandomizedSearchCV --> Fast
    * GridSearchCV
* Assign hyperparameters in form of dictionery
* Fit the model
* Check best paramters and best score

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 1200, num = 12 )]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num= 6)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10] 

In [None]:
#create random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [None]:
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid, scoring = 'neg_mean_squared_error',n_iter = 10, cv = 5, verbose = 2, random_state = 42, n_jobs = 1)

Fit the model

In [None]:
rf_random.fit(X_train, y_train)

Find the best params

In [None]:
rf_random.best_params_

Get the predictions on test data 

In [None]:
predictions = rf_random.predict(X_test)

Plot the error difference

In [None]:
plt.figure(figsize = (8,8))
sns.distplot(y_test-predictions)
plt.show()

In [None]:
plt.figure(figsize = (8,8))
plt.scatter(y_test, predictions, alpha = 0.5)
plt.xlabel('y_test')
plt.ylabel('y_pred')
plt.show()

Get the MAE, MSE, and RSME score

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

# Save the model to reuse it again

In [None]:
import pickle
# open a file, where you ant to store the data
file = open('flight_rf.pkl', 'wb')

# dump information to that file
pickle.dump(rf_random, file)

In [None]:
flight_model = open('flight_rf.pkl','rb')
forest = pickle.load(flight_model)

In [None]:
y_prediction = forest.predict(X_test)

In [None]:
metrics.r2_score(y_test, y_prediction)

In [None]:
 !pip freeze > '../working/flight_requirements.txt'