## Part 3. Hands-on exercise

In this exercise, you are required to build a regression model using the random forests algorithm.

The problem to be solved is predicting the price of flights.

Please download the flight price dataset from Learn.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Load and explore the dataset
- Preprocess the dataset
- Build and evaluate a regression model using random forests with default hyper-parameters
- Hyper-parameter tuning through cross-validation for random forests

In [2]:
# import all libraries used in this notebook here
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# suppress all warnings
warnings.filterwarnings("ignore")


### Task 1 and 2

In [1]:
"""load dataset and encoding cualitative values"""
data = pd.read_csv("flight_price_dataset.csv")

data1 = data

# List of categorical columns
categorical_columns = ["airline", "departure_time",	"arrival_time", "source_city", "stops", "destination_city", "class"]

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Iterate over each categorical column and apply label encoding
for col in categorical_columns:
    data1[col] = label_encoder.fit_transform(data1[col])

# Displaying the DataFrame after label encoding
#data1.head()

NameError: name 'pd' is not defined

In [81]:
# General information of the dataset
print("General Information of the Dataset:")
print(data.info())


General Information of the Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        300153 non-null  int64  
 1   airline           300153 non-null  int32  
 2   flight            300153 non-null  object 
 3   source_city       300153 non-null  int32  
 4   departure_time    300153 non-null  int32  
 5   stops             300153 non-null  int32  
 6   arrival_time      300153 non-null  int32  
 7   destination_city  300153 non-null  int32  
 8   class             300153 non-null  int32  
 9   duration          300153 non-null  float64
 10  days_left         300153 non-null  int64  
 11  price             300153 non-null  int64  
dtypes: float64(1), int32(7), int64(3), object(1)
memory usage: 19.5+ MB
None


In [82]:
# Unique values in each column
print("\nUnique Values in Each Column:")
for column in data.columns:
    unique_values = data[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values: {unique_values}")
    print()


Unique Values in Each Column:
Column: Unnamed: 0
Unique Values: [     0      1      2 ... 300150 300151 300152]

Column: airline
Unique Values: [4 0 5 2 3 1]

Column: flight
Unique Values: ['SG-8709' 'SG-8157' 'I5-764' ... '6E-7127' '6E-7259' 'AI-433']

Column: source_city
Unique Values: [2 5 0 4 3 1]

Column: departure_time
Unique Values: [2 1 4 0 5 3]

Column: stops
Unique Values: [2 0 1]

Column: arrival_time
Unique Values: [5 4 1 0 2 3]

Column: destination_city
Unique Values: [5 0 4 3 1 2]

Column: class
Unique Values: [1 0]

Column: duration
Unique Values: [ 2.17  2.33  2.25  2.08 12.25 16.33 11.75 14.5  15.67  3.75  2.5   5.83
  8.    6.   14.67 16.17 18.   23.17 24.17  8.83  4.5  15.25 11.   19.08
 22.83 26.42 17.75 19.58 26.67 15.17 20.83 11.42 22.25 26.   21.75  3.83
  4.42  7.67  8.33 10.42 23.75 19.5   6.5  12.42 21.08 28.17 28.25  9.25
 17.92  7.08 13.83  7.58 15.83 24.42  4.17  4.25  5.08 29.33 17.   27.17
 24.75  5.75 12.75 13.75 17.83  5.5  23.83  5.   26.5  12.83  8.9

In [83]:
""" Split the data into features (X) and target (y)"""

feature_columns = ["airline", "departure_time",	"arrival_time", "source_city", "stops", "destination_city", "class","duration","days_left"]
feature_df = data1[feature_columns]
target_df = data1["price"]


# display the first five rows of the features
feature_df.head()


Unnamed: 0,airline,departure_time,arrival_time,source_city,stops,destination_city,class,duration,days_left
0,4,2,5,2,2,5,1,2.17,1
1,4,1,4,2,2,5,1,2.33,1
2,0,1,1,2,2,5,1,2.17,1
3,5,4,0,2,2,5,1,2.25,1
4,5,4,4,2,2,5,1,2.33,1


In [84]:
"""train test split"""
X_train, X_test, y_train, y_test = train_test_split(
    feature_df.values, # call `.values` to convert the feature from pd.DataFrame to np.array
    target_df.values, # ca;; `.values` to convert the target from pd.Series to np.array
    train_size = 0.7, # 70% for training, 30% for test
    random_state = 0 # controls the shuffling, set to zero for reproduciblillity
)

### Task 3

In [85]:
"""hyper-parameters to search"""

param_dict = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3],
    'max_features': [None, 'sqrt', 'log2'],
    'min_impurity_decrease': [0.0, 0.1, 0.2]
}

# hyper-parameter tuning through cross-validation
grid_clf = GridSearchCV(
    estimator = DecisionTreeClassifier(),
    param_grid = param_dict,
    scoring = 'f1_weighted',
    refit = True,
    cv = 5,
    verbose = 1,
    n_jobs = -1
)
grid_clf.fit(X_train, y_train)

Fitting 5 folds for each of 729 candidates, totalling 3645 fits


KeyboardInterrupt: 

In [None]:
""" obtain the best hyper-parameters and the best score """
print('Best hyper-parameters:', grid_clf.best_params_)
print('Best score:', grid_clf.best_score_)

In [None]:
"""Predict and classification metrics"""

# predict categories for test dataset
y_pred = grid_clf.predict(X_test)

# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

In [None]:
""" inference with unseen data"""
X_unseen = np.array([5.2, 3.1, 1.3, 0.2])
print("The predicted category of the unseen data:", grid_clf.predict(X_unseen.reshape(-1, 4)))
print("The predicted probabilities:", grid_clf.predict_proba(X_unseen.reshape(-1, 4)))

## Task 4