<a href="https://colab.research.google.com/github/Mangel2320/Hackathon/blob/main/Hackathon_Shinkansen_Travel_Experience.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Criteria**

**Hackathon Generic Approach:**

1.	Load Data (Both Train & Test Separately)
2.	Understand the Data (check each of the following in both the train & test)

  a.	Check Head & Tail
  
  b.	Info, Describe
  
  c.	Null Values

  d.	Bad data like ‘$’ or ‘#’ in numerical column or any other unwanted character
3.	Clean the data

  a.	Treat Missing Values in both the train & test
  
  b.	Remove bad data values in both the train & test
  
  c.	Encode the object variables in both the train & test
  
  d.	Feature Engineer (if needed)

  e.	Scale/Normalise (if needed)
4.	Make a simple model using any algorithm

  a.	Fit the model on Train
  
  b.	Predict on the Test

  c.	Store the predicted values in an array
5.	Submission (Approach 1)

  a.	Import the sample submission file

  b.	Check if your test data and sample submission file has same ID column sequence (if not, then sort them such that each ID’s individual predicted value is placed on corresponding ID in submission file)
  
  c.	Replace the “target” column in Submission File with predicted array

  d.	Export the file to a CSV

  e.	Make sure it has the same headers as the sample submission file

  f.	Upload on the platform and check your score
6.	Submission (Approach 2)

  a.	From your Original Test Set take the ID

  b.	Create a new data frame with ID and corresponding predicted values
  
  c.	Export that data frame to CSV

  d.	The number of rows (including headers) & columns should match with the sample submission else the platform will not accept it

  e.	Upload on the platform and check your score
7.	Now go back to the step 3 (this is an iterative process)
  
  a.	Check if Scaling approach change helps

  b.	Check if feature engineering helps

  c.	Try removing unnecessary variables (feature importances) & check if it helps

  d.	Try grid search

  e.	Try advanced model tuning  techniques (like non-parametric ensemble methods)
**Few Pointers to take care of:**
  1.	Do not drop null values from Test Set
  2.	Whatever preprocessing step you perform on Training, it must also be performed on the Test set
  3.	Try using **“n_jobs = -1”** while fitting the model for parallel processing to decrease the time taken for fitting the model. This can take up all your computational resources and your PC might start working slow for any other task you perform on the PC.
  4.	Recommended to make copies of datasets at every checkpoint so you don’t have to restart from first. You can directly read the latest checkpoint dataset and start from there.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## **Load & view**


In [None]:
train_travel = pd.read_csv('/content/drive/MyDrive/Hackathon/Train_feedback/Traveldata_train_(1).csv')
train_travel.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,98800001,Female,Loyal Customer,52.0,,Business,272,0.0,5.0
1,98800002,Male,Loyal Customer,48.0,Personal Travel,Eco,2200,9.0,0.0
2,98800003,Female,Loyal Customer,43.0,Business travel,Business,1061,77.0,119.0
3,98800004,Female,Loyal Customer,44.0,Business travel,Business,780,13.0,18.0
4,98800005,Female,Loyal Customer,50.0,Business travel,Business,1981,0.0,0.0


In [None]:
train_survey = pd.read_csv('/content/drive/MyDrive/Hackathon/Train_feedback/Surveydata_train_(1).csv')
train_survey.head()

Unnamed: 0,ID,Overall_Experience,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,98800001,0,need improvement,Green Car,excellent,excellent,very convinient,good,need improvement,acceptable,need improvement,need improvement,acceptable,need improvement,good,need improvement,poor
1,98800002,0,poor,Ordinary,excellent,poor,need improvement,good,poor,good,good,excellent,need improvement,poor,need improvement,good,good
2,98800003,1,need improvement,Green Car,need improvement,need improvement,need improvement,need improvement,good,excellent,excellent,excellent,excellent,excellent,good,excellent,excellent
3,98800004,0,acceptable,Ordinary,need improvement,,need improvement,acceptable,need improvement,acceptable,acceptable,acceptable,acceptable,acceptable,good,acceptable,acceptable
4,98800005,1,acceptable,Ordinary,acceptable,acceptable,manageable,need improvement,good,excellent,good,good,good,good,good,good,good


In [None]:
test_travel = pd.read_csv('/content/drive/MyDrive/Hackathon/Train_feedback/Traveldata_test.csv')
test_travel.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,99900001,Female,,36.0,Business travel,Business,532,0.0,0.0
1,99900002,Female,disloyal Customer,21.0,Business travel,Business,1425,9.0,28.0
2,99900003,Male,Loyal Customer,60.0,Business travel,Business,2832,0.0,0.0
3,99900004,Female,Loyal Customer,29.0,Personal Travel,Eco,1352,0.0,0.0
4,99900005,Male,disloyal Customer,18.0,Business travel,Business,1610,17.0,0.0


In [None]:
test_survey = pd.read_csv('/content/drive/MyDrive/Hackathon/Train_feedback/Surveydata_test.csv')
test_survey.head()

Unnamed: 0,ID,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,99900001,acceptable,Green Car,acceptable,acceptable,manageable,need improvement,excellent,good,excellent,excellent,excellent,excellent,good,excellent,poor
1,99900002,extremely poor,Ordinary,good,poor,manageable,acceptable,poor,acceptable,acceptable,excellent,acceptable,good,acceptable,excellent,acceptable
2,99900003,excellent,Ordinary,excellent,excellent,very convinient,excellent,excellent,excellent,need improvement,need improvement,need improvement,need improvement,good,need improvement,excellent
3,99900004,acceptable,Green Car,excellent,acceptable,very convinient,poor,acceptable,excellent,poor,acceptable,need improvement,excellent,excellent,excellent,poor
4,99900005,excellent,Ordinary,extremely poor,excellent,need improvement,excellent,excellent,excellent,excellent,,acceptable,excellent,excellent,excellent,excellent


## **EDA**

In [None]:
print(train_travel.shape)
print(train_survey.shape)
print(test_travel.shape)
print(test_survey.shape)

(94379, 9)
(94379, 17)
(35602, 9)
(35602, 16)


In [None]:
print(train_travel.info())
print(train_survey.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      94379 non-null  int64  
 1   Gender                  94302 non-null  object 
 2   CustomerType            85428 non-null  object 
 3   Age                     94346 non-null  float64
 4   TypeTravel              85153 non-null  object 
 5   Travel_Class            94379 non-null  object 
 6   Travel_Distance         94379 non-null  int64  
 7   DepartureDelay_in_Mins  94322 non-null  float64
 8   ArrivalDelay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 6.5+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                    

In [None]:
train_travel.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,94379.0,98847190.0,27245.014865,98800001.0,98823595.5,98847190.0,98870784.5,98894379.0
Age,94346.0,39.41965,15.116632,7.0,27.0,40.0,51.0,85.0
Travel_Distance,94379.0,1978.888,1027.961019,50.0,1359.0,1923.0,2538.0,6951.0
DepartureDelay_in_Mins,94322.0,14.64709,38.138781,0.0,0.0,0.0,12.0,1592.0
ArrivalDelay_in_Mins,94022.0,15.00522,38.439409,0.0,0.0,0.0,13.0,1584.0


In [None]:
train_survey.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,94379.0,98847190.0,27245.014865,98800001.0,98823595.5,98847190.0,98870784.5,98894379.0
Overall_Experience,94379.0,0.5466576,0.497821,0.0,0.0,1.0,1.0,1.0


## **Bad data**

#### **train_travel**

In [None]:
train_travel.columns

Index(['ID', 'Gender', 'CustomerType', 'Age', 'TypeTravel', 'Travel_Class',
       'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins'],
      dtype='object')

In [None]:
train_travel.Gender.unique()

array(['Female', 'Male', nan], dtype=object)

In [None]:
train_travel.CustomerType.unique()

array(['Loyal Customer', 'disloyal Customer', nan], dtype=object)

In [None]:
train_travel.Age.unique()

array([52., 48., 43., 44., 50., 56., 65., 22., 57., 25., 26., 47., 33.,
       54.,  9., 68., 24., 23., 10., 55., 36., 62., 39., 29., 76., 30.,
       41.,  7., 32., 46., 35., 38., 61., 49., 21., 34., 27., 18., 37.,
       45., 63., 42., 13., 60., 64., 73., 20., 40., 58., 28., 19., 59.,
       31., 53., 17., 77., 69., 16., 70., 51., 66., 67., 14., 11., 12.,
        8., 71., 15., 80., 72., 85., nan, 74., 75., 79., 78.])

In [None]:
train_travel['Age'] = train_travel['Age'].round(2)
train_travel['Age'].replace(39.42, 39, inplace=True)

In [None]:
train_travel.TypeTravel.unique()

array([nan, 'Personal Travel', 'Business travel'], dtype=object)

In [None]:
train_travel.Travel_Class.unique()

array(['Business', 'Eco'], dtype=object)

In [None]:
train_travel.Travel_Distance.unique()

array([ 272, 2200, 1061, ..., 5652, 6655, 4156])

In [None]:
train_travel.DepartureDelay_in_Mins.unique()

array([0.000e+00, 9.000e+00, 7.700e+01, 1.300e+01, 1.000e+00, 2.000e+01,
       4.900e+01, 1.400e+01, 2.200e+01, 2.000e+00, 1.000e+02, 1.100e+01,
       4.200e+01, 8.000e+00, 6.500e+01, 7.600e+01, 1.800e+01, 2.700e+01,
       6.700e+01, 6.200e+01, 2.500e+01, 5.900e+01, 7.000e+00, 5.700e+01,
       1.500e+01, 3.000e+00, 5.000e+00, 1.600e+01, 7.500e+01, 1.130e+02,
       1.000e+01, 8.400e+01, 1.700e+01, 4.000e+00, 9.900e+01, 6.600e+01,
       1.250e+02, 9.600e+01, 2.300e+01, 6.100e+01, 6.400e+01, 1.900e+01,
       4.100e+01, 1.180e+02, 5.400e+01, 6.900e+01, 1.490e+02, 1.810e+02,
       6.000e+00, 3.200e+01, 8.100e+01, 1.590e+02, 1.260e+02, 2.900e+01,
       1.540e+02, 2.800e+01, 3.800e+01, 3.700e+01, 5.000e+01, 4.700e+01,
       3.000e+01, 9.800e+01, 4.000e+01, 1.200e+01, 1.310e+02, 1.070e+02,
       6.300e+01, 3.500e+01, 3.400e+01, 3.300e+01, 1.080e+02, 2.100e+01,
       9.100e+01, 7.400e+01, 5.100e+01, 1.060e+02, 2.600e+01, 1.010e+02,
       4.400e+01, 5.800e+01, 3.100e+01, 6.800e+01, 

In [None]:
train_travel.ArrivalDelay_in_Mins.unique()

array([5.000e+00, 0.000e+00, 1.190e+02, 1.800e+01, 3.000e+00, 3.400e+01,
       4.900e+01, 1.000e+00, 5.200e+01, 8.500e+01, 7.000e+00, 8.000e+00,
       1.400e+01, 9.300e+01, 3.000e+01, 2.000e+01, 1.540e+02, 7.400e+01,
       2.000e+00, 1.000e+01, 6.400e+01, 6.800e+01, 3.600e+01, 4.000e+00,
       2.300e+01, 3.800e+01, 7.600e+01, 1.500e+01, 1.200e+02, 1.000e+02,
       6.000e+00, 1.110e+02, 5.700e+01, 2.600e+01, 4.300e+01, 1.070e+02,
       8.800e+01, 2.100e+01, 1.900e+01, 3.500e+01, 5.000e+01, 5.500e+01,
       3.100e+01, 1.300e+01, 1.020e+02, 1.200e+01, 8.200e+01, 5.100e+01,
       7.000e+01, 6.100e+01, 1.700e+01, 1.030e+02, 2.200e+01, 1.390e+02,
             nan, 8.600e+01, 9.000e+01, 1.640e+02, 1.150e+02, 3.700e+01,
       1.440e+02, 7.100e+01, 1.100e+01, 9.000e+00, 3.200e+01, 4.200e+01,
       1.050e+02, 4.800e+01, 1.260e+02, 1.240e+02, 6.500e+01, 2.700e+01,
       1.600e+01, 6.000e+01, 1.310e+02, 1.300e+02, 8.900e+01, 6.600e+01,
       7.900e+01, 1.100e+02, 4.000e+01, 7.200e+01, 

#### **train_survey**

In [None]:
# Define the function select_unique_values_2
def select_unique_values_2(train_survey):
    """
    This function takes a Pandas DataFrame as input and returns two dictionaries:
    one containing unique values for object columns and the other for numerical columns.
    """
    unique_object_vals_2 = {}
    unique_numeric_vals_2 = {}

    for col in train_survey.columns:
        if train_survey[col].dtype == 'object':
            unique_object_vals_2[col] = train_survey[col].unique().tolist()
        else:
            unique_numeric_vals_2[col] = train_survey[col].unique().tolist()

    return unique_object_vals_2, unique_numeric_vals_2

# Call the function
unique_object_vals_2, unique_numeric_vals_2 = select_unique_values_2(train_survey)

# Print unique values for object columns
print("Unique object values:")
for col, vals in unique_object_vals_2.items():
    print(f"{col}: {vals}")

# Print unique values for numerical columns
print("\nUnique numerical values:")
for col, vals in unique_numeric_vals_2.items():
    print(f"{col}: {vals}")

Unique object values:
Seat_comfort: ['need improvement', 'poor', 'acceptable', 'good', 'excellent', 'extremely poor', nan]
Seat_Class: ['Green Car', 'Ordinary']
Arrival_time_convenient: ['excellent', 'need improvement', 'acceptable', nan, 'good', 'poor', 'extremely poor']
Catering: ['excellent', 'poor', 'need improvement', nan, 'acceptable', 'good', 'extremely poor']
Platform_location: ['very convinient', 'need improvement', 'manageable', 'Inconvinient', 'Convinient', nan, 'very inconvinient']
Onboardwifi_service: ['good', 'need improvement', 'acceptable', 'excellent', 'poor', 'extremely poor', nan]
Onboard_entertainment: ['need improvement', 'poor', 'good', 'excellent', 'acceptable', 'extremely poor', nan]
Online_support: ['acceptable', 'good', 'excellent', 'poor', nan, 'need improvement', 'extremely poor']
Onlinebooking_Ease: ['need improvement', 'good', 'excellent', 'acceptable', 'poor', nan, 'extremely poor']
Onboard_service: ['need improvement', 'excellent', 'acceptable', 'good', 

#### **test_travel**

In [None]:
# Define the function select_unique_values_2
def select_unique_values_2(test_travel):
    """
    This function takes a Pandas DataFrame as input and returns two dictionaries:
    one containing unique values for object columns and the other for numerical columns.
    """
    unique_object_vals_2 = {}
    unique_numeric_vals_2 = {}

    for col in test_travel.columns:
        if test_travel[col].dtype == 'object':
            unique_object_vals_2[col] = test_travel[col].unique().tolist()
        else:
            unique_numeric_vals_2[col] = test_travel[col].unique().tolist()

    return unique_object_vals_2, unique_numeric_vals_2

# Call the function
unique_object_vals_2, unique_numeric_vals_2 = select_unique_values_2(test_travel)

# Print unique values for object columns
print("Unique object values:")
for col, vals in unique_object_vals_2.items():
    print(f"{col}: {vals}")

# Print unique values for numerical columns
print("\nUnique numerical values:")
for col, vals in unique_numeric_vals_2.items():
    print(f"{col}: {vals}")

Unique object values:
Gender: ['Female', 'Male', nan]
CustomerType: [nan, 'disloyal Customer', 'Loyal Customer']
TypeTravel: ['Business travel', 'Personal Travel', nan]
Travel_Class: ['Business', 'Eco']

Unique numerical values:
ID: [99900001, 99900002, 99900003, 99900004, 99900005, 99900006, 99900007, 99900008, 99900009, 99900010, 99900011, 99900012, 99900013, 99900014, 99900015, 99900016, 99900017, 99900018, 99900019, 99900020, 99900021, 99900022, 99900023, 99900024, 99900025, 99900026, 99900027, 99900028, 99900029, 99900030, 99900031, 99900032, 99900033, 99900034, 99900035, 99900036, 99900037, 99900038, 99900039, 99900040, 99900041, 99900042, 99900043, 99900044, 99900045, 99900046, 99900047, 99900048, 99900049, 99900050, 99900051, 99900052, 99900053, 99900054, 99900055, 99900056, 99900057, 99900058, 99900059, 99900060, 99900061, 99900062, 99900063, 99900064, 99900065, 99900066, 99900067, 99900068, 99900069, 99900070, 99900071, 99900072, 99900073, 99900074, 99900075, 99900076, 999000

#### **test_survey**

In [None]:
# Define the function select_unique_values_2
def select_unique_values_2(test_survey):
    """
    This function takes a Pandas DataFrame as input and returns two dictionaries:
    one containing unique values for object columns and the other for numerical columns.
    """
    unique_object_vals_2 = {}
    unique_numeric_vals_2 = {}

    for col in test_survey.columns:
        if test_survey[col].dtype == 'object':
            unique_object_vals_2[col] = test_survey[col].unique().tolist()
        else:
            unique_numeric_vals_2[col] = test_survey[col].unique().tolist()

    return unique_object_vals_2, unique_numeric_vals_2

# Call the function
unique_object_vals_2, unique_numeric_vals_2 = select_unique_values_2(test_survey)

# Print unique values for object columns
print("Unique object values:")
for col, vals in unique_object_vals_2.items():
    print(f"{col}: {vals}")

# Print unique values for numerical columns
print("\nUnique numerical values:")
for col, vals in unique_numeric_vals_2.items():
    print(f"{col}: {vals}")

Unique object values:
Seat_comfort: ['acceptable', 'extremely poor', 'excellent', 'poor', 'need improvement', 'good', nan]
Seat_Class: ['Green Car', 'Ordinary']
Arrival_time_convenient: ['acceptable', 'good', 'excellent', 'extremely poor', nan, 'need improvement', 'poor']
Catering: ['acceptable', 'poor', 'excellent', 'need improvement', 'good', nan, 'extremely poor']
Platform_location: ['manageable', 'very convinient', 'need improvement', 'Inconvinient', 'Convinient', nan]
Onboardwifi_service: ['need improvement', 'acceptable', 'excellent', 'poor', 'good', 'extremely poor', nan]
Onboard_entertainment: ['excellent', 'poor', 'acceptable', 'good', 'need improvement', 'extremely poor', nan]
Online_support: ['good', 'acceptable', 'excellent', 'need improvement', 'poor', nan]
Onlinebooking_Ease: ['excellent', 'acceptable', 'need improvement', 'poor', 'good', nan, 'extremely poor']
Onboard_service: ['excellent', 'need improvement', 'acceptable', nan, 'good', 'poor']
Leg_room: ['excellent', 'a

In [None]:
train_t = train_travel.copy()
train_s = train_survey.copy()
test_t = test_travel.copy()
test_s = test_survey.copy()

## **Missing values**

In [None]:
print('train travel:', train_t.isnull().sum())

print('train survey:', train_s.isnull().sum())

train travel: ID                           0
Gender                      77
CustomerType              8951
Age                         33
TypeTravel                9226
Travel_Class                 0
Travel_Distance              0
DepartureDelay_in_Mins      57
ArrivalDelay_in_Mins       357
dtype: int64
train survey: ID                            0
Overall_Experience            0
Seat_comfort                 61
Seat_Class                    0
Arrival_time_convenient    8930
Catering                   8741
Platform_location            30
Onboardwifi_service          30
Onboard_entertainment        18
Online_support               91
Onlinebooking_Ease           73
Onboard_service            7601
Leg_room                     90
Baggage_handling            142
Checkin_service              77
Cleanliness                   6
Online_boarding               6
dtype: int64


In [None]:
print('test travel:', test_t.isnull().sum())

print('test survey:', test_s.isnull().sum())

test travel: ID                           0
Gender                      30
CustomerType              3383
Age                         11
TypeTravel                3448
Travel_Class                 0
Travel_Distance              0
DepartureDelay_in_Mins      29
ArrivalDelay_in_Mins       123
dtype: int64
test survey: ID                            0
Seat_comfort                 22
Seat_Class                    0
Arrival_time_convenient    3325
Catering                   3357
Platform_location            12
Onboardwifi_service          12
Onboard_entertainment         8
Online_support               26
Onlinebooking_Ease           18
Onboard_service            2872
Leg_room                     25
Baggage_handling             40
Checkin_service              22
Cleanliness                   2
Online_boarding               2
dtype: int64


In [None]:
print(train_t.shape)
print(train_s.shape)
print(test_t.shape)
print(test_s.shape)

(94379, 9)
(94379, 17)
(35602, 9)
(35602, 16)


In [None]:
from sklearn.impute import SimpleImputer


#### **train_travel**

In [None]:
def impute_missing_values_with_mean(train_t):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(train_t, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in train_t.select_dtypes(include=['number']).columns:
        # Calculate the mean of the column, excluding NaN values
        mean_value = train_t[column].mean()
        # Impute missing values with the mean
        train_t[column].fillna(mean_value, inplace=True)

    return train_t

# Example usage:
# Impute missing values with the mean
train_t_imputed_num = impute_missing_values_with_mean(train_t)

train_t_imputed_num.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,98800001,Female,Loyal Customer,52.0,,Business,272,0.0,5.0
1,98800002,Male,Loyal Customer,48.0,Personal Travel,Eco,2200,9.0,0.0
2,98800003,Female,Loyal Customer,43.0,Business travel,Business,1061,77.0,119.0
3,98800004,Female,Loyal Customer,44.0,Business travel,Business,780,13.0,18.0
4,98800005,Female,Loyal Customer,50.0,Business travel,Business,1981,0.0,0.0


In [None]:
def impute_missing_values_with_mode(train_t):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(train_t, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in train_t.select_dtypes(include=['object', 'category']).columns:
        # Calculate the mode of the column
        mode_value = train_t[column].mode()[0]
        # Impute missing values with the mode
        train_t[column].fillna(mode_value, inplace=True)

    return train_t

# Impute missing values with the mode
train_t_imputed_cat = impute_missing_values_with_mode(train_t)

train_t_imputed_cat.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,98800001,Female,Loyal Customer,52.0,Business travel,Business,272,0.0,5.0
1,98800002,Male,Loyal Customer,48.0,Personal Travel,Eco,2200,9.0,0.0
2,98800003,Female,Loyal Customer,43.0,Business travel,Business,1061,77.0,119.0
3,98800004,Female,Loyal Customer,44.0,Business travel,Business,780,13.0,18.0
4,98800005,Female,Loyal Customer,50.0,Business travel,Business,1981,0.0,0.0


In [None]:
train_t.isnull().sum().sum()

0

#### **train_survey**

In [None]:
def impute_missing_values_with_mode(train_s):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(train_s, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in train_s.select_dtypes(include=['object', 'category']).columns:
        # Calculate the mode of the column
        mode_value = train_s[column].mode()[0]
        # Impute missing values with the mode
        train_s[column].fillna(mode_value, inplace=True)

    return train_s

# Impute missing values with the mode
train_s_imputed_cat = impute_missing_values_with_mode(train_s)

train_s_imputed_cat.head()

Unnamed: 0,ID,Overall_Experience,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,98800001,0,need improvement,Green Car,excellent,excellent,very convinient,good,need improvement,acceptable,need improvement,need improvement,acceptable,need improvement,good,need improvement,poor
1,98800002,0,poor,Ordinary,excellent,poor,need improvement,good,poor,good,good,excellent,need improvement,poor,need improvement,good,good
2,98800003,1,need improvement,Green Car,need improvement,need improvement,need improvement,need improvement,good,excellent,excellent,excellent,excellent,excellent,good,excellent,excellent
3,98800004,0,acceptable,Ordinary,need improvement,acceptable,need improvement,acceptable,need improvement,acceptable,acceptable,acceptable,acceptable,acceptable,good,acceptable,acceptable
4,98800005,1,acceptable,Ordinary,acceptable,acceptable,manageable,need improvement,good,excellent,good,good,good,good,good,good,good


In [None]:
train_s.isnull().sum().sum()

0

#### **test_travel**

In [None]:
def impute_missing_values_with_mean(test_t):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(test_t, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in test_t.select_dtypes(include=['number']).columns:
        # Calculate the mean of the column, excluding NaN values
        mean_value = train_t[column].mean()
        # Impute missing values with the mean
        test_t[column].fillna(mean_value, inplace=True)

    return test_t

# Impute missing values with the mean
test_t_imputed_num = impute_missing_values_with_mean(test_t)

test_t_imputed_num.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,99900001,Female,,36.0,Business travel,Business,532,0.0,0.0
1,99900002,Female,disloyal Customer,21.0,Business travel,Business,1425,9.0,28.0
2,99900003,Male,Loyal Customer,60.0,Business travel,Business,2832,0.0,0.0
3,99900004,Female,Loyal Customer,29.0,Personal Travel,Eco,1352,0.0,0.0
4,99900005,Male,disloyal Customer,18.0,Business travel,Business,1610,17.0,0.0


In [None]:
def impute_missing_values_with_mode(test_t):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(test_t, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in test_t.select_dtypes(include=['object', 'category']).columns:
        # Calculate the mode of the column
        mode_value = train_t[column].mode()[0]
        # Impute missing values with the mode
        test_t[column].fillna(mode_value, inplace=True)

    return test_t

# Impute missing values with the mode
test_t_imputed_cat = impute_missing_values_with_mode(test_t)

test_t_imputed_cat.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,99900001,Female,Loyal Customer,36.0,Business travel,Business,532,0.0,0.0
1,99900002,Female,disloyal Customer,21.0,Business travel,Business,1425,9.0,28.0
2,99900003,Male,Loyal Customer,60.0,Business travel,Business,2832,0.0,0.0
3,99900004,Female,Loyal Customer,29.0,Personal Travel,Eco,1352,0.0,0.0
4,99900005,Male,disloyal Customer,18.0,Business travel,Business,1610,17.0,0.0


In [None]:
test_t.isnull().sum().sum()

0

#### **test_survey**

In [None]:
def impute_missing_values_with_mode(test_s):
    # Ensure the dataframe is of type DataFrame
    if not isinstance(test_s, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    # Iterate over each column in the DataFrame
    for column in test_s.select_dtypes(include=['object', 'category']).columns:
        # Calculate the mode of the column
        mode_value = train_s[column].mode()[0]
        # Impute missing values with the mode
        test_s[column].fillna(mode_value, inplace=True)

    return test_s

# Impute missing values with the mode
test_s_imputed_cat = impute_missing_values_with_mode(test_s)

test_s_imputed_cat.head()

Unnamed: 0,ID,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,99900001,acceptable,Green Car,acceptable,acceptable,manageable,need improvement,excellent,good,excellent,excellent,excellent,excellent,good,excellent,poor
1,99900002,extremely poor,Ordinary,good,poor,manageable,acceptable,poor,acceptable,acceptable,excellent,acceptable,good,acceptable,excellent,acceptable
2,99900003,excellent,Ordinary,excellent,excellent,very convinient,excellent,excellent,excellent,need improvement,need improvement,need improvement,need improvement,good,need improvement,excellent
3,99900004,acceptable,Green Car,excellent,acceptable,very convinient,poor,acceptable,excellent,poor,acceptable,need improvement,excellent,excellent,excellent,poor
4,99900005,excellent,Ordinary,extremely poor,excellent,need improvement,excellent,excellent,excellent,excellent,good,acceptable,excellent,excellent,excellent,excellent


In [None]:
test_s.isnull().sum().sum()

0

In [None]:
train_tr = train_t.copy()
train_su = train_s.copy()
test_tr = test_t.copy()
test_su = test_s.copy()

## **Encoding**

#### **train_travel**

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

In [None]:
categorical_columns_1 = train_tr.select_dtypes(include=['object', 'category']).columns
categorical_columns_1

Index(['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class'], dtype='object')

In [None]:
categorical_columns_1 = ['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class']
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for column in categorical_columns_1:
    train_tr[column] = label_encoder.fit_transform(train_tr[column])

train_tr.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,98800001,0,0,52.0,0,0,272,0.0,5.0
1,98800002,1,0,48.0,1,1,2200,9.0,0.0
2,98800003,0,0,43.0,0,0,1061,77.0,119.0
3,98800004,0,0,44.0,0,0,780,13.0,18.0
4,98800005,0,0,50.0,0,0,1981,0.0,0.0


In [None]:
#train_t['Gender'] = label_encoder.fit_transform(train_t['Gender'])
#train_t['CustomerType'] = label_encoder.fit_transform(train_t['CustomerType'])
#train_t['TypeTravel'] = label_encoder.fit_transform(train_t['TypeTravel'])
#train_t['Travel_Class'] = label_encoder.fit_transform(train_t['Travel_Class'])

#### **train_survey**

In [None]:
categorical_columns_2 = train_su.select_dtypes(include=['object', 'category']).columns
categorical_columns_2

Index(['Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
       'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
       'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
       'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding'],
      dtype='object')

In [None]:
categorical_columns_2 = [
    'Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
    'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
    'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
    'Baggage_handling', 'Checkin_service', 'Cleanliness', 'Online_boarding'
]

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for column in categorical_columns_2:
    train_su[column] = label_encoder.fit_transform(train_su[column])

train_su.head()

Unnamed: 0,ID,Overall_Experience,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,98800001,0,4,0,1,1,4,3,4,0,4,4,0,3,3,4,5
1,98800002,0,5,1,1,5,3,3,5,3,3,1,4,4,4,3,3
2,98800003,1,4,0,4,4,3,4,3,1,1,1,1,1,3,1,1
3,98800004,0,0,1,4,0,3,0,4,0,0,0,0,0,3,0,0
4,98800005,1,0,1,0,0,2,4,3,1,3,3,3,2,3,3,3


#### **test_travel**

In [None]:
categorical_columns_3 = test_tr.select_dtypes(include=['object', 'category']).columns
categorical_columns_3

Index(['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class'], dtype='object')

In [None]:
categorical_columns_3 = ['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for column in categorical_columns_3:
    test_tr[column] = label_encoder.fit_transform(test_tr[column])

test_tr.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins
0,99900001,0,0,36.0,0,0,532,0.0,0.0
1,99900002,0,1,21.0,0,0,1425,9.0,28.0
2,99900003,1,0,60.0,0,0,2832,0.0,0.0
3,99900004,0,0,29.0,1,1,1352,0.0,0.0
4,99900005,1,1,18.0,0,0,1610,17.0,0.0


#### **test_survey**

In [None]:
categorical_columns_4 = test_su.select_dtypes(include=['object', 'category']).columns
categorical_columns_4

Index(['Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
       'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
       'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
       'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding'],
      dtype='object')

In [None]:
categorical_columns_4 = ['Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
                         'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
                         'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
                         'Baggage_handling', 'Checkin_service', 'Cleanliness',
                         'Online_boarding']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for column in categorical_columns_4:
    test_su[column] = label_encoder.fit_transform(test_su[column])

test_su.head()

Unnamed: 0,ID,Seat_comfort,Seat_Class,Arrival_time_convenient,Catering,Platform_location,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,99900001,0,0,0,0,2,4,1,2,1,1,1,1,2,1,5
1,99900002,2,1,3,5,2,0,5,0,0,1,0,2,0,1,0
2,99900003,1,1,1,1,4,1,1,1,4,3,4,3,2,3,1
3,99900004,0,0,1,0,4,5,0,1,5,0,4,1,1,1,5
4,99900005,1,1,2,1,3,1,1,1,1,2,0,1,1,1,1


## **Merge datasets**

In [None]:
# Merging the travel and survey data on 'ID'
train_data = pd.merge(train_tr, train_su, on='ID')
train_data.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Overall_Experience,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,98800001,0,0,52.0,0,0,272,0.0,5.0,0,...,3,4,0,4,4,0,3,3,4,5
1,98800002,1,0,48.0,1,1,2200,9.0,0.0,0,...,3,5,3,3,1,4,4,4,3,3
2,98800003,0,0,43.0,0,0,1061,77.0,119.0,1,...,4,3,1,1,1,1,1,3,1,1
3,98800004,0,0,44.0,0,0,780,13.0,18.0,0,...,0,4,0,0,0,0,0,3,0,0
4,98800005,0,0,50.0,0,0,1981,0.0,0.0,1,...,4,3,1,3,3,3,2,3,3,3


In [None]:
test_data = pd.merge(test_tr, test_su, on='ID')
test_data.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Seat_comfort,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,99900001,0,0,36.0,0,0,532,0.0,0.0,0,...,4,1,2,1,1,1,1,2,1,5
1,99900002,0,1,21.0,0,0,1425,9.0,28.0,2,...,0,5,0,0,1,0,2,0,1,0
2,99900003,1,0,60.0,0,0,2832,0.0,0.0,1,...,1,1,1,4,3,4,3,2,3,1
3,99900004,0,0,29.0,1,1,1352,0.0,0.0,0,...,5,0,1,5,0,4,1,1,1,5
4,99900005,1,1,18.0,0,0,1610,17.0,0.0,1,...,1,1,1,1,2,0,1,1,1,1


## **Drop 'ID'**

In [None]:
# Drop ID column
train_df = train_data.drop('ID', axis=1)
test_df = test_data.drop('ID', axis=1)

In [None]:
train_df.head()

Unnamed: 0,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Overall_Experience,Seat_comfort,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,0,0,52.0,0,0,272,0.0,5.0,0,4,...,3,4,0,4,4,0,3,3,4,5
1,1,0,48.0,1,1,2200,9.0,0.0,0,5,...,3,5,3,3,1,4,4,4,3,3
2,0,0,43.0,0,0,1061,77.0,119.0,1,4,...,4,3,1,1,1,1,1,3,1,1
3,0,0,44.0,0,0,780,13.0,18.0,0,0,...,0,4,0,0,0,0,0,3,0,0
4,0,0,50.0,0,0,1981,0.0,0.0,1,0,...,4,3,1,3,3,3,2,3,3,3


In [None]:
test_df.head()

Unnamed: 0,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Seat_comfort,Seat_Class,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,0,0,36.0,0,0,532,0.0,0.0,0,0,...,4,1,2,1,1,1,1,2,1,5
1,0,1,21.0,0,0,1425,9.0,28.0,2,1,...,0,5,0,0,1,0,2,0,1,0
2,1,0,60.0,0,0,2832,0.0,0.0,1,1,...,1,1,1,4,3,4,3,2,3,1
3,0,0,29.0,1,1,1352,0.0,0.0,0,0,...,5,0,1,5,0,4,1,1,1,5
4,1,1,18.0,0,0,1610,17.0,0.0,1,1,...,1,1,1,1,2,0,1,1,1,1


In [None]:
print('Train data:', train_df.shape)
print('Test data:', test_df.shape)

Train data: (94379, 24)
Test data: (35602, 23)


In [None]:
train_df.columns

Index(['Gender', 'CustomerType', 'Age', 'TypeTravel', 'Travel_Class',
       'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins',
       'Overall_Experience', 'Seat_comfort', 'Seat_Class',
       'Arrival_time_convenient', 'Catering', 'Platform_location',
       'Onboardwifi_service', 'Onboard_entertainment', 'Online_support',
       'Onlinebooking_Ease', 'Onboard_service', 'Leg_room', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding'],
      dtype='object')

In [None]:
test_df.columns

Index(['Gender', 'CustomerType', 'Age', 'TypeTravel', 'Travel_Class',
       'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins',
       'Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
       'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
       'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
       'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding'],
      dtype='object')

## **Feature Engineering**

In [None]:
#train_df['TotalDelay'] = train_df['DepartureDelay_in_Mins'] + train_df['ArrivalDelay_in_Mins']
#test_df['TotalDelay'] = test_df['DepartureDelay_in_Mins'] + test_df['ArrivalDelay_in_Mins']


In [None]:
#train_df = train_df.drop(['DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins'], axis=1)
#test_df = test_df.drop(['DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins'], axis=1)

In [None]:
train_df.columns

Index(['Gender', 'CustomerType', 'Age', 'TypeTravel', 'Travel_Class',
       'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins',
       'Overall_Experience', 'Seat_comfort', 'Seat_Class',
       'Arrival_time_convenient', 'Catering', 'Platform_location',
       'Onboardwifi_service', 'Onboard_entertainment', 'Online_support',
       'Onlinebooking_Ease', 'Onboard_service', 'Leg_room', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding'],
      dtype='object')

In [None]:
test_df.columns

Index(['Gender', 'CustomerType', 'Age', 'TypeTravel', 'Travel_Class',
       'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDelay_in_Mins',
       'Seat_comfort', 'Seat_Class', 'Arrival_time_convenient', 'Catering',
       'Platform_location', 'Onboardwifi_service', 'Onboard_entertainment',
       'Online_support', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room',
       'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding'],
      dtype='object')

## **Splitting data**

In [None]:
X_train = train_df.drop('Overall_Experience', axis=1)
y_train = train_df['Overall_Experience']

In [None]:
# Split the data into training and validation sets
#X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y_train.head()

Unnamed: 0,Overall_Experience
0,0
1,0
2,1
3,0
4,1


## **Scaling**

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)  # .select_dtypes(include=[np.number]))

In [None]:
test_scaled = scaler.transform(test_df)

In [None]:
#scaler = MinMaxScaler()

# Fit and transform the data
#scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
#df_scaled = pd.DataFrame(scaled_data, columns=df.columns)


In [None]:
X_train_scale = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scale.head()

Unnamed: 0,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Seat_comfort,Seat_Class,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,-0.985222,-0.445082,0.832369,-0.625411,-1.046703,-1.660469,-0.384165,-0.260781,0.796421,-0.994811,...,0.371984,1.054582,-1.408074,1.023649,1.065589,-1.412999,1.305267,0.500143,1.233882,1.602103
1,1.014999,-0.445082,0.567712,1.59895,0.95538,0.215099,-0.148112,-0.391103,1.3486,1.005216,...,0.371984,1.671699,0.42844,0.416178,-0.79453,1.057109,2.220299,1.067487,0.578789,0.456212
2,-0.985222,-0.445082,0.236891,-0.625411,-1.046703,-0.892926,1.635398,2.710567,0.796421,-0.994811,...,0.944672,0.437466,-0.795903,-0.798764,-0.79453,-0.795472,-0.524796,0.500143,-0.731396,-0.689679
3,-0.985222,-0.445082,0.303056,-0.625411,-1.046703,-1.166284,-0.0432,0.078057,-1.412296,1.005216,...,-1.346079,1.054582,-1.408074,-1.406235,-1.41457,-1.412999,-1.439828,0.500143,-1.386489,-1.262624
4,-0.985222,-0.445082,0.700041,-0.625411,-1.046703,0.002054,-0.384165,-0.391103,-1.412296,1.005216,...,0.944672,0.437466,-0.795903,0.416178,0.445549,0.439582,0.390235,0.500143,0.578789,0.456212


In [None]:
test_scale = pd.DataFrame(test_scaled, columns=test_df.columns)
test_scale.head()

Unnamed: 0,Gender,CustomerType,Age,TypeTravel,Travel_Class,Travel_Distance,DepartureDelay_in_Mins,ArrivalDelay_in_Mins,Seat_comfort,Seat_Class,...,Onboardwifi_service,Onboard_entertainment,Online_support,Onlinebooking_Ease,Onboard_service,Leg_room,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,-0.985222,-0.445082,-0.226258,-0.625411,-1.046703,-1.40754,-0.384165,-0.391103,-1.412296,-0.994811,...,0.944672,-0.796767,-0.183731,-0.798764,-0.79453,-0.795472,-0.524796,-0.067201,-0.731396,1.602103
1,-0.985222,2.246775,-1.218722,-0.625411,-1.046703,-0.538825,-0.148112,0.338702,-0.307938,1.005216,...,-1.346079,1.671699,-1.408074,-1.406235,-0.79453,-1.412999,0.390235,-1.201888,-0.731396,-1.262624
2,1.014999,-0.445082,1.361683,-0.625411,-1.046703,0.829911,-0.384165,-0.391103,-0.860117,1.005216,...,-0.773391,-0.796767,-0.795903,1.023649,0.445549,1.057109,1.305267,-0.067201,0.578789,-0.689679
3,-0.985222,-0.445082,-0.689408,1.59895,0.95538,-0.60984,-0.384165,-0.391103,-1.412296,-0.994811,...,1.517359,-1.413884,-0.795903,1.63112,-1.41457,1.057109,-0.524796,-0.634544,-0.731396,1.602103
4,1.014999,2.246775,-1.417214,-0.625411,-1.046703,-0.358856,0.061712,-0.391103,-0.860117,1.005216,...,-0.773391,-0.796767,-0.795903,-0.798764,-0.174491,-1.412999,-0.524796,-0.634544,-0.731396,-0.689679


## **Model Building - RF**

In [None]:
# Create the RandomForest model with parallel processing
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model using parallel processing
rf.fit(X_train_scale, y_train)



In [None]:
y_train_pred = rf.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred)}')

Validation Accuracy: 0.999989404422594


#### **Test Pred**

In [None]:
test_pred = rf.predict(test_scale)
test_pred

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
y_test = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred})
y_test.head()

Unnamed: 0,ID,Overall_Experience
0,99900001,1
1,99900002,1
2,99900003,1
3,99900004,0
4,99900005,1


In [None]:
from google.colab import files

y_test.to_csv('submission_rf.csv', index=False)

# Download the file
files.download('submission_rf.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Model Building - RF-tuned**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
model = RandomForestClassifier(random_state=42,  n_jobs=-1)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}

rf_tuned = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy')
rf_tuned.fit(X_train_scale, y_train)

In [None]:
model = RandomForestClassifier(random_state=42,  n_jobs=-1)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}

rf_tuned_fe = GridSearchCV(estimator=model, param_grid=param_grid, cv = 3, scoring='accuracy')
rf_tuned_fe.fit(X_train_scale, y_train)

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],              # Number of trees in the forest
    'max_depth': [10, 20, 30, None],             # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],             # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],               # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'],    # Number of features to consider for the best split
    'bootstrap': [True, False],
}

# Create the Random Forest model
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

# Fit the model
grid_search_rf.fit(X_train_scale, y_train)

In [None]:
y_train_pred2 = rf_tuned.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred2)}')

Validation Accuracy: 1.0


In [None]:
y_train_pred_rf = rf_tuned_fe.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred_rf)}')

Validation Accuracy: 1.0


### **Test pred**

In [None]:
test_pred2 = rf_tuned_fe.predict(test_scale)
test_pred2

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
y_test2 = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred2})
y_test2.head()

Unnamed: 0,ID,Overall_Experience
0,99900001,1
1,99900002,1
2,99900003,1
3,99900004,0
4,99900005,1


In [None]:
from google.colab import files # 0.9383462  0.9368294

y_test2.to_csv('submission_rf_tuned.csv', index=False)

# Download the file
files.download('submission_rf_tuned.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Model Building - Gradient Boosting Classifiers (XGBoost)**

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost model
xgb_model = XGBClassifier(random_state=42, n_jobs=-1)

# Train the model on the training data
xgb_model.fit(X_train_scale, y_train)

In [None]:
y_train_pred3 = xgb_model.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred3)}')

Validation Accuracy: 0.970184045179542


### **Test pred**

In [None]:
test_pred3 = xgb_model.predict(test_scale)
test_pred3

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
y_test3 = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred3})
y_test3.head()

Unnamed: 0,ID,Overall_Experience
0,99900001,1
1,99900002,1
2,99900003,1
3,99900004,0
4,99900005,1


In [None]:
y_test3.to_csv('submission.csv', index=False)

In [None]:
from google.colab import files

y_test3.to_csv('submission_xgb.csv', index=False)

# Download the file
files.download('submission_xgb.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Model Building - Gradient Boosting Classifiers (XGBoost) - tuned**

In [None]:
xgb_model_tuned = XGBClassifier(random_state=42, n_jobs=-1)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

grid_xgb_tuned = GridSearchCV(estimator=xgb_model_tuned, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_xgb_tuned.fit(X_train_scale, y_train)

In [None]:
from xgboost import XGBClassifier
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

xgb = XGBClassifier(random_state=42, n_jobs=-1)

param_dist = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'min_child_weight': randint(1, 10),
    'gamma': uniform(0, 0.5)
}

random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_dist,
    n_iter=100,  # Number of parameter settings sampled
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=2
)

random_search.fit(X_train_scale, y_train)

In [None]:
y_train_pred4 = grid_xgb_tuned.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred4)}')

Validation Accuracy: 0.9802816304474512


In [None]:
y_train_pred_random = random_search.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred_random)}')

Validation Accuracy: 0.9792114771294461


### **Test pred**

In [None]:
test_pred_random = random_search.predict(test_scale)
test_pred_random

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
y_test_random = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred_random})
y_test_random.head()

Unnamed: 0,ID,Overall_Experience
0,99900001,1
1,99900002,1
2,99900003,1
3,99900004,0
4,99900005,1


In [None]:
y_test4.to_csv('submission.csv', index=False)

In [None]:
from google.colab import files

y_test_random.to_csv('submission_xgb_random.csv', index=False)

# Download the file
files.download('submission_xgb_random.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Model Building - Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, max_iter=1000)

# Fit the model on the training data
lr.fit(X_train_scale, y_train)

In [None]:
y_train_pred_lr = lr.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred_lr)}')

Validation Accuracy: 0.7524449294864324


### **Test pred**

In [None]:
test_pred = rf.predict(test_scale)
test_pred

In [None]:
y_test = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred})
y_test.head()

In [None]:
y_test.to_csv('submission.csv', index=False)

## **Model Building - Logistic Regression - tuned**

In [None]:
model = LogisticRegression(random_state=42, max_iter=1000)

# Fit the model on the training data
model.fit(X_train_scale, y_train)

### **Test pred**

## **Model Building - AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a base Decision Tree classifier
base_estimator = DecisionTreeClassifier(max_depth=1)  # Typically, max_depth=1 is used as a weak learner

# Create the AdaBoost classifier
adb = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42)

# Fit the model on the training data
adb.fit(X_train_scale, y_train)



In [None]:
y_train_pred_ab = adb.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred_ab)}')

Validation Accuracy: 0.8907913836764535


## **Model Building - AdaBoost - Tuned**

In [None]:
base_estimator = DecisionTreeClassifier(random_state=42)

# AdaBoost Classifier
adb_clf = AdaBoostClassifier(base_estimator=base_estimator, random_state=42)

# Parameters for tuning - make sure 'bootstrap' is removed
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1, 10],
    'base_estimator__max_depth': [1, 2, 3, 4, 5],  # Parameters for the base estimator
    # No 'bootstrap' here since AdaBoost doesn't support it
}

# Grid Search
grid_search_adb = GridSearchCV(estimator=adb_clf, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the model
grid_search_adb.fit(X_train_scale, y_train)



In [None]:
y_train_pred_ab_tuned = grid_search_adb.predict(X_train_scale)
print(f'Validation Accuracy: {accuracy_score(y_train, y_train_pred_ab_tuned)}')

Validation Accuracy: 0.9561766918488223


#### **Test**


In [None]:
test_pred_ab_tuned = grid_search_adb.predict(test_scale)
test_pred_ab_tuned

array([1, 1, 1, ..., 1, 1, 0])

In [None]:
y_test_ab_tuned = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': test_pred_ab_tuned})
y_test_ab_tuned.head()

Unnamed: 0,ID,Overall_Experience
0,99900001,1
1,99900002,1
2,99900003,1
3,99900004,0
4,99900005,1


In [None]:
from google.colab import files

y_test_ab_tuned.to_csv('submission_ab_tuned.csv', index=False)

# Download the file
files.download('submission_ab_tuned.csv')   # 0.937

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>