# **Problem Description**

**Goal:**
The goal of the problem is to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train.

**Dataset:** 

The problem consists of 2 separate datasets: **Travel data** & Survey **data**. **Travel data** has information related to passengers and attributes related to the Shinkansen train, in which they traveled. The survey data is aggregated data of surveys indicating the post-service experience. You are expected to treat both these datasets as raw data and perform any necessary data cleaning/validation steps as required.

The data has been split into two groups and provided in the Dataset folder. The folder contains both train and test data separately.

Train_Data
Test_Data

**Target Variable:** Overall_Experience (1 represents ‘satisfied’, and 0 represents ‘not satisfied’)

The **training set** can be used to build your machine learning model. The training set has labels for the target column - Overall_Experience.

The **testing set** should be used to see how well your model performs on unseen data. For the test set, it is expected to predict the ‘Overall_Experience’ level for each participant.


# **Mounting the drive**

In [None]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Importing Library

In [None]:
import pandas as pd
import numpy as np
import numpy as np

from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer

#Importing NN Libraries
from tensorflow.keras.layers import Dense, Input, Dropout, BatchNormalization, Activation, LeakyReLU, ReLU
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras import optimizers

### Loading the Data

In [None]:
# Storing the path of the data file from the Google drive
path = '/content/drive/MyDrive/Hackathon/Datasets/'

In [None]:
#train_data = pd.read_csv(path+'complete_dataset_training.csv')
train_data = pd.read_csv('complete_dataset_training.csv')

In [None]:
#test_data = pd.read_csv(path+'complete_dataset_test.csv')
test_data = pd.read_csv('complete_dataset_test.csv')

### Understanding the Data

In [None]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,ID,Overall_Experience,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,...,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
0,0,98800001.0,0.0,2.0,0.0,6.0,6.0,6.0,5.0,2.0,...,2.0,3.0,1.0,1.0,52.0,1.8,0.0,272.0,0.0,5.0
1,1,98800002.0,0.0,3.0,1.0,6.0,3.0,1.0,5.0,3.0,...,5.0,5.0,2.0,1.0,48.0,1.0,1.0,2200.0,9.0,0.0
2,2,98800003.0,1.0,2.0,0.0,2.0,2.0,1.0,2.0,5.0,...,6.0,6.0,1.0,1.0,43.0,2.0,0.0,1061.0,77.0,119.0
3,3,98800004.0,0.0,4.0,1.0,2.0,4.4,1.0,4.0,2.0,...,4.0,4.0,1.0,1.0,44.0,2.0,0.0,780.0,13.0,18.0
4,4,98800005.0,1.0,4.0,1.0,4.0,4.0,4.0,2.0,5.0,...,5.0,5.0,1.0,1.0,50.0,2.0,0.0,1981.0,0.0,0.0


In [None]:
train_data.tail()

Unnamed: 0.1,Unnamed: 0,ID,Overall_Experience,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,...,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
53662,53662,98853663.0,0.0,5.0,0.0,6.0,5.0,5.0,6.0,5.0,...,5.0,6.0,2.0,1.0,33.0,1.0,1.0,2146.0,0.0,0.0
53663,53663,98853664.0,1.0,2.0,1.0,2.0,2.0,1.0,6.0,6.0,...,5.0,4.0,2.0,1.0,41.0,2.0,0.0,364.0,1.0,0.0
53664,53664,98853665.0,1.0,6.0,0.0,4.0,4.0,4.0,6.0,6.0,...,4.0,6.0,2.0,1.0,41.0,2.0,1.0,1377.0,38.0,17.0
53665,53665,98853666.0,0.0,2.0,1.0,2.0,2.0,4.0,3.0,2.0,...,5.0,3.0,2.0,2.0,30.0,2.0,0.0,1889.0,75.0,74.0
53666,53666,98853667.0,1.0,3.0,0.0,3.0,6.0,3.0,6.0,,...,,,,,,,,,,


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53667 entries, 0 to 53666
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               53667 non-null  int64  
 1   ID                       53667 non-null  float64
 2   Overall_Experience       53667 non-null  float64
 3   Seat_Comfort             53667 non-null  float64
 4   Seat_Class               53667 non-null  float64
 5   Arrival_Time_Convenient  53667 non-null  float64
 6   Catering                 53667 non-null  float64
 7   Platform_Location        53667 non-null  float64
 8   Onboard_Wifi_Service     53667 non-null  float64
 9   Onboard_Entertainment    53666 non-null  float64
 10  Online_Support           53666 non-null  float64
 11  Ease_of_Online_Booking   53666 non-null  float64
 12  Onboard_Service          53666 non-null  float64
 13  Legroom                  53666 non-null  float64
 14  Baggage_Handling      

In [None]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,53667.0,26833.0,15492.472785,0.0,13416.5,26833.0,40249.5,53666.0
ID,53667.0,98826830.0,15492.472785,98800001.0,98813417.5,98826834.0,98840250.5,98853667.0
Overall_Experience,53667.0,0.5460711,0.497878,0.0,0.0,1.0,1.0,1.0
Seat_Comfort,53667.0,3.772713,1.447433,1.0,2.0,4.0,5.0,6.0
Seat_Class,53667.0,0.4975497,0.499999,0.0,0.0,0.0,1.0,1.0
Arrival_Time_Convenient,53667.0,3.981396,1.481592,1.0,3.0,4.0,5.0,6.0
Catering,53667.0,3.800995,1.430309,1.0,3.0,4.0,5.0,6.0
Platform_Location,53667.0,3.782656,1.63175,1.0,3.0,4.0,5.0,6.0
Onboard_Wifi_Service,53667.0,4.15809,1.436495,1.0,3.0,4.0,5.0,6.0
Onboard_Entertainment,53666.0,4.321884,1.423301,1.0,3.0,5.0,5.0,6.0


In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               35602 non-null  int64  
 1   ID                       35602 non-null  float64
 2   Seat_Comfort             35602 non-null  float64
 3   Seat_Class               35602 non-null  float64
 4   Arrival_Time_Convenient  35602 non-null  float64
 5   Catering                 35602 non-null  float64
 6   Platform_Location        35602 non-null  float64
 7   Onboard_Wifi_Service     35602 non-null  float64
 8   Onboard_Entertainment    35602 non-null  float64
 9   Online_Support           35602 non-null  float64
 10  Ease_of_Online_Booking   35602 non-null  float64
 11  Onboard_Service          35602 non-null  float64
 12  Legroom                  35602 non-null  float64
 13  Baggage_Handling         35602 non-null  float64
 14  CheckIn_Service       

In [None]:
# Split the data into features and target variable
X = train_data.drop(['ID', 'Overall_Experience', 'Unnamed: 0'], axis=1)
y = train_data['Overall_Experience']

In [None]:
X.head()

Unnamed: 0,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,...,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
0,2.0,0.0,6.0,6.0,6.0,5.0,2.0,4.0,2.0,2.0,...,2.0,3.0,1.0,1.0,52.0,1.8,0.0,272.0,0.0,5.0
1,3.0,1.0,6.0,3.0,1.0,5.0,3.0,5.0,5.0,6.0,...,5.0,5.0,2.0,1.0,48.0,1.0,1.0,2200.0,9.0,0.0
2,2.0,0.0,2.0,2.0,1.0,2.0,5.0,6.0,6.0,6.0,...,6.0,6.0,1.0,1.0,43.0,2.0,0.0,1061.0,77.0,119.0
3,4.0,1.0,2.0,4.4,1.0,4.0,2.0,4.0,4.0,4.0,...,4.0,4.0,1.0,1.0,44.0,2.0,0.0,780.0,13.0,18.0
4,4.0,1.0,4.0,4.0,4.0,2.0,5.0,6.0,5.0,5.0,...,5.0,5.0,1.0,1.0,50.0,2.0,0.0,1981.0,0.0,0.0


# **Building the NN Architecture**

**MODEL 1**

In [None]:
# Define the neural network architecture
model1 = Sequential()

#1st Fully connected layer
model1.add(Dense(64, activation='relu', input_shape=(X.shape[1],)))
model1.add(BatchNormalization())
model1.add(Dropout(0.5))

# 2nd fully connected layer
model1.add(Dense(32, activation='relu'))
model1.add(BatchNormalization())
model1.add(Dropout(0.5))

# Final dense layer
model1.add(Dense(1, activation='sigmoid'))

In [None]:
model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                1536      
                                                                 
 batch_normalization (BatchN  (None, 64)               256       
 ormalization)                                                   
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 batch_normalization_1 (Batc  (None, 32)               128       
 hNormalization)                                                 
                                                                 
 dropout_1 (Dropout)         (None, 32)                0

In [None]:
# Compile the model
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# **Training the Model**

In [None]:
# Train the model
model1.fit(X, y, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5ed0c21a90>

In [None]:
# Preprocess the test data
X_test = test_data.drop(['ID','Unnamed: 0'], axis=1)

In [None]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Seat_Comfort             35602 non-null  float64
 1   Seat_Class               35602 non-null  float64
 2   Arrival_Time_Convenient  35602 non-null  float64
 3   Catering                 35602 non-null  float64
 4   Platform_Location        35602 non-null  float64
 5   Onboard_Wifi_Service     35602 non-null  float64
 6   Onboard_Entertainment    35602 non-null  float64
 7   Online_Support           35602 non-null  float64
 8   Ease_of_Online_Booking   35602 non-null  float64
 9   Onboard_Service          35602 non-null  float64
 10  Legroom                  35602 non-null  float64
 11  Baggage_Handling         35602 non-null  float64
 12  CheckIn_Service          35602 non-null  float64
 13  Cleanliness              35602 non-null  float64
 14  Online_Boarding       

In [None]:
X_test.head()

Unnamed: 0,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,...,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
0,4.0,0.0,4.0,4.0,4.0,2.0,6.0,5.0,6.0,6.0,...,6.0,3.0,1.0,1.0,36.0,2.0,0.0,532.0,0.0,0.0
1,1.0,1.0,5.0,3.0,4.0,4.0,3.0,4.0,4.0,6.0,...,6.0,4.0,1.0,2.0,21.0,2.0,0.0,1425.0,9.0,28.0
2,6.0,1.0,6.0,6.0,6.0,6.0,6.0,6.0,2.0,2.0,...,2.0,6.0,2.0,1.0,60.0,2.0,0.0,2832.0,0.0,0.0
3,4.0,0.0,6.0,4.0,6.0,3.0,4.0,6.0,3.0,4.0,...,6.0,3.0,1.0,1.0,29.0,1.0,1.0,1352.0,0.0,0.0
4,6.0,1.0,1.0,6.0,1.0,6.0,6.0,6.0,6.0,5.0,...,6.0,6.0,2.0,2.0,18.0,2.0,0.0,1610.0,17.0,0.0


In [None]:
# Make predictions on the test data
y_pred = model1.predict(X_test)



In [None]:
# Convert predictions to binary labels
y_pred = np.round(y_pred)

In [None]:
print(y_pred)

[[1.]
 [0.]
 [1.]
 ...
 [0.]
 [1.]
 [0.]]


In [None]:
print(test_data['ID'].shape)

(35602,)


In [None]:
print(y_pred.flatten().shape)

(35602,)


In [None]:
# Save the predictions to a CSV file
# note that y_pred is a 2D array, so we need to flatten it to convert to 1D
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': y_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions.csv', index=False)
predictions.to_csv('predictions.csv', index=False)

**MODEL 2**

In [None]:
# Define the neural network architecture
model2 = Sequential()

#1st Fully connected layer
model2.add(Dense(128, activation='relu', input_shape=(X.shape[1],)))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))

# 2nd fully connected layer
model2.add(Dense(64, activation='relu'))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))

# 2nd fully connected layer
model2.add(Dense(32, activation='relu'))
model2.add(BatchNormalization())
model2.add(Dropout(0.2))

# Final dense layer
model2.add(Dense(1, activation='sigmoid'))

In [None]:
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 128)               3072      
                                                                 
 batch_normalization_2 (Batc  (None, 128)              512       
 hNormalization)                                                 
                                                                 
 dropout_2 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 batch_normalization_3 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_3 (Dropout)         (None, 64)               

# **Compiling and training the model**

In [None]:
# Compile the model
learning_rate = 0.001
optimizer = optimizers.Adam(learning_rate=learning_rate)
model2.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model2.fit(X, y, epochs=25, batch_size=32, validation_split=0.2)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f5ec01927c0>

In [None]:
# Make predictions on the test data
y_pred = model2.predict(X_test)

# Convert predictions to binary labels
y_pred = np.round(y_pred)



In [None]:
# Save the predictions to a CSV file
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': y_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions2.csv', index=False)
predictions.to_csv('predictions2.csv', index=False)

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

# Model 3

In [None]:
# Define the neural network architecture
model3 = Sequential()

#1st Fully connected layer
model3.add(Dense(128, activation='relu', input_shape=(X.shape[1],)))
model3.add(BatchNormalization())
model3.add(Dropout(0.2))

# 2nd fully connected layer
model3.add(Dense(64, activation='relu'))
model3.add(Dense(64, activation='relu'))

# 2nd fully connected layer
model3.add(Dense(32, activation='relu'))
model3.add(BatchNormalization())
model3.add(Dropout(0.2))

# Final dense layer
model3.add(Dense(1, activation='sigmoid'))

In [None]:
model3.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_7 (Dense)             (None, 128)               3072      
                                                                 
 batch_normalization_5 (Batc  (None, 128)              512       
 hNormalization)                                                 
                                                                 
 dropout_5 (Dropout)         (None, 128)               0         
                                                                 
 dense_8 (Dense)             (None, 64)                8256      
                                                                 
 dense_9 (Dense)             (None, 64)                4160      
                                                                 
 dense_10 (Dense)            (None, 32)                2080      
                                                      

# **Compiling and training the model**

In [None]:
# Compile the model
learning_rate = 0.002
optimizer = optimizers.Adam(learning_rate=learning_rate)
model3.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model3.fit(X, y, epochs=25, batch_size=32, validation_split=0.2)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f5ec04ecf70>

In [None]:
# Make predictions on the test data
y_pred = model3.predict(X_test)

# Convert predictions to binary labels
y_pred = np.round(y_pred)



In [None]:
# Save the predictions to a CSV file
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': y_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions2.csv', index=False)
predictions.to_csv('predictions3.csv', index=False)

# Model 4

In [None]:
# Define the neural network architecture
model4 = Sequential()

#1st Fully connected layer
model4.add(Dense(128, activation='relu', input_shape=(X.shape[1],)))
model4.add(BatchNormalization())
model4.add(Dropout(0.2))

# 2nd fully connected layer
model4.add(Dense(64, activation='relu'))
model4.add(Dense(64, activation='linear'))
model4.add(Dense(48, activation='relu'))
model4.add(Dropout(0.2))

# 2nd fully connected layer
model4.add(Dense(32, activation='relu'))
model4.add(BatchNormalization())
model4.add(Dropout(0.2))

model4.add(Dense(48, activation='relu'))
model4.add(BatchNormalization())
model4.add(Dropout(0.2))

# Final dense layer
model4.add(Dense(1, activation='sigmoid'))

In [None]:
model4.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_26 (Dense)            (None, 128)               3072      
                                                                 
 batch_normalization_14 (Bat  (None, 128)              512       
 chNormalization)                                                
                                                                 
 dropout_14 (Dropout)        (None, 128)               0         
                                                                 
 dense_27 (Dense)            (None, 64)                8256      
                                                                 
 dense_28 (Dense)            (None, 64)                4160      
                                                                 
 dense_29 (Dense)            (None, 48)                3120      
                                                      

# **Compiling and training the model**

In [None]:
# Compile the model
learning_rate = 0.002
optimizer = optimizers.Adam(learning_rate=learning_rate)
model4.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model4.fit(X, y, epochs=25, batch_size=32, validation_split=0.2)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f5ead10cbe0>

In [None]:
# Make predictions on the test data
y_pred = model4.predict(X_test)

# Convert predictions to binary labels
y_pred = np.round(y_pred)



In [None]:
# Save the predictions to a CSV file
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': y_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions2.csv', index=False)
predictions.to_csv('predictions4.csv', index=False)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
X.isnull().sum()

Seat_Comfort               0
Seat_Class                 0
Arrival_Time_Convenient    0
Catering                   0
Platform_Location          0
Onboard_Wifi_Service       0
Onboard_Entertainment      1
Online_Support             1
Ease_of_Online_Booking     1
Onboard_Service            1
Legroom                    1
Baggage_Handling           1
CheckIn_Service            1
Cleanliness                1
Online_Boarding            1
Gender                     1
Customer_Type              1
Age                        1
Type_Travel                1
Travel_Class               1
Travel_Distance            1
Departure_Delay_in_Mins    1
Arrival_Delay_in_Mins      1
dtype: int64

In [None]:
test_x = X_test.copy()
test_x.drop(columns=["Departure_Delay_in_Mins","Arrival_Delay_in_Mins"], axis=1, inplace=True)

In [None]:
trial_data = pd.read_csv("complete_dataset_training.csv", index_col=0)
trial_data.isnull().sum()


ID                         0
Overall_Experience         0
Seat_Comfort               0
Seat_Class                 0
Arrival_Time_Convenient    0
Catering                   0
Platform_Location          0
Onboard_Wifi_Service       0
Onboard_Entertainment      0
Online_Support             0
Ease_of_Online_Booking     0
Onboard_Service            0
Legroom                    0
Baggage_Handling           0
CheckIn_Service            0
Cleanliness                0
Online_Boarding            0
Gender                     0
Customer_Type              0
Age                        0
Type_Travel                0
Travel_Class               0
Travel_Distance            0
Departure_Delay_in_Mins    0
Arrival_Delay_in_Mins      0
dtype: int64

In [None]:
y = trial_data["Overall_Experience"]
trial_data.drop(columns=["ID", "Departure_Delay_in_Mins", "Overall_Experience", "Arrival_Delay_in_Mins"], axis=1, inplace=True)

In [None]:
trial_data.head()

Unnamed: 0,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,...,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance
0,2.0,0.0,6.0,6.0,6.0,5.0,2.0,4.0,2.0,2.0,...,2.0,5.0,2.0,3.0,1.0,1.0,52.0,1.8,0.0,272.0
1,3.0,1.0,6.0,3.0,1.0,5.0,3.0,5.0,5.0,6.0,...,3.0,2.0,5.0,5.0,2.0,1.0,48.0,1.0,1.0,2200.0
2,2.0,0.0,2.0,2.0,1.0,2.0,5.0,6.0,6.0,6.0,...,6.0,5.0,6.0,6.0,1.0,1.0,43.0,2.0,0.0,1061.0
3,4.0,1.0,2.0,4.4,1.0,4.0,2.0,4.0,4.0,4.0,...,4.0,5.0,4.0,4.0,1.0,1.0,44.0,2.0,0.0,780.0
4,4.0,1.0,4.0,4.0,4.0,2.0,5.0,6.0,5.0,5.0,...,5.0,5.0,5.0,5.0,1.0,1.0,50.0,2.0,0.0,1981.0


In [None]:
knn_tuned = KNeighborsClassifier(n_neighbors=6)
knn_tuned.fit(trial_data, y)

KNeighborsClassifier(n_neighbors=6)

In [None]:
y_pred = knn_tuned.predict(test_x)
y_pred = np.round(y_pred)

In [None]:
# Save the predictions to a CSV file
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': y_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions2.csv', index=False)
predictions.to_csv('predictions5.csv', index=False)

In [None]:
trial_data.head()

Unnamed: 0,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,...,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance
0,2.0,0.0,6.0,6.0,6.0,5.0,2.0,4.0,2.0,2.0,...,2.0,5.0,2.0,3.0,1.0,1.0,52.0,1.8,0.0,272.0
1,3.0,1.0,6.0,3.0,1.0,5.0,3.0,5.0,5.0,6.0,...,3.0,2.0,5.0,5.0,2.0,1.0,48.0,1.0,1.0,2200.0
2,2.0,0.0,2.0,2.0,1.0,2.0,5.0,6.0,6.0,6.0,...,6.0,5.0,6.0,6.0,1.0,1.0,43.0,2.0,0.0,1061.0
3,4.0,1.0,2.0,4.4,1.0,4.0,2.0,4.0,4.0,4.0,...,4.0,5.0,4.0,4.0,1.0,1.0,44.0,2.0,0.0,780.0
4,4.0,1.0,4.0,4.0,4.0,2.0,5.0,6.0,5.0,5.0,...,5.0,5.0,5.0,5.0,1.0,1.0,50.0,2.0,0.0,1981.0


In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
trial_data.drop(columns=["Seat_Class"], axis=1, inplace=True)
test_x.drop(columns=["Seat_Class"], axis=1, inplace=True)

In [None]:
# Build random forest model
# rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# rf_model.fit(trial_data, y)

In [None]:
# Build 2nd random forest model
rf_model = RandomForestClassifier(n_estimators=120, max_depth=22, random_state=42)
rf_model.fit(trial_data, y)

RandomForestClassifier(max_depth=22, n_estimators=120, random_state=42)

In [None]:
# Make predictions on the test data
rf_pred = rf_model.predict(test_x)

In [None]:
# Convert predictions to binary labels
# Only run this if you are using random forest only. Else, skip
rf_pred = np.round(rf_pred)

In [None]:
predictions = pd.DataFrame({'ID': test_data['ID'], 'Overall_Experience': rf_pred.flatten().astype(int)})
#predictions.to_csv(path+'predictions2.csv', index=False)
predictions.to_csv('predictions4.csv', index=False)

In [None]:
print(pd.DataFrame(rf_model.feature_importances_, columns = ["Imp"], index = trial_data.columns).sort_values(by = 'Imp', ascending = False))

                              Imp
Onboard_Entertainment    0.222947
Seat_Comfort             0.133469
Ease_of_Online_Booking   0.088368
Online_Support           0.060689
Legroom                  0.042390
Customer_Type            0.039399
Travel_Distance          0.037413
Catering                 0.037406
Travel_Class             0.037249
Online_Boarding          0.037166
Age                      0.032695
Onboard_Service          0.030738
Gender                   0.028577
Cleanliness              0.026795
CheckIn_Service          0.026290
Type_Travel              0.025821
Baggage_Handling         0.025600
Arrival_Time_Convenient  0.024169
Platform_Location        0.020468
Onboard_Wifi_Service     0.017591
Seat_Class               0.004759
