# 🚄 **Shinkansen Bullet Train Passenger Experience Prediction**  

This notebook presents the **data preprocessing, feature engineering, and model training** steps used to predict passenger satisfaction (`Overall_Experience`) for the **Shinkansen Bullet Train**. The dataset consists of two parts: **Survey Data** capturing passenger feedback and **Travel Data** containing travel-related details. Both datasets were merged to create a comprehensive feature set.  

The project required extensive **data cleaning**, including handling missing values, encoding categorical variables, and aligning the test dataset with the training set. Several machine learning models were tested, and after extensive **hyperparameter tuning and cross-validation**, **XGBoost** was selected as the best-performing model.  

This notebook focuses only on the **final and optimized approach** that achieved the highest accuracy in the competition. Other models and preprocessing techniques explored during the process are excluded for clarity. 🚀

## **Implementation Plan**
To build an effective classification model, the following steps are performed:

1. **Data Loading & Cleaning** – Handle missing values and preprocess the data.
2. **Feature Engineering** – Scale numerical features and encode categorical variables.
3. **Model Selection** – Train multiple models
4. **Hyperparameter Tuning** – Optimize model parameters using **GridSearchCV**.
5. **Model Evaluation** – Compare models using **cross-validation accuracy**.
6. **Prediction & Submission** – Generate predictions for test data and format the results for submission.

---

## **Let's Get Started!**
Below is the Python implementation of the solution:

In [39]:
# Importing necessary libraries
import pandas as pd 
import numpy as np 
# Data preprocessing
from sklearn.preprocessing import OneHotEncoder
# Model selection and evaluation
from sklearn.model_selection import GridSearchCV, cross_val_score
# Machine learning models
from xgboost.sklearn import XGBClassifier
# Model persistence
import joblib

## Loading the Survey Data (Training Set)

The **survey dataset** (`Surveydata_train.csv`) is loaded into a Pandas DataFrame.  
- This dataset contains **passenger feedback** on various aspects of their travel experience.  
- It includes responses to **multiple survey questions**, including the **target variable** (`Overall_Experience`).  

This data will later be merged with the **travel dataset** to create a complete training dataset.


In [40]:
df_survey = pd.read_csv('Surveydata_train.csv')

In [41]:
df_survey.head()

Unnamed: 0,ID,Overall_Experience,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
0,98800001,0,Needs Improvement,Green Car,Excellent,Excellent,Very Convenient,Good,Needs Improvement,Acceptable,Needs Improvement,Needs Improvement,Acceptable,Needs Improvement,Good,Needs Improvement,Poor
1,98800002,0,Poor,Ordinary,Excellent,Poor,Needs Improvement,Good,Poor,Good,Good,Excellent,Needs Improvement,Poor,Needs Improvement,Good,Good
2,98800003,1,Needs Improvement,Green Car,Needs Improvement,Needs Improvement,Needs Improvement,Needs Improvement,Good,Excellent,Excellent,Excellent,Excellent,Excellent,Good,Excellent,Excellent
3,98800004,0,Acceptable,Ordinary,Needs Improvement,,Needs Improvement,Acceptable,Needs Improvement,Acceptable,Acceptable,Acceptable,Acceptable,Acceptable,Good,Acceptable,Acceptable
4,98800005,1,Acceptable,Ordinary,Acceptable,Acceptable,Manageable,Needs Improvement,Good,Excellent,Good,Good,Good,Good,Good,Good,Good


## Checking for Missing Values in the Survey Data

To ensure data quality, we check for **missing values** in the `Surveydata_train.csv` dataset.  
- This helps identify columns that require **imputation or removal** before merging with the travel dataset.  
- The output will show **True** for columns that have missing values and **False** for those that are complete.

Handling missing values properly ensures a **clean and reliable dataset** for model training.


In [42]:
df_survey.isna().any()

ID                         False
Overall_Experience         False
Seat_Comfort                True
Seat_Class                 False
Arrival_Time_Convenient     True
Catering                    True
Platform_Location           True
Onboard_Wifi_Service        True
Onboard_Entertainment       True
Online_Support              True
Ease_of_Online_Booking      True
Onboard_Service             True
Legroom                     True
Baggage_Handling            True
CheckIn_Service             True
Cleanliness                 True
Online_Boarding             True
dtype: bool

## Counting Missing Values in the Survey Data

After identifying which columns have missing values, we now count **how many columns contain missing data** in `Surveydata_train.csv`.  
- This helps assess the **extent of missing data** and decide on appropriate handling strategies.  
- A result of `0` means there are **no missing values**, while a nonzero result indicates **some columns require imputation or cleaning**.

This step ensures that we maintain **data integrity** before merging with the travel dataset.


In [43]:
df_survey.isna().any().sum()

np.int64(14)

## Checking the Dimensions of the Survey Data

To understand the **size of the dataset**, we check its shape using `.shape`.  
- This returns a tuple **(number of rows, number of columns)** in `Surveydata_train.csv`.  
- Knowing the dataset's size helps in **memory management, merging datasets, and preprocessing decisions**.

This step ensures we have a clear understanding of the **survey data structure** before further processing.


In [44]:
df_survey.shape

(94379, 17)

## Loading the Travel Data (Training Set)

The **travel dataset** (`Traveldata_train.csv`) is loaded into a Pandas DataFrame.  
- This dataset contains **passenger details and travel-related attributes**, such as age, delays, and travel class.  
- It provides important contextual data that will later be merged with the **survey dataset** to build a comprehensive training dataset.

Loading this dataset is essential for combining **travel data with passenger feedback** for model training.


In [45]:
df_travel = pd.read_csv('Traveldata_train.csv')

In [46]:
df_travel.head()

Unnamed: 0,ID,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
0,98800001,Female,Loyal Customer,52.0,,Business,272,0.0,5.0
1,98800002,Male,Loyal Customer,48.0,Personal Travel,Eco,2200,9.0,0.0
2,98800003,Female,Loyal Customer,43.0,Business Travel,Business,1061,77.0,119.0
3,98800004,Female,Loyal Customer,44.0,Business Travel,Business,780,13.0,18.0
4,98800005,Female,Loyal Customer,50.0,Business Travel,Business,1981,0.0,0.0


## Checking for Missing Values in the Travel Data

Before merging with the survey dataset, we check for **missing values** in `Traveldata_train.csv`.  
- This helps identify columns that may need **imputation or removal** during preprocessing.  
- The output will show **True** for columns containing missing values and **False** for those that are complete.

Handling missing values properly ensures a **clean and reliable dataset** for training.


In [47]:
df_travel.isna().any()

ID                         False
Gender                      True
Customer_Type               True
Age                         True
Type_Travel                 True
Travel_Class               False
Travel_Distance            False
Departure_Delay_in_Mins     True
Arrival_Delay_in_Mins       True
dtype: bool

## Counting Missing Values in the Travel Data

To assess the **extent of missing data**, we count how many columns contain missing values in `Traveldata_train.csv`.  
- A result of **0** means there are **no missing values**, ensuring a clean dataset.  
- A nonzero result indicates that **some columns require imputation or cleaning** before merging with the survey dataset.  

This step helps in deciding how to **handle missing data efficiently** during preprocessing.


In [48]:
df_travel.isna().any().sum()

np.int64(6)

## Checking the Dimensions of the Travel Data

To understand the **size of the travel dataset**, we check its shape using `.shape`.  
- This returns a tuple **(number of rows, number of columns)** in `Traveldata_train.csv`.  
- Knowing the dataset's size helps in **memory management, merging datasets, and preprocessing decisions**.  

This step ensures we have a clear understanding of the **travel data structure** before further processing.


In [49]:
df_travel.shape

(94379, 9)

## Verifying Row Count Consistency Between Survey and Travel Data

Before merging, we check if the **survey dataset** (`Surveydata_train.csv`) and the **travel dataset** (`Traveldata_train.csv`) have the same number of rows.  
- This ensures a **one-to-one relationship** between passenger survey responses and their corresponding travel details.  
- If the result is **True**, both datasets have the same number of rows and can be safely merged.  
- If **False**, it indicates a mismatch that may require **further investigation** before merging.

Ensuring dataset consistency is crucial for creating a reliable training dataset.


In [50]:
df_survey.shape[0] == df_travel.shape[0]

True

## Exploring the Structure of the Survey Data

To understand the **data types** and **missing values**, we use `.info()` on `Surveydata_train.csv`.  
- This displays **column names, data types, and non-null value counts** for each feature.  
- Helps identify **categorical vs. numerical columns** for appropriate preprocessing.  
- Highlights **missing values**, guiding data cleaning and imputation strategies.

This step ensures a better understanding of the dataset before merging and further preprocessing.


In [51]:
df_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       94379 non-null  int64 
 1   Overall_Experience       94379 non-null  int64 
 2   Seat_Comfort             94318 non-null  object
 3   Seat_Class               94379 non-null  object
 4   Arrival_Time_Convenient  85449 non-null  object
 5   Catering                 85638 non-null  object
 6   Platform_Location        94349 non-null  object
 7   Onboard_Wifi_Service     94349 non-null  object
 8   Onboard_Entertainment    94361 non-null  object
 9   Online_Support           94288 non-null  object
 10  Ease_of_Online_Booking   94306 non-null  object
 11  Onboard_Service          86778 non-null  object
 12  Legroom                  94289 non-null  object
 13  Baggage_Handling         94237 non-null  object
 14  CheckIn_Service          94302 non-nul

## Statistical Summary of the Survey Data

To gain insights into the **distribution and characteristics** of features in `Surveydata_train.csv`, we use `.describe(include='all')`.  
- This provides **summary statistics** for **numerical** and **categorical** columns.  
- For **numerical features**, it shows values like **mean, min, max, and standard deviation**.  
- For **categorical features**, it displays **unique values, most frequent categories, and their counts**.  

This step helps in **understanding data trends, detecting anomalies, and preparing for preprocessing**.


In [52]:
df_survey.describe(include='all')

Unnamed: 0,ID,Overall_Experience,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
count,94379.0,94379.0,94318,94379,85449,85638,94349,94349,94361,94288,94306,86778,94289,94237,94302,94373,94373
unique,,,6,2,6,6,6,6,6,6,6,6,6,5,6,6,6
top,,,Acceptable,Green Car,Good,Acceptable,Manageable,Good,Good,Good,Good,Good,Good,Good,Good,Good,Good
freq,,,21158,47435,19574,18468,24173,22835,30446,30016,28909,27265,28870,34944,26502,35427,25533
mean,98847190.0,0.546658,,,,,,,,,,,,,,,
std,27245.01,0.497821,,,,,,,,,,,,,,,
min,98800000.0,0.0,,,,,,,,,,,,,,,
25%,98823600.0,0.0,,,,,,,,,,,,,,,
50%,98847190.0,1.0,,,,,,,,,,,,,,,
75%,98870780.0,1.0,,,,,,,,,,,,,,,


## Exploring the Structure of the Travel Data

To understand the **data types, missing values, and column structure** in `Traveldata_train.csv`, we use `.info()`.  
- Displays **column names, data types, and non-null value counts** for each feature.  
- Helps distinguish between **categorical and numerical columns** for preprocessing.  
- Identifies **missing values**, guiding imputation or data cleaning strategies.  

This step ensures that we properly handle the travel dataset before merging it with the survey data.


In [53]:
df_travel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Gender                   94302 non-null  object 
 2   Customer_Type            85428 non-null  object 
 3   Age                      94346 non-null  float64
 4   Type_Travel              85153 non-null  object 
 5   Travel_Class             94379 non-null  object 
 6   Travel_Distance          94379 non-null  int64  
 7   Departure_Delay_in_Mins  94322 non-null  float64
 8   Arrival_Delay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 6.5+ MB


## Statistical Summary of the Travel Data

To analyze the **distribution and key statistics** of numerical features in `Traveldata_train.csv`, we use `.describe()`.  
- Provides **summary statistics** such as **mean, min, max, standard deviation, and percentiles** for numerical columns.  
- Helps detect **outliers, missing values, and skewed distributions** in travel-related features.  
- Supports **feature scaling decisions** and guides **data preprocessing strategies**.  

This step ensures a **better understanding of the dataset** before merging and model training.


In [54]:
df_travel.describe()

Unnamed: 0,ID,Age,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
count,94379.0,94346.0,94379.0,94322.0,94022.0
mean,98847190.0,39.419647,1978.888185,14.647092,15.005222
std,27245.01,15.116632,1027.961019,38.138781,38.439409
min,98800000.0,7.0,50.0,0.0,0.0
25%,98823600.0,27.0,1359.0,0.0,0.0
50%,98847190.0,40.0,1923.0,0.0,0.0
75%,98870780.0,51.0,2538.0,12.0,13.0
max,98894380.0,85.0,6951.0,1592.0,1584.0


## Merging Survey and Travel Data

To create a **comprehensive training dataset**, we merge the **survey data** (`Surveydata_train.csv`) with the **travel data** (`Traveldata_train.csv`) using the **common key** `ID`.  
- Ensures each passenger’s **travel details** and **survey responses** are combined into a single dataset.  
- The `on='ID'` argument ensures the merge happens **based on matching passenger IDs**.  
- The merged dataset `df` will be used for **data preprocessing and model training**.

This step is crucial for integrating **passenger experience and travel attributes** into a unified dataset.


In [55]:
df = pd.merge(df_survey, df_travel, on='ID')

In [56]:
df.head()

Unnamed: 0,ID,Overall_Experience,Seat_Comfort,Seat_Class,Arrival_Time_Convenient,Catering,Platform_Location,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,...,Cleanliness,Online_Boarding,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins
0,98800001,0,Needs Improvement,Green Car,Excellent,Excellent,Very Convenient,Good,Needs Improvement,Acceptable,...,Needs Improvement,Poor,Female,Loyal Customer,52.0,,Business,272,0.0,5.0
1,98800002,0,Poor,Ordinary,Excellent,Poor,Needs Improvement,Good,Poor,Good,...,Good,Good,Male,Loyal Customer,48.0,Personal Travel,Eco,2200,9.0,0.0
2,98800003,1,Needs Improvement,Green Car,Needs Improvement,Needs Improvement,Needs Improvement,Needs Improvement,Good,Excellent,...,Excellent,Excellent,Female,Loyal Customer,43.0,Business Travel,Business,1061,77.0,119.0
3,98800004,0,Acceptable,Ordinary,Needs Improvement,,Needs Improvement,Acceptable,Needs Improvement,Acceptable,...,Acceptable,Acceptable,Female,Loyal Customer,44.0,Business Travel,Business,780,13.0,18.0
4,98800005,1,Acceptable,Ordinary,Acceptable,Acceptable,Manageable,Needs Improvement,Good,Excellent,...,Good,Good,Female,Loyal Customer,50.0,Business Travel,Business,1981,0.0,0.0


## Saving and Reloading the Merged Dataset

After merging the **survey** and **travel** datasets, we save the **combined dataset** as `df.csv`.  
- This ensures that the processed dataset is stored for future use **without re-running the merge operation**.  
- The dataset is then **reloaded** to verify the saved file and continue with preprocessing.

This step helps maintain **efficiency** and avoids unnecessary recomputation in later stages.


In [57]:
df.to_csv('df.csv', index=False)

In [58]:
# Load the dataset from a CSV file into a Pandas DataFrame
df = pd.read_csv('df.csv')

##  Data Preprocessing: Feature Encoding & Missing Value Handling

### **Overview**
This step focuses on **transforming categorical features into a machine-learning-friendly format** and **handling missing values** to ensure a clean and consistent dataset for model training.

### **🔹 Key Steps in This Preprocessing Phase**
1. **Categorical Feature Encoding:**
   - The dataset includes **ordinal features** (ordered categories like `Seat_Comfort`) and **nominal features** (unordered categories like `Gender`).
   - These features are **converted into numerical format** using **One-Hot Encoding**, ensuring all categorical variables are represented in a way suitable for machine learning algorithms.
   - The **original categorical columns** are then removed, and the newly encoded columns are added to the dataset.

2. **Handling Missing Values:**
   - **Categorical Features:** After encoding, missing values are **filled with 0** in the newly created binary columns.
   - **Numerical Features:**
     - The **Age** column is filled with its mean value of **40**.
     - Delay-related features like **`Departure_Delay_in_Mins`** and **`Arrival_Delay_in_Mins`** are filled with their **mean values** to avoid data loss.
   - A **dictionary of imputed values** is saved to maintain consistency when processing test data.

3. **Saving Preprocessing Information:**
   - The **column names** of the transformed dataset are saved to ensure alignment with the test set during inference.
   - A separate file (`numerical_means.csv`) is created to store the computed mean values, which will be applied to the test set for consistency.
   - The final **cleaned dataset** (`df_cleaned_one_hot.csv`) is saved for training.

### **🔹 Why Is This Step Important?**
✔️ **Prepares categorical data** for ML models that require numerical inputs.  
✔️ **Handles missing values systematically**, preventing model biases due to inconsistent data.  
✔️ **Ensures train-test alignment**, avoiding discrepancies when applying the model to new data.  

After this step, we will have a **fully cleaned dataset**, ready for feature scaling and model training.


In [59]:

# Define Ordinal and Nominal Columns
# Ordinal features (ordered categories)
ordinal_cols = [
    'Seat_Comfort', 'Arrival_Time_Convenient', 'Catering', 'Onboard_Wifi_Service',
    'Onboard_Entertainment', 'Online_Support', 'Ease_of_Online_Booking',
    'Onboard_Service', 'Legroom', 'Baggage_Handling', 'CheckIn_Service',
    'Cleanliness', 'Online_Boarding', 'Platform_Location'
]
# Nominal features (unordered categories)
nominal_cols = ['Gender', 'Customer_Type', 'Type_Travel', 'Travel_Class', 'Seat_Class']

# Dictionary to Store Filling Values
filling_values = {}

# 🚀 Step 1: One-Hot Encode Ordinal and Nominal Columns
# Combine all categorical columns
all_cat_cols = ordinal_cols + nominal_cols
# Apply one-hot encoding (convert categorical values to binary features)
df_cat_encoded = pd.get_dummies(df[all_cat_cols], drop_first=False, dtype=int)

# Replace NA with 0 in One-Hot Encoded Columns
df_cat_encoded.fillna(0, inplace=True)

# 🚀 Step 2: Drop Original Categorical Columns and Add Encoded Columns
df.drop(columns=all_cat_cols, inplace=True)
df = pd.concat([df, df_cat_encoded], axis=1)

# 🚀 Step 3: Fill Numerical Columns
numerical_cols = ['Departure_Delay_in_Mins', 'Arrival_Delay_in_Mins', 'Age']
numerical_fill_values = {}

# Fill 'Age' with 40 (As Requested)
df['Age'] = df['Age'].fillna(40)
numerical_fill_values['Age'] = 40

# Fill Other Numerical Columns with Their Mean
for col in ['Departure_Delay_in_Mins', 'Arrival_Delay_in_Mins']:
    mean_value = df[col].mean()
    df[col] = df[col].fillna(mean_value)
    numerical_fill_values[col] = mean_value

# Save Numerical Means for Reuse in Test Set
numerical_means_df = pd.DataFrame(list(numerical_fill_values.items()), columns=["Column", "Mean_Value"])
numerical_means_df.to_csv("numerical_means.csv", index=False)

# 🚀 Step 4: Save Column Names for Future Consistency (Test Set)
column_names = df.columns.tolist()
pd.DataFrame(column_names, columns=["Column"]).to_csv("one_hot_columns.csv", index=False)

# Save the Cleaned Training Dataset
df.to_csv("df_cleaned_one_hot.csv", index=False)

# ✅ Final Status Update
print("✅ Training set cleaned and saved.")
print("✅ 'one_hot_columns.csv' and 'numerical_means.csv' saved for test alignment.")


✅ Training set cleaned and saved.
✅ 'one_hot_columns.csv' and 'numerical_means.csv' saved for test alignment.


## Loading the Test Dataset

After preprocessing and preparing the training dataset, we now load the **test dataset**, which will be used to evaluate the model’s performance on unseen data.

### **🔹 What is the Test Dataset?**
- The test dataset (`df_test.csv`) contains **passenger and travel-related features**, just like the training dataset, but **without the target variable** (`Overall_Experience`).
- Our goal is to **preprocess this dataset in the same way** as the training data, ensuring consistency before making predictions.

### **🔹 Why is This Step Important?**
✔️ Ensures the **test dataset matches the train dataset** in terms of feature transformations.  
✔️ Enables **model evaluation on unseen data** to measure real-world performance.  
✔️ Allows us to **generate predictions** for submission or further analysis.  

Next, we will preprocess `df_test` using the **same transformations applied to the training dataset** to maintain consistency.


In [60]:
# Load the test dataset from a CSV file into a Pandas DataFrame
df_test= pd.read_csv('df_test.csv')

## Preprocessing the Test Dataset

To ensure consistency with the training data, we apply the same preprocessing steps to the test dataset.  
- **One-hot encoding** is used to convert categorical features into numerical format.  
- Missing values in **numerical columns** are filled using **precomputed means from training data**.  
- The test dataset is **aligned with the training feature set**, adding missing columns (set to 0) and removing extra ones.  
- Finally, the cleaned test dataset is saved as `"df_test_one_hot_encoded.csv"` for model predictions.  


In [61]:

# Load Numerical Means from Training
numerical_means = pd.read_csv("numerical_means.csv").set_index("Column")["Mean_Value"].to_dict()

# Step 1: One-Hot Encode Ordinal and Nominal Columns
all_cat_cols = [
    'Seat_Comfort', 'Arrival_Time_Convenient', 'Catering', 'Onboard_Wifi_Service',
    'Onboard_Entertainment', 'Online_Support', 'Ease_of_Online_Booking',
    'Onboard_Service', 'Legroom', 'Baggage_Handling', 'CheckIn_Service',
    'Cleanliness', 'Online_Boarding', 'Platform_Location',
    'Gender', 'Customer_Type', 'Type_Travel', 'Travel_Class', 'Seat_Class'
]

df_test_cat_encoded = pd.get_dummies(df_test[all_cat_cols], drop_first=False, dtype=int)

# Replace NA with 0 in One-Hot Encoded Columns
df_test_cat_encoded.fillna(0, inplace=True)

# Step 2: Drop Original Categorical Columns and Add Encoded Columns
df_test.drop(columns=all_cat_cols, inplace=True)
df_test = pd.concat([df_test, df_test_cat_encoded], axis=1)

# Step 3: Fill Numerical Columns Using Training Means
numerical_cols = ["Age", "Departure_Delay_in_Mins", "Arrival_Delay_in_Mins"]

# Fill 'Age' with 40
df_test['Age'] = df_test['Age'].fillna(40)

# Fill Other Numerical Columns with Training Means
for col in ['Departure_Delay_in_Mins', 'Arrival_Delay_in_Mins']:
    mean_value = numerical_means.get(col, df_test[col].mean())
    df_test[col] = df_test[col].fillna(mean_value)

# Step 4: Align Columns with Training Set
train_columns = pd.read_csv("one_hot_columns.csv")["Column"].tolist()

# Add Missing Columns (Set to 0)
for col in train_columns:
    if col not in df_test.columns:
        df_test[col] = 0

# Drop Extra Columns
df_test = df_test[train_columns]

# Save Processed Test Dataset
df_test.to_csv("df_test_one_hot_encoded.csv", index=False)

print("✅ Test set cleaned using training means and saved.")


✅ Test set cleaned using training means and saved.


In [62]:
# Remove the 'ID' column from the dataset as it is not needed for model training
del df['ID']

In [63]:
# Define features (X) and target (y)
X_train = df.drop(columns=["Overall_Experience"])  # Training features
y_train = df["Overall_Experience"]  # Training target

In [64]:
# Remove the 'ID' column from the test dataset as it is not required for model prediction
del df_test['ID']

In [65]:
# Remove the 'Overall_Experience' column from the test dataset as it is not available during real-world predictions
del df_test['Overall_Experience']

## Hyperparameter Tuning for XGBoost

To optimize the performance of **XGBoost**, a hyperparameter tuning process was conducted through **multiple iterations** of Grid Search.  
- The final set of **best-performing hyperparameters** was determined after testing **various configurations** to achieve maximum accuracy.  
- A **Grid Search with 5-fold Cross-Validation** is performed to systematically evaluate different combinations of hyperparameters.  
- The model is trained using **`X_train`** and validated using cross-validation, with accuracy as the evaluation metric.  

The results will display **the best hyperparameter combination** along with the highest cross-validation accuracy achieved.


In [66]:
# Define hyperparameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [500, ],
    'max_depth': [20],
    'min_child_weight': [1],
    'learning_rate': [0.05],
    'subsample': [0.8],
    'colsample_bytree': [0.75, 0.8, 0.85]
}

# Perform Grid Search with Cross-Validation for XGBoost
print("Running Grid Search for XGBoost...")
xgb_s = XGBClassifier(random_state=42)
grid_search_xgb_s = GridSearchCV(xgb_s, param_grid_xgb, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search_xgb_s.fit(X_train, y_train)

print(f"🔹 XGBoost Best Accuracy: {grid_search_xgb_s.best_score_:.4f}")
print(f"🔹 XGBoost Best Params: {grid_search_xgb_s.best_params_}")

Running Grid Search for XGBoost...
Fitting 5 folds for each of 3 candidates, totalling 15 fits
🔹 XGBoost Best Accuracy: 0.9585
🔹 XGBoost Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 20, 'min_child_weight': 1, 'n_estimators': 500, 'subsample': 0.8}


## Training the Final XGBoost Model

After performing **hyperparameter tuning**, the best parameters identified from **Grid Search** are now used to train the **final XGBoost model**.  
- The model is initialized with the **optimal hyperparameters** obtained from cross-validation.  
- It is then **trained on the full training dataset (`X_train`, `y_train`)** to maximize performance.  

This trained model will be used for **evaluating test data and making final predictions**.


In [67]:
final_xgb = XGBClassifier(**grid_search_xgb_s.best_params_, random_state=42)
final_xgb.fit(X_train, y_train)

## Making Final Predictions with XGBoost

Now that the **final XGBoost model** has been trained with the best hyperparameters, we use it to make predictions on the **preprocessed test dataset** (`df_test`).  
- The model predicts whether each passenger was **satisfied (1) or not satisfied (0)** based on the provided features.  
- These predictions will be used for evaluation and final submission.

The variable `final_predictions_xgb_40` stores the predicted values for the test dataset.


In [68]:
final_predictions_xgb_40 = final_xgb.predict(df_test)

In [69]:
final_predictions_xgb_40

array([1, 1, 1, ..., 1, 1, 0], shape=(35602,))

## Loading the Cleaned Test Dataset with IDs

To prepare for the final submission, I **reload the cleaned test dataset** (`df_test_cleaned.csv`).  
- This dataset contains the **passenger IDs** that were removed during preprocessing.  
- The IDs will be used to **match predictions with the correct passengers** for the final output file.


## Creating the Final Submission File

Now, we generate the **submission file** containing predictions for each passenger:  
- A new DataFrame (`submission_xgb_40`) is created with two columns:  
  - **"ID"** → Passenger IDs from the cleaned test dataset.  
  - **"Overall_Experience"** → Predicted satisfaction levels (1 = Satisfied, 0 = Not Satisfied).  
- The file is saved as `"submission_xgb_40.csv"`, ready for final evaluation or submission.

This ensures the **predictions are properly linked to the original test IDs** for accurate reporting.


In [67]:
# Create submission DataFrame
submission_xgb_40 = pd.DataFrame({
    "ID": df_test_ID["ID"],  # Use the same IDs from the test file
    "Overall_Experience": final_predictions_xgb_40
})

submission_xgb_40.to_csv("submission_xgb_40.csv", index=False)