# Shinkansen Travel Experience

---------------
## **Context:**
---------------
This problem statement is based on the Shinkansen Bullet Train in Japan, and passengers’ experience with that mode of travel. This machine-learning exercise aims to determine the relative importance of each parameter with regard to their contribution to the passengers’ overall travel experience. The dataset contains a random sample of individuals who travelled on this train. The on-time performance of the trains along with passenger information is published in a file named ‘Traveldata_train.csv’.  These passengers were later asked to provide their feedback on various parameters related to the travel, along with their overall experience. These collected details are made available in the survey report labelled ‘Surveydata_train.csv’.

In the survey, each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labelled ‘Overall_Experience’. 

The objective of this problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale. You are provided test data containing the travel data and the survey data of passengers. Both the test data and the train data are collected at the same time and belong to the same population.

---------------
## **Goal:**
---------------
The goal of the problem is to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train.

---------------
## **Dataset:**
---------------
The problem consists of 2 separate datasets: Travel data & Survey data. Travel data has information related to passengers and attributes related to the Shinkansen train, in which they traveled. The survey data is aggregated data of surveys indicating the post-service experience. You are expected to treat both these datasets as raw data and perform any necessary data cleaning/validation steps as required.

The data has been split into two groups and provided in the Dataset folder. The folder contains both train and test data separately.

Train_Data
Test_Data

Target Variable: Overall_Experience (1 represents ‘satisfied’, and 0 represents ‘not satisfied’)

The training set can be used to build your machine-learning model. The training set has labels for the target column - Overall_Experience.

The testing set should be used to see how well your model performs on unseen data. For the test set, it is expected to predict the ‘Overall_Experience’ level for each participant.

Data Dictionary:
All the data is self-explanatory. The survey levels are explained in the Data Dictionary file.

Submission File Format: You will need to submit a CSV file with exactly 35,602 entries plus a header row. The file should have exactly two columns

ID
Overall_Experience (contains 0 & 1 values, 1 represents ‘Satisfied’, and 0 represents ‘Not Satisfied’)

---------------
## **Evaluation Criteria:**
---------------
Accuracy Score: The evaluation metric is simply the percentage of predictions made by the model that turned out to be correct. This is also called the accuracy of the model. It will be calculated as the total number of correct predictions (True Positives + True Negatives) divided by the total number of observations in the dataset.
 
In other words, the best possible accuracy is 100% (or 1), and the worst possible accuracy is 0%.

Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use **Google Colab** for this project.

Let's start by mounting the Google drive on Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive') 

In [None]:
# Basic python libraries
import numpy as np
import pandas as pd

# Python libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
from collections import defaultdict

# For implementing cross validation
from surprise.model_selection import KFold

import warnings
warnings.filterwarnings('ignore')

### **Loading the data**

In [None]:
# Reading the datasets. Train dataset files
Traveldata_train = pd.read_csv(r"C:\Users\pc\Desktop\Shinkansen Travel Experience\Traveldata_train.csv")
Surveydata_train = pd.read_csv(r"C:\Users\pc\Desktop\Shinkansen Travel Experience\Surveydata_train.csv")
Traveldata_test = pd.read_csv(r"C:\Users\pc\Desktop\Shinkansen Travel Experience\Traveldata_test.csv")
Surveydata_test = pd.read_csv(r"C:\Users\pc\Desktop\Shinkansen Travel Experience\Surveydata_test.csv")
Data_Dictionary = pd.read_csv(r"C:\Users\pc\Desktop\Shinkansen Travel Experience\Data_Dictionary.csv")	

In [None]:
# ensure that the data is loaded correctly
print(Traveldata_train.shape)
print(Surveydata_train.shape)
print(Traveldata_test.shape)
print(Surveydata_test.shape)


### **Exploring the data**

In [None]:
Traveldata_train.head()

In [None]:
Surveydata_train.head()

In [None]:
Traveldata_train.info()

In [None]:
Surveydata_train.info()

In [None]:
Traveldata_train.describe()

In [None]:
Surveydata_train.describe()

In [None]:
# Visualize missing values using a heatmap for better understanding:
plt.figure(figsize=(12, 8))
sns.heatmap(Traveldata_train.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Traveldata_train')
plt.show()

In [None]:
# Check for class imbalance in the target variable (Overall_Experience)
ptrain_data["Overall_Experience"].value_counts(normalize=True)
plt.figure(figsize=(10, 6))
sns.countplot(x='Overall_Experience', data=ptrain_data, palette='viridis')
plt.title('Class Distribution of Overall Experience')
plt.xlabel('Overall Experience')
plt.ylabel('Count')
plt.show()

#### **Now lets merge the datasets the train and test datasets**

In [None]:
# Merge the Datasets
train_data = pd.merge(Traveldata_train, Surveydata_train, on="ID")
test_data = pd.merge(Traveldata_test, Surveydata_test, on="ID")

#### **Cleaning the datasets**

In [None]:
# Check for missing values in the merged dataset
train_data.isnull().sum()

In [None]:
# imputing missing values
train_data.fillna(train_data.mean(), inplace=True)

In [None]:
# drop the columns with missing values
# train_data = train_data.dropna(axis=1, how='any')

In [None]:
# Check for duplicates in the merged dataset
train_data.duplicated().sum()

In [None]:
# Convert categorical variables to numerical
from sklearn.preprocessing import LabelEncoder 

#### **Exploratory Data Analysis (EDA)**

In [None]:
# plot the distribution of the target variable 
sns.countplot(x="Overall_Experience", data=train_data)
sns.boxplot(x="Overall_Experience", y="some_feature", data=train_data)

In [None]:
# Identify important features using correlation or feature importance techniques. 
# For example, using correlation matrix
correlation_matrix = train_data.corr()

In [None]:
# Correlation heatmap:
plt.figure(figsize=(12, 8))
sns.heatmap(train_data.corr(), annot=True, cmap="coolwarm")

In [None]:
# Pairplot for selected features:
sns.pairplot(train_data, hue="Overall_Experience")

In [None]:
# Identifying categorical and numerical features separately for better analysis.
categorical_features = train_data.select_dtypes(include=['object']).columns.tolist()
numerical_features = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("Categorical Features: ", categorical_features)


In [None]:
# Consider using OneHotEncoder for categorical variables with more than two categories.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [None]:
# Scale numerical features using StandardScaler or MinMaxScaler for models sensitive to feature scaling (e.g., Logistic Regression, SVM).
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, auc
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


#### **Model Building**

In [None]:
# Split the training data into training and validation sets:
from sklearn.model_selection import train_test_split
X = train_data.drop(columns=["Overall_Experience"])
y = train_data["Overall_Experience"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a classification model (e.g., Logistic Regression, Random Forest, Gradient Boosting):
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
# Evaluate the model on the validation set:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))

In [None]:
# feature importance visualization for Random Forest:
feature_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feature_importances.nlargest(10).plot(kind="barh")

In [None]:
# cross-validation to ensure the model's performance is consistent:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
print("Cross-Validation Accuracy:", scores.mean())

#### **Test the Model**

In [None]:
# Use the trained model to predict the Overall_Experience for the test dataset:
test_predictions = model.predict(test_data)

In [None]:
# Save the predictions in the required format:
submission = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions})
submission.to_csv("submission.csv", index=False)

#### **Iteration and Improvement**

In [None]:
# Experiment with different models (e.g., XGBoost, LightGBM).
from xgboost import XGBClassifier
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_val)
print("XGBoost Accuracy:", accuracy_score(y_val, y_pred_xgb))

In [None]:
# Use the trained XGBoost model to predict the Overall_Experience for the test dataset:
test_predictions_xgb = xgb_model.predict(test_data)

In [None]:
# Save the predictions in the required format:
submission_xgb = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions_xgb})
submission_xgb.to_csv("submission_xgb.csv", index=False)

In [None]:
# Perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV.
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_val)
print("Best Model Accuracy:", accuracy_score(y_val, y_pred_best))

In [None]:
# Use the best model to predict the Overall_Experience for the test dataset:
test_predictions_best = best_model.predict(test_data)

In [None]:
# Add a comparison of model performances (e.g., Random Forest vs. XGBoost vs. Voting Classifier) in a table or bar chart.
model_performance = pd.DataFrame({
    'Model': ['Random Forest', 'XGBoost', 'Best Model'],
    'Accuracy': [accuracy_score(y_val, y_pred), accuracy_score(y_val, y_pred_xgb), accuracy_score(y_val, y_pred_best)]
})
model_performance.plot(x='Model', y='Accuracy', kind='bar', legend=False)
plt.title('Model Performance Comparison')
plt.ylabel('Accuracy')
plt.show()


In [None]:
# Save the best model predictions in the required format:
submission_best = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions_best})
submission_best.to_csv("submission_best.csv", index=False)

In [None]:
# save the predictions in the required format:
submission_best = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions_best})

In [None]:
# Use feature selection techniques to improve performance.
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(RandomForestClassifier(random_state=42))
selector.fit(X_train, y_train)
X_train_selected = selector.transform(X_train)
X_val_selected = selector.transform(X_val)
X_test_selected = selector.transform(test_data)

In [None]:
# train the model again with selected features:
model_selected = RandomForestClassifier(random_state=42)
model_selected.fit(X_train_selected, y_train)
y_pred_selected = model_selected.predict(X_val_selected)
print("Selected Features Model Accuracy:", accuracy_score(y_val, y_pred_selected))
# Use the trained model with selected features to predict the Overall_Experience for the test dataset: 

In [None]:
# Use the trained model with selected features to predict the Overall_Experience for the test dataset: 
test_predictions_selected = model_selected.predict(X_test_selected)

In [None]:
# save the predictions in the required format:
submission_best = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions_best})
submission_best.to_csv("submission_best.csv", index=False)


In [None]:
# Use ensemble methods to combine predictions from multiple models.
from sklearn.ensemble import VotingClassifier
voting_model = VotingClassifier(estimators=[
    ('rf', RandomForestClassifier(random_state=42)),
    ('xgb', XGBClassifier(random_state=42))
], voting='soft')
voting_model.fit(X_train, y_train)
y_pred_voting = voting_model.predict(X_val)
print("Voting Classifier Accuracy:", accuracy_score(y_val, y_pred_voting))


In [None]:
# Use the voting model to predict the Overall_Experience for the test dataset:
test_predictions_voting = voting_model.predict(test_data)


In [None]:
# Save the predictions in the required format:
submission = pd.DataFrame({"ID": test_data["ID"], "Overall_Experience": test_predictions})
submission.to_csv("submission_voting.csv", index=False)

#### **Submition and Evaluation**

In [None]:
import shap

# Initialize the SHAP explainer for your model
explainer = shap.TreeExplainer(model)  # Use TreeExplainer for tree-based models like Random Forest

In [None]:
# Compute SHAP values for the validation dataset
shap_values = explainer.shap_values(X_val)

In [None]:
shap.summary_plot(shap_values[1], X_val)  # For binary classification, use shap_values[1] for class 1

In [None]:
# Ensure the submission file matches the required format.
submission.head()
submission.info()
submission.describe()
submission.isnull().sum()
submission.duplicated().sum()
submission.to_csv("submission_final.csv", index=False)


In [None]:
# Explain the prediction for a single instance
instance = X_val.iloc[0].values  # Select a single row
explanation = explainer.explain_instance(instance, model.predict_proba)

# Visualize the explanation
explanation.show_in_notebook()

In [None]:
# Submit the file and evaluate the accuracy score.
# The accuracy score will be evaluated based on the test dataset and the submission file.
# The evaluation metric will depend on the competition or task requirements.
# For example, if the task is a classification problem, accuracy score can be used.
# If the task is a regression problem, RMSE or MAE can be used.
# The evaluation metric can be calculated using the true labels and predicted labels.
# For example, if the true labels are in a variable called y_true and predicted labels are in y_pred:
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error
# y_true = test_data["Overall_Experience"]
# y_pred = test_predictions
# accuracy = accuracy_score(y_true, y_pred)
# rmse = mean_squared_error(y_true, y_pred, squared=False)
# mae = mean_absolute_error(y_true, y_pred)
# print("Accuracy:", accuracy)
# print("RMSE:", rmse)
# print("MAE:", mae)
# The evaluation metric can be used to compare different models and select the best one.
# The best model can be used to make predictions on the test dataset and submit the results.
# The final submission file can be submitted to the competition or task platform for evaluation.
# The evaluation results can be used to analyze the model performance and improve it further.
# The model can be improved by using more advanced techniques such as deep learning, transfer learning, or reinforcement learning.
# The model can also be improved by using more data, better features, or better hyperparameters.
# The model can be improved by using more advanced techniques such as deep learning, transfer learning, or reinforcement learning.