# Exploring Mental Health Data Competition

## Goal 
To use data from a mental health survey to explore factors that may cause individuals to experience depression.

## Dataset Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Depression Survey/Dataset for Analysis dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

### Notes:

A number of data artifacts have been left in the synthetic dataset.
This is not a particularly difficult dataset to model. It may be interesting to focus on different ways to visualize the dataset.
Files

* train.csv - the training dataset; class is the binary target (either e or p)
* test.csv - the test dataset; your objective is to predict target class for each row
* sample_submission.csv - a sample submission file in the correct format

## Evaluation

Submissions are evaluated using **Accuracy Score**.

### **Contents:** 
 1. [Imports and data loading](#1)
 2. [Exploratory Data Analysis](#2)
 3. [Building a Machine Learning Model](#3)
 
 
 <a id="1"></a>
## 1. Imports and data loading

In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np
from scipy import stats

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option("display.max_columns", None)

# Import packages for data modeling
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from xgboost import XGBClassifier

# Function that helps plot feature importance
from xgboost import plot_importance

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load dataset into the training dataframe
df_train = pd.read_csv("/kaggle/input/playground-series-s4e11/train.csv",  index_col='id')
df_train.head()

In [None]:
# Load dataset into the test dataframe
df_test= pd.read_csv("/kaggle/input/playground-series-s4e11/test.csv",  index_col='id')
df_test.head()

This dataset presents a typical scenario in machine learning where there are columns specific to sub-groups within the dataset. In this case, students and working professionals have some overlapping columns and others that are unique to each group. To tackle this effectively, there is a need to create 2 subgroups:
* The column **Working Professional or Student** indicates the subgroup.

<a id="2"></a>
## 2. Exploratory Data Analysis

 - Understand your variables
 - Clean your dataset (missing data, redundant data, outliers)

### Gather basic information about the data

In [None]:
# Summary information
df_train.info()

In [None]:
df_train.describe()

### Check missing values

In [None]:
# Check for missing values
df_train.isna().sum()

### Check duplicates

In [None]:
# Check for duplicates
df_train.duplicated().sum()

### Check outliers

In [None]:
# Create a boxplot to visualize distribution of `age` and detect any outliers
plt.figure(figsize=(7,2))
plt.title('Distribution of age', fontsize=10)
sns.boxplot(x=df_train['Age'])
plt.show()

There are no outliers in the "age" distribution.

In [None]:
# Determine the number of rows containing outliers
Q1 = df_train["Age"].quantile(0.25)
Q3 = df_train["Age"].quantile(0.75)

age_iqr = Q3 - Q1

age_upper_limit = Q3 + 1.5 * age_iqr
age_lower_limit = Q1 - 1.5 * age_iqr

age_outliers = df_train[(df_train["Age"] > age_upper_limit) | (df_train["Age"] < age_lower_limit)]

print(f"Number of rows containing outliers: {age_outliers.shape[0]}")

Now, we will examine variables and create plots to visualize relationships, especially between non-numerical variables in the data. This way we will have insights about their correlation with depression variable and we will choose wich variables to encode and wich one we should just drop.

### Distribution of the depression by gender

In [None]:
# Create histogram to compare depression by gender or by student versus worker professional
fig, ax = plt.subplots(1, 2, figsize = (10,4))
sns.histplot(data=df_train, x="Gender", hue="Depression", multiple="dodge",
              shrink=.6, ax = ax[0])
ax[0].set_title("Distribution of depression by gender", fontsize=10)

sns.histplot(data=df_train, x="Working Professional or Student", hue="Depression", multiple="dodge",
              shrink=.6, ax = ax[1])
ax[1].set_title("Distribution of depression by Working Professional or Student", fontsize=10)

The histogram shows that there is no visible correlation between depression and the gender of the person. However, between the working categories of professionals or students there is a clear difference: only a small percentage of working professionals are experiencing depression while more than half students have depression. 

### Distribution of the depression by age

In [None]:
# Create histogram to compare depression by age
plt.figure(figsize=(5,4))
sns.histplot(data=df_train, x="Age", hue="Depression",
             hue_order=[0, 1], shrink=.6, bins = 10)
plt.title("Distribution of depression by Age", fontsize=10);

The graphic shows that the more young the person is the more likely it is that the person is suffering from depression.

### Distribution of the depression by sleep duration

In [None]:
# Create histogram to compare depression by sleep duration
plt.figure(figsize=(5,4))
sleep_duration_4 = df_train["Sleep Duration"].value_counts()[df_train["Sleep Duration"].value_counts() > 1000].index

# Step 2: Filter rows where 'Sleep Duration' is in the sleep_duration_4 list
df_sleep_duration = df_train[df_train["Sleep Duration"].isin(sleep_duration_4)]
sns.histplot(data=df_sleep_duration, x="Sleep Duration", hue="Depression",
             hue_order=[0, 1], shrink=.6, bins = 4)
plt.xticks(rotation = 90)
plt.title("Distribution of depression by Sleep Duration", fontsize=10);

The people sleeping less than 5 hours are more likely to have depression.

#### Distribution of the depression by Dietary habits

In [None]:
# Create histogram to compare depression by Dietary Habits
plt.figure(figsize=(5,4))
dietary_habits_3 = df_train["Dietary Habits"].value_counts()[df_train["Dietary Habits"].value_counts() > 1000].index

# Step 2: Filter rows where 'Dietary Habits' is in the dietary_habits_3 list
df_dietary_habits = df_train[df_train["Dietary Habits"].isin(dietary_habits_3)]
sns.histplot(data=df_dietary_habits, x="Dietary Habits", hue="Depression",
             hue_order=[0, 1], shrink=.6, bins = 3)
plt.xticks(rotation = 90)
plt.title("Distribution of depression by Dietary Habits", fontsize=10);

Un unhealthy diet is more likely to lead a person to depression. 

### Distribution of the depression by Degree

Split the dataset into 2 groups: 
* Students 
* Professionals

In [None]:
students = df_train[df_train['Working Professional or Student'] == 'Student']
professionals = df_train[df_train['Working Professional or Student'] == 'Working Professional']

In [None]:
# Create histogram to compare depression by Degree
plt.figure(figsize=(5,4))
degree_students = students["Degree"].value_counts()[students["Degree"].value_counts() > 1000].index

# Step 2: Filter rows where 'Degree' is in the degree list
df_degree = students[students["Degree"].isin(degree_students)]
sns.histplot(data=df_degree, x="Degree", hue="Depression",
             hue_order=[0, 1], shrink=.6, bins = 3)
plt.xticks(rotation = 90)
plt.title("Distribution of depression by Degree for students", fontsize=10);

In [None]:
# Create histogram to compare depression by Degree for professionals
plt.figure(figsize=(5,4))
degree_professionals = professionals["Degree"].value_counts()[professionals["Degree"].value_counts() > 1000].index

# Step 2: Filter rows where 'Degree' is in the degree list
df_degree = professionals[professionals["Degree"].isin(degree_professionals)]
sns.histplot(data=df_degree, x="Degree", hue="Depression",
             hue_order=[0, 1], shrink=.6, bins = 3)
plt.xticks(rotation = 90)
plt.title("Distribution of depression by Degree for professionals", fontsize=10);

It looks like the most depressed people are the ones with the Class12 degree. Between the other cathegories, depressed people looks fairly distributed.

#### Distribution of the depression by suicidal thoughts and family history of mental illness

In [None]:
# Create histogram to compare depression by suicidal thoughts and family history of mental illness
fig, ax = plt.subplots(1, 2, figsize = (10,4))
sns.histplot(data=df_train, x="Have you ever had suicidal thoughts ?", hue="Depression", multiple="dodge",
              shrink=.6, ax = ax[0])
ax[0].set_title("Distribution of depression by suicidal thoughts ? ", fontsize=10)

sns.histplot(data=df_train, x="Family History of Mental Illness", hue="Depression", multiple="dodge",
              shrink=.6, ax = ax[1])
ax[1].set_title("Distribution of depression by Family History of Mental Illness", fontsize=10)

The Family History of Mental Illness doesn't look to have a strong correlation with depression value. However, the suicidal thoughts do have an obvious corelation. 

### The heatmap of correlation between the depression and other variables.

Check for strong correlations between variables in the dataset.

In [None]:
# Eliminate columns that contain non-numeric data.
corr_df = students.select_dtypes(include=['float64', 'int64'])

# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
sns.heatmap(corr_df.corr(), cmap='Blues', annot=True)
plt.title("Correlation Heatmap for students", fontsize=14)

In [None]:
# Eliminate columns that contain non-numeric data.
corr_df = professionals.select_dtypes(include=['float64', 'int64'])

# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
sns.heatmap(corr_df.corr(), cmap='Blues', annot=True)
plt.title("Correlation Heatmap for professionals", fontsize=14)

From the correlation heatmap we can conclude that the variables: Academic Pressure, Work Pressure, CGPA, Work/Study hours, Financial Stress are positively correlated with the depression variable, while the likelihood of having a depression is negatively correlated with Study or Job Satisfaction and age. 
 
<a id="4"></a> 
## 3. Data Preprocessing

Based on the "Working Professional or Student" column we are creating two separate subsets:
* Students specific columns: **Academic Pressure**, **CGPA**, **Study, Staisfaction**
* Working Professionals specific columns: **Profession**, **Work Pressure**, **Job Satisfaction**

In [None]:
# Handle missing values for each subgroup
student_features = ['Academic Pressure', 'Study Satisfaction']
professional_features = ['Work Pressure', 'Job Satisfaction']

First, we will drop some columns that are irrelevant to our model training:
* Name;
* City;
* Profession.

In [None]:
# Drop the columns: Name, City, CGPA, Degree, Profession
students.drop(["Name", "City", "Profession", "CGPA",
               "Work Pressure", "Job Satisfaction", 
               'Working Professional or Student'], axis = 1, inplace = True)
professionals.drop(["Name", "City", "Profession", 
                    "Academic Pressure", "Study Satisfaction", 
                    "CGPA", 'Working Professional or Student'], axis = 1, inplace = True)

# Same for test data
df_test.drop(["Name", "City", "Profession"], axis = 1, inplace = True)

students_test = df_test[df_test['Working Professional or Student'] == 'Student']
professionals_test = df_test[df_test['Working Professional or Student'] == 'Working Professional']

students_test.drop(["Work Pressure", "Job Satisfaction", 'Working Professional or Student', "CGPA"], axis = 1, inplace = True)
professionals_test.drop(["Academic Pressure", "Study Satisfaction", "CGPA", 'Working Professional or Student'], axis = 1, inplace = True)

### The prediction target.

The predicted variable is “Depression” and it is already a binary variable.

### Handle Missing Values.

Columns like "Academic Pressure" and "CGPA" have many missing values for non-students. Similarly, "Work Pressure" has missing values for non-professionals.


In [None]:
# Create pipelines for each group
student_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

professional_pipeline = Pipeline([
    ('imputer', SimpleImputer(fill_value = 0)),
    ('scaler', StandardScaler())
])

# Apply transformations
students[student_features] = student_pipeline.fit_transform(students[student_features])
professionals[professional_features] = professional_pipeline.fit_transform(professionals[professional_features])

# Apply transformation for test data
students_test[student_features] = student_pipeline.transform(students_test[student_features])
professionals_test[professional_features] = professional_pipeline.transform(professionals_test[professional_features])

In [None]:
# Fill missing values
students['Academic Pressure'].fillna(students['Academic Pressure'].mean(), inplace=True)
students['Study Satisfaction'].fillna(students['Study Satisfaction'].mean(), inplace=True)

professionals['Work Pressure'].fillna(professionals['Work Pressure'].mean(), inplace=True)
professionals['Job Satisfaction'].fillna(professionals['Job Satisfaction'].mean(), inplace=True)

# Test data: Fill missing values
students_test['Academic Pressure'].fillna(students_test['Academic Pressure'].mean(), inplace=True)
students_test['Study Satisfaction'].fillna(students_test['Study Satisfaction'].mean(), inplace=True)

professionals_test['Work Pressure'].fillna(professionals_test['Work Pressure'].mean(), inplace=True)
professionals_test['Job Satisfaction'].fillna(professionals_test['Job Satisfaction'].mean(), inplace=True)

In [None]:
# Fix for Work Pressure (irrelevant for students)
professionals['Work Pressure'].fillna(0, inplace=True)

# Fix for Job Satisfaction (irrelevant for students)
professionals['Job Satisfaction'].fillna(0, inplace=True)

# Fix for Dietary Habits (small number of missing values)
most_freq_diet_stud = students['Dietary Habits'].mode()[0]
students['Dietary Habits'].fillna(most_freq_diet_stud, inplace=True)
most_freq_diet_prof = professionals['Dietary Habits'].mode()[0]
professionals['Dietary Habits'].fillna(most_freq_diet_prof, inplace=True)

# Fix for Financial Stress (small number of missing values)
median_fin_stress_stud = students['Financial Stress'].median()
students['Financial Stress'].fillna(median_fin_stress_stud, inplace=True)
median_fin_stress_prof = professionals['Financial Stress'].median()
professionals['Financial Stress'].fillna(median_fin_stress_prof, inplace=True)

# Fix for Degree(small number of missing values)
most_freq_degree = df_train['Degree'].mode()[0]
professionals["Degree"].fillna(most_freq_degree, inplace=True)

In [None]:
print(students.isna().sum())
print(professionals.isna().sum())

In [None]:
# Same for test data
students_test.fillna(0, inplace=True)
professionals_test.fillna(0, inplace=True)

# Fix for Dietary Habits (small number of missing values)
most_freq_diet_stud_test = students_test['Dietary Habits'].mode()[0]
students_test['Dietary Habits'].fillna(most_freq_diet_stud_test, inplace=True)
most_freq_diet_prof_test = professionals_test['Dietary Habits'].mode()[0]
professionals_test['Dietary Habits'].fillna(most_freq_diet_prof_test, inplace=True)

# Fix for Financial Stress (small number of missing values)
median_fin_stress_stud_test = students_test['Financial Stress'].median()
students_test['Financial Stress'].fillna(median_fin_stress_stud_test, inplace=True)
median_fin_stress_prof_test = professionals_test['Financial Stress'].median()
professionals_test['Financial Stress'].fillna(median_fin_stress_prof_test, inplace=True)

# Fix for Degree(small number of missing values)
most_freq_degree_test= df_test['Degree'].mode()[0]
professionals_test["Degree"].fillna(most_freq_degree_test, inplace=True)

In [None]:
print(students_test.isna().sum())
print(professionals_test.isna().sum())

###  Cleaning the variables for encoding.

We will use the following non-numerical variables that need to be encoded in order to be used in the model training:
* Sleep Duration;
* Dietary Habits.

In [None]:
less_than_5 = ["Less than 5 hours", "3-4 hours", "4-5 hours", "2-3 hours", "1-6 hours", "No", "45", "3-6 hours", "1-3 hours", "than 5 hours" ]
hours_5_6 = ["5-6 hours", "4-6 hours" ]
hours_6_7 = ["6-7 hours", "6-8 hours", "moderate"]
hours_7_8 = ["7-8 hours"]
more_than_8 = ["More than 8 hours", "9-11 hours", "10-11 hours", "8-9 hours", "49 hours" ]

all_my_lists = less_than_5 + hours_5_6 + hours_6_7 + hours_7_8 + more_than_8

students.loc[students['Sleep Duration'].isin(less_than_5), 'Sleep Duration'] = 'Less than 5 hours'
students.loc[students['Sleep Duration'].isin(hours_5_6), 'Sleep Duration'] = '5-6 hours'
students.loc[students['Sleep Duration'].isin(hours_6_7), 'Sleep Duration'] = '6-7 hours'
students.loc[students['Sleep Duration'].isin(more_than_8), 'Sleep Duration'] = 'More than 8 hours'

professionals.loc[professionals['Sleep Duration'].isin(less_than_5), 'Sleep Duration'] = 'Less than 5 hours'
professionals.loc[professionals['Sleep Duration'].isin(hours_5_6), 'Sleep Duration'] = '5-6 hours'
professionals.loc[professionals['Sleep Duration'].isin(hours_6_7), 'Sleep Duration'] = '6-7 hours'
professionals.loc[professionals['Sleep Duration'].isin(more_than_8), 'Sleep Duration'] = 'More than 8 hours'

# Fill the rest with most frequent values
students.loc[~students['Sleep Duration'].isin(all_my_lists), 'Sleep Duration'] = students['Sleep Duration'].mode()[0]
professionals.loc[~professionals['Sleep Duration'].isin(all_my_lists), 'Sleep Duration'] = professionals['Sleep Duration'].mode()[0]

In [None]:
# Same for test data
students_test.loc[students_test['Sleep Duration'].isin(less_than_5), 'Sleep Duration'] = 'Less than 5 hours'
students_test.loc[students_test['Sleep Duration'].isin(hours_5_6), 'Sleep Duration'] = '5-6 hours'
students_test.loc[students_test['Sleep Duration'].isin(hours_6_7), 'Sleep Duration'] = '6-7 hours'
students_test.loc[students_test['Sleep Duration'].isin(more_than_8), 'Sleep Duration'] = 'More than 8 hours'

professionals_test.loc[professionals_test['Sleep Duration'].isin(less_than_5), 'Sleep Duration'] = 'Less than 5 hours'
professionals_test.loc[professionals_test['Sleep Duration'].isin(hours_5_6), 'Sleep Duration'] = '5-6 hours'
professionals_test.loc[professionals_test['Sleep Duration'].isin(hours_6_7), 'Sleep Duration'] = '6-7 hours'
professionals_test.loc[professionals_test['Sleep Duration'].isin(more_than_8), 'Sleep Duration'] = 'More than 8 hours'

# Fill the rest with most frequent values
professionals_test.loc[~professionals_test['Sleep Duration'].isin(all_my_lists), 'Sleep Duration'] = professionals_test['Sleep Duration'].mode()[0]
students_test.loc[~students_test['Sleep Duration'].isin(all_my_lists), 'Sleep Duration'] = students_test['Sleep Duration'].mode()[0]

In [None]:
# Fill the rest with most frequent values
diet_list = ["Moderate", "Healthy", "Unhealthy"]
students.loc[~students['Dietary Habits'].isin(diet_list), 'Dietary Habits'] = students['Dietary Habits'].mode()[0]
professionals.loc[~professionals['Dietary Habits'].isin(diet_list), 'Dietary Habits'] = professionals['Dietary Habits'].mode()[0]

# Same for test data
students_test.loc[~students_test['Dietary Habits'].isin(diet_list), 'Dietary Habits'] = students_test['Dietary Habits'].mode()[0]
professionals_test.loc[~professionals_test['Dietary Habits'].isin(diet_list), 'Dietary Habits'] = professionals_test['Dietary Habits'].mode()[0]

Prepare the "Degree" most frequent values for students and for professionals.

In [None]:
# Fill the rest with more frequent values
students.loc[~students['Degree'].isin(degree_students), 'Degree'] = "Other"
professionals.loc[~professionals['Degree'].isin(degree_professionals), 'Degree'] = "Other"

# Same for test data\
degree_students_test = students_test["Degree"].value_counts()[students_test["Degree"].value_counts() > 1000].index
students_test.loc[~students_test['Degree'].isin(degree_students_test), 'Degree'] = "Other"

degree_professionals_test = professionals_test["Degree"].value_counts()[professionals_test["Degree"].value_counts() > 1000].index
professionals_test.loc[~professionals_test['Degree'].isin(degree_professionals_test), 'Degree'] = "Other"

<a id="4"></a>
## 4. Building a Machine Learning Model

 - Fit the model that predicts the outcome variable using two or more independent variables
 - Check model assumptions
 - Evaluate the model



### Types of models most appropriate for this task.
For this project, I choose XGBoost models to predict depression. This model is effective in handling complex relationships and non-linear patterns.
 
**XGBoost**:
 - XGBoost is known for its superior accuracy and speed, especially with large datasets.
 - It deals well with imbalanced data, which is common in turnover prediction (e.g., fewer employees leaving compared to staying).
 - XGBoost optimizes computation and performs well with missing data or noisy inputs.
 - It can be easily fine-tuned, which gives more flexibility in improving model performance.
 
### Spliting the data

In [None]:
# Separate the dataset into labels (y) and features (X).
# Define the X (predictor) variables X1 for students and X2 for professionals
X1 = students.drop("Depression",axis=1)
X2 = professionals.drop("Depression",axis=1)
# Define the y (target) variable
y1 = students["Depression"]
y2 = professionals["Depression"]

# Split the data into training set and testing set
X1_train, X1_valid, y1_train, y1_valid = train_test_split(X1, y1, test_size=0.25, 
                                                    stratify=y1, random_state=24)
X2_train, X2_valid, y2_train, y2_valid = train_test_split(X2, y2, test_size=0.25, 
                                                    stratify=y2, random_state=24)

X1_test = students_test.copy()
X2_test = professionals_test.copy()

 ### Encoding

There is a need to encode the non-numeric variables. Here they are: 
* *Gender*, 
* *Working Professional or Student*, 
* *Sleep Duration*, 
* *Dietary Habits*, 
* *Have you ever had suicidal thoughts ?*, 
* *Family History of Mental Illness*.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encode_columns =      ["Gender", 
                       "Sleep Duration",
                       "Degree",
                       "Dietary Habits", 
                       "Have you ever had suicidal thoughts ?", 
                       "Family History of Mental Illness"]

# Apply ordinal encoder 
ordinal_encoder = OrdinalEncoder()
X1_train[encode_columns] = ordinal_encoder.fit_transform(X1_train[encode_columns])
X1_valid[encode_columns] = ordinal_encoder.transform(X1_valid[encode_columns])

X2_train[encode_columns] = ordinal_encoder.fit_transform(X2_train[encode_columns])
X2_valid[encode_columns] = ordinal_encoder.transform(X2_valid[encode_columns])


# Apply for test data as well
X1_test[encode_columns] = ordinal_encoder.transform(X1_test[encode_columns])
X2_test[encode_columns] = ordinal_encoder.transform(X2_test[encode_columns])

### Tune the Model

In [None]:
# Define xgb to be your XGBClassifier.
xgb = XGBClassifier(objective="binary:logistic", random_state = 24)

### Define the parameters for hyperparameter tuning

Define the parameters for hyperparameter tuning
To identify suitable parameters for your xgboost model, first define the parameters for hy- perparameter tuning. Specifically, define a range of values for max_depth, min_child_weight, learning_rate, n_estimators, subsample, and colsample_bytree.
Consider a more limited range for each parameter to allow for timely iteration and model training.

In [None]:
# Define parameters for tuning as `cv_params`.
cv_prof_params = { "max_depth": [6],
             "min_child_weight": [15],
             "learning_rate": [0.15],
             "n_estimators": [60],
             "subsample": [0.5],
             "colsample_bytree": [0.6]}

cv_stud_params = { "max_depth": [4, 6, 8],
             "min_child_weight": [25],
             "learning_rate": [0.05, 0.15],
             'n_estimators': [60, 100, 200],
             'subsample': [0.6, 0.8],
             'colsample_bytree': [0.4, 0.6]}

# Construct the GridSearch.
# Model for students
xgb_cv_student = GridSearchCV(xgb, cv_stud_params, scoring = "accuracy", cv = 5, n_jobs = 5, refit = "accuracy")

# Model for professionals
xgb_cv_professional = GridSearchCV(xgb, cv_prof_params, scoring = "accuracy", cv = 5, n_jobs = 5, refit = "accuracy")

In [None]:
%%time
# fit the GridSearch model to training data
xgb_students = xgb_cv_student.fit(X1_train, y1_train)
xgb_students

In [None]:
%%time
# fit the GridSearch model to training data
xgb_professionals = xgb_cv_professional.fit(X2_train, y2_train)
xgb_professionals

### Results and evaluation

In [None]:
# Apply your model to predict on your test data. Call this output "y_pred".
y1_pred = xgb_students.predict(X1_valid)

In [None]:
print(xgb_students.best_params_)
print(xgb_students.best_score_)

In [None]:
# Get predictions
accuracy1 = accuracy_score(y1_valid, y1_pred)

print(accuracy1)

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X1_train , y1_train)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_smote, y_smote)
rf_score = rf_model.score(X1_valid , y1_valid)
print(f"Random Forest Accuracy: {rf_score}")

In [None]:
# Apply professionals model to predict on the test data.
y2_pred = xgb_professionals.predict(X2_valid)

print(xgb_professionals.best_params_)
print(xgb_professionals.best_score_)

### Evaluate XGBoost model’s performance

In [None]:
# Get predictions
accuracy2 = accuracy_score(y2_valid, y2_pred)

print(accuracy2)

- The **XGBoost model** demonstrates strong performance with high accuracy, indicating that it makes accurate predictions.

This result suggests that XGBoost is a powerful model for predicting employee turnover.

#### **Gain clarity with the confusion matrix**

In [None]:
# Construct and display the confusion matrix for students.
cm = confusion_matrix(y1_valid, y1_pred, labels=xgb_students.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                             display_labels = xgb_students.classes_)
# Plot the visual in-line.
disp.plot()

In [None]:
# Construct and display the confusion matrix for professionals.
cm = confusion_matrix(y2_valid, y2_pred, labels=xgb_professionals.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                             display_labels = xgb_professionals.classes_)
# Plot the visual in-line.
disp.plot()

This indicates that the model is performing very well, with a low number of false predictions. Specifically, it has a very low **false positive** rate.

This indicates that the model is well-balanced between training and validation sets, and the false negatives and false positives indicate where the model might be improved.

### Evaluate Feature Importance

In [None]:
import shap

best_model = xgb_students.best_estimator_
explainer = shap.Explainer(best_model, X1_valid)
shap_values = explainer(X1_valid)

# Visualize
shap.summary_plot(shap_values, X1_valid)

In [None]:
best_model2 = xgb_professionals.best_estimator_
explainer2 = shap.Explainer(best_model2, X2_valid)
shap_values2 = explainer(X2_valid)

# Visualize
shap.summary_plot(shap_values2, X2_valid)

### Visualize most important features

In [None]:
# Plot the relative feature importance of the predictor variables in the students model.
plot_importance(xgb_students.best_estimator_)

In [None]:
# Plot the relative feature importance of the predictor variables in the professionals model.
plot_importance(xgb_professionals.best_estimator_)

### Adjust the Decision Threshold

In [None]:
# Predict probabilities
y_valid_probs = xgb_students.predict_proba(X1_valid)[:, 1]

# Adjust threshold
threshold = 0.45  # Lower threshold to reduce FNs
y_valid_pred = (y_valid_probs >= threshold).astype(int)

# Reevaluate
print(confusion_matrix(y1_valid, y_valid_pred))
print(classification_report(y1_valid, y_valid_pred))

### Combine Predictions
Combine the predictions for students and professionals back into a single array.

In [None]:
# Initialize an empty predictions DataFrame with the same indices as X_test
test_preds = pd.Series(0, index=df_test.index)

# Predict for students if there are any
if not X1_test.empty:
    student_test_probs  = xgb_students.predict_proba(X1_test)[:, 1]
    # Adjust threshold
    threshold = 0.45  # Lower threshold to reduce FNs
    student_test_predictions = (student_test_probs >= threshold).astype(int)
    # Assign predictions for students
    test_preds.loc[X1_test.index] = student_test_predictions

# Predict for professionals if there are any
if not X2_test.empty:
    professional_test_predictions = xgb_professionals.predict(X2_test)
    # Assign predictions for professionals
    test_preds.loc[X2_test.index] = professional_test_predictions

# Convert predictions to integers
test_preds = test_preds.astype(int)

### Evaluate Combined Predictions
Finally, evaluate the combined predictions against the actual Depression labels.

In [None]:
print(xgb_students.best_score_)
print(xgb_professionals.best_score_)

## Prepare the test data and make the submitable predictions

In [None]:
# Save test predictions to file
output = pd.DataFrame({"id": df_test.index,
                       "Depression": test_preds})
output.to_csv("submission.csv", index = False)

In [None]:
output.head()