# Modeling Depression using Classification Algorithms

This machine learning analysis purports to shed light on the following questions:
- what **key factors** affect depression?
- what students need the **most help**? 

A clear understanding of these two would allow us to **design policies** to better prevent depression and tackle its associated features.

In [None]:
#importing data analysis and visualisation packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 8) 
plt.rcParams['font.size'] = 14

In [None]:
#importing machine learning packages

from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import GridSearchCV

In [None]:
df = pd.read_csv("depression_after_eda.csv")

In [None]:
#to check which column in the dataset contains string values

df.applymap(type).eq(str).any()

In [None]:
#features Degrees and Cities contain a lot of unique values, making the model too complex and also adding little explanatory power.

print(df["Degree"].nunique())
print(df["City"].nunique())

In [None]:
#for now, I exclude City and Degree from our X features. The selected features for X are 11.

X = df[["Gender","Age","Academic Pressure","CGPA","Study Satisfaction", "Sleep Duration","Dietary Habits","Suicidal Thoughts", "Work_study_hours","Financial Stress", "Family History"]]
y = df["Depression"]

In [None]:
#checking imbalances: the two classes depression=0 and depression=1 do not seem imbalanced.

print(y.value_counts(normalize=True))

**One-hot encoding**

Each category of a categorical feature is converted into a binary column (1/0) indicating whether the observation belongs to that category, allowing the model to understand the categorical data numerically.

In [None]:
#one-hot encoding on the categorical variables: Gender, Sleep Duration, Dietary Habits, Suicidal Thoughts, Family History

X = pd.get_dummies(columns=["Gender", "Sleep Duration", "Dietary Habits", "Suicidal Thoughts", "Family History"], drop_first=True, data=X)

In [None]:
# 1st and most used hold-out technique: train/test split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=52, stratify=y)

In [None]:
print(X.shape)
print(X.columns)

## 1) K-Nearest Neighbours

KNN (K-Nearest Neighbors) is a simple, non-parametric machine learning algorithm used for classification. It predicts the label of a new observation based on the majority class of its k nearest neighbors in the training data.

I define the following pipeline with StandardScaler and KNeighborsClassifier. StandardScaler standardizes our data by subtracting the mean from each feature and dividing by its standard deviation. Scaling features is important given that Euclidean distance is used to compute the number of KNN.

In [None]:
# I create a pipeline that scales the data. 
#the following happens under the hood automatically: 
# StandardScaler().fit(X_train)*: learns the mean and std from the training data
# StandardScaler().transform(X_train)*: scales X_train using those parameters
# KNeighborsClassifier().fit(...)*: trains on the scaled X_train

pipeline = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])

In [None]:
#tuning the model to understand when the testing error is minimised

k_range=list(range(1,101))
training_error=[]
testing_error=[]

for k in k_range:
    pipeline = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(k))])
    pipeline.fit(X_train, y_train)
    
    y_pred_class = pipeline.predict(X_test)
    testing_accuracy=metrics.accuracy_score(y_test, y_pred_class)
    testing_error.append(1-testing_accuracy)
    
    y_pred_class = pipeline.predict(X_train)
    training_accuracy=metrics.accuracy_score(y_train, y_pred_class)
    training_error.append(1-training_accuracy)

In [None]:
# I compare the testing and training error to see which value of k minimises the testing errors.

knn_error = pd.DataFrame(list(zip(k_range, training_error, testing_error)), columns=["k","training_error","testing_error"])
knn_error.set_index("k").sort_values(by="testing_error", ascending=True)

The value of nearest neighbours that minimise the testing error is k=88.

In [None]:
knn_error.set_index("k").sort_values(by="testing_error", ascending=True).plot();
plt.savefig("knn_errors.png", dpi=300, bbox_inches="tight")
plt.show()

In [None]:
#Computing the model accuracy with k=88

pipeline_knn = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(88))])
pipeline_knn.fit(X_train, y_train)
y_pred_knn=pipeline_knn.predict(X_test)
print(f'The accuracy of the KNN model is {metrics.accuracy_score(y_test, y_pred_knn):.4f}')

In [None]:
#accuracy of the null model (most frequent class)
#alternatively: y_test.value_counts(normalize=True)

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.predict(X_test)
print(f'The accuracy of the null model is: {dummy_clf.score(X_test, y_test):.3f}')

Because the accuracy of the simple KNN is higher than the accuracy of the null model, our model has a better performance. Our model predicts correctly 85% of observations.

In [None]:
# Confusion matrix to investigate TP, TN, FP, FN

class_names = ['depressed_no', 'depressed_yes']
ConfusionMatrixDisplay.from_estimator(pipeline.fit(X_train, y_train), X_test, y_test,
                                 display_labels=class_names, cmap=plt.cm.Blues);
plt.savefig("Confusion matrix from KNN", dpi=300, bbox_inches="tight")
plt.show()

**Key metrics for classification**
- **accuracy**: fraction of correct predictions overall;
- **precision**: fraction of positive predictions that are correct (TP/TP+FP)
- **recall (or sensitivity)**: fraction of actual positives that are correctly predicted (TP/TP+FN)
- **specificity**: fraction of actual negatives that are correctly predicted (TN/TN+FP)

In [None]:
#classification report to investigate key metrics

print(metrics.classification_report(y_test, y_pred_knn))

**ROC-AUC curve**
- ROC: t’s a curve that shows how well your model distinguishes between positive and negative classes at every possible threshold. When a model gives probabilities (like 0.8 or 0.2), you can choose any cutoff (say 0.5) to decide what’s “positive.” Each point on the curve corresponds to a different decision threshold. Low threshold → almost everything predicted as positive (high recall, high FPR). High threshold → almost nothing predicted as positive (low recall, low FPR)
- AUC condenses the curve into a single number between 0 and 1, representing how well the model separates the two classes overall.

In [None]:
RocCurveDisplay.from_estimator(pipeline_knn, X_test, y_test);

The model is very good at ranking the positive probability scores, for depressed, higher than the negative cases (e.g. “not depressed”). In 92% of all random positive–negative pairs, the model gives the positive sample a higher score than the negative one.

In [None]:
X_test["y_pred_knn"] = y_pred_knn
X_test["y"]=df.loc[df.index, "Depression"]
X_test["y_pred=y"] = X_test["y_pred_knn"] == X_test["y"]

In [None]:
# model's accuracy corresponds to checking when y=y_pred

X_test["y_pred=y"].value_counts(normalize=True)

In [None]:
#investigating false negatives

false_negatives_knn=X_test[(X_test["y_pred_knn"]==0)&(X_test["y"]==1)]
display(false_negatives_knn.describe())
display(false_negatives_knn.head())

In [None]:
#investigating true positives

true_positives = X_test[(X_test["y_pred_knn"]==1)&(X_test["y"]==1)]
display(true_positives.describe())
display(true_positives.head())

By comparing TP and FN, we can see that false negatives, with respect to true positives, have a higher average age, lower average academic pressure, higher average study satisfaction, lower work study hours and lower financial stress. Because of this, the model incorrectly classified 334 observations that had features that do not normally associate with depression.

In [None]:
#investigating false positives

false_positives_knn = X_test[(X_test["y_pred_knn"]==1)&(X_test["y"]==0)]
display(false_positives_knn.describe())
display(false_positives_knn.head())

In [None]:
#investigating true negatives

true_negatives_knn = X_test[(X_test["y_pred_knn"]==0)&(X_test["y"]==0)]
display(true_negatives_knn.describe())
display(true_negatives_knn.head())

By comparing FP and TN, we can see that FP, with respect to TN, have a lower average age, higher average academic pressure, lower average study satisfaction, higher average work study hours, a higher financial stress. Because of this, the model incorrectly classified 786 observations that are not depressed but that have traits similar to depressed students.

In [None]:
# Visually comparing TP and FN

columns = ["Age", "Academic Pressure", "CGPA", "Study Satisfaction", "Work_study_hours", "Financial Stress"]
for column in columns:
    plt.figure(figsize=(8, 4))
    plt.hist(false_negatives_knn[column], bins=20, alpha=0.5, label='False Negatives', edgecolor="red", linewidth=2, histtype='step', fill=False, density=True)
    plt.hist(true_positives[column], bins=20, alpha=0.5, label='True Positives', edgecolor="green", linewidth=2, histtype='step', fill=False,  density=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.legend()
    plt.tight_layout()
    plt.show()

The main differences between TP and FN are driven by 1) age, 2)academic pressure, 3)study satisfaction, 4)financial stress:
- TP are younger, whereas false negatives are in general older;
- TP have a right-skewed distribution across academic pressure (more people feeling higher academic pressure)
- TP have a left-skewed distribution across study satisfaction (less and less people feeling higher study satisfaction)
- TP have a right-skewed distribution across financial stress compared to FN

In [None]:
# Visually comparing TN and FP

columns = ["Age", "Academic Pressure", "CGPA", "Study Satisfaction", "Work_study_hours", "Financial Stress"]
for column in columns:
    plt.figure(figsize=(8, 4))
    plt.hist(true_negatives_knn[column], bins=20, alpha=0.5, label='True Negative', edgecolor="blue", linewidth=2, histtype='step', fill=False, density=True)
    plt.hist(false_positives_knn[column], bins=20, alpha=0.5, label='False Positives', edgecolor="orange", linewidth=2, histtype='step', fill=False,  density=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.legend()
    plt.tight_layout()
    plt.show()

The main differences between TN and FP are driven by 1) age, 2)academic pressure, 3)study satisfaction, 4)financial stress:
- TN are usually older;
- TN have a left-skewed distribution across academic pressure (more people feeling less academic pressure);
- TN have a right-skewed distribution across study satisfaction (more people feel satisfied by their studies);
- TN have a left-skewed distribution across financial stress (non-depressed people are not financially stressed).

### Improving the KNN model

#### 1) Applying cross-validation (2nd hold-out technique)

In [None]:
#I average the five accuracy values I get for each fold

kf = KFold(n_splits=5, shuffle=True, random_state=2023)
scores=cross_val_score(pipeline, X, y, cv=kf, scoring='accuracy')
print(f'The accuracy of the model using cross-validation is: {np.mean(scores):.4f}')

I get a lightly lower accuracy than using train/test split, so this does not improve our model.

#### 2) Adding Degree and Cities into the features 

In [None]:
X_full = df[["Gender","City","Degree","Age","Academic Pressure","CGPA","Study Satisfaction", "Sleep Duration","Dietary Habits","Suicidal Thoughts", "Work_study_hours","Financial Stress", "Family History"]]
X_full = pd.get_dummies(columns=["Gender", "City","Degree","Sleep Duration", "Dietary Habits", "Suicidal Thoughts", "Family History"], drop_first=True, data=X_full)
X_full_train, X_full_test, y_train, y_test = train_test_split(X_full,y, random_state=52, stratify=y)

In [None]:
k_range=list(range(1,101))
training_error=[]
testing_error=[]

for k in k_range:
    pipeline_full = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(k))])
    pipeline_full.fit(X_full_train, y_train)
    
    y_pred_knn_full=pipeline_knn_full.predict(X_full_test)
    testing_accuracy=metrics.accuracy_score(y_test, y_pred_knn_full)
    testing_error.append(1-testing_accuracy)
    
    y_pred_knn_full=pipeline_knn_full.predict(X_full_train)
    training_accuracy=metrics.accuracy_score(y_train, y_pred_knn_full)
    training_error.append(1-training_accuracy)

In [None]:
pipeline_knn_full = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(94))])
pipeline_knn_full.fit(X_full_train, y_train)
y_pred_knn_full=pipeline_knn_full.predict(X_full_test)
print(f' The accuracy of the KNN model is: {metrics.accuracy_score(y_test, y_pred_knn_full):.4f}')


Adding cities and degrees reduces our accuracy (from 0.85 to 0.67).

#### 3) Checking the model's best variable combination using SequentialFeatureSelector

In [None]:
pipeline = Pipeline([('scaler', StandardScaler()),('knn', KNeighborsClassifier(n_neighbors=88))])
sfs = SequentialFeatureSelector(pipeline, n_features_to_select='auto', direction='forward', scoring='accuracy', cv=5, n_jobs=-1)
sfs.fit(X, y)

In [None]:
# To see which features were selected

selected_features = X.columns[sfs.get_support()]
print("Best feature combination:", list(selected_features))

The best variable combination: **age, academic pressure, study satisfaction, dietary habits, suicidal thoughts, work study hours and financial stress**.

In [None]:
X_best = df[["Age","Academic Pressure","Study Satisfaction","Dietary Habits","Suicidal Thoughts", "Work_study_hours","Financial Stress"]]
X_best = pd.get_dummies(columns=["Dietary Habits", "Suicidal Thoughts"], drop_first=True, data=X_best)
X_best_train, X_best_test, y_train, y_test = train_test_split(X,y, random_state=52, stratify=y)
pipeline_knn_best = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(88))])
pipeline_knn_best.fit(X_best_train, y_train)
y_pred_knn_best=pipeline_knn_best.predict(X_best_test)
print(f' The accuracy of the KNN model after selecting the best combination of variables is: {metrics.accuracy_score(y_test, y_pred_knn_best):.4f}')

In [None]:
class_names = ['depressed_no', 'depressed_yes']
ConfusionMatrixDisplay.from_estimator(pipeline_knn_best.fit(X_best_train, y_train), X_best_test, y_test,
                                 display_labels=class_names, cmap=plt.cm.Blues);
plt.show()

## 2) Logistic Regression

Logistic Regression is a parametric classification algorithm (unlike KNN), used to predict the probability that an observation belongs to a certain class — typically binary (0 or 1). It ensures that the values output are predictions of class membership that can be interpreted as probabilities. Such probabilities can be converted into class predictions.
The logistic regression is modelled as a linear combination of the features: $$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$ This can be rearranged into the **logistic function**: $$p = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$


The logistic regression outperforms KNN, as accuracy here is slightly higher, amounting to 84.5%!

In [None]:
pipeline_logreg = Pipeline([('scaler', StandardScaler()), ('logreg', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=51)
pipeline_logreg.fit(X_train, y_train)
y_pred_logreg = pipeline_logreg.predict(X_test)
print(f'The accuracy of the logistic regression is {pipeline_logreg.score(X_test, y_test):.4f}')

In [None]:
class_names = ['depressed_no', 'depressed_yes']
ConfusionMatrixDisplay.from_estimator(pipeline_logreg, X_test, y_test,
                                 display_labels=class_names,
                                 cmap="Reds");
plt.title("Confusion Matrix");
plt.savefig("Confusion Matrix from logistic regression", dpi=300, bbox_inches="tight");
plt.show()

In [None]:
metrics.RocCurveDisplay.from_estimator(pipeline_logreg, X_test, y_test);

In [None]:
# extracting the coefficients from a pipeline to be able to derive the coefficients

logreg = pipeline_logreg.named_steps["logreg"]
coefs = logreg.coef_

**Feature importance**

In [None]:
feature_names = X_train.columns
coef_df = pd.DataFrame({'Feature': feature_names,'Magnitude': np.abs(coefs[0])})
coef_df.sort_values(by="Magnitude", ascending=False)

The coefficients that increase the odds, and thus probabilities of an observation belonging to class 1, are, in order of magnitute, **1) suicidal thoughts, 2) academic pressure, 3) financial stress, 4) age, 5) dietary habits**.

In [None]:
coef_df.sort_values(by="Magnitude", ascending=False).set_index('Feature')['Magnitude'].plot(kind='bar', figsize=(10,4))
plt.title('Coefficient Magnitudes (Logistic Regression)')
plt.ylabel('Absolute Value of Coefficient')
plt.show()

**SHAP values**

SHAP values (short for SHapley Additive exPlanations) tell you how much each feature contributes to a model’s prediction for a given observation, by approximating feature importance through local gradients. They work best with parametric models and tree-based algorithms (logistic/linear regression; tree-based models such as Decision Trees, Random Forests, XGBoost, LightGBM)
The explainer is the SHAP object that knows how your model works and computes SHAP values for it. SHAP values are mostly used with tree-based alghorithms.

Unlike the gini importance, which is tree-specific, SHAP values investigate how much each feature, for a specific prediction, push the result up or down.

In [None]:
explainer = shap.Explainer(pipeline_logreg.named_steps['logreg'], 
                           pipeline_logreg.named_steps['scaler'].transform(X))
shap_values = explainer(pipeline_logreg.named_steps['scaler'].transform(X))
shap_values.feature_names = list(X.columns);

In [None]:
np.shape(shap_values.values)

**Waterfall plot**

Plot for a single observation

In [None]:
shap.plots.waterfall(shap_values[0])

**Absolute mean SHAP plot**

Features that have high positive or negative contributions will have large shap values. In this case: suicidal thoughts, academic pressure, financial stress, unhealthy dietary habits.

In [None]:
shap.plots.bar(shap_values);

**Beeswarm plot**

visualisation of all shap values and their direction

In [None]:
shap.plots.beeswarm(shap_values)

## 3) Decision Trees

 Decision Tree is a non-parametric supervised machine learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences as a tree-like structure of if-then-else conditions, helping the algorithm learn patterns in the data to make predictions.
 
- The tree starts at the **root node**, which contains the full dataset.
- At each node, the algorithm selects the **best feature and threshold** to split the data into two or more branches. The goal is to maximize information gain (in classification) or reduce variance (in regression). For regression, the model picks the feature so that the resulting tree has the **lowest RMSE**. For classification, the model picks the feature that reduces the **Gini index** (from 0 to 0.5 -> when 0, the node is pure).
- This splitting process continues recursively until a stopping condition is met (e.g., max depth reached, or nodes are pure).
- The final output is a leaf node, which provides the prediction:
    - A class label in classification tasks,
    - A numerical value in regression tasks.
- Decision trees are not sensitive to the scale of the input features, so standardisation/normalisation 
is not required, unlike for KNN, logistic regression, linear regression, KMeans.

In [None]:
max_depth_range = list(range(1,200))
testing_error=[]
training_error=[]

for depth in max_depth_range:
    treeclf = DecisionTreeClassifier(max_depth=depth, random_state=52)
    treeclf.fit(X_train, y_train)
    y_pred_class = treeclf.predict(X_test)
    testing_accuracy=metrics.accuracy_score(y_test, y_pred_class)
    testing_error.append(1-testing_accuracy)    
    
    y_pred_class = treeclf.predict(X_train)
    training_accuracy=metrics.accuracy_score(y_train, y_pred_class)
    training_error.append(1-training_accuracy)

In [None]:
treeclf_error = pd.DataFrame(list(zip(max_depth_range, training_error, testing_error)), columns=["max_length","training_error","testing_error"])
treeclf_error.sort_values("testing_error")

#max_lenght=8 leads to the lowest testing error

In [None]:
treeclf = DecisionTreeClassifier(random_state=1, max_depth=8)
treeclf.fit(X_train,y_train)

In [None]:
y_pred = treeclf.predict(X_test)
accuracy_tree = metrics.accuracy_score(y_test, y_pred)
print(f' accuracy: {accuracy_tree:.4f}')

**Gini feature importance**

Each time a feature is used to split a node in a tree, it causes a reduction in Gini impurity (the Gini index, looking at how mixed the classes at a node are, is minimised). The total reduction in impurity is its Gini importance. A higher Gini importance means that the feature is more influential in making splits that reduce impurity, and therefore in making our model more accurate.

Gini importance is specific to trees, and does not mention in what direction the feature affects predictions (negatively or positively). It is also biased towards continuous variables or those with many unique values.

In [None]:
# calculating Gini importance

pd.DataFrame({'feature':X.columns, 'importance':treeclf.feature_importances_}).sort_values("importance", ascending=False)

Using the decision tree algorithm, the features with the highest Gini importance are suicidal thoughts, academic pressure, financial stress, age, dietary habits.

In [None]:
export_graphviz(treeclf, out_file='Decision_Tree', feature_names=X.columns)

**SHAP values**

In [None]:
explainer = shap.TreeExplainer(treeclf)
shap_values_dt = explainer(X)
shap_values_dt.feature_names = list(X.columns);

In [None]:
np.shape(shap_values_dt.values)

In [None]:
shap.plots.bar(shap_values_dt[:, :, 1])

In [None]:
shap.plots.beeswarm(shap_values_dt[:,:,1])

## 4) Random Forest

A Random Forest is a supervised machine learning algorithm that consists of a collection (or “ensemble”) of many individual decision trees, typically trained using a method called bagging (each tree is trained on a randomly drawn subset of the training data, to increase diversity). It is used for both classification and regression tasks and is known for its accuracy, robustness, and ability to handle large datasets with high dimensionality.

The idea behind a random forest is **to reduce the risk of overfitting** that is often associated with a single decision tree, while maintaining high predictive performance. This is achieved by building many decision trees during training and combining their outputs to make a final prediction:
- For **classification**, it predicts the class that is the majority vote among all trees.
- For **regression**, it predicts the average of the outputs from all trees.

In [None]:
rf_model = RandomForestClassifier(n_estimators=300, max_depth=10,random_state=52)

In [None]:
#computing the best estimator
n_estimator_range = range(10, 300, 10)
mean_scores = []

for n in n_estimator_range:
    rf = RandomForestClassifier(n_estimators=n, random_state=52)
    scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
    mean_scores.append(scores.mean())

In [None]:
plt.plot(n_estimator_range, mean_scores);
plt.xlabel('n_estimators');
plt.ylabel('Cross-Validated Accuracy');
plt.title('Choosing n_estimators');

In [None]:
#fitting the model

rf_model.fit(X_train, y_train)
y_pred_rf=rf_model.predict(X_test)

In [None]:
#computing accuracy, other metrics and confusion matrix.

print("Accuracy Random Forest:", metrics.accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", metrics.classification_report(y_test, y_pred_rf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

**Gini feature importance**

In [None]:
feature_importance = pd.DataFrame({"feature": X_train.columns,"importance": rf_model.feature_importances_
}).sort_values(by="importance", ascending=False)
feature_importance

Suicidal Thoughts, Academic Pressure, CGPA, Age, Financial Stress

In [None]:
#using GridSearch to select the best parameters

params = {"max_depth": [3, 5, 10, 20, 30], "n_estimators": [100, 150, 200, 250, 300]}
search = GridSearchCV(RandomForestClassifier(), param_grid=params, cv=5)
search.fit(X_train, y_train)
print(search.best_params_)

**SHAP values**

In [None]:
explainer = shap.TreeExplainer(rf_model)
shap_values_rf = explainer(X)

In [None]:
np.shape(shap_values_rf.values)

In [None]:
shap.plots.bar(shap_values_rf[:,:,1]

In [None]:
shap.plots.beeswarm(shap_values_rf[:,:,1]

### 5) *for future model improvement* Gradient Boosting Machines (XGBoost, LightGBM, CatBoost)
Unfortunately I was not able to import the relevant packages to implement these algorithms. 

https://medium.com/@weidagang/essential-python-for-machine-learning-xgboost-4b662cf19fcd

## 6) Results

### What **key factors** affect depression?

- **KNN**: model accuracy is 0.8405. The best variable combination leading to the highest accuracy is age, academic pressure, study satisfaction, dietary habits, suicidal thoughts, work study hours and financial stress.
- **logistic regression**: model accuracy is 0.8455. The strongest coefficients are the ones associated with the following features are suicidal thoughts, academic pressure, financial stress, age, unhealthy dietary habits. Features with large SHAP values include suicidal thoughts, academic pressure, financial stress, unhealthy dietary habits. 
- **decision tree**: accuracy is 0.8311. The features with the highest Gini importance are suicidal thoughts, academic pressure, financial stress, age, dietary habits.
- **random forest**: accuracy is 0.8466. The features with the highest Gini importance are suicidal thoughts,a cademic pressure, CGPA, age, financial stress.