In [10]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder, FunctionTransformer, OrdinalEncoder
from sklearn.compose import ColumnTransformer

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

from imblearn.over_sampling import SMOTE, RandomOverSampler, BorderlineSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

### Experiments to find best sampling techniques

In [7]:
# Load your data
df = pd.read_csv('../data/depression_data.csv')
df = df.drop(columns=['Name'])

# Log scaling for Income (creating this column before splitting features and target)
df['Income'] = df['Income'].apply(lambda x: np.log(x + 1))

# Splitting features and target
X = df.drop(['History of Mental Illness'], axis=1)
y = df['History of Mental Illness'].map({'Yes': 1, 'No': 0})

# Columns setup
categorical_cols = ['Marital Status', 'Education Level', 'Smoking Status', 'Physical Activity Level',
                    'Employment Status', 'Alcohol Consumption', 'Dietary Habits', 'Sleep Patterns',
                    'History of Substance Abuse', 'Family History of Depression', 'Chronic Medical Conditions']
numeric_cols = ['Age', 'Number of Children']

# One hot encoding without drop
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(drop=None), categorical_cols)
    ])

# Define models
models = {
    'Dummy': DummyClassifier(strategy='most_frequent'),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'LightGBM': LGBMClassifier(),
    'NeuralNetwork': MLPClassifier(hidden_layer_sizes=(64, 32), early_stopping=True)
}

# Sampling techniques
sampling_methods = {
    'None': None,
    'RandomOverSampler': RandomOverSampler(),
    'RandomUnderSampler': RandomUnderSampler(),
    'SMOTE': SMOTE(),
    'BorderlineSMOTE': BorderlineSMOTE(),
    'ADASYN': ADASYN()
}

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply preprocessing to the training data before resampling
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

# Iterate through each sampling method
for sampling_name, sampler in sampling_methods.items():
    if sampler is not None:
        X_resampled, y_resampled = sampler.fit_resample(X_train_transformed, y_train)
    else:
        X_resampled, y_resampled = X_train_transformed, y_train

    print(f"\nSampling Technique: {sampling_name}")
    
    # Iterate through each model
    for model_name, model in models.items():
        model.fit(X_resampled, y_resampled)
        y_pred = model.predict(X_test_transformed)

        # Print metrics
        print(f"\nModel: {model_name}")
        print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
        print(f"Precision: {precision_score(y_test, y_pred, zero_division=1):.4f}")
        print(f"Recall: {recall_score(y_test, y_pred, zero_division=1):.4f}")
        print(f"F1 Score: {f1_score(y_test, y_pred, zero_division=1):.4f}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred, zero_division=1))


Sampling Technique: None

Model: Dummy
Accuracy: 0.6954
Precision: 1.0000
Recall: 0.0000
F1 Score: 0.0000
Classification Report:
              precision    recall  f1-score   support

           0       0.70      1.00      0.82     86319
           1       1.00      0.00      0.00     37812

    accuracy                           0.70    124131
   macro avg       0.85      0.50      0.41    124131
weighted avg       0.79      0.70      0.57    124131


Model: DecisionTree
Accuracy: 0.5967
Precision: 0.3252
Recall: 0.3014
F1 Score: 0.3129
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.73      0.71     86319
           1       0.33      0.30      0.31     37812

    accuracy                           0.60    124131
   macro avg       0.51      0.51      0.51    124131
weighted avg       0.59      0.60      0.59    124131


Model: RandomForest
Accuracy: 0.6186
Precision: 0.3323
Recall: 0.2498
F1 Score: 0.2852
Classification Re

### Experiment Insights and Takeaways

After conducting multiple experiments with various models and sampling techniques, here is a summary of our findings and the rationale behind our final choice:

#### 1. **Sampling Techniques:**

We experimented with several sampling techniques to address the class imbalance in the dataset, namely:
- No sampling (baseline)
- Random Over-Sampling
- Random Under-Sampling
- SMOTE (Synthetic Minority Over-sampling Technique)
- BorderlineSMOTE
- ADASYN (Adaptive Synthetic Sampling)

Among these, **Random Over-Sampling** and **SMOTE** provided the best results in terms of model performance, especially when paired with Logistic Regression and Random Forest models. **SMOTE** performed slightly better than Random Over-Sampling in recall and F1 scores. Additionally, SMOTE generates synthetic samples of the minority class rather than duplicating data, which reduces the risk of overfitting.

#### 2. **Model Performance Across Techniques:**

- **Dummy Classifier**: 
    - As expected, the dummy classifier provided the baseline accuracy (~69%) but did not capture any of the minority class, resulting in an F1 score of 0. This highlighted the importance of balancing techniques to improve predictive power.

- **Decision Tree**: 
    - The Decision Tree performed better than the dummy classifier, but its performance was still suboptimal. With **Random Over-Sampling**, it achieved an F1 score of ~0.33. However, its performance improved marginally with SMOTE to an F1 score of ~0.32. The decision tree's performance remained lower compared to other models, as it often struggles with class imbalance.

- **Random Forest**: 
    - Random Forest models improved over the Decision Tree, especially with SMOTE and Random Over-Sampling. The best results were achieved with **SMOTE**, where the F1 score reached ~0.32. However, it also exhibited some limitations in handling class imbalance, especially when undersampling was used.

- **Logistic Regression**: 
    - Logistic Regression consistently performed well across all sampling techniques. Its performance improved the most with **SMOTE**, reaching an F1 score of ~0.42. This model balanced simplicity with interpretability, making it a strong candidate for the final solution.

- **LightGBM**: 
    - LightGBM models performed similarly across sampling methods but exhibited very low recall for the minority class. Despite a high precision, LightGBM struggled to predict the minority class (history of mental illness) effectively. Its best F1 score was observed with **SMOTE** (~0.16).

- **Neural Network**: 
    - The neural network model showed promise with F1 scores of ~0.42 using SMOTE and ~0.44 with Random Over-Sampling, but it was computationally more expensive. Given that it did not outperform simpler models like Logistic Regression, it was not chosen as the final model.

#### 3. **Final Model Choice: Logistic Regression + SMOTE**

The Logistic Regression model with **SMOTE** consistently provided the best balance between precision, recall, and F1 scores. Here’s a quick breakdown:

- **Accuracy**: ~61.68%
- **Precision**: ~38.97%
- **Recall**: ~45.65%
- **F1 Score**: ~42.05%

While Random Over-Sampling provided comparable results, we opted for **SMOTE** as the final sampling technique due to its ability to generate synthetic samples rather than duplicating data. This choice reduces the risk of overfitting, which can occur when oversampling simply duplicates minority class instances.

#### 4. **Rationale for SMOTE Over Random Over-Sampling**

Though both Random Over-Sampling and SMOTE performed similarly, SMOTE provides a more generalized and robust approach. Random Over-Sampling can lead to overfitting, as it creates exact copies of minority class instances, which can cause the model to memorize the duplicated data points rather than generalize well to unseen data. SMOTE, on the other hand, generates synthetic instances that blend the characteristics of real minority class instances, providing a more balanced and diverse representation of the minority class without overfitting.

### Conclusion

For our final solution, we will move forward with **SMOTE**. This gives us a good balance between performance and simplicity, with the added benefit of reducing overfitting risks.


### Experiments to find best model and parameters

In [12]:
# Load data
df = pd.read_csv('../data/depression_data.csv')
df = df.drop(columns=['Name'])

# Splitting features and target
X = df.drop(['History of Mental Illness'], axis=1)
y = df['History of Mental Illness'].map({'Yes': 1, 'No': 0})

# Columns setup
categorical_cols = ['Marital Status', 'Education Level', 'Smoking Status', 'Physical Activity Level',
                    'Employment Status', 'Alcohol Consumption', 'Dietary Habits', 'Sleep Patterns', 
                    'History of Substance Abuse', 'Family History of Depression', 'Chronic Medical Conditions']
numeric_cols = ['Age', 'Number of Children']

# Log scaling for Income
df['Income'] = df['Income'].apply(lambda x: np.log(x + 1))

# One hot encoding and scaling
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(drop=None), categorical_cols)
    ])

# Transform the data
X_transformed = preprocessor.fit_transform(X)

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=42)

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Define models and hyperparameters for GridSearch
model_params = {
    'DecisionTree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'classifier__max_depth': [10, 20, 30, None],
            'classifier__min_samples_split': [2, 5, 10]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=42),
        'params': {
            'classifier__max_iter': [100, 250, 500],
            'classifier__C': [0.01, 0.1, 1, 10],
            'classifier__penalty': ['l1', 'l2'],
            'classifier__solver': ['liblinear', 'saga']
        }
    },
    'LightGBM': {
        'model': LGBMClassifier(random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__max_depth': [10, 20, 30, None],
            'classifier__learning_rate': [0.01, 0.05, 0.1]
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100, 150],
            'classifier__max_depth': [10, 20, 30, None],
            'classifier__min_samples_split': [2, 5, 10]
        }
    }
}


# Iterate through each model, apply GridSearchCV and evaluate
for model_name, mp in model_params.items():
    print(f"\nModel: {model_name}")
    
    # Create pipeline for model with classifier (preprocessor is already applied)
    clf = Pipeline(steps=[('classifier', mp['model'])])
    
    # Perform Grid Search with Cross-Validation
    grid_search = GridSearchCV(clf, mp['params'], cv=5, scoring='f1', n_jobs=-1)
    grid_search.fit(X_resampled, y_resampled)
    
    # Get the best model from grid search
    best_model = grid_search.best_estimator_
    print("--------------- Best Model: ------------------")
    print(best_model)
    
    # Predictions on the test set
    y_pred = best_model.predict(X_test)
    
    # Print evaluation metrics
    print(f"Best parameters found: {grid_search.best_params_}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))


Model: DecisionTree
--------------- Best Model: ------------------
Pipeline(steps=[('classifier', DecisionTreeClassifier(random_state=42))])
Best parameters found: {'classifier__max_depth': None, 'classifier__min_samples_split': 2}
Accuracy: 0.5933
Precision: 0.3287
Recall: 0.3178
F1 Score: 0.3232
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.71      0.71     57471
           1       0.33      0.32      0.32     25283

    accuracy                           0.59     82754
   macro avg       0.52      0.52      0.52     82754
weighted avg       0.59      0.59      0.59     82754


Model: LogisticRegression
--------------- Best Model: ------------------
Pipeline(steps=[('classifier',
                 LogisticRegression(C=0.01, random_state=42,
                                    solver='liblinear'))])
Best parameters found: {'classifier__C': 0.01, 'classifier__max_iter': 100, 'classifier__penalty': 'l2', 'classifier__solver'

### Experiment Insights and Key Takeaways

In this experiment, we evaluated four models: Decision Tree, Logistic Regression, LightGBM, and Random Forest. The goal was to determine the best-performing model based on accuracy, precision, recall, and F1 score, while also considering model explainability and robustness.

#### 1. **Decision Tree:**
- **Best Parameters**: `{'classifier__max_depth': None, 'classifier__min_samples_split': 2}`
- **Performance**: 
  - **Accuracy**: 59.33%
  - **Precision**: 32.87%
  - **Recall**: 31.78%
  - **F1 Score**: 32.32%
  
Decision Tree provided a relatively low performance compared to the other models. Its accuracy and F1 score are below 60%, and it showed limited capability in predicting the minority class (history of mental illness), with a recall of just ~31%. This indicates that the model is struggling to generalize well, especially given the class imbalance, even with SMOTE applied.

#### 2. **Logistic Regression:**
- **Best Parameters**: `{'classifier__C': 0.01, 'classifier__max_iter': 100, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear'}`
- **Performance**: 
  - **Accuracy**: 61.72%
  - **Precision**: 39.19%
  - **Recall**: 45.86%
  - **F1 Score**: 42.26%
  
Logistic Regression emerged as one of the best-performing models in this experiment. It achieved a good balance between precision and recall, with an F1 score of ~42%. Its simplicity and interpretability make it a strong candidate for the final model. Despite its slightly lower accuracy compared to other models, its performance in terms of recall (ability to identify positive cases of mental illness) is notable, making it a reliable choice for this problem.

#### 3. **LightGBM:**
- **Best Parameters**: `{'classifier__learning_rate': 0.05, 'classifier__max_depth': 30, 'classifier__n_estimators': 50}`
- **Performance**: 
  - **Accuracy**: 63.92%
  - **Precision**: 40.00%
  - **Recall**: 36.21%
  - **F1 Score**: 38.01%
  
LightGBM achieved the highest accuracy (63.92%) among the models, but its recall for the minority class was lower than Logistic Regression. While LightGBM offers powerful predictive performance, it tends to be less interpretable compared to simpler models like Logistic Regression. This makes it less desirable for this task, where model explainability is important.

#### 4. **Random Forest:**
- **Best Parameters**: `{'classifier__max_depth': 30, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 150}`
- **Performance**: 
  - **Accuracy**: 60.61%
  - **Precision**: 34.03%
  - **Recall**: 30.82%
  - **F1 Score**: 32.34%
  
Random Forest provided moderate performance, similar to the Decision Tree model. Despite its slightly higher precision, the recall for the minority class remained relatively low (30.82%), which limits its effectiveness for identifying individuals with a history of mental illness. Its F1 score (32%) also indicates that it struggles to balance precision and recall effectively for this problem.

### Final Model Choice: Logistic Regression

Based on the performance of the models, **Logistic Regression** is the best model for our problem. Here are the key reasons for this choice:
- **Balanced Performance**: Logistic Regression provided the best balance between precision and recall, resulting in an F1 score of ~42%. This balance is crucial for the task, where both false positives and false negatives can have significant implications.
- **Explainability**: Logistic Regression is highly interpretable, allowing us to understand how each feature contributes to the predictions. This is important in real-world applications, especially in health-related domains, where decision-making transparency is essential.
- **Simplicity**: Logistic Regression is computationally efficient and straightforward, making it easier to deploy and scale. While models like LightGBM and Random Forest offer marginally higher accuracy, they are more complex and less interpretable.
- **Consistent Results**: Across both experiments (previous and current), Logistic Regression consistently performed well, especially when paired with SMOTE for addressing the class imbalance.

### Conclusion

For our final model, we will proceed with **Logistic Regression**, using the best parameters found in this experiment (`C = 0.01, max_iter = 100, penalty = 'l2', solver = 'liblinear'`). This model offers the best combination of performance, simplicity, and explainability for predicting whether an individual is likely to suffer from mental illness.
