дана база данных, содержащая информацию об индийцах. целевая колонка - депрессия. необходимо обработать данные и написать модель предсказывающая депрессию (бинамиризация)

Состав команды

Есакова Елизавета

Моттуева Уруйдана

Сапожков Николай

Шабанова Надежда

Совместная работа
Координация:

-----------------------------

Участник 1 передаёт обработанные данные и пайплайн Участнику 2.

Участник 2 передаёт настроенный GridSearchCV Участнику 3.

Участник 3 обучает модель и передаёт результаты Участнику 4.

Участник 4 анализирует и представляет итоговые результаты.

----------------------

Инструменты:

Телеграмм,
Google Colab.

Проверка:

Каждый участник проверяет свою и чужие части кода.

Финализация кода и результатов проводится совместно.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('final_depression_dataset_1.csv')

print(df.info())
print(df.head(5))
print(df['Depression'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2556 entries, 0 to 2555
Data columns (total 19 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Name                                   2556 non-null   object 
 1   Gender                                 2556 non-null   object 
 2   Age                                    2556 non-null   int64  
 3   City                                   2556 non-null   object 
 4   Working Professional or Student        2556 non-null   object 
 5   Profession                             1883 non-null   object 
 6   Academic Pressure                      502 non-null    float64
 7   Work Pressure                          2054 non-null   float64
 8   CGPA                                   502 non-null    float64
 9   Study Satisfaction                     502 non-null    float64
 10  Job Satisfaction                       2054 non-null   float64
 11  Slee

In [None]:
df['CGPA'].fillna(df['CGPA'].median(), inplace=True)
df['Study Satisfaction'].fillna(df['Study Satisfaction'].median(), inplace=True)
df['Job Satisfaction'].fillna(df['Job Satisfaction'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CGPA'].fillna(df['CGPA'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Study Satisfaction'].fillna(df['Study Satisfaction'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermedia

In [None]:
df['Sleep Duration'] = df['Sleep Duration'].astype('category')
df['Dietary Habits'] = df['Dietary Habits'].astype('category')

In [None]:
X = df.drop('Depression', axis=1)
y = df['Depression']

In [None]:
numeric_features = ['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
                   'Study Satisfaction', 'Job Satisfaction', 'Work/Study Hours',
                   'Financial Stress']
categorical_features = ['Gender', 'City', 'Working Professional or Student',
                       'Profession', 'Sleep Duration', 'Dietary Habits', 'Degree',
                       'Have you ever had suicidal thoughts ?', 'Family History of Mental Illness']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier(random_state=42))])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9375
              precision    recall  f1-score   support

          No       0.95      0.98      0.96       421
         Yes       0.89      0.74      0.81        91

    accuracy                           0.94       512
   macro avg       0.92      0.86      0.88       512
weighted avg       0.94      0.94      0.94       512



In [None]:
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
print("Best model accuracy:", accuracy_score(y_test, y_pred_best))
print(classification_report(y_test, y_pred_best))

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best parameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200}
Best model accuracy: 0.93359375
              precision    recall  f1-score   support

          No       0.94      0.98      0.96       421
         Yes       0.88      0.73      0.80        91

    accuracy                           0.93       512
   macro avg       0.91      0.85      0.88       512
weighted avg       0.93      0.93      0.93       512



In [None]:
onehot_columns = list(pipeline.named_steps['preprocessor'].
                     named_transformers_['cat'].
                     named_steps['onehot'].
                     get_feature_names_out(categorical_features))

feature_names = numeric_features + onehot_columns

importances = best_model.named_steps['classifier'].feature_importances_

feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

print(feature_importance_df.head(20))

                                               feature  importance
0                                                  Age    0.172049
1                                    Academic Pressure    0.073579
6                                     Work/Study Hours    0.049963
111           Have you ever had suicidal thoughts ?_No    0.047003
112          Have you ever had suicidal thoughts ?_Yes    0.045510
7                                     Financial Stress    0.043818
41   Working Professional or Student_Working Profes...    0.039474
2                                        Work Pressure    0.035138
5                                     Job Satisfaction    0.034833
3                                                 CGPA    0.034600
4                                   Study Satisfaction    0.031016
40             Working Professional or Student_Student    0.029562
74                                  Profession_Teacher    0.024545
95                                     Degree_Class 12    0.02

In [None]:
import joblib

joblib.dump(best_model, 'depression_prediction_model.pkl')

['depression_prediction_model.pkl']

In [None]:
import pandas as pd

# Пример новых данных
new_data = pd.DataFrame({
    'Name': ['Test User'],
    'Gender': ['Male'],
    'Age': [30],
    'City': ['Mumbai'],
    'Working Professional or Student': ['Working Professional'],
    'Profession': ['Engineer'],
    'Academic Pressure': [3],
    'Work Pressure': [4],
    'CGPA': [7.5],
    'Study Satisfaction': [4],
    'Job Satisfaction': [3],
    'Sleep Duration': ['7-8 hours'],
    'Dietary Habits': ['Moderate'],
    'Degree': ['BE'],
    'Have you ever had suicidal thoughts ?': ['No'],
    'Work/Study Hours': [8],
    'Financial Stress': [3],
    'Family History of Mental Illness': ['No']
})

In [None]:
prediction = model.predict(new_data)
prediction_proba = model.predict_proba(new_data)

print(f"Prediction: {'Depression' if prediction[0] == 'Yes' else 'No Depression'}")
print(f"Probability: {prediction_proba[0][1]*100:.2f}% chance of depression")

Prediction: No Depression
Probability: 4.50% chance of depression


In [None]:
import joblib
import pandas as pd

model = joblib.load('depression_prediction_model.pkl')

user_data = {
    'Name': input("Enter name: "),
    'Gender': input("Gender (Male/Female): "),
    'Age': int(input("Age: ")),
    'City': input("City: "),
    'Working Professional or Student': input("Working Professional or Student: "),
    'Profession': input("Profession: "),
    'Academic Pressure': int(input("Academic Pressure (1-5): ")),
    'Work Pressure': int(input("Work Pressure (1-5): ")),
    'CGPA': float(input("CGPA (0-10): ")),
    'Study Satisfaction': int(input("Study Satisfaction (1-5): ")),
    'Job Satisfaction': int(input("Job Satisfaction (1-5): ")),
    'Sleep Duration': input("Sleep Duration: "),
    'Dietary Habits': input("Dietary Habits (Healthy/Moderate/Unhealthy): "),
    'Degree': input("Degree: "),
    'Have you ever had suicidal thoughts ?': input("Suicidal thoughts (Yes/No): "),
    'Work/Study Hours': int(input("Work/Study Hours: ")),
    'Financial Stress': int(input("Financial Stress (1-5): ")),
    'Family History of Mental Illness': input("Family History of Mental Illness (Yes/No): ")
}

new_data = pd.DataFrame([user_data])

prediction = model.predict(new_data)
probabilities = model.predict_proba(new_data)

result = "likely has depression" if prediction[0] == 'Yes' else "likely does not have depression"
confidence = probabilities[0][1]*100 if prediction[0] == 'Yes' else probabilities[0][0]*100

print(f"\nResult: The model predicts that {user_data['Name']} {result}.")
print(f"Confidence: {confidence:.2f}%")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-b9ab53c584e0>", line 8, in <cell line: 0>
    'Gender': input("Gender (Male/Female): "),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 1177, in raw_input
    return self._input_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 1219, in _input_request
    raise KeyboardInterrupt("Interrupted by user") from None
KeyboardInterrupt: Interrupted by user

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 2099, in showtraceback
    stb = value._render_traceback_()
          ^^^^^^^^^^^^^^^^

TypeError: object of type 'NoneType' has no len()