# Depression on Study

![stress](../images/stress.jpg)

Learning is an activity that can cause stress. Especially with a long duration and high score targets to achieve, it can trigger significant levels of stress. Awareness of stress in students should be a concern. Through this project, which aims to classify depression in students, the model can assist in providing suggestions and monitoring whether a student is experiencing depression. Depression must be identified early so that affected individuals do not remain trapped in a depressive phase, which could lead to extreme actions such as self-harm or even suicide. The model in this project will help users determine whether they are experiencing depression. If the model detects depression, users can immediately contact a psychologist to consult about the depression they are experiencing.

Source of the data: https://www.kaggle.com/datasets/hopesb/student-depression-dataset/data

## Import the relevant libraries

In [121]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pickle

## Load the data

In [4]:
data = pd.read_csv("../data/Student Depression Dataset.csv")
data

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.90,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,140685,Female,27.0,Surat,Student,5.0,0.0,5.75,5.0,0.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27897,140686,Male,27.0,Ludhiana,Student,2.0,0.0,9.40,3.0,0.0,Less than 5 hours,Healthy,MSc,No,0.0,3.0,Yes,0
27898,140689,Male,31.0,Faridabad,Student,3.0,0.0,6.61,4.0,0.0,5-6 hours,Unhealthy,MD,No,12.0,2.0,No,0
27899,140690,Female,18.0,Ludhiana,Student,5.0,0.0,6.88,2.0,0.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1


## Preprocessing

The objective in this part is:
- Standardizing feature and value name with lowercase and using underscore instead of space for feature names 
- Checking missing value and drop it
- Checking the less important feature
- Drop data with unknown or others value
- Simplify the value within `sleep_duration_hours` feature

In [8]:
# standardizing text
data.columns = data.columns.str.lower().str.replace(' ', '_')
data = data.rename(columns={'have_you_ever_had_suicidal_thoughts_?': 'suicidal_thoughts', 
                           'sleep_duration': 'sleep_duration_hours'})

str_columns = list(data.dtypes[data.dtypes == 'object'].index)

# changing text inside into lower and no space
for cat in str_columns:
    data[cat] = data[cat].str.lower()

data

Unnamed: 0,id,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration_hours,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
0,2,male,33.0,visakhapatnam,student,5.0,0.0,8.97,2.0,0.0,5-6 hours,healthy,b.pharm,yes,3.0,1.0,no,1
1,8,female,24.0,bangalore,student,2.0,0.0,5.90,5.0,0.0,5-6 hours,moderate,bsc,no,3.0,2.0,yes,0
2,26,male,31.0,srinagar,student,3.0,0.0,7.03,5.0,0.0,less than 5 hours,healthy,ba,no,9.0,1.0,yes,0
3,30,female,28.0,varanasi,student,3.0,0.0,5.59,2.0,0.0,7-8 hours,moderate,bca,yes,4.0,5.0,yes,1
4,32,female,25.0,jaipur,student,4.0,0.0,8.13,3.0,0.0,5-6 hours,moderate,m.tech,yes,1.0,1.0,no,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,140685,female,27.0,surat,student,5.0,0.0,5.75,5.0,0.0,5-6 hours,unhealthy,class 12,yes,7.0,1.0,yes,0
27897,140686,male,27.0,ludhiana,student,2.0,0.0,9.40,3.0,0.0,less than 5 hours,healthy,msc,no,0.0,3.0,yes,0
27898,140689,male,31.0,faridabad,student,3.0,0.0,6.61,4.0,0.0,5-6 hours,unhealthy,md,no,12.0,2.0,no,0
27899,140690,female,18.0,ludhiana,student,5.0,0.0,6.88,2.0,0.0,less than 5 hours,healthy,class 12,yes,10.0,5.0,no,1


In [10]:
# checking if there is missing data
data.isnull().sum()

id                                  0
gender                              0
age                                 0
city                                0
profession                          0
academic_pressure                   0
work_pressure                       0
cgpa                                0
study_satisfaction                  0
job_satisfaction                    0
sleep_duration_hours                0
dietary_habits                      0
degree                              0
suicidal_thoughts                   0
work/study_hours                    0
financial_stress                    3
family_history_of_mental_illness    0
depression                          0
dtype: int64

In [12]:
data[data['financial_stress'].isnull()]

Unnamed: 0,id,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration_hours,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
4458,22377,female,32.0,varanasi,student,3.0,0.0,5.64,1.0,0.0,5-6 hours,healthy,bca,no,12.0,,no,1
13596,68910,male,29.0,hyderabad,student,2.0,0.0,8.94,3.0,0.0,less than 5 hours,unhealthy,b.ed,no,12.0,,yes,0
19266,97610,female,20.0,kolkata,student,1.0,0.0,6.83,1.0,0.0,5-6 hours,healthy,mbbs,no,9.0,,yes,0


The financial stress have relation with depression. The more it is, the more stress so it may lead to depression. Based on this fact, we drop row with the missing value in financial stress.

In [17]:
data = data.dropna()
data

Unnamed: 0,id,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration_hours,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
0,2,male,33.0,visakhapatnam,student,5.0,0.0,8.97,2.0,0.0,5-6 hours,healthy,b.pharm,yes,3.0,1.0,no,1
1,8,female,24.0,bangalore,student,2.0,0.0,5.90,5.0,0.0,5-6 hours,moderate,bsc,no,3.0,2.0,yes,0
2,26,male,31.0,srinagar,student,3.0,0.0,7.03,5.0,0.0,less than 5 hours,healthy,ba,no,9.0,1.0,yes,0
3,30,female,28.0,varanasi,student,3.0,0.0,5.59,2.0,0.0,7-8 hours,moderate,bca,yes,4.0,5.0,yes,1
4,32,female,25.0,jaipur,student,4.0,0.0,8.13,3.0,0.0,5-6 hours,moderate,m.tech,yes,1.0,1.0,no,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,140685,female,27.0,surat,student,5.0,0.0,5.75,5.0,0.0,5-6 hours,unhealthy,class 12,yes,7.0,1.0,yes,0
27897,140686,male,27.0,ludhiana,student,2.0,0.0,9.40,3.0,0.0,less than 5 hours,healthy,msc,no,0.0,3.0,yes,0
27898,140689,male,31.0,faridabad,student,3.0,0.0,6.61,4.0,0.0,5-6 hours,unhealthy,md,no,12.0,2.0,no,0
27899,140690,female,18.0,ludhiana,student,5.0,0.0,6.88,2.0,0.0,less than 5 hours,healthy,class 12,yes,10.0,5.0,no,1


In [15]:
data.columns

Index(['id', 'gender', 'age', 'city', 'profession', 'academic_pressure',
       'work_pressure', 'cgpa', 'study_satisfaction', 'job_satisfaction',
       'sleep_duration_hours', 'dietary_habits', 'degree', 'suicidal_thoughts',
       'work/study_hours', 'financial_stress',
       'family_history_of_mental_illness', 'depression'],
      dtype='object')

In [19]:
categories = ['gender', 'profession', 'academic_pressure', 'work_pressure', 
              'study_satisfaction', 'job_satisfaction', 'sleep_duration_hours', 
              'dietary_habits', 'degree', 'suicidal_thoughts', 'financial_stress', 
              'family_history_of_mental_illness', 'depression']

for category in categories:
    print(data[category].value_counts())
    print('\n\n')

gender
male      15546
female    12352
Name: count, dtype: int64



profession
student                   27867
architect                     8
teacher                       6
digital marketer              3
content writer                2
chef                          2
doctor                        2
pharmacist                    2
civil engineer                1
ux/ui designer                1
educational consultant        1
manager                       1
lawyer                        1
entrepreneur                  1
Name: count, dtype: int64



academic_pressure
3.0    7461
5.0    6296
4.0    5155
1.0    4800
2.0    4177
0.0       9
Name: count, dtype: int64



work_pressure
0.0    27895
5.0        2
2.0        1
Name: count, dtype: int64



study_satisfaction
4.0    6359
2.0    5838
3.0    5820
1.0    5449
5.0    4422
0.0      10
Name: count, dtype: int64



job_satisfaction
0.0    27890
2.0        3
4.0        2
1.0        2
3.0        1
Name: count, dtype: int64



sleep_durati

After look into the features, we exclude some features that do not give a good information for training the model because dominated by one value. Such as `profession` that dominated by student, `work_pressure`, `job_satisfaction`, `degree`. Also we exclude some data that have values others for some feature, because it means unknown.

Another features like `id`, `city` also dropped for lack of importance for modelling.

In [22]:
clean_data = data[['gender', 'age', 'academic_pressure', 'cgpa', 
                   'study_satisfaction', 'sleep_duration_hours', 'dietary_habits',  
                   'suicidal_thoughts', 'work/study_hours', 'financial_stress',
                   'family_history_of_mental_illness', 'depression']]

clean_data

Unnamed: 0,gender,age,academic_pressure,cgpa,study_satisfaction,sleep_duration_hours,dietary_habits,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
0,male,33.0,5.0,8.97,2.0,5-6 hours,healthy,yes,3.0,1.0,no,1
1,female,24.0,2.0,5.90,5.0,5-6 hours,moderate,no,3.0,2.0,yes,0
2,male,31.0,3.0,7.03,5.0,less than 5 hours,healthy,no,9.0,1.0,yes,0
3,female,28.0,3.0,5.59,2.0,7-8 hours,moderate,yes,4.0,5.0,yes,1
4,female,25.0,4.0,8.13,3.0,5-6 hours,moderate,yes,1.0,1.0,no,0
...,...,...,...,...,...,...,...,...,...,...,...,...
27896,female,27.0,5.0,5.75,5.0,5-6 hours,unhealthy,yes,7.0,1.0,yes,0
27897,male,27.0,2.0,9.40,3.0,less than 5 hours,healthy,no,0.0,3.0,yes,0
27898,male,31.0,3.0,6.61,4.0,5-6 hours,unhealthy,no,12.0,2.0,no,0
27899,female,18.0,5.0,6.88,2.0,less than 5 hours,healthy,yes,10.0,5.0,no,1


In [24]:
# exclude a data which value is others
clean_data = clean_data[clean_data['sleep_duration_hours'] != 'others']
clean_data = clean_data[clean_data['dietary_habits'] != 'others']
clean_data

Unnamed: 0,gender,age,academic_pressure,cgpa,study_satisfaction,sleep_duration_hours,dietary_habits,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
0,male,33.0,5.0,8.97,2.0,5-6 hours,healthy,yes,3.0,1.0,no,1
1,female,24.0,2.0,5.90,5.0,5-6 hours,moderate,no,3.0,2.0,yes,0
2,male,31.0,3.0,7.03,5.0,less than 5 hours,healthy,no,9.0,1.0,yes,0
3,female,28.0,3.0,5.59,2.0,7-8 hours,moderate,yes,4.0,5.0,yes,1
4,female,25.0,4.0,8.13,3.0,5-6 hours,moderate,yes,1.0,1.0,no,0
...,...,...,...,...,...,...,...,...,...,...,...,...
27896,female,27.0,5.0,5.75,5.0,5-6 hours,unhealthy,yes,7.0,1.0,yes,0
27897,male,27.0,2.0,9.40,3.0,less than 5 hours,healthy,no,0.0,3.0,yes,0
27898,male,31.0,3.0,6.61,4.0,5-6 hours,unhealthy,no,12.0,2.0,no,0
27899,female,18.0,5.0,6.88,2.0,less than 5 hours,healthy,yes,10.0,5.0,no,1


In [26]:
# Simplify the value within sleep_duration_hours
clean_data.sleep_duration_hours.value_counts()

sleep_duration_hours
less than 5 hours    8304
7-8 hours            7343
5-6 hours            6178
more than 8 hours    6043
Name: count, dtype: int64

In [30]:
clean_data.sleep_duration_hours = clean_data.sleep_duration_hours.map({'less than 5 hours': '<5',
                                                                       '5-6 hours': '5-6', 
                                                                       '7-8 hours': '7-8',
                                                                       'more than 8 hours': '>8'})

clean_data.sleep_duration_hours.value_counts()

sleep_duration_hours
<5     8304
7-8    7343
5-6    6178
>8     6043
Name: count, dtype: int64

## EDA

In [49]:
from sklearn.metrics import mutual_info_score

In [57]:
clean_data.columns

Index(['gender', 'age', 'academic_pressure', 'cgpa', 'study_satisfaction',
       'sleep_duration_hours', 'dietary_habits', 'suicidal_thoughts',
       'work/study_hours', 'financial_stress',
       'family_history_of_mental_illness', 'depression'],
      dtype='object')

In [73]:
features = ['gender', 'age', 'academic_pressure', 'cgpa', 'study_satisfaction',
       'sleep_duration_hours', 'dietary_habits', 'suicidal_thoughts',
       'work/study_hours', 'financial_stress','family_history_of_mental_illness']

def mutual_info_churn_score(series):
    return mutual_info_score(series, clean_data.depression)

mi = clean_data[features].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)



suicidal_thoughts                   0.154711
academic_pressure                   0.122142
financial_stress                    0.068693
age                                 0.031684
work/study_hours                    0.023162
dietary_habits                      0.021896
study_satisfaction                  0.014407
cgpa                                0.009658
sleep_duration_hours                0.004981
family_history_of_mental_illness    0.001424
gender                              0.000002
dtype: float64

From mutual information we know that `gender` has do nothing to `depression`. The score is really small to take. So we will drop `gender` from dataset.

In [36]:
clean_data.describe()

Unnamed: 0,age,academic_pressure,cgpa,study_satisfaction,work/study_hours,financial_stress,depression
count,27868.0,27868.0,27868.0,27868.0,27868.0,27868.0,27868.0
mean,25.820942,3.141345,7.656252,2.943663,7.157923,3.140161,0.585546
std,4.905716,1.381616,1.470663,1.361012,3.707154,1.437275,0.492636
min,18.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,21.0,2.0,6.29,2.0,4.0,2.0,0.0
50%,25.0,3.0,7.77,3.0,8.0,3.0,1.0
75%,30.0,4.0,8.92,4.0,10.0,4.0,1.0
max,59.0,5.0,10.0,5.0,12.0,5.0,1.0


The features with numerical value are `age`, `cgpa`, `work/study_hours`. The others is ordinal data, except feature `depression`. Some information we can gather for the numerical features are:

1. The `age` of student is balance with median and mean around 25.
2. The mean and median for `cgpa` around `7.5`, means most of student are not fail.
3. Most of student study more than 8 hours.

In [76]:
final_data = clean_data.drop('gender', axis=1)
final_data

Unnamed: 0,age,academic_pressure,cgpa,study_satisfaction,sleep_duration_hours,dietary_habits,suicidal_thoughts,work/study_hours,financial_stress,family_history_of_mental_illness,depression
0,33.0,5.0,8.97,2.0,5-6,healthy,yes,3.0,1.0,no,1
1,24.0,2.0,5.90,5.0,5-6,moderate,no,3.0,2.0,yes,0
2,31.0,3.0,7.03,5.0,<5,healthy,no,9.0,1.0,yes,0
3,28.0,3.0,5.59,2.0,7-8,moderate,yes,4.0,5.0,yes,1
4,25.0,4.0,8.13,3.0,5-6,moderate,yes,1.0,1.0,no,0
...,...,...,...,...,...,...,...,...,...,...,...
27896,27.0,5.0,5.75,5.0,5-6,unhealthy,yes,7.0,1.0,yes,0
27897,27.0,2.0,9.40,3.0,<5,healthy,no,0.0,3.0,yes,0
27898,31.0,3.0,6.61,4.0,5-6,unhealthy,no,12.0,2.0,no,0
27899,18.0,5.0,6.88,2.0,<5,healthy,yes,10.0,5.0,no,1


## Modeling

In [78]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

In [127]:
df_full_train, df_test = train_test_split(final_data, test_size=0.2, random_state=12)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=12)

In [129]:
y_train = df_train.depression.values
y_val = df_val.depression.values
y_test = df_test.depression.values

del(df_train['depression'])
del(df_val['depression'])
del(df_test['depression'])

In [131]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [133]:
train_dict = df_train.to_dict(orient='records')
val_dict = df_val.to_dict(orient='records')
test_dict = df_test.to_dict(orient='records')

In [134]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
X_test = dv.transform(test_dict)

### Logistic Regression

In [138]:
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(f'Training accuracy {round(lr.score(X_train, y_train), 3)}')
print(classification_report(y_val, y_pred))

Training accuracy 0.846
              precision    recall  f1-score   support

           0       0.82      0.80      0.81      2304
           1       0.86      0.88      0.87      3270

    accuracy                           0.85      5574
   macro avg       0.84      0.84      0.84      5574
weighted avg       0.84      0.85      0.84      5574



In [140]:
# Tuning Logistic Regression Model
reg = [0.01, 0.1, 1, 10, 100] #Searching for best regularization
max_iter = [100, 200, 300]

scores = []

for c in reg:
    for iter in max_iter:
        lr = LogisticRegression(C=c, max_iter=iter).fit(X_train, y_train)
    
        y_pred = lr.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        scores.append((c, iter, auc))

df_scores = pd.DataFrame(scores, columns=['C', 'max_iter', 'auc'])
df_scores.sort_values(by='auc', ascending=False)[:10]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,C,max_iter,auc
3,0.1,100,0.923181
1,0.01,200,0.923178
2,0.01,300,0.923178
0,0.01,100,0.923175
4,0.1,200,0.923134
5,0.1,300,0.923134
6,1.0,100,0.92311
7,1.0,200,0.92311
8,1.0,300,0.92311
9,10.0,100,0.923101


In [151]:
# Best parameter for Logistic Regression
lr = LogisticRegression(C=0.1, max_iter=100).fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(f'Training accuracy {round(lr.score(X_train, y_train), 3)}')
print(f'AUC {round(roc_auc_score(y_val, y_pred), 3)}')
print(classification_report(y_val, y_pred))

Training accuracy 0.845
AUC 0.838
              precision    recall  f1-score   support

           0       0.83      0.79      0.81      2304
           1       0.86      0.88      0.87      3270

    accuracy                           0.85      5574
   macro avg       0.84      0.84      0.84      5574
weighted avg       0.84      0.85      0.84      5574



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Random Forest Classifier 

In [145]:
scores = []

for d in [5, 10, 15]:
    for n in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=n, max_depth=d, random_state=1)
        rf.fit(X_train, y_train)
    
        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        scores.append((d, n, auc))

df_scores = pd.DataFrame(scores, columns=['max_depth', 'n_estimators', 'auc'])
df_scores.sort_values(by='auc', ascending=False)[:10]

Unnamed: 0,max_depth,n_estimators,auc
37,10,180,0.917502
34,10,150,0.917493
36,10,170,0.917487
35,10,160,0.917457
39,10,200,0.91744
33,10,140,0.917431
38,10,190,0.917418
31,10,120,0.91738
30,10,110,0.917341
32,10,130,0.91731


In [153]:
# Best parameter for Random Forest Classifier
rf = RandomForestClassifier(n_estimators=180, max_depth=10, random_state=1)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_val)
print(f'Training accuracy {round(rf.score(X_train, y_train), 3)}')
print(f'AUC {round(roc_auc_score(y_val, y_pred), 3)}')
print(classification_report(y_val, y_pred))

Training accuracy 0.891
AUC 0.834
              precision    recall  f1-score   support

           0       0.83      0.78      0.80      2304
           1       0.85      0.88      0.87      3270

    accuracy                           0.84      5574
   macro avg       0.84      0.83      0.84      5574
weighted avg       0.84      0.84      0.84      5574



### Pickin' the best model

The best model after training and tuning the model is `LogisticRegression`. So we pick it up and training one more again using `df_full_train`, then export the `model` and `dictvectorize`.

In [155]:
y_full_train = df_full_train.depression.values
del(df_full_train['depression'])
df_full_train = df_full_train.reset_index(drop=True)
full_train_dict = df_full_train.to_dict(orient='records')
X_full_train = dv.transform(full_train_dict)

In [157]:
# Best parameter for Logistic Regression
lr = LogisticRegression(C=0.1, max_iter=100).fit(X_full_train, y_full_train)
y_pred = lr.predict(X_test)

print(f'Training accuracy {round(lr.score(X_full_train, y_full_train), 3)}')
print(f'AUC {round(roc_auc_score(y_test, y_pred), 3)}')
print(classification_report(y_test, y_pred))

Training accuracy 0.846
AUC 0.845
              precision    recall  f1-score   support

           0       0.84      0.80      0.82      2280
           1       0.86      0.89      0.88      3294

    accuracy                           0.85      5574
   macro avg       0.85      0.85      0.85      5574
weighted avg       0.85      0.85      0.85      5574



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [159]:
# Export the dv and model
with open('../model/dv n model.bin', 'wb') as f_out:
    pickle.dump((dv, lr), f_out)