<a href="https://colab.research.google.com/github/LucasFelipeNunes/Exercicios-Inteligencia-Artificial/blob/main/Atividade_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atividade de Inteligência Artificial

## Autoria

**Alunos:** Lucas Felipe da Silva Nunes e Luiz Gustavo Duarte Chagas

**Professor:** Cristóvão José Dias da Cunha

**Curso:** Análise e Desenvolvimento de Sistemas (6º ADS)

**Faculdade de Tecnologia de Guaratinguetá Professor João Mod**

## Escopo

Neste projeto, é analisada uma [base de dados](https://www.kaggle.com/datasets/bhadramohit/mental-health-dataset), no formato CSV, que contém informações sobre pacientes que foram registrados em um hospital, incluindo dados pessoais, profissionais e a informação de se eles tem uma condição de saúde mental ou não.

Este notebook busca analisar como pode se melhor treinar uma Inteligência Artificial para analisar esse tipo de dados e fazer predições com base neles.

## Importação das Bibliotecas

Para isso, primeiramente, importa-se as bibliotecas que serão utilizadas - incluiundo a do próprio Google Drive, em que se está armazenado o banco de dados.

In [1]:
!pip install lazypredict
!pip install scikit-learn



In [2]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Funções Auxiliares

Além disso, inclui-se como função auxiliar uma corretora de cores, para facilitar a visualização dos dados na tabela de correlação que será analisada posteriormente.

### Função para Colorir a Tabela de Correlação

In [4]:
def color_corr ( value_str ):
  try:
    value = float(value_str)
    if value >= 0.6 or value <= -0.6:
      color = 'red'
    else:
      color = 'blue'
  except:
          color = 'blue'
  return 'color: %s' % color

## Importação de Dados

Começando o fluxo principal do código, a primeira coisa que se faz é importar os dados da base.

In [34]:
df=pd.read_csv('/content/drive/My Drive/mental_health_dataset_corrigido.csv', delimiter=';')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

In [35]:
print(df.columns)

Index(['User_ID', 'Age', 'Gender', 'Occupation', 'Country',
       'Mental_Health_Condition', 'Severity', 'Consultation_History',
       'Stress_Level', 'Sleep_Hours', 'Work_Hours', 'Physical_Activity_Hours'],
      dtype='object')


Para se visualizar o formato da base de dados, pode se imprimir as suas primeiras linhas.

In [36]:
df.head(20)

Unnamed: 0,User_ID,Age,Gender,Occupation,Country,Mental_Health_Condition,Severity,Consultation_History,Stress_Level,Sleep_Hours,Work_Hours,Physical_Activity_Hours
0,1,36,Non-binary,Sales,Canada,No,,Yes,Medium,7.1,46,5
1,2,34,Female,Education,UK,Yes,Low,No,Low,7.5,47,8
2,3,65,Non-binary,Sales,USA,Yes,High,No,Low,8.4,58,10
3,4,34,Male,Other,Australia,No,,No,Medium,9.8,30,2
4,5,22,Female,Healthcare,Canada,Yes,Low,No,Medium,4.9,62,5
5,6,64,Non-binary,IT,UK,Yes,High,No,High,6.3,34,0
6,7,26,Female,Engineering,UK,No,,No,Medium,5.1,58,6
7,8,57,Male,IT,UK,Yes,Medium,Yes,High,4.2,57,9
8,9,25,Male,Education,USA,Yes,Low,No,Medium,6.7,67,9
9,10,65,Non-binary,Healthcare,India,No,,Yes,Low,8.9,57,6


## Retirando Colunas Inutilzadas

Como se pode ver, a tabela contém uma coluna que não diz nada sobre o paciente ou sua condição mental: a User_ID. Portanto, pode-se removê-la.

In [37]:
df.drop(['User_ID'],axis=1,inplace=True)

Além disso, a coluna **Severity** não será utilizada por este modelo; afinal, não se sabe a serveridade de uma condição mental antes que descubra, primeiro, se ela existe. Removendo esta coluna:

In [38]:
df.drop(['Severity'],axis=1,inplace=True)

Após este processo, as colunas do base de dados ficam conforme se segue:

In [39]:
df.head()

Unnamed: 0,Age,Gender,Occupation,Country,Mental_Health_Condition,Consultation_History,Stress_Level,Sleep_Hours,Work_Hours,Physical_Activity_Hours
0,36,Non-binary,Sales,Canada,No,Yes,Medium,7.1,46,5
1,34,Female,Education,UK,Yes,No,Low,7.5,47,8
2,65,Non-binary,Sales,USA,Yes,No,Low,8.4,58,10
3,34,Male,Other,Australia,No,No,Medium,9.8,30,2
4,22,Female,Healthcare,Canada,Yes,No,Medium,4.9,62,5


## Transformando Colunas Categóricas em Numéricas

Também é importante se analisar as colunas categóricas: ou seja, aquelas que contém uma enumeração de valores que podem estar registradas. A intenção e transformá-las em numéricas, para facilitar a análise do modelo. Nesta base de dados, essas colunas são:

In [40]:
df.select_dtypes(include=['object'])

Unnamed: 0,Gender,Occupation,Country,Mental_Health_Condition,Consultation_History,Stress_Level
0,Non-binary,Sales,Canada,No,Yes,Medium
1,Female,Education,UK,Yes,No,Low
2,Non-binary,Sales,USA,Yes,No,Low
3,Male,Other,Australia,No,No,Medium
4,Female,Healthcare,Canada,Yes,No,Medium
...,...,...,...,...,...,...
495,Non-binary,Healthcare,Canada,Yes,Yes,High
496,Female,Sales,UK,Yes,Yes,Medium
497,Male,IT,Canada,Yes,Yes,High
498,Non-binary,Healthcare,India,Yes,Yes,Medium


Para se fazer este processo, primeiro atribui-se as colunas a uma variável **colunas_object**.

In [41]:
colunas_object = list(df.select_dtypes(include=['object']).columns)
colunas_object

['Gender',
 'Occupation',
 'Country',
 'Mental_Health_Condition',
 'Consultation_History',
 'Stress_Level']

### Transformando-as em Colunas Númericas (Técnica Dummie)

Depois, utiliza-se este objeto para orientar a função **get_dummies** para transformar as colunas categóricas em numéricas pela Técnica Dummie.

In [42]:
df = pd.get_dummies( df, columns = colunas_object )

Após estes processos, as colunas da tabela ficam da seguinte forma:

In [43]:
df.head()

Unnamed: 0,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,...,Country_Other,Country_UK,Country_USA,Mental_Health_Condition_No,Mental_Health_Condition_Yes,Consultation_History_No,Consultation_History_Yes,Stress_Level_High,Stress_Level_Low,Stress_Level_Medium
0,36,7.1,46,5,False,False,True,False,False,False,...,False,False,False,True,False,False,True,False,False,True
1,34,7.5,47,8,True,False,False,False,True,False,...,False,True,False,False,True,True,False,False,True,False
2,65,8.4,58,10,False,False,True,False,False,False,...,False,False,True,False,True,True,False,False,True,False
3,34,9.8,30,2,False,True,False,False,False,False,...,False,False,False,True,False,True,False,False,False,True
4,22,4.9,62,5,True,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,True


## Removendo Dados com Alta Correlação

Como se pode notar, as colunas que permitem apenas o registro de valores em uma enumeração entre dois possíveis geram colunas númericas completamente inversamente correlacionadas. Desta forma, pode-se retirar uma das colunas númericas geradas neste caso. Analisando a tabela de correlação para ver quais são os casos:

In [44]:
df.corr().style.applymap(color_corr)

Unnamed: 0,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,Occupation_Finance,Occupation_Healthcare,Occupation_IT,Occupation_Other,Occupation_Sales,Country_Australia,Country_Canada,Country_Germany,Country_India,Country_Other,Country_UK,Country_USA,Mental_Health_Condition_No,Mental_Health_Condition_Yes,Consultation_History_No,Consultation_History_Yes,Stress_Level_High,Stress_Level_Low,Stress_Level_Medium
Age,1.0,0.004539,0.004613,-0.02711,-0.098657,0.042083,0.047862,0.015614,0.06302,-8.3e-05,-0.010011,-0.004438,0.014918,0.05947,-0.072802,0.074515,-0.042096,-0.063433,-0.008776,-0.008934,0.023685,0.021145,0.030686,-0.030686,0.020709,-0.020709,0.07281,0.032301,-0.097738
Sleep_Hours,0.004539,1.0,-0.244355,0.058446,0.004045,-0.120724,0.024852,0.188767,0.09679,-0.093376,0.046696,-0.16439,0.011855,0.279747,-0.007785,0.110292,-0.016549,-0.004995,0.012947,0.100934,-0.157637,-0.002598,0.663832,-0.663832,0.100059,-0.100059,-0.062763,0.008455,0.047098
Work_Hours,0.004613,-0.244355,1.0,0.05625,-0.007479,0.040106,0.027443,-0.126034,-0.120198,0.017931,-0.016505,0.121333,-0.053885,-0.096439,0.052646,-0.061182,0.057854,0.002724,0.032074,-0.038836,-0.064501,0.036891,-0.17383,0.17383,-0.151829,0.151829,-0.060297,0.053727,-0.001431
Physical_Activity_Hours,-0.02711,0.058446,0.05625,1.0,-0.034237,0.001162,0.018629,0.028962,-0.091576,0.025233,0.007144,-0.018232,0.043671,0.03309,-0.004726,-0.001586,0.008665,-0.084598,0.031279,0.025297,-0.014089,0.044673,0.027034,-0.027034,-0.067983,0.067983,-0.041272,-0.010586,0.047496
Gender_Female,-0.098657,0.004045,-0.007479,-0.034237,1.0,-0.438847,-0.474472,-0.155517,0.033608,-0.012041,0.010275,-0.078215,-0.026601,-0.005031,0.077885,-0.021972,-0.053032,0.089904,0.084695,-0.06386,0.01769,-0.083125,0.016855,-0.016855,0.062637,-0.062637,0.056928,-0.04256,-0.007007
Gender_Male,0.042083,-0.120724,0.040106,0.001162,-0.438847,1.0,-0.472239,-0.154785,0.003637,0.043554,-0.074889,0.123246,0.025516,-0.004001,-0.12864,0.025666,0.024865,-0.091784,-0.076012,-0.032299,-0.033797,0.153499,-0.051509,0.051509,-0.10924,0.10924,-0.124316,0.040689,0.068782
Gender_Non-binary,0.047862,0.024852,0.027443,0.018629,-0.474472,-0.472239,1.0,-0.16735,-0.048715,-0.036228,0.042122,-0.001186,-0.003947,-0.002357,0.045883,-0.019363,0.051201,-0.023956,0.029494,0.048925,-0.007984,-0.053851,0.007521,-0.007521,-0.009268,0.009268,0.057366,-0.002378,-0.048524
Gender_Prefer not to say,0.015614,0.188767,-0.126034,0.028962,-0.155517,-0.154785,-0.16735,1.0,0.026499,0.011942,0.044083,-0.090469,0.01076,0.023754,0.0071,0.033683,-0.050526,0.05459,-0.081148,0.095227,0.050312,-0.03075,0.055705,-0.055705,0.116054,-0.116054,0.017176,0.009105,-0.024577
Occupation_Education,0.06302,0.09679,-0.120198,-0.091576,0.033608,0.003637,-0.048715,0.026499,1.0,-0.156131,-0.081797,-0.146665,-0.121165,-0.072351,-0.177634,0.011511,-0.044608,0.036223,0.018205,-0.043261,0.041328,-0.036764,0.046199,-0.046199,0.037401,-0.037401,0.016998,-0.028917,0.014498
Occupation_Engineering,-8.3e-05,-0.093376,0.017931,0.025233,-0.012041,0.043554,-0.036228,0.011942,-0.156131,1.0,-0.139265,-0.249707,-0.206292,-0.123182,-0.302433,0.058281,-0.027029,-0.104262,-0.034673,0.031566,0.077866,0.023605,-0.018521,0.018521,0.096559,-0.096559,0.046237,0.075851,-0.118708


Pode-se notar que existem duas colunas categóricas que geram tal correlação: **Mental_Health_Condition** e **Consultation_History**, ambas que tem como opções apenas **Yes** e **No**. Portanto, pode-se remover uma dessas duas colunas numéricas geradas por cada campo (aqui escolheu-se remover as da opção **No**).

In [45]:
df.drop(['Mental_Health_Condition_No', 'Consultation_History_No'], axis=1, inplace=True)

Além disso, vale-se observar também as colunas com mais de duas opções. Caso as opções sejam compreensivas, de forma que não tenha como existir outra opção além das apresentadas na base de dados, pode-se remover uma das colunas - já que a abscência de valores em todas as outras implicaria, necessariamente, no preenchimento dela.

As colunas **Occupation** e **Country** cumprem este requisito, por causa da opção **Other**. Então, pode-se remover a coluna com esta opção. Além disso, a **Stress_Level** também cumpre: neste contexto, não há outra opção para o nível de estresse além de baixo (**Low**), médio (**Medium**) ou alto (**High**). Portanto, pode-se remover uma de suas colunas também (aqui removeu-se a do **High**).

In [46]:
df.drop(['Country_Other'], axis=1, inplace=True)
df.drop(['Occupation_Other'], axis=1, inplace=True)
df.drop(['Stress_Level_High'], axis=1, inplace=True)

Após as remoções, os campos do dataset ficaram como se segue.

In [47]:
df.head()

Unnamed: 0,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,...,Country_Australia,Country_Canada,Country_Germany,Country_India,Country_UK,Country_USA,Mental_Health_Condition_Yes,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium
0,36,7.1,46,5,False,False,True,False,False,False,...,False,True,False,False,False,False,False,True,False,True
1,34,7.5,47,8,True,False,False,False,True,False,...,False,False,False,False,True,False,True,False,True,False
2,65,8.4,58,10,False,False,True,False,False,False,...,False,False,False,False,False,True,True,False,True,False
3,34,9.8,30,2,False,True,False,False,False,False,...,True,False,False,False,False,False,False,False,False,True
4,22,4.9,62,5,True,False,False,False,False,False,...,False,True,False,False,False,False,True,False,False,True


## Separando Dados de Treino e Teste

Após isso, pode-se separar os dados da base em dados de treinamento do modelo (que correspondem a 80% do total), e dados de teste (que correspondem a 20% do total).

In [48]:
train , test = train_test_split(df, train_size=0.8, random_state=64)

## Separando as Variáveis Alvo (Y) das Características (X)

Feito isso, pode-se separar o campo da variável alvo Y dos campos das características X. Neste caso, como se busca observar a tendência de condições de saúde mental, separa-se o campo **Mental_Health_Condition_Yes** como o Y, dos outros campos X.

In [49]:
train_x = train.drop(columns=['Mental_Health_Condition_Yes'], axis=1)
train_y = train['Mental_Health_Condition_Yes']

In [50]:
test_x = test.drop(columns=['Mental_Health_Condition_Yes'], axis=1)
test_y = test['Mental_Health_Condition_Yes']

Visualizando os resultados destes processos:

In [51]:
train_x.head()

Unnamed: 0,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,...,Occupation_Sales,Country_Australia,Country_Canada,Country_Germany,Country_India,Country_UK,Country_USA,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium
462,45,5.3,63,4,False,False,True,False,False,False,...,True,True,False,False,False,False,False,True,True,False
403,35,5.3,60,4,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
415,36,5.2,64,6,False,True,False,False,False,False,...,False,True,False,False,False,False,False,True,False,True
127,55,6.5,67,0,False,False,True,False,False,True,...,False,False,True,False,False,False,False,True,False,True
417,31,5.0,60,4,True,False,False,False,False,True,...,False,False,True,False,False,False,False,False,False,False


In [52]:
train_y.head()

Unnamed: 0,Mental_Health_Condition_Yes
462,True
403,True
415,True
127,False
417,True


In [53]:
test_x.head()

Unnamed: 0,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,...,Occupation_Sales,Country_Australia,Country_Canada,Country_Germany,Country_India,Country_UK,Country_USA,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium
240,41,4.6,63,1,False,False,True,False,False,False,...,False,False,False,False,False,True,False,True,False,False
340,46,5.6,60,8,False,True,False,False,False,False,...,True,False,False,False,False,False,True,True,True,False
215,31,7.3,60,6,True,False,False,False,False,True,...,False,False,False,False,True,False,False,False,True,False
57,54,6.0,71,10,False,False,True,False,True,False,...,False,False,False,False,False,True,False,True,False,True
244,31,7.3,55,8,True,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,True


In [54]:
test_y.head()

Unnamed: 0,Mental_Health_Condition_Yes
240,True
340,True
215,False
57,False
244,False


## Treinando o Modelo

Nesta etapa, treina-se o modelo com os algorítimos supervisionados de classificação. Foram escolhidos seis algoritmos para o treinamento:

- Regressão Logística
- Análise Discriminante Linear
- Árvore de Decisão
- K Vizinhos Mais Próximos
- Máquinas de Vetores de Suporte
- Floresta Aleatória

In [55]:
LR  = LogisticRegression(solver='lbfgs', max_iter=1000).fit(train_x, train_y)
LDA = LinearDiscriminantAnalysis().fit(train_x, train_y)
DT  = DecisionTreeClassifier().fit(train_x, train_y)
KN  = KNeighborsClassifier().fit(train_x, train_y)
SVM = SVC().fit(train_x,train_y)
RF  = RandomForestClassifier().fit(train_x,train_y)

## Métricas da Matriz de Confusão

Feito o treinamento, pode-se obter as métricas de cada algoritmo com os dados de teste

In [56]:
metricas = {'Acurácia': [LR.score(test_x,test_y),
                         LDA.score(test_x,test_y),
                         DT.score(test_x,test_y),
                         KN.score(test_x,test_y),
                         SVM.score(test_x,test_y),
                         RF.score(test_x,test_y)],
            'Precisão': [precision_score(test_y, LR.predict(test_x)),
                      precision_score(test_y, LDA.predict(test_x)),
                      precision_score(test_y, DT.predict(test_x)),
                      precision_score(test_y, KN.predict(test_x)),
                      precision_score(test_y, SVM.predict(test_x)),
                      precision_score(test_y, RF.predict(test_x))],
            'Revocação' : [recall_score(test_y, RF.predict(test_x)),
                    recall_score(test_y, LDA.predict(test_x)),
                    recall_score(test_y, DT.predict(test_x)),
                    recall_score(test_y, KN.predict(test_x)),
                    recall_score(test_y, SVM.predict(test_x)),
                    recall_score(test_y, RF.predict(test_x))],
            'F1' : [f1_score(test_y, RF.predict(test_x)),
                    f1_score(test_y, LDA.predict(test_x)),
                    f1_score(test_y, DT.predict(test_x)),
                    f1_score(test_y, KN.predict(test_x)),
                    f1_score(test_y, SVM.predict(test_x)),
                    f1_score(test_y, RF.predict(test_x))]
        }

dados = pd.DataFrame(metricas,
                     columns = ['Acurácia', 'Precisão','Revocação','F1'],
                     index=['LR','LDA','DT','KNN','SVM','RF'])

dados

Unnamed: 0,Acurácia,Precisão,Revocação,F1
LR,0.85,0.87,0.92,0.94
LDA,0.86,0.89,0.92,0.9
DT,0.85,0.89,0.9,0.9
KNN,0.74,0.76,0.93,0.84
SVM,0.72,0.72,1.0,0.84
RF,0.91,0.96,0.92,0.94


In [57]:
print(confusion_matrix(test_y, SVM.predict(test_x)))

[[ 0 28]
 [ 0 72]]


## Análise de Resultados e Remodelagem

Como se pode ver, os modelos ficaram com uma média de 50% em todos os casos.

In [58]:
def categorize_age(age):
    if age < 18:
        return 'Under 18'
    elif age <= 34:
        return '18-34'
    elif age <= 49:
        return '35-49'
    elif age <= 64:
        return '50-64'
    else:
        return '65+'

df['Age_Range'] = df['Age'].apply(categorize_age)
df.drop(['Age'],axis=1,inplace=True)

df.head()


Unnamed: 0,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,Occupation_Finance,...,Country_Canada,Country_Germany,Country_India,Country_UK,Country_USA,Mental_Health_Condition_Yes,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium,Age_Range
0,7.1,46,5,False,False,True,False,False,False,False,...,True,False,False,False,False,False,True,False,True,35-49
1,7.5,47,8,True,False,False,False,True,False,False,...,False,False,False,True,False,True,False,True,False,18-34
2,8.4,58,10,False,False,True,False,False,False,False,...,False,False,False,False,True,True,False,True,False,65+
3,9.8,30,2,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,18-34
4,4.9,62,5,True,False,False,False,False,False,False,...,True,False,False,False,False,True,False,False,True,18-34


In [59]:
def categorize_work_hours(work_hours):
    if work_hours < 30:
        return '30-'
    elif work_hours <= 35:
        return '30-35'
    elif work_hours <= 40:
        return '35-40'
    elif work_hours <= 45:
        return '40-45'
    else:
        return '45+'

df['Work_Hours_Range'] = df['Work_Hours'].apply(categorize_work_hours)
df.drop(['Work_Hours'],axis=1,inplace=True)

df.head()

Unnamed: 0,Sleep_Hours,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,Occupation_Finance,Occupation_Healthcare,...,Country_Germany,Country_India,Country_UK,Country_USA,Mental_Health_Condition_Yes,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium,Age_Range,Work_Hours_Range
0,7.1,5,False,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,35-49,45+
1,7.5,8,True,False,False,False,True,False,False,False,...,False,False,True,False,True,False,True,False,18-34,45+
2,8.4,10,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,True,False,65+,45+
3,9.8,2,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,18-34,30-35
4,4.9,5,True,False,False,False,False,False,False,True,...,False,False,False,False,True,False,False,True,18-34,45+


In [60]:
def categorize_sleep_hours(sleep_hours):
    if sleep_hours < 6:
        return '6-'
    elif sleep_hours <= 8:
        return '6-8'
    elif sleep_hours <= 10:
        return '8-10'
    else:
        return '10+'

df['Sleep_Hours_Range'] = df['Sleep_Hours'].apply(categorize_sleep_hours)
df.drop(['Sleep_Hours'],axis=1,inplace=True)

df.head()

Unnamed: 0,Physical_Activity_Hours,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,Occupation_Finance,Occupation_Healthcare,Occupation_IT,...,Country_India,Country_UK,Country_USA,Mental_Health_Condition_Yes,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium,Age_Range,Work_Hours_Range,Sleep_Hours_Range
0,5,False,False,True,False,False,False,False,False,False,...,False,False,False,False,True,False,True,35-49,45+,6-8
1,8,True,False,False,False,True,False,False,False,False,...,False,True,False,True,False,True,False,18-34,45+,6-8
2,10,False,False,True,False,False,False,False,False,False,...,False,False,True,True,False,True,False,65+,45+,8-10
3,2,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,18-34,30-35,8-10
4,5,True,False,False,False,False,False,False,True,False,...,False,False,False,True,False,False,True,18-34,45+,6-


In [61]:
def categorize_physical_activity_hours(physical_activity_hours):
    if physical_activity_hours < 4:
        return '4-'
    elif physical_activity_hours <= 8:
        return '4-8'
    elif physical_activity_hours <= 12:
        return '8-12'
    elif physical_activity_hours <= 16:
        return '12-16'
    else:
        return '20+'

df['Physical_Activity_Hours_Range'] = df['Physical_Activity_Hours'].apply(categorize_physical_activity_hours)
df.drop(['Physical_Activity_Hours'],axis=1,inplace=True)

df.head()

Unnamed: 0,Gender_Female,Gender_Male,Gender_Non-binary,Gender_Prefer not to say,Occupation_Education,Occupation_Engineering,Occupation_Finance,Occupation_Healthcare,Occupation_IT,Occupation_Sales,...,Country_UK,Country_USA,Mental_Health_Condition_Yes,Consultation_History_Yes,Stress_Level_Low,Stress_Level_Medium,Age_Range,Work_Hours_Range,Sleep_Hours_Range,Physical_Activity_Hours_Range
0,False,False,True,False,False,False,False,False,False,True,...,False,False,False,True,False,True,35-49,45+,6-8,4-8
1,True,False,False,False,True,False,False,False,False,False,...,True,False,True,False,True,False,18-34,45+,6-8,4-8
2,False,False,True,False,False,False,False,False,False,True,...,False,True,True,False,True,False,65+,45+,8-10,8-12
3,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,18-34,30-35,8-10,4-
4,True,False,False,False,False,False,False,True,False,False,...,False,False,True,False,False,True,18-34,45+,6-,4-8


## Conclusões

Tendo em vista as métricas apresentadas sobre os modelos treinados, algumas das principais conclusões que pode-se obter são:
- O treinamento por ??? é o melhor, pois...
- Os campos que mais interferem no resultado final são...