# Machine Learning Project

In this notebook, you will work on a different dataset and create a classifier model using the techniques you learned from the previous notebooks. Here, you are free to try different things, such as models and techniques, to extract information and train your model.

### Personal Key Indicators of Heart Disease

You are going to work on a [Kaggle dataset](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?select=heart_2020_cleaned.csv) classifying the presence of a heart disease in patients given their health indicators.
In the link, you will find the description of each feature. Notice, that the dataset is imbalanced, which means that there are more data from one class than the other.
There are many techniques to deal with it, but here, the most important is to investigate the data and train a model. We will learn about dealing with these problems along the way.

In [1]:
# Import here your libraries.
import pandas as pd
import sklearn
import seaborn as sns

In [2]:
data_path = 'datasets/heart_2020_cleaned.csv'
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


Here, we have a series of features and one label (**HeartDisease**). You must train a model using only the features. Feel free to remove or add new features if you think it is necessary. If you are feeling confident and this project is too easy for you, try to change the problem to classify *SkinCancer* cases, for example. In this scenario, *HeartDisease* will become a feature and the label will be *SkinCancer*.Will you obtain similar, better, or worse results?

In [3]:
df.columns

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer'],
      dtype='object')

In [4]:
#vou fazer umas vars dummies pra botar na log regre
sex = pd.get_dummies(df["Sex"], drop_first=True)
Smoking = pd.get_dummies(df["Smoking"], drop_first=True)
AlcoholDrinking = pd.get_dummies(df["AlcoholDrinking"], drop_first=True)
Stroke = pd.get_dummies(df["Stroke"], drop_first=True)
Asthma = pd.get_dummies(df["Asthma"], drop_first=True)
KidneyDisease = pd.get_dummies(df["KidneyDisease"], drop_first=True)
diffWalking = pd.get_dummies(df["DiffWalking"], drop_first=True)
diabetic = pd.get_dummies(df["Diabetic"], drop_first=True)
phys_activity = pd.get_dummies(df["PhysicalActivity"], drop_first=True)




In [5]:
sex.value_counts()

Male 
False    167805
True     151990
Name: count, dtype: int64

In [6]:
df.drop(["Stroke", "Sex", "Smoking", "AlcoholDrinking", "Asthma", "KidneyDisease"], axis=1,inplace=True)


In [7]:
df = pd.concat([df, Stroke, sex, Smoking, AlcoholDrinking, Asthma, KidneyDisease, diffWalking, diabetic, phys_activity], axis=1)

a regressão logistica eh similar à linear, mas ao invés de prevermos algo contínuo, prevemos se algo é verdadeiro ou falso. ao invés de fit uma linha aos dados, a reg log fit uma s shaped curve. prevemos vars binarias usando vars quanti. x = vars quanti (discretas ou continuas), y = vars binarias.

In [8]:
df.columns

Index(['HeartDisease', 'BMI', 'PhysicalHealth', 'MentalHealth', 'DiffWalking',
       'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth',
       'SleepTime', 'SkinCancer', 'Yes', 'Male', 'Yes', 'Yes', 'Yes', 'Yes',
       'Yes', 'No, borderline diabetes', 'Yes', 'Yes (during pregnancy)',
       'Yes'],
      dtype='object')

In [9]:
df.drop(['BMI', 'PhysicalHealth', 'MentalHealth', 'DiffWalking',
       'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth',
       'SleepTime', 'SkinCancer'], axis=1, inplace = True)

In [10]:
df.columns

Index(['HeartDisease', 'Yes', 'Male', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes',
       'No, borderline diabetes', 'Yes', 'Yes (during pregnancy)', 'Yes'],
      dtype='object')

agora vamos dividir em partes de ttreino e partes de teste

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('HeartDisease',axis=1), # Drop the class column.
                                                    df['HeartDisease'], test_size=0.30, 
                                                    random_state=101)

In [13]:
X_train.head()

Unnamed: 0,Yes,Male,Yes.1,Yes.2,Yes.3,Yes.4,Yes.5,"No, borderline diabetes",Yes.6,Yes (during pregnancy),Yes.7
214119,False,False,False,False,False,False,False,False,False,False,True
168731,False,True,False,False,False,False,True,False,True,False,True
227567,False,True,True,False,False,False,False,False,False,False,True
25154,False,True,False,False,True,False,False,False,False,False,True
220917,False,False,True,False,False,False,False,False,False,False,True


In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [16]:
predictions = logmodel.predict(X_test)

In [17]:
from sklearn.metrics import classification_report

In [18]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

          No       0.92      0.99      0.95     87658
         Yes       0.52      0.07      0.12      8281

    accuracy                           0.91     95939
   macro avg       0.72      0.53      0.54     95939
weighted avg       0.88      0.91      0.88     95939



como apontado pelo luan, há um desbalanceamento nos dados e isso afeta o f1-score. a parte de support indica que tem muito mais pessoas sem heart disease do que com. o que é bom, mas ruim pra análise.

In [19]:
#dividir o dataset
com_doenca = df[df["HeartDisease"] == "Yes"]
sem_doenca = df[df["HeartDisease"] == "No"]

In [20]:
com_doenca.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27373 entries, 5 to 319790
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   HeartDisease             27373 non-null  object
 1   Yes                      27373 non-null  bool  
 2   Male                     27373 non-null  bool  
 3   Yes                      27373 non-null  bool  
 4   Yes                      27373 non-null  bool  
 5   Yes                      27373 non-null  bool  
 6   Yes                      27373 non-null  bool  
 7   Yes                      27373 non-null  bool  
 8   No, borderline diabetes  27373 non-null  bool  
 9   Yes                      27373 non-null  bool  
 10  Yes (during pregnancy)   27373 non-null  bool  
 11  Yes                      27373 non-null  bool  
dtypes: bool(11), object(1)
memory usage: 721.7+ KB


In [21]:
sem_doenca.info()

<class 'pandas.core.frame.DataFrame'>
Index: 292422 entries, 0 to 319794
Data columns (total 12 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   HeartDisease             292422 non-null  object
 1   Yes                      292422 non-null  bool  
 2   Male                     292422 non-null  bool  
 3   Yes                      292422 non-null  bool  
 4   Yes                      292422 non-null  bool  
 5   Yes                      292422 non-null  bool  
 6   Yes                      292422 non-null  bool  
 7   Yes                      292422 non-null  bool  
 8   No, borderline diabetes  292422 non-null  bool  
 9   Yes                      292422 non-null  bool  
 10  Yes (during pregnancy)   292422 non-null  bool  
 11  Yes                      292422 non-null  bool  
dtypes: bool(11), object(1)
memory usage: 7.5+ MB


In [22]:
from sklearn.utils import resample

In [23]:
sem_doenca = resample(sem_doenca, replace=True,
                      n_samples = round(len(sem_doenca)/10), 
                     random_state=42)

In [24]:
sem_doenca.shape

(29242, 12)

In [25]:
com_doenca.shape

(27373, 12)

agora as pessoas sem doenca do coração são as mesmas que as com doenca, no dataset.

In [26]:
novo_df = pd.concat([com_doenca, sem_doenca])
novo_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 56615 entries, 5 to 966
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   HeartDisease             56615 non-null  object
 1   Yes                      56615 non-null  bool  
 2   Male                     56615 non-null  bool  
 3   Yes                      56615 non-null  bool  
 4   Yes                      56615 non-null  bool  
 5   Yes                      56615 non-null  bool  
 6   Yes                      56615 non-null  bool  
 7   Yes                      56615 non-null  bool  
 8   No, borderline diabetes  56615 non-null  bool  
 9   Yes                      56615 non-null  bool  
 10  Yes (during pregnancy)   56615 non-null  bool  
 11  Yes                      56615 non-null  bool  
dtypes: bool(11), object(1)
memory usage: 1.5+ MB


In [27]:
X_train, X_test, y_train, y_test = train_test_split(novo_df.drop('HeartDisease',axis=1), # Drop the class column.
                                                    novo_df['HeartDisease'], test_size=0.30, 
                                                    random_state=101)

In [28]:
logmodel = LogisticRegression(class_weight="balanced")
logmodel.fit(X_train,y_train)

In [29]:
predictions = logmodel.predict(X_test)

In [30]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

          No       0.68      0.79      0.73      8755
         Yes       0.73      0.60      0.66      8230

    accuracy                           0.70     16985
   macro avg       0.70      0.70      0.70     16985
weighted avg       0.70      0.70      0.70     16985



isso indica que está bem melhor. agora vou tentar fazer uma classificação baseada em naive bayes.

In [31]:
from sklearn.naive_bayes import BernoulliNB

In [32]:
modelo = BernoulliNB()

In [33]:
modelo.fit(X_train, y_train)

In [34]:
from sklearn.metrics import accuracy_score

In [35]:
y_predicao = modelo.predict(X_test)
print(accuracy_score(y_test, y_predicao))

0.6995584339122756


In [36]:
print(classification_report(y_test, y_predicao))

              precision    recall  f1-score   support

          No       0.68      0.78      0.73      8755
         Yes       0.73      0.61      0.66      8230

    accuracy                           0.70     16985
   macro avg       0.70      0.70      0.70     16985
weighted avg       0.70      0.70      0.70     16985



vou tentar fazer um bootstraping agregation (bagging) pra ver se eu consigo um f1-score melhor

In [37]:
df.Male.head()

0    False
1    False
2     True
3    False
4    False
Name: Male, dtype: bool

In [38]:
from sklearn.cluster import MiniBatchKMeans
from imblearn.over_sampling import SMOTE

In [70]:
smote = SMOTE(n_jobs=-1, random_state=0)

TypeError: SMOTE.__init__() got an unexpected keyword argument 'n_jobs'

In [74]:
x = df.drop(["HeartDisease", 'No, borderline diabetes', 'Yes (during pregnancy)', 'Male'], axis=1).astype("float")
y = df["HeartDisease"]
y.value_counts()



HeartDisease
No     292422
Yes     27373
Name: count, dtype: int64

In [75]:
x, y = smote.fit_resample(x, y)

In [76]:
y.value_counts()

HeartDisease
No     292422
Yes    292422
Name: count, dtype: int64

In [77]:
X_train, X_test, y_train, y_test = train_test_split(x, # Drop the class column.
                                                    y, test_size=0.30, 
                                                    random_state=101)

In [79]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))


              precision    recall  f1-score   support

          No       0.66      0.79      0.72     88075
         Yes       0.74      0.59      0.66     87379

    accuracy                           0.69    175454
   macro avg       0.70      0.69      0.69    175454
weighted avg       0.70      0.69      0.69    175454



In [80]:
from  imblearn.over_sampling import KMeansSMOTE

In [82]:
x = df.drop(["HeartDisease", 'No, borderline diabetes', 'Yes (during pregnancy)', 'Male'], axis=1).astype("float")
y = df["HeartDisease"]
y.value_counts()

HeartDisease
No     292422
Yes     27373
Name: count, dtype: int64

In [83]:
KSMOTE  = KMeansSMOTE(cluster_balance_threshold=0.1)

In [84]:
X_KSMOTE, y_KSMOTE = KSMOTE.fit_resample(x, y)

In [85]:
y_KSMOTE.value_counts()

HeartDisease
Yes    292423
No     292422
Name: count, dtype: int64

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X_KSMOTE, # Drop the class column.
                                                    y_KSMOTE, test_size=0.30, 
                                                    random_state=101)

In [89]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

          No       0.90      0.87      0.89     88157
         Yes       0.88      0.90      0.89     87297

    accuracy                           0.89    175454
   macro avg       0.89      0.89      0.89    175454
weighted avg       0.89      0.89      0.89    175454

