# Machine Learning Project

In this notebook, you will work on a different dataset and create a classifier model using the techniques you learned from the previous notebooks. Here, you are free to try different things, such as models and techniques, to extract information and train your model.

### Personal Key Indicators of Heart Disease

You are going to work on a [Kaggle dataset](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?select=heart_2020_cleaned.csv) classifying the presence of a heart disease in patients given their health indicators.
In the link, you will find the description of each feature. Notice, that the dataset is imbalanced, which means that there are more data from one class than the other.
There are many techniques to deal with it, but here, the most important is to investigate the data and train a model. We will learn about dealing with these problems along the way.

In [3]:
# Import here your libraries.
import pandas as pd
import sklearn
import seaborn as sns

In [4]:
data_path = 'datasets/heart_2020_cleaned.csv'
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


Here, we have a series of features and one label (**HeartDisease**). You must train a model using only the features. Feel free to remove or add new features if you think it is necessary. If you are feeling confident and this project is too easy for you, try to change the problem to classify *SkinCancer* cases, for example. In this scenario, *HeartDisease* will become a feature and the label will be *SkinCancer*.Will you obtain similar, better, or worse results?

In [5]:
df.columns

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer'],
      dtype='object')

In [6]:
#vou fazer umas vars dummies pra botar na log regre
sex = pd.get_dummies(df["Sex"], drop_first=True)
Smoking = pd.get_dummies(df["Smoking"], drop_first=True)
AlcoholDrinking = pd.get_dummies(df["AlcoholDrinking"], drop_first=True)
Stroke = pd.get_dummies(df["Stroke"], drop_first=True)
Asthma = pd.get_dummies(df["Asthma"], drop_first=True)
KidneyDisease = pd.get_dummies(df["KidneyDisease"], drop_first=True)



In [7]:
sex.value_counts()

Male 
False    167805
True     151990
Name: count, dtype: int64

In [8]:
df.drop(["Stroke", "Sex", "Smoking", "AlcoholDrinking", "Asthma", "KidneyDisease"], axis=1,inplace=True)


In [9]:
df = pd.concat([df, Stroke, sex, Smoking, AlcoholDrinking, Asthma, KidneyDisease], axis=1)

a regressão logistica eh similar à linear, mas ao invés de prevermos algo contínuo, prevemos se algo é verdadeiro ou falso. ao invés de fit uma linha aos dados, a reg log fit uma s shaped curve. prevemos vars binarias usando vars quanti. x = vars quanti (discretas ou continuas), y = vars binarias.

In [10]:
df.columns

Index(['HeartDisease', 'BMI', 'PhysicalHealth', 'MentalHealth', 'DiffWalking',
       'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth',
       'SleepTime', 'SkinCancer', 'Yes', 'Male', 'Yes', 'Yes', 'Yes', 'Yes'],
      dtype='object')

In [13]:
df.drop(['BMI', 'PhysicalHealth', 'MentalHealth', 'DiffWalking',
       'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth',
       'SleepTime', 'SkinCancer'], axis=1, inplace = True)

In [14]:
df.columns

Index(['HeartDisease', 'Yes', 'Male', 'Yes', 'Yes', 'Yes', 'Yes'], dtype='object')

agora vamos dividir em partes de ttreino e partes de teste

In [16]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('HeartDisease',axis=1), # Drop the class column.
                                                    df['HeartDisease'], test_size=0.30, 
                                                    random_state=101)

In [20]:
X_train.head()

Unnamed: 0,Yes,Male,Yes.1,Yes.2,Yes.3,Yes.4
214119,False,False,False,False,False,False
168731,False,True,False,False,False,False
227567,False,True,True,False,False,False
25154,False,True,False,False,True,False
220917,False,False,True,False,False,False


In [21]:
from sklearn.linear_model import LogisticRegression

In [25]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [26]:
predictions = logmodel.predict(X_test)

In [23]:
from sklearn.metrics import classification_report

In [27]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

          No       0.92      1.00      0.95     87658
         Yes       0.52      0.03      0.06      8281

    accuracy                           0.91     95939
   macro avg       0.72      0.51      0.51     95939
weighted avg       0.88      0.91      0.88     95939

