<a href="https://colab.research.google.com/github/MarkNCI/AI-Ml-Diploma/blob/main/MHayden_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Description
I am using the Personal Key Indicators of Heart Disease dataset from [Kaggle](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease), which is an annual telephone survey conducted by the United States Center for Disease Control in 2020 with over 400,000 participents.

## Column Descriptions

```
HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).
BMI: Body Mass Index (BMI).
Smoking: Have you smoked at least 100 cigarettes in your entire life?
AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
Stroke: (Ever told) (you had) a stroke?
PhysicalHealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0-30 days).
MentalHealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good? (0-30 days).
DiffWalking: Do you have serious difficulty walking or climbing stairs?
Sex: Are you male or female?
AgeCategory: Fourteen-level age category. (then calculated the mean)
Race: Imputed race/ethnicity value.
Diabetic: (Ever told) (you had) diabetes?
PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job.
GenHealth: Would you say that in general your health is...
SleepTime: On average, how many hours of sleep do you get in a 24-hour period?
Asthma: (Ever told) (you had) asthma?
KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
SkinCancer: (Ever told) (you had) skin cancer?

```



In [1]:
# Load Libraries
from google.colab import files
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Upload Cancer Patients dataset (Goodle Colab) ##
uploaded = files.upload()

Saving heart_2020_cleaned.csv to heart_2020_cleaned (2).csv


In [None]:
# Load dataset
df = pd.read_csv('/content/heart_2020_cleaned.csv')
df.info()

# Dataset Details

In [None]:
print(df.columns,'\n')
df.head()

In [None]:
# Count nulls
df.isna().sum()

In [None]:
# Categorical columns
categorical = ['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'Race', 
               'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer']

for cat in categorical:
  print(cat)
  print(np.unique(df[cat].values))

In [None]:
# Stats per Categorical column
colors = sns.color_palette('pastel')[0:5]
plt.pie(df['HeartDisease'].value_counts(),labels = df['HeartDisease'].unique(), colors = colors, autopct='%.0f%%')
plt.title('HeartDisease')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Smoking'].value_counts(),labels = df['Smoking'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Smoking')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['AlcoholDrinking'].value_counts(),labels = df['AlcoholDrinking'].unique(), colors = colors, autopct='%.0f%%')
plt.title('AlcoholDrinking')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Stroke'].value_counts(),labels = df['Stroke'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Stroke')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Sex'].value_counts(),labels = df['Sex'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Sex')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['DiffWalking'].value_counts(),labels = df['DiffWalking'].unique(), colors = colors, autopct='%.0f%%')
plt.title('DiffWalking')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Race'].value_counts(),labels = df['Race'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Race')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Diabetic'].value_counts(),labels = df['Diabetic'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Diabetic')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['PhysicalActivity'].value_counts(),labels = df['PhysicalActivity'].unique(), colors = colors, autopct='%.0f%%')
plt.title('PhysicalActivity')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['GenHealth'].value_counts(),labels = df['GenHealth'].unique(), colors = colors, autopct='%.0f%%')
plt.title('GenHealth')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['Asthma'].value_counts(),labels = df['Asthma'].unique(), colors = colors, autopct='%.0f%%')
plt.title('Asthma')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['KidneyDisease'].value_counts(),labels = df['KidneyDisease'].unique(), colors = colors, autopct='%.0f%%')
plt.title('KidneyDisease')
plt.show()

colors = sns.color_palette('pastel')[0:5]
plt.pie(df['SkinCancer'].value_counts(),labels = df['SkinCancer'].unique(), colors = colors, autopct='%.0f%%')
plt.title('SkinCancer')
plt.show()

# Feature Extraction

In [None]:
# Converting ages into mean
print(df['AgeCategory'].unique())
df['Age'] = df['AgeCategory'].apply(lambda x: '57' if x == '55-59' else '80' if x == '80 or older' 
                                    else '67' if x == '65-69' else '77' if x == '75-79' else '42' if x == '40-44'
                                    else '72' if x == '70-74' else '62' if x == '60-64' else '52' if x == '50-54' 
                                    else '47' if x == '45-49' else '20' if x == '18-24' else '37' if x == '35-39' 
                                    else '32' if x == '30-34' else '27' if x == '25-29' else x)
print(df['Age'].unique())

In [None]:
# Numerical columns: Normalise between 0 and 1
numerical = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
for num in numerical:
  df[num] = df[num]/df[num].max()

df[numerical]

In [None]:
# One Hot Encoding 



```
TODO:
1) Feature selection: What do I use?
2) Features with multiple values --> 1 hot encoding?
3) Stats for numerical columns
```

