# CATEGORICAL DATA EXERCISE

# Imports

In [None]:
import pandas as pd

# Problem: The Heart Dataset

## Dataset Description

File name: 'D3_Heart_Dataset.csv'

This dataset has been obtained from Kaggle: https://www.kaggle.com/fedesoriano/heart-failure-prediction

The data contains 918 observations with 12 attributes as described below:
1. Age: patient's age, range: 28 to 77.
2. Sex: patient's gender, M(79%), F(21%).
3. ChestPainType: ASY (54%), NAP (22%), Other(24%).
4. RestingBP: resting blood pressure, range: 0 to 200.
5. Cholestrol: serum cholestrol, range: 0 to 603.
6. FastingBS: fasting blood sugar, 0 or 1.
7. RestingECG: resting electrocardiogram results, Normal (60%), LVH (20%), Other (19%).
8. MaxHR: maximum heart rate achieved, range: 60 to 202.
9. ExerciseAngina: exercise induced angina, true(317-40%), false (547-60%).
10. OldPeak: old peak=ST, range: -2.6 to 6.2.
11. ST_Slope: ST slope, Up or flat.
12. HeartDisease: target, 1 or 0.

Last column indicates presence of heart disease given the remaining 11 attributes. 

This is a binary classification problem.

Contains categorical data, otherwise the dataset is clean.

## Loading Data

In [None]:
#Reading the file into a dataframe
data=pd.read_csv('D3_Heart_Dataset.csv')
#Displaying the read contents
data

## Exploring Data

In [None]:
#Finding datatype of data
type(data)

In [None]:
data.info()
#This information shows that each column has 918 entries. 
#Non of the columns contain any 'null' value.
#There are 5 attributes with datatype of 'object'/'string'.

## Encoding Categorical Data

### Consider the 'ExerciseAngina' column first and apply dummy variable encoding.
- This is a binary variable as its presence counts towards increased risk of heart disease and absence means otherwise.
- Can be ecoded using dummy variable encoding.

In [None]:
#The following method is used to count the possible values in a column
data['ExerciseAngina'].value_counts()
#This shows that there are two possible values for this attribute: 'Y' or 'N'. 
#Also there are 547 entries for 'N' and 371 entries for 'Y'.

In [None]:
#The simplest way to encode 'ExerciseAngina' as dummy varaible is to use the replace method. 
data['ExerciseAngina']=data['ExerciseAngina'].replace('Y',1)
data['ExerciseAngina']=data['ExerciseAngina'].replace('N',0)
data
#Observe that values of the column 'ExerciseAngina' have been cahnged to 0 and 1.

### Now consider the 'ChestPainType' column and apply ordinal encoding.

In [None]:
data['ChestPainType'].value_counts()
#This column contains 4 different values ASY, NAP, ATA, TA

Let us use Ordinal encoding as follows:
- TA (typical angina): 1
- ATA (atypical angina): 2
- NAP (non-anginal pain): 3
- ASY (asymptomatic): 4

In [None]:
data['ChestPainType']=data['ChestPainType'].replace('TA',1)
data['ChestPainType']=data['ChestPainType'].replace('ATA',2)
data['ChestPainType']=data['ChestPainType'].replace('NAP',3)
data['ChestPainType']=data['ChestPainType'].replace('ASY',4)
data

### Now consider the 'Gender' column and apply one-hot encoding.

In [None]:
#get_dummies is a simple method in pandas which can achieve this task.
pd.get_dummies(data, columns=['Gender'])
#The resulting table has two dummy variable encoded columns Gender_F and Gender_M, in place of one column Gender

In [None]:
#All the cahnges that we have made so far are done on the dataframe, and not in the original csv file.
#The to_csv method can be used to save the dataframe into a csv file.
data.to_csv('D3_Heart_Dataset_Clean.csv')

In [None]:
#TASK FOR YOU
#Try different types of encoding on the remaining categorical features.