# AIM
* To perform data cleaning on heart disease diagnostic data.

# **Data Exploration and Cleaning**
**We will analyze the data to Find out the below stuff**

* Missing Values
* All The Numerical Variables
* Distribution of the Numerical Variables
* Outliers
* Relationship between independent and dependent feature(SalePrice)

In [8]:
# import modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Display all the columns of the dataframe

pd.pandas.set_option('display.max_columns',None)

In [9]:
# load the data

# Set the path to the raw data folder
raw_data_path = 'C:\\Users\\prath\\Heart-Disease-Diagnostic-Analysis\\Data\\Raw\\'

# Load the train.csv file into a pandas DataFrame
df = pd.read_csv(raw_data_path + 'heart_disease_dataset.csv')

### **Dataset Information**

In [10]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [11]:
# column names

df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

**There are thirteen features in Dataset**

* age: The person's age in years

* sex: The person's gender (1 = male, 0 = female)

* cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

* trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)

* chol: The person's cholesterol measurement in mg/dl

* fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

* restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

* thalach: The person's maximum heart rate achieved

* exang: Exercise induced angina (1 = yes; 0 = no)

* oldpeak: ST depression induced by exercise relative to rest

* slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

* ca: The number of major vessels (0-3)

* thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

**The output in Dataset**
* num: Heart disease (0 = no, 1 = yes)

### **Check for missing values**

In [12]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64

**Conclusion: There are no Null Values in our Data**

### **Converting Imp Numerical Data into Categorical Data**


In [13]:
# converting column 'num'  data [0,1] into two categories

def heart_disease(row):
    if row==0:
        return 'Absent'
    elif row==1:
        return 'Present'


#Applying converted data into our dataset with new column - Heart_Disease

df['Heart_Disease']=df['num'].apply(heart_disease)


In [14]:
# converting column 'sex'  data [0,1] into two categories

def gender(row):
    if row==1:
        return 'Male'
    elif row==0:
        return 'Female'


#Applying converted data into our dataset with new column - Gender

df['Gender']=df['sex'].apply(gender)

In [15]:
#Converting column 'age' data into three categories

def age_range(row):
    if row>=29 and row<40:
        return 'Young Age'
    elif row>=40 and row<55:
        return 'Middle Age'
    elif row>55:
        return 'Elder Age'

#Applying converted data into our dataset with new column - Age_Range

df['Age_Range']=df['age'].apply(age_range)

In [20]:
# Save processed data to a CSV file

df.to_csv(r'C:\Users\prath\Heart-Disease-Diagnostic-Analysis\heart_disease_processed_data.csv', index=False)