### Performing Auto Exploratory Data Analysis - on a cardiovascular dataset: 
 Dataset of patient records with various attributes such as:
 - ID : patient identification number - numeric value
 - Age : in Days - numeric value
 - Gender : Categorical - 1. Female 2.Male
 - Height : in CM - numeric value
 - Weight : in KG - numeric value
 - AP_Hi : Systolic blood pressure - numeric value
 - AP_Low : Diastolic blood pressure - numeric value
 - Cholestrol : Categorical - 1. Normal 2. Above normal 3. Well above normal
 - Glucose : Categorical - 1. Normal 2. Above normal 3. Well above normal
 - Smoke : Binary - Smokes or not
 - Alcohol : Binary - Consumes alcohol or not
 - Active : Binary - Active lifestyle or not
 - Cardio : Target variable : Binary - Presence or Absence of Cardiovascular disease
 

Steps Involved in EDA

**Cleaning the Data**
- Cleaning the data so as to remove the unwanted columns such ID(patient id has no influence on whether or not the patient has a cardiovascular disease)

**Performing Transformation in the Data**
- The age is present in days. It would make sense to users to view and interpret age in years instead of days. Hence we are converting age from days to years.

**Removing Outliers/Anomalies** 
- Removing outliers or values that are completely off the data. For example records with the ap_hi or ap_lo value with more than 300(values like 13000 does not make sense in this column). This process will help build a better/ efficient ML model/

**Using Pandas Profiler**
- The auto EDA tool used here is Pandas profiler, which helps understand the data, the datatype of value that the column holds, number of distinct values in a column, range of the values present in the columns, interactions and correlations between the columns.

Installing pandas profiling for performing Auto EDA

In [None]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip (22.0 MB)


Import Statments 

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

Reading the patient csv file through pandas

In [None]:
#file = "/content/sample_data/facebook_data.csv"
#file = 'https://raw.githubusercontent.com/Dhanasree-Rajamani/Data-Mining/main/Data%20Mining%20Assignment%202/DataMining_Datasets/train_airline.csv'
file = 'https://raw.githubusercontent.com/Dhanasree-Rajamani/Data-Mining/main/Data%20Mining%20Assignment%202/DataMining_Datasets/cardio_train.csv'
df = pd.read_csv(file, delimiter=";")

Understanding the columns present in the file

In [None]:
df.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

The number of rows and columns in the dataset

In [None]:
df.shape

(39988, 13)

Convert age from days to years - to improve readability

In [None]:
df['age_in_years'] = df['age']//365

Remove outlier values in the systolic and diastolic pressure column to build better model

In [None]:
ap_hi_limit = df[ (df['ap_hi'] <= 30) | (df['ap_hi'] >= 300)].id 
ap_lo_limit = df[ (df['ap_lo'] <= 30) | (df['ap_lo'] >= 300)].id

# drop these given row
# indexes from dataFrame
for key in ap_hi_limit:
  df = df[(df.id != key)]
for key in ap_lo_limit:
  df = df[(df.id != key)]

Checking number of rows in the dataset

In [None]:
df.shape

(39285, 14)

Dropping unwanted columns which dont influence the target value

In [None]:
df = df.drop(columns=['id'])

In [None]:
df = df.drop(columns=['age'])

In [None]:
df = df.reset_index(drop = True)

Viewing the first 5 records of the dataset

In [None]:
df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_in_years
0,2,168,62.0,110,80,1,1,0,0,1,0,50
1,1,156,85.0,140,90,3,1,0,0,1,1,55
2,1,165,64.0,130,70,3,1,0,0,0,1,51
3,2,169,82.0,150,100,1,1,0,0,1,1,48
4,1,156,56.0,100,60,1,1,0,0,0,0,47


Performing the Auto EDA step with Pandas profiling

In [None]:
profile = ProfileReport(df, title='Pandas Profiling report for patient cardio dataset', explorative = True)

Saving the Pandas profiling Report to file

In [None]:
profile.to_file("output_cardio_pandasProfiling.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]