# 1. Importing libraries

In [54]:
import pandas as pd
import matplotlib.pyplot as plt

In [55]:
df=pd.read_csv("data/train.csv")
df.head(10)

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence
5,5,38,1,4,138,283,0,0,147,1,1.6,2,2,7,Presence
6,6,59,1,4,130,246,0,2,152,0,0.8,2,2,3,Presence
7,7,60,0,3,120,245,0,0,151,0,1.2,1,0,3,Absence
8,8,48,0,4,140,212,0,2,125,0,0.0,1,0,3,Absence
9,9,44,0,4,150,197,0,0,150,0,0.0,2,0,3,Absence


In [56]:
df.columns

Index(['id', 'Age', 'Sex', 'Chest pain type', 'BP', 'Cholesterol',
       'FBS over 120', 'EKG results', 'Max HR', 'Exercise angina',
       'ST depression', 'Slope of ST', 'Number of vessels fluro', 'Thallium',
       'Heart Disease'],
      dtype='object')

# 2. Brief information of features in the dataset

üë¥**Age:-**   Age of the patient (in years)

üöª **Gender:-** Gender of the patient (1 = male, 0 = female)

‚ù§Ô∏è‚Äçü©π **Chest Pain Type:-** Type of chest pain experienced by the patient. Chest pain type helps differentiate between cardiac and non-cardiac causes.

-   Typical Categories:

-   1 ‚Üí Typical Angina
-   Chest pain related to reduced blood flow to the heart (strong indicator of heart disease).

-   2 ‚Üí Atypical Angina
-   Chest pain that is not classic but may still be heart-related.

-   3 ‚Üí Non-anginal Pain
-   Chest pain not related to heart disease.

-   4 ‚Üí Asymptomatic
-   No chest pain, but heart disease may still be present.

üëâ Patients with typical angina (1) or asymptomatic (4) cases often show higher heart disease risk in datasets.

ü©∏ **BP:-** Resting blood pressure (mm Hg)

üß™ **Cholesterol:-** Serum cholesterol level (mg/dl)

üç¨**FBS over 120:-** Fasting blood sugar > 120 mg/dl (1 = yes, 0 = no)

üìâ**EKG results:-** Resting electrocardiogram results

-   An EKG measures the electrical activity of the heart.

-   Indicates possible heart abnormalities.

-   Typical Values:

-   0 ‚Üí Normal

-   1 ‚Üí ST-T wave abnormality (may indicate ischemia or heart strain)

-   2 ‚Üí Left ventricular hypertrophy (enlarged heart muscle)

-   üëâ Abnormal EKG results (1 or 2) may suggest underlying heart disease.

üèÉ‚Äç‚ôÇÔ∏è**Max HR:-** Maximum heart rate achieved during exercise

üò£**Exercise angina:-** Exercise-induced angina (1 = yes, 0 = no)

üìâ**ST depression:-** ST depression induced by exercise relative to rest

üìâ**Slope of ST:-**

-    Refers to the slope of the ST segment in an ECG during peak exercise.

-    It indicates how the heart responds to stress.

-    Typical values:

-        0 ‚Üí Upsloping (generally normal)

-        1 ‚Üí Flat (may indicate risk)

-        2 ‚Üí Downsloping (stronger indicator of heart disease)

ü©ª**Number of vessels fluro:-**

-   Number of major blood vessels (0‚Äì3) colored by fluoroscopy.

-   Fluoroscopy is an imaging technique that shows blood flow in coronary arteries.

-   Higher values usually indicate more blocked vessels.

‚ò¢Ô∏è**Thallium:-**
-   Result of a Thallium stress test (nuclear imaging test).

-   It shows how well blood flows to the heart muscle.

-   Typical encoded values:

-   1 or 3 ‚Üí Normal

-   6 ‚Üí Fixed defect (no blood flow even at rest ‚Üí previous damage)


### 2.1 Metadata about the dataset

In [57]:
print("Number of records:",df.shape[0])
print("Number of features:",df.shape[1])

Number of records: 630000
Number of features: 15


In [58]:
print("Checking for missing values & datatype of the columns",df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       630000 non-null  int64  
 1   Age                      630000 non-null  int64  
 2   Sex                      630000 non-null  int64  
 3   Chest pain type          630000 non-null  int64  
 4   BP                       630000 non-null  int64  
 5   Cholesterol              630000 non-null  int64  
 6   FBS over 120             630000 non-null  int64  
 7   EKG results              630000 non-null  int64  
 8   Max HR                   630000 non-null  int64  
 9   Exercise angina          630000 non-null  int64  
 10  ST depression            630000 non-null  float64
 11  Slope of ST              630000 non-null  int64  
 12  Number of vessels fluro  630000 non-null  int64  
 13  Thallium                 630000 non-null  int64  
 14  Hear

In [59]:
df.describe()

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
count,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0
mean,314999.5,54.136706,0.714735,3.312752,130.497433,245.011814,0.079987,0.98166,152.816763,0.273725,0.716028,1.455871,0.45104,4.618873
std,181865.479132,8.256301,0.451541,0.851615,14.975802,33.681581,0.271274,0.998783,19.112927,0.44587,0.948472,0.545192,0.798549,1.950007
min,0.0,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,157499.75,48.0,0.0,3.0,120.0,223.0,0.0,0.0,142.0,0.0,0.0,1.0,0.0,3.0
50%,314999.5,54.0,1.0,4.0,130.0,243.0,0.0,0.0,157.0,0.0,0.1,1.0,0.0,3.0
75%,472499.25,60.0,1.0,4.0,140.0,269.0,0.0,2.0,166.0,1.0,1.4,2.0,1.0,7.0
max,629999.0,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


# 3. Feature mapping for Exploratory Data Analysis 

#### Since many columns having "int64" datatype should be categorical as they represent discrete medical categories, not continuous quantities. 
**So I treat them as categorical features during EDA and will apply appropriate encoding during modeling.**

In [60]:
df.head()

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


In [61]:
#drop id column
df_eda = df.drop("id",axis=1)

In [62]:
df_eda

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
629995,56,0,1,110,226,0,0,132,0,0.0,1,0,7,Absence
629996,54,1,4,128,249,1,2,150,0,0.0,2,0,3,Absence
629997,67,1,4,130,275,0,0,149,0,0.0,1,2,7,Presence
629998,52,1,4,140,199,0,2,157,0,0.0,1,0,6,Presence


In [63]:
df_eda.head()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


In [64]:
df_eda['Exercise angina'].unique()

array([1, 0], dtype=int64)

In [65]:
df_eda.columns

Index(['Age', 'Sex', 'Chest pain type', 'BP', 'Cholesterol', 'FBS over 120',
       'EKG results', 'Max HR', 'Exercise angina', 'ST depression',
       'Slope of ST', 'Number of vessels fluro', 'Thallium', 'Heart Disease'],
      dtype='object')

In [66]:
df_eda['Sex']=df_eda['Sex'].map({1:"male",0:"female"})
df_eda['Chest pain type']=df_eda['Chest pain type'].map({1:"Typical Angina",2:"Atypcial Angina",3:"Non-anginal pain",4:"Asymptomatic"})
df_eda['FBS over 120']=df_eda['FBS over 120'].map({1:"yes",0:"no"})
df_eda['EKG results']=df_eda['EKG results'].map({0:"Normal",1:"ST-T wave abnormality",2:"Left ventricular hypertrophy"})
df_eda['Exercise angina']=df_eda['Exercise angina'].map({1:"yes",0:"no"})
df_eda['Slope of ST']=df_eda['Slope of ST'].map({1:"Upslpoing",2:"Flat",3:"Downsloping"})
df_eda['Thallium']=df_eda['Thallium'].map({3:"Normal",6:"Fiexed defect",7:"Revserible defect"})

In [67]:
df_eda.head()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,58,male,Asymptomatic,152,239,no,Normal,158,yes,3.6,Flat,2,Revserible defect,Presence
1,52,male,Typical Angina,125,325,no,Left ventricular hypertrophy,171,no,0.0,Upslpoing,0,Normal,Absence
2,56,female,Atypcial Angina,160,188,no,Left ventricular hypertrophy,151,no,0.0,Upslpoing,0,Normal,Absence
3,44,female,Non-anginal pain,134,229,no,Left ventricular hypertrophy,150,no,1.0,Flat,0,Normal,Absence
4,58,male,Asymptomatic,140,234,no,Left ventricular hypertrophy,125,yes,3.8,Flat,3,Normal,Presence


# 4. Saving this dataframe for Exploratory Data Analysis in next notebook

In [68]:
df_eda.to_csv("feature_engineered_data_for_eda.csv",index=False)