This juyter notebook illustrates the power of python's EDA technique to help visualize data present in raw data, clean and make meaningful sense out of it.

Here we import the necessary modules to help explore data.

In [17]:
%pip install scikit-Learn

Collecting scikit-Learn
  Downloading scikit_learn-1.2.1-cp310-cp310-macosx_12_0_arm64.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.3.2
  Downloading scipy-1.10.0-cp310-cp310-macosx_12_0_arm64.whl (28.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.8/28.8 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: threadpoolctl, scipy, joblib, scikit-Learn
Successfully installed joblib-1.2.0 scikit-Learn-1.2.1 scipy-1.10.0 threadpoolctl-3.1.0


In [18]:
import numpy as np
import pandas as pd
import regex as re
from sklearn.preprocessing import LabelEncoder

Inspect first few rows of data. 

In [19]:
heart_df = pd.read_csv('heart_disease.csv')
print(heart_df.head())

    age     sex  trestbps   chol                cp  exang  fbs  thalach  \
0  63.0    male     145.0  233.0    typical angina    0.0  1.0    150.0   
1  67.0    male     160.0  286.0      asymptomatic    1.0  0.0    108.0   
2  67.0    male     120.0  229.0      asymptomatic    1.0  0.0    129.0   
3  37.0    male     130.0  250.0  non-anginal pain    0.0  0.0    187.0   
4  41.0  female     130.0  204.0   atypical angina    0.0  0.0    172.0   

  heart_disease  
0       absence  
1      presence  
2      presence  
3       absence  
4       absence  


Inspect the data type of columns of the data frame

In [20]:
print(heart_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    object 
 2   trestbps       303 non-null    float64
 3   chol           303 non-null    float64
 4   cp             303 non-null    object 
 5   exang          303 non-null    float64
 6   fbs            303 non-null    float64
 7   thalach        303 non-null    float64
 8   heart_disease  303 non-null    object 
dtypes: float64(6), object(3)
memory usage: 21.4+ KB
None


find out the various kinds of cp(a.k.a chest problems) patients suffer from.

In [21]:
print(heart_df.cp.unique())

['typical angina' 'asymptomatic' 'non-anginal pain' 'atypical angina']


Categorize patients w.r.t severence of cp ailments.

In [22]:
ls=['non-anginal pain','typical angina','atypical angina','asymptomatic']
heart_df.cp=pd.Categorical(heart_df.cp,ls,ordered=True)
print(heart_df.cp.unique())

['typical angina', 'asymptomatic', 'non-anginal pain', 'atypical angina']
Categories (4, object): ['non-anginal pain' < 'typical angina' < 'atypical angina' < 'asymptomatic']


the category "heart_disease" could potentially be a good candidate for a "bool"-type variable.So, here we change its data type.

In [23]:
print(heart_df.heart_disease.unique())
heart_df.heart_disease=heart_df.heart_disease.apply(lambda x:1 if x=='presence' else 0)
print(heart_df.head())
print(heart_df.info())

['absence' 'presence']
    age     sex  trestbps   chol                cp  exang  fbs  thalach  \
0  63.0    male     145.0  233.0    typical angina    0.0  1.0    150.0   
1  67.0    male     160.0  286.0      asymptomatic    1.0  0.0    108.0   
2  67.0    male     120.0  229.0      asymptomatic    1.0  0.0    129.0   
3  37.0    male     130.0  250.0  non-anginal pain    0.0  0.0    187.0   
4  41.0  female     130.0  204.0   atypical angina    0.0  0.0    172.0   

   heart_disease  
0              0  
1              1  
2              1  
3              0  
4              0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   age            303 non-null    float64 
 1   sex            303 non-null    object  
 2   trestbps       303 non-null    float64 
 3   chol           303 non-null    float64 
 4   cp             303 non-null    ca

Categorize list basis the severity of cp(i.e. chest pains) and use label encoder

In [27]:
le=LabelEncoder()
heart_df.cp=le.fit_transform(heart_df.cp)
print(heart_df.head())

    age     sex  trestbps   chol  cp  exang  fbs  thalach  heart_disease
0  63.0    male     145.0  233.0   3    0.0  1.0    150.0              0
1  67.0    male     160.0  286.0   0    1.0  0.0    108.0              1
2  67.0    male     120.0  229.0   0    1.0  0.0    129.0              1
3  37.0    male     130.0  250.0   2    0.0  0.0    187.0              0
4  41.0  female     130.0  204.0   1    0.0  0.0    172.0              0


Compute dataFrame of patients suffering from acute angina

In [31]:
angina_df=heart_df[heart_df.cp==3].reset_index(drop=True)
print(angina_df.head())
#print(angina_df.info())

    age     sex  trestbps   chol  cp  exang  fbs  thalach  heart_disease
0  63.0    male     145.0  233.0   3    0.0  1.0    150.0              0
1  64.0    male     110.0  211.0   3    1.0  0.0    144.0              0
2  58.0  female     150.0  283.0   3    0.0  1.0    162.0              0
3  66.0  female     150.0  226.0   3    0.0  0.0    114.0              0
4  69.0  female     140.0  239.0   3    0.0  0.0    151.0              0


Display average patient age suffering from accute angina and also the max and min ages resp.

In [43]:
print('avg patient age suffering from angina:')
print(int(angina_df.age.mean()),'years')
print('max patient age suffering from angina:')
print(int(angina_df.age.max()),'years')
print('min patient age suffering from angina:')
print(int(angina_df.age.min()),'years')
print("Total patients suffering:")
print(angina_df.age.count())
print("% of patients with acute angina:"+str(100*(angina_df.age.count()/heart_df.sex.count())))

avg patient age suffering from angina:
55 years
max patient age suffering from angina:
69 years
min patient age suffering from angina:
34 years
Total patients suffering:
23
% of patients with acute angina:7.590759075907591


figure out how many of angina sufferers are male and female.

In [46]:
angina_m_fm_df = angina_df.groupby(['sex']).age.count().reset_index()
angina_m_fm_df.rename(columns={"age":"count"},inplace=True)    #This creates a new df but inplace=True restricts creating one
print(angina_m_fm_df)

      sex  count
0  female      4
1    male     19


list out patients with acute angina and heart diseases present.

In [50]:
def ang_heart(x):
    if x.cp==3 and x.heart_disease==1:
        return x

ang_heart_df=heart_df[(heart_df.cp==3)&(heart_df.heart_disease==1)].reset_index(drop=True)
print(ang_heart_df)

    age   sex  trestbps   chol  cp  exang  fbs  thalach  heart_disease
0  65.0  male     138.0  282.0   3    0.0  1.0    174.0              1
1  59.0  male     170.0  288.0   3    0.0  0.0    159.0              1
2  59.0  male     160.0  273.0   3    0.0  0.0    125.0              1
3  38.0  male     120.0  231.0   3    1.0  0.0    182.0              1
4  61.0  male     134.0  234.0   3    0.0  0.0    145.0              1
5  59.0  male     134.0  204.0   3    0.0  0.0    162.0              1
6  45.0  male     110.0  264.0   3    0.0  0.0    132.0              1


Compute number of patients and patients above 55yrs age count.

In [52]:
print("Total patient count with heart problems and acute angina:",ang_heart_df.age.count())
print("Total patient count with above problems and age>60yrs:",len(ang_heart_df[ang_heart_df.age>60]))

Total patient count with heart problems and acute angina: 7
Total patient count with above problems and age>60yrs: 2


find patients with >240 chol count

In [54]:
print("% of patients with chol count > 240:",100*(len(heart_df[heart_df.chol>240])/heart_df.age.count()))

% of patients with chol count > 240: 50.165016501650165


group patients with sex for chol count > 240

In [58]:
high_chol_df=heart_df[heart_df.chol>240].reset_index(drop=True)
#print(high_chol_df.head())
high_chol_sex_df=high_chol_df.groupby(['sex']).age.count().reset_index()
high_chol_sex_df.rename(columns={'age':'count'},inplace=True)
print(high_chol_sex_df)

      sex  count
0  female     58
1    male     94
