# **Analyzing Cardiovascular disease Data.**

In [35]:
!wget  https://raw.githubusercontent.com/PranjalAgni/mlcourse.ai/master/data/mlbootcamp5_train.csv  
  
  


Redirecting output to ‘wget-log.1’.


There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

| Feature | Variable Type | Variable      | Value Type |
|---------|--------------|---------------|------------|
| Age | Objective Feature | age | int (days) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

All of the dataset values were collected at the moment of medical examination.

In [0]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

# plotly for data viz
import plotly
plotly.tools.set_credentials_file(username='PranjalAgni' , api_key='26Hi3ELeX3Sk7feuAOnj')

In [37]:
my_data = pd.read_csv('mlbootcamp5_train.csv', sep=';')
my_data.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [38]:
my_data.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

In [39]:
my_data.isnull().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

In [40]:
# Number of Male smokers
my_data[(my_data['gender'] == 2) & (my_data['smoke'] == 1)]['smoke'].count()

5356

In [41]:
# Number of Female smokers
my_data[(my_data['gender'] == 1) & (my_data['smoke'] == 1)]['smoke'].count()

813

*Number of male smoker = 5356*
*Number of Female smoker = 813*

**How many men and women is present in the dataset**

In [42]:
# Number of men
my_data[my_data['gender'] == 2]['gender'].count()

24470

In [43]:
# Number of women present 
my_data[my_data['gender'] == 1]['gender'].count()

45530

*Number of women = 45530*
*Number of men = 24470*

**Which gender more oftens takes alcohol**

In [44]:
# Number of men consuming alcohol
my_data[(my_data['alco'] == 1) & (my_data['gender'] == 2)]['gender'].count()

2603

In [45]:
# Number of women consuming alcohol
my_data[(my_data['alco'] == 1) & (my_data['gender'] == 1)]['gender'].count()

1161

*Men take alcohol more often than women*

**Difference between male and female smokers**

In [46]:
# Percentage male smokers
male_smokers = 5356
total_males = 24470
percentage_male_smokers = (male_smokers/total_males) * 100
percentage_male_smokers

21.88802615447487

In [47]:
# Percentage Female smokers
female_smokers = 813
total_females = 45530
percentage_female_smokers = (female_smokers/total_females) * 100
percentage_female_smokers

1.7856358444981333

In [48]:
difference_smokers = percentage_male_smokers - percentage_female_smokers
round(difference_smokers)


20

**Difference btw age of median of smokers and non-smokers**

In [0]:
# Median of age  of smokers(days)
smokers_days = my_data[my_data['smoke'] == 1]['age'].median()
smokers_months = smokers_days // 30

In [0]:
# Median of age of non-smokers(days)
nsmokers_days = my_data[my_data['smoke'] == 0]['age'].median()
nsmokers_months = nsmokers_days // 30

In [51]:
# Therefore difference is 
diff = nsmokers_months - smokers_months
diff

20.0

*Difference b/w median of smokers and non-smokers = 20*

**New feature BMI**

In [52]:
bmi = [(w/((h/100)**2)) for w,h in zip(my_data['weight'] , my_data['height'])]
bmi

[21.9671201814059,
 34.927679158448385,
 23.507805325987146,
 28.71047932495361,
 23.011176857330703,
 29.384676110696898,
 37.72972534382733,
 29.983587930816814,
 28.44095497516423,
 25.282569898869724,
 28.01022373166206,
 20.04744562130375,
 22.03856749311295,
 31.244992789617044,
 28.997893837184456,
 37.85830178474852,
 25.95155709342561,
 20.82999519307803,
 28.67262607522348,
 21.338210638622158,
 31.239414355075468,
 27.993022029291247,
 36.05191475725044,
 18.49112426035503,
 23.529411764705884,
 27.76709812465291,
 24.243918474687703,
 30.853209920493647,
 23.95122659311947,
 25.909456951787405,
 43.7044745057232,
 24.859073561850078,
 23.73323840037973,
 28.515624999999993,
 27.39817568244846,
 20.70081674131507,
 31.020408163265305,
 26.026174895895306,
 27.43484224965706,
 25.71166207529844,
 25.153150229218223,
 21.461936624163616,
 23.59700420642249,
 24.919900320398717,
 21.0828132906055,
 24.38652644413961,
 40.77122389879591,
 24.44727891156463,
 22.857142857142858,


In [53]:
# Making BMI a dataframe
bmi_feature = pd.DataFrame({'bmi': bmi})
bmi_feature

Unnamed: 0,bmi
0,21.967120
1,34.927679
2,23.507805
3,28.710479
4,23.011177
5,29.384676
6,37.729725
7,29.983588
8,28.440955
9,25.282570


In [54]:
# Median of new feature BMI
median_bmi = bmi_feature.median()
print('Median of BMI {}'.format(median_bmi))

Median of BMI bmi    26.374068
dtype: float64
