In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


#### Каждое задание оценивает в 1 балл, для успешной сдачи задания нужно набрать 12 баллов

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 1.

https://www.kaggle.com/ronitf/heart-disease-uci

___

In [210]:
import pandas as pd
import numpy as np

In [211]:
data = pd.read_csv('heart_pandas.csv')

In [212]:
data.head(5)

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate_achieved,exercise_induced_angina,st_depression,st_slope,num_major_vessels,thalassemia,target
0,63,male,non-anginal pain,145,233,greater than 120mg/ml,normal,150,no,2.3,upsloping,0,normal,1
1,37,male,atypical angina,130,250,lower than 120mg/ml,ST-T wave abnormality,187,no,3.5,upsloping,0,fixed defect,1
2,41,female,typical angina,130,204,lower than 120mg/ml,normal,172,no,1.4,flat,0,fixed defect,1
3,56,male,typical angina,120,236,lower than 120mg/ml,ST-T wave abnormality,178,no,0.8,flat,0,fixed defect,1
4,57,female,typical angina,120,354,lower than 120mg/ml,ST-T wave abnormality,163,yes,0.6,flat,0,fixed defect,1


#### Описание признаков

**age**: The person's age in years

**sex**: The person's sex (1 = male, 0 = female)

**cp**: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

**trestbps**: The person's resting blood pressure (mm Hg on admission to the hospital)

**chol**: The person's cholesterol measurement in mg/dl

**fbs**: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

**restecg**: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

**thalach**: The person's maximum heart rate achieved

**exang**: Exercise induced angina (1 = yes; 0 = no)

**oldpeak**: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

**slope**: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

**ca**: The number of major vessels (0-3)

**thal**: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

**target**: Heart disease (0 = no, 1 = yes)

> 1. age 
> 2. sex 
> 3. chest pain type (4 values) 
> 4. resting blood pressure 
> 5. serum cholestoral in mg/dl 
> 6. fasting blood sugar > 120 mg/dl
> 7. resting electrocardiographic results (values 0,1,2)
> 8. maximum heart rate achieved 
> 9. exercise induced angina 
> 10. oldpeak = ST depression induced by exercise relative to rest 
> 11. the slope of the peak exercise ST segment 
> 12. number of major vessels (0-3) colored by flourosopy 
> 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

___

In [213]:
data.dtypes

age                          int64
sex                         object
chest_pain_type             object
resting_blood_pressure       int64
cholesterol                  int64
fasting_blood_sugar         object
rest_ecg                    object
max_heart_rate_achieved      int64
exercise_induced_angina     object
st_depression              float64
st_slope                    object
num_major_vessels            int64
thalassemia                 object
target                       int64
dtype: object

**Сколько мужчин в датасете? Сколько женщен? (sex) **

In [214]:
data['sex'].value_counts()

male      207
female     96
Name: sex, dtype: int64

 **Какой процент мужчин в датасете? (решите в одну строчку, не используя предыдущий результат. Не забудте знак процента) **

In [215]:
data['sex'].value_counts(normalize=True)*100

male      68.316832
female    31.683168
Name: sex, dtype: float64

**Сколько мужчин имеют заболевание сердца? Сколько женщин имеют заболевание сердца? **

In [216]:
data.groupby(['sex', 'target'])['target'].count()

sex     target
female  0          24
        1          72
male    0         114
        1          93
Name: target, dtype: int64

**Какую долю, от общего числа пациентов, занимают мужчины не имеющие болезнь сердца?**

In [217]:
data[(data['target']==0) & (data['sex']=='male')].sex.count()/data.shape[0]*100

37.62376237623762

**Сколько лет самому молодому пациенту, страдающему болезнью сердца?**

In [218]:
data.query('target > 0').agg({'age' : 'min'})

age    29
dtype: int64

In [219]:
data[(data['target'] > 0)].agg({'age' : 'min'}) #second way

age    29
dtype: int64

**Сколько лет самому возрастному пациенту, у которого нет проблем с сердцем?**

In [220]:
data.query('target < 1').agg({'age' : 'max'})

age    77
dtype: int64

**Сколько лет самой молодой женщине, которая страдает болезнью сердца?**

In [221]:
data[(data['target']==1) & (data['sex']=='female')].agg({'age' : 'min'})

age    34
dtype: int64

**Какой средний возраст женщин?**

In [222]:
data[(data['sex']=='female')].agg({'age' : 'mean'})

age    55.677083
dtype: float64

**Каковы средние значения и среднеквадратичные отклонения возраста тех, кто страдают болезнью сердца?**

In [223]:
data.groupby(['target'])['age'].agg(['mean', 'std'])  #without sex separation

Unnamed: 0_level_0,mean,std
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,56.601449,7.962082
1,52.49697,9.550651


In [224]:
data.groupby(['sex', 'target'])['age'].agg(['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
sex,target,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0,59.041667,4.964913
female,1,54.555556,10.265337
male,0,56.087719,8.385155
male,1,50.903226,8.682897


** Правда ли, что люди не болеющие болезнью сердца имеют уровня холестерина меньше среднего? (chol) **

In [225]:
data[(data['target']<1)].agg({'cholesterol' : 'mean'}) < data['cholesterol'].mean() #the answer is no

cholesterol    False
dtype: bool

In [226]:
data.query('target < 1').agg({'cholesterol' : 'mean'}) < data['cholesterol'].mean()

cholesterol    False
dtype: bool

**Выведите статистику rest_ecg для все числовых признаков, его максимально и среднее значение (используйте groupby(), решите в одну строчку)**

In [227]:
data.dtypes

age                          int64
sex                         object
chest_pain_type             object
resting_blood_pressure       int64
cholesterol                  int64
fasting_blood_sugar         object
rest_ecg                    object
max_heart_rate_achieved      int64
exercise_induced_angina     object
st_depression              float64
st_slope                    object
num_major_vessels            int64
thalassemia                 object
target                       int64
dtype: object

In [228]:
data.groupby(['rest_ecg'])[['age', 'resting_blood_pressure', 'cholesterol', 'max_heart_rate_achieved', 'st_depression', 'num_major_vessels', 'target']].agg(['max', 'mean']).T

Unnamed: 0,rest_ecg,ST-T wave abnormality,left ventricular hypertrophy,normal
age,max,71.0,76.0,77.0
age,mean,52.914474,61.0,55.687075
resting_blood_pressure,max,180.0,180.0,200.0
resting_blood_pressure,mean,129.065789,140.5,134.027211
cholesterol,max,354.0,327.0,564.0
cholesterol,mean,237.269737,261.75,255.142857
max_heart_rate_achieved,max,194.0,140.0,202.0
max_heart_rate_achieved,mean,151.960526,125.75,147.904762
st_depression,max,5.6,4.4,6.2
st_depression,mean,0.879605,2.725,1.159184


**Посчитайте у кого уровень депрессии при физический нагрузке выше (в среднем), среди мужчин страдающих болезнью сердца или среди женщин не страдающих болезнью сердца (st_depression)**

In [229]:
data.groupby(['sex', 'target'])['st_depression'].agg(['mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
sex,target,Unnamed: 2_level_1
female,0,1.841667
female,1,0.554167
male,0,1.531579
male,1,0.605376


Ответ: уровень депрессии выше среди женщин не страдающих болезнью сердца

**Посчтитайте максимальный и минимальный уровень холестерина для каждого типа chest_pain_type, rest_ecg, thalassemia. Пишите код оптимально, можно использовать циклы**

In [230]:
data.groupby('chest_pain_type')['cholesterol'].agg(['max', 'min'])

Unnamed: 0_level_0,max,min
chest_pain_type,Unnamed: 1_level_1,Unnamed: 2_level_1
atypical angina,564,126
non-anginal pain,298,182
typical angina,409,131


In [231]:
data.groupby('rest_ecg')['cholesterol'].agg(['max', 'min'])

Unnamed: 0_level_0,max,min
rest_ecg,Unnamed: 1_level_1,Unnamed: 2_level_1
ST-T wave abnormality,354,126
left ventricular hypertrophy,327,197
normal,564,149


In [232]:
data.groupby('thalassemia')['cholesterol'].agg(['max', 'min'])

Unnamed: 0_level_0,max,min
thalassemia,Unnamed: 1_level_1,Unnamed: 2_level_1
fixed defect,417,141
normal,318,169
reversable defect,564,126


Не получилось в одну строчку, не поняла как тут сделать циклы :(

**Сколько значений может принимать каждый из категориальных признаков?**

In [233]:
data.dtypes

age                          int64
sex                         object
chest_pain_type             object
resting_blood_pressure       int64
cholesterol                  int64
fasting_blood_sugar         object
rest_ecg                    object
max_heart_rate_achieved      int64
exercise_induced_angina     object
st_depression              float64
st_slope                    object
num_major_vessels            int64
thalassemia                 object
target                       int64
dtype: object

In [234]:
data[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 'exercise_induced_angina', 'st_slope', 'thalassemia']].nunique()

sex                        2
chest_pain_type            3
fasting_blood_sugar        2
rest_ecg                   3
exercise_induced_angina    2
st_slope                   2
thalassemia                3
dtype: int64

**У какого категориального признака наблюдается самый сильный дизбаланс классов?**

In [242]:
obj = data[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 'exercise_induced_angina', 'st_slope', 'thalassemia']].describe().T
obj

Unnamed: 0,count,unique,top,freq
sex,303,2,male,207
chest_pain_type,303,3,typical angina,193
fasting_blood_sugar,303,2,lower than 120mg/ml,258
rest_ecg,303,3,ST-T wave abnormality,152
exercise_induced_angina,303,2,no,204
st_slope,303,2,upsloping,161
thalassemia,303,3,fixed defect,166


In [263]:
pd.DataFrame(obj, columns=["count", "freq", "percent"])

Unnamed: 0,count,freq,percent
sex,303,207,
chest_pain_type,303,193,
fasting_blood_sugar,303,258,
rest_ecg,303,152,
exercise_induced_angina,303,204,
st_slope,303,161,
thalassemia,303,166,


In [264]:
result=pd.DataFrame(obj, columns=["count", "freq", "percent"])
values=result['freq']/result['count']*100
result["percent"]=values
result

Unnamed: 0,count,freq,percent
sex,303,207,68.316832
chest_pain_type,303,193,63.69637
fasting_blood_sugar,303,258,85.148515
rest_ecg,303,152,50.165017
exercise_induced_angina,303,204,67.326733
st_slope,303,161,53.135314
thalassemia,303,166,54.785479


Ответ: самый сильный дизбаланс классов у признака 'fasting_blood_sugar', поскольку частота встречаемости одного из двух его уникальных признаков = 85.15% 