# Предсказание риска сердечных приступов

**Цель исследования:**
- Создать модель для предсказания риска сердечных приступов

**Ход исследования:**
- Импорт данных;
- Исследование датасета;
- Предобработка данных;
- Обучение модели.

## Импорт библиотек

In [17]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Загрузка данных

In [18]:
pth1 = './datasets/heart_train.csv'

In [19]:
if os.path.exists(pth1):
    train_df = pd.read_csv(pth1)
else:
    print('Упс, что-то пошло не так!')
                                         
train_df.head()

Unnamed: 0.1,Unnamed: 0,Age,Cholesterol,Heart rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,Physical Activity Days Per Week,Sleep Hours Per Day,Heart Attack Risk (Binary),Blood sugar,CK-MB,Troponin,Gender,Systolic blood pressure,Diastolic blood pressure,id
0,0,0.359551,0.732143,0.074244,1.0,1.0,1.0,1.0,1.0,0.535505,...,3.0,0.333333,0.0,0.227018,0.048229,0.036512,Male,0.212903,0.709302,2664
1,1,0.202247,0.325,0.047663,1.0,1.0,0.0,0.0,1.0,0.06869,...,3.0,0.833333,0.0,0.150198,0.017616,0.000194,Female,0.412903,0.569767,9287
2,2,0.606742,0.860714,0.055912,1.0,0.0,1.0,1.0,1.0,0.944001,...,2.0,1.0,0.0,0.227018,0.048229,0.036512,Female,0.23871,0.22093,5379
3,3,0.730337,0.007143,0.053162,0.0,0.0,1.0,0.0,1.0,0.697023,...,0.0,0.333333,1.0,0.227018,0.048229,0.036512,Female,0.348387,0.267442,8222
4,4,0.775281,0.757143,0.021998,0.0,0.0,1.0,0.0,1.0,0.412878,...,5.0,1.0,1.0,0.227018,0.048229,0.036512,Male,0.619355,0.44186,4047


In [20]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8685 entries, 0 to 8684
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Unnamed: 0                       8685 non-null   int64  
 1   Age                              8685 non-null   float64
 2   Cholesterol                      8685 non-null   float64
 3   Heart rate                       8685 non-null   float64
 4   Diabetes                         8442 non-null   float64
 5   Family History                   8442 non-null   float64
 6   Smoking                          8442 non-null   float64
 7   Obesity                          8442 non-null   float64
 8   Alcohol Consumption              8442 non-null   float64
 9   Exercise Hours Per Week          8685 non-null   float64
 10  Diet                             8685 non-null   int64  
 11  Previous Heart Problems          8442 non-null   float64
 12  Medication Use      

Вывод
- В датафрейме 8685 строк;
- Есть пропуски;

## Предобработка данных

### Удаление признака "Unnamed: 0"

In [21]:
train_df.drop('Unnamed: 0', axis=1, inplace=True)
train_df.columns

Index(['Age', 'Cholesterol', 'Heart rate', 'Diabetes', 'Family History',
       'Smoking', 'Obesity', 'Alcohol Consumption', 'Exercise Hours Per Week',
       'Diet', 'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day',
       'Heart Attack Risk (Binary)', 'Blood sugar', 'CK-MB', 'Troponin',
       'Gender', 'Systolic blood pressure', 'Diastolic blood pressure', 'id'],
      dtype='object')

### Обработка дубликатов

#### Проверка явных дубликатов

In [25]:
train_df.duplicated().sum()

0

#### Проверка неявных дубликатов

In [28]:
# Поиск дубликатов по ID
train_df['id'].duplicated().sum()

0

In [29]:
# Поиск дубликатов по всем признакам, за исключение признака ID
train_df.drop('id', axis=1).duplicated().sum()

0

Вывод по разделу **"Предобработка данных"**:
- Удален признак **"Unnamed: 0"** как неинформативный;
- Дубликаты отсутствуют


In [31]:
train_df.columns

Index(['Age', 'Cholesterol', 'Heart rate', 'Diabetes', 'Family History',
       'Smoking', 'Obesity', 'Alcohol Consumption', 'Exercise Hours Per Week',
       'Diet', 'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day',
       'Heart Attack Risk (Binary)', 'Blood sugar', 'CK-MB', 'Troponin',
       'Gender', 'Systolic blood pressure', 'Diastolic blood pressure', 'id'],
      dtype='object')