# Course-End Project: Healthcare

## Problem statement:
Cardiovascular diseases are the leading cause of death globally. It is therefore
necessary to identify the causes and develop a system to predict heart attacks in an
effective manner. The data below has the information about the factors that might
have an impact on cardiovascular health.

## Task to be performed:
1. Preliminary analysis:
    a. Perform preliminary data inspection and report the findings on the
structure of the data, missing values, duplicates, etc.
    
    b. Based on these findings, remove duplicates (if any) and treat missing
values using an appropriate strategy

2. Prepare a report about the data explaining the distribution of the disease
    and the related factors using the steps listed below:
    a. Get a preliminary statistical summary of the data and explore the
    measures of central tendencies and spread of the data
    
    b. Identify the data variables which are categorical and describe and
    explore these variables using the appropriate tools, such as count plot
    
    c. Study the occurrence of CVD across the Age category

    d. Study the composition of all patients with respect to the Sex category
    
    e. Study if one can detect heart attacks based on anomalies in the resting
    blood pressure (trestbps) of a patient
    
    f. Describe the relationship between cholesterol levels and a target
    variable

    g. State what relationship exists between peak exercising and the
    occurrence of a heart attack

    h. Check if thalassemia is a major cause of CVD

    i. List how the other factors determine the occurrence of CVD
    
    j. Use a pair plot to understand the relationship between all the given
    variables
    
3. Build a baseline model to predict the risk of a heart attack using a logistic
regression and random forest and explore the results while using correlation
analysis and logistic regression (leveraging standard error and p-values from
statsmodels) for feature selection

In [1]:
import pandas as pd
import os, io
import numpy as np
from pandas import Series, DataFrame, read_table
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error
%matplotlib inline



## 1. Preliminary analysis:

### a. Perform preliminary data inspection and report the findings on the structure of the data, missing values, duplicates, etc.

In [2]:
#import data
df = pd.read_excel('/Users/michaeldionne/Library/CloudStorage/Dropbox/AI_ML Bootcamp/Caltech-AI-Machine-Learning-Bootcamp/Course5_Machine Learning/Final Project/1645792390_cep1_dataset.xlsx')
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
#Check number of columns and rows, and data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [5]:
df.dtypes


age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [6]:
#check for missing values
df.isnull().sum()


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [7]:
#check for duplicates
df.duplicated().sum()


1

In [8]:
#remove duplicates
df.drop_duplicates(inplace=True)


In [9]:
#check for duplicates
df.duplicated().sum()


0

In [10]:
# You can fill the missing values with the mean, median, or mode of the respective column, or use any other advanced imputation techniques.
# For this example, we'll use the mean of the column to fill missing values.

df = df.fillna(df.mean())
