<a id="Introduction"></a>
# Introduction

According to the National Heart, Lung and Blood Institute:

> Heart disease is a catch-all phrase for a variety of conditions that affect the heart’s structure and function. Coronary heart disease is a type of heart disease that develops when the arteries of the heart cannot deliver enough oxygen-rich blood to the heart. __It is the leading cause of death in the United States__.

(Emphasis by me. Source: https://www.nhlbi.nih.gov/health-topics/espanol/enfermedad-coronaria)

Also, according to the World Health Organization, cardiovascular diseases are the __leading cause of death globally__ (source:  https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)). 

In this notebook we try to learn enough information of this topic to understand the [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci) dataset and build simple models to predict whether a patient has a disease or not based on features like the heart rate during exercise or the cholesterol levels in the blood.


In [14]:
library(corrplot)

In [15]:
# Libraries
library(ggplot2)
library(tidyverse)

# Read the data
data <- read.csv('../input/heart-disease/heart.csv')



In [16]:
library('caTools')

In [17]:
head(data)

In [18]:
# DISPLAY THE NUMBER OF ROWS AND COLUMNS
nrow(data)
ncol(data)
str(data)
summary(data)

In [19]:
# Deleting not related variables
data = subset(data, select = c(-thal,-ca))
# Converting the categorical data to R factors
data$sex <- as.factor(data$sex)
data$target <- as.factor(data$target)
data$cp <- as.factor(data$cp)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)

In [20]:
summary(data)

To have a better views of what we are going to do, we will divide the value "age" into three goups.
One group will be the Young_age (<45)
the second will be the middle_age (45-55)
Finally the Old_age (>55)

In [21]:
# Group the different ages in three groups (young, middle, old)
Young <- data[which((data$age<45)), ]
Middle <- data[which((data$age>=45)&(data$age<55)), ]
Old <- data[which(data$age>55), ]
groups <- data.frame(age_group = c("Young","Middle","Old"), group_count = c(NROW(Young$age), NROW(Middle$age), NROW(Old$age)))

#ploting different age groups
ggplot(groups, aes(x=age_group, y=group_count, fill=age_group)) + 
  ggtitle("Age Analysis") +
  xlab("Age Group")  +
  ylab("group Count") +
  geom_bar(stat="identity") +
  scale_fill_discrete(name = "Age Group", labels = c("Middle", "Old", "Young"))

As we can see on the graph above above a population of old people are the one who are more flexible to get or not heart disease. The younger population are the one who can get least heart disease base on thier population.
Also we cannot determine here who has disease or not. let's give a name for gender.

In [22]:
levels(data$sex) <- c("Female", "Male")
# Bar plot for sex
ggplot(data, aes(x= sex, fill= target)) + 
  geom_bar() +
  xlab("Gender") +
ylab("Gender Count") +
  ggtitle("Analysis of Gender") +
  scale_fill_discrete(name = "Heart disease", labels = c("No", "Yes"))


On the graph above I wanted to divide the whole population by group of age (young, middle and old), and then divide each bar plot of those target population (young, middle and old) by sex or gender and state in each group the proportion of male and female. finally, find if each group have the disease.

In [23]:
levels(data$cp) <- c("Asymptomatic", "Atypical angina", "No angina", "Typical angina")
# Bar plot for The chest pain ~ target
ggplot(data, aes(x= cp, fill=target)) + 
  geom_bar() +
  xlab("Chest Pain Type") +
  ylab("Count") +
  ggtitle("Analysis of Chest Pain To observe people who Experience Disease") +
  scale_fill_discrete(name = "Heart disease", labels = c("No", "Yes"))

The graph above show the repartition of the chest pain type. As we can see the type of chest pain that are more subceptible to have disease is the No angina type. On the Asymptomatic type there is less people who got the diseas. But by proportion we can also see that Atypical angina, No angina and typical angina are the tree type which are great proportion of sick people compared to thier population.

# I am going to visualize on a circle bar plot each type of chest pain hearth disease symptome on male and female to have a better view of the proportion of people that are affected.

sex (1 = male; 0 = female); 
age in year;
 cp: chest pain type
-- Value 0: typical angina
-- Value 1: atypical angina
-- Value 2: non-anginal pain
-- Value 3: asymptomatic

In [24]:
## 0 = "Asymptomatic", 1 = "Atypical angina", 2 = "No angina", 3 = "Typical angina"
data2 <- data %>% 
  arrange(desc(cp)) %>%
  mutate(prop = trestbps / sum(data$trestbps) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )
# Basic piechart
data2 %>% 
  count(cp)
data3 <- data2 %>% count(cp) 
ggplot(data3, aes(x="", y=n , fill=cp)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + 
  theme(legend.position="none") +
  geom_label(aes(label = cp),
             color = "white",size = 8,
             position = position_stack(vjust = 0.5),
             show.legend = FALSE)


On the this circle we can se that the great proportion of people who get symptom are the Asymptomatic.

In [25]:
ggplot(data, aes(x = age, y = chol)) +
    geom_line() +
    facet_wrap(facets = vars(cp)) + 
    labs(title = "Observed cholesterol base on age of individual",
        x = "Year of observation (age)",
        y = "Number of individuals")

Now show on the graph for each type of chest pain symptom people who will be have higher cholesterol base on thier age. As we see for Asymptomatic people, the higher cholesterol is between the age of 55 to 65. For Atypical angina is at the age of 55. For No angina is between the age of 65 to 70. Finally for typical angina is between the age of 50 to 55.
**To conclude, people that with get much chest pain are found on the No angina group between the age of 65-70 with high cholesterol than other group peak.**