# Use age and sex to determine the presence of heart disease in patients of Hungary.

### Introduction



Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
Clearly state the question you will try to answer with your project
Identify and describe the dataset that will be used to answer the ques

tion
Preliminary exploratory data analysis:
Demonstrate that the dataset can be read from the web into R 
Clean and wrangle your data into a tidy format
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your 


analysis.
Methods:
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize

 the results
Expected outcomes and significance:
What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

### Introduction: 

Heart disease refers to a range of conditions that affect the heart and the term is often used interchangeably with "cardiovascular disease," which generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack. Heart disease is also a leading cause of death globally, one in five deaths in the U.S. is caused by heart disease and heart disease cost the U.S. 239.9 billion dollars each year. There are many risk factors that directly associates with the presence of heart diease in patients, such as age and sex. With that said, the goal for us today is to classify patients with or without heart disease by distinguishing the likelihood using the risk factors. The dataset we will be using to perform this process on will be the 1988 Hungary Heart Disease database. 

In [24]:
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

In [14]:
# reading the data from the URl and assigning column names

url <- "https://raw.githubusercontent.com/ANGUO17/dsci-100-2023w2-group-06/main/heart%2Bdisease/reprocessed.hungarian.data"

data <- read.table(url)   
colnames(data) <- c("age", "sex", "cp", "trestbps", "chol" , "fbs", 
                    "restecg", "thalach", "exang", "oldpeak", "slope", "ca",
                    "thal", "num")

#Age; sex; chest pain type (angina, abnang, notang, asympt)
#%  Trestbps (resting blood pres); cholesteral; fasting blood sugar < 120
#%  (true or false); resting ecg (norm, abn, hyper); max heart rate; 
#%  exercise induced angina (true or false); oldpeak; slope (up, flat, down)
#%  number of vessels colored (???); thal (norm, fixed, rever). Finally, the
#%  class is either healthy (buff) or with heart-disease (sick).
    
head(data)




Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>
1,40,1,2,140,289,0,0,172,0,0.0,-9,-9,-9,0
2,49,0,3,160,180,0,0,156,0,1.0,2,-9,-9,1
3,37,1,2,130,283,0,1,98,0,0.0,-9,-9,-9,0
4,48,0,4,138,214,0,0,108,1,1.5,2,-9,-9,3
5,54,1,3,150,-9,0,0,122,0,0.0,-9,-9,-9,0
6,39,1,3,120,339,0,0,170,0,0.0,-9,-9,-9,0


In [43]:
#3 age: age in years
#4 sex: sex (1 = male; 0 = female)
#9 cp: chest pain type
        #-- Value 1: typical angina
       # -- Value 2: atypical angina
        #-- Value 3: non-anginal pain
       # -- Value 4: asymptomatic
#10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
#12 chol: serum cholestoral in mg/dl
#16 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
#19 restecg: resting electrocardiographic results
       # -- Value 0: normal
       # -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
       # -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
#32 thalach: maximum heart rate achieved
#38 exang: exercise induced angina (1 = yes; 0 = no)
#40 oldpeak = ST depression induced by exercise relative to rest
#41 slope: the slope of the peak exercise ST segment
        #-- Value 1: upsloping
        #-- Value 2: flat
        #-- Value 3: downsloping
#44 ca: number of major vessels (0-3) colored by flourosopy
#51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
#58 num: diagnosis of heart disease (angiographic disease status)
       # -- Value 0: < 50% diameter narrowing
        #-- Value 1: > 50% diameter narrowing
        #(in any major vessel: attributes 59 through 68 are vessels)


#Wrangling data
data_processed <- data |>
    mutate(sex = as_factor(sex)) |>
    mutate(sex = fct_recode(sex, "male" = "1", "female" = "0")) |>
    mutate(cp = as_factor(cp)) |>
    mutate(cp = fct_recode(cp, "typical angina" = "1", "atypical angina" = "2", 
                                        "non-anginal pain" = "3" , "asymptomatic" = "4")) |>
    mutate(fbs = as_factor(fbs)) |>
    mutate(fbs = fct_recode(fbs, "true" = "1", "false" = "0")) |>
    mutate(restecg = as_factor(restecg)) |>
    mutate(restecg = fct_recode(restecg, "normal" = "0", "abnormal" = "1")) |>
    mutate(exang = as_factor(exang)) |>
    mutate(exang = fct_recode(exang, "yes" = "1", "no" = "0")) |>
    mutate(slope = as_factor(slope)) |>
    mutate(slope = fct_recode(slope, "upsloping" = "1", "flat" = "2", "downsloping" = "3")) |>
    mutate(thal = as_factor(thal)) |>
    mutate(thal = fct_recode(thal, "normal" = "3", "fixed defect" = "6", "reversable defect" = "7")) |>
    mutate(diagnosis = as_factor(num)) |>
    mutate(diagnosis = fct_recode(diagnosis, "Sick" = "1","Sick" = "2" ,"Sick" = "3","Sick" = "4", "Healthy" = "0")) |>

    select(age, sex, chol, diagnosis)


    
    


head(data_processed)

Unnamed: 0_level_0,age,sex,chol,diagnosis
Unnamed: 0_level_1,<int>,<fct>,<int>,<fct>
1,40,male,289,Healthy
2,49,female,180,Sick
3,37,male,283,Healthy
4,48,female,214,Sick
5,54,male,-9,Healthy
6,39,male,339,Healthy
