# Title (decide later)

## Introduction

Heart disease, also known as cardiovascular disease, is the top leading cause of death across the world, according to the [WHO]("https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death"). Heart disease refers to several heart-affecting conditions, with the most common condition causing blood vessels to narrow.

The predictive question we wish to answer is: ***“What factors contribute the most to the presence of heart disease?”***

Through data analysis, we will the heart disease data set from the [UC Irvine Machine Learning Repository]("https://archive.ics.uci.edu/dataset/45/heart+disease"), collected on June 30, 1988. This data set includes various observations from people in Cleveland, Hungary, Switzerland and the VA Long Beach. We will focus on the **Cleveland data set** to answer our question.

## Preliminary exploratory data analysis

Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

Find variables that you want to model = proposal
What do the variables mean; what existing relationships

variables to test
- age
- sex
- cp
- trestbps

In [1]:
library(tidyverse)
set.seed(29)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [4]:
# Reading the data from the web

# Specifying new column names & old ones
new_col_names <- c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","num")
old_col_names <- c("X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12","X13","X14")

heart_data <- read_csv("https://raw.githubusercontent.com/Mr-Slope/DSCI-100_Group_Project/main/processed.cleveland.data",
                      col_names=FALSE) |>
rename(age = X1,
      sex = X2,
      cp = X3,
      trestbps = X4,
      chol = X5,
      fbs = X6,
      restecg = X7,
      thalach = X8,
      exang = X9,
      oldpeak = X10,
      slope = X11,
      ca = X12,
      thal = X13,
      num = X14)

head(heart_data)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


In the table above:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type 
    1 = typical angina
    2 = atypical angina
    3 = non-anginal pain
    4 = asymptomatic
4. trestbps: resting blood pressure (mmHg)
5. chol: serum cholestoral (mg/dl)
6. fbs: fasting blood sugar of > 120 mg/dl (1=true,0=false)
7. restecg: resting electrocardiographic results
    0 = normal
    1 = abnormal
    2 = probably/definite left ventricular hypertrophy (damage) by Estes' criteria
8. thalach: maximum heart rate
9. exang: exercise induced angina (1 = yes, 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: slope of peak exercise ST segment
    1 = upsloping
    2 = flat
    3 = downsloping
12. ca: number of major vessels (0-3) colored by fluoroscopy
13. thal: 3 = normal, 6 = fixed defect, 7 = reversible defect
14. num: diagnosis of heart disease (angiographic disease status)
    0 = < 50% diameter narrowing, no heart disease present
    1 = > 50% diameter narrowing, heart disease present

## Methods


Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

Describe at least one way that you will visualize the results

We will predict num from (insert variables here)
num = 0 means that the patient does not have heart disease
num = 1 means patient has heart disease

## Expected outcomes and significance:


What do you expect to find?

What impact could such findings have?

What future questions could this lead to?
