# Project Proposal
Group 60 - Danyal, Ruth, Thomas, Paul

## Title: Predicting Heart disease based on data

## Introduction (Ruth)

## Preliminary exploratory data analysis (Thomas)
Notes:
* Although all of these variables appear numeric in the data files, the documentation tells us that many of them are actually categorical. We have only been shown how to deal with numeric variables, so we will either have to wait for more explanation in class or omit the categorical variables from our model.
* Following the lead of the previous experiments mentioned on the website, I used only the reduced datasets with 14 columns. I also combined the values from 1 to 4 for different diagnoses into a simple "true" value in a new column, and dropped the original column. Unlike the previous experiments, however, I combined the data from all 4 locations into one table, keeping track of the locations in another new column.

In [1]:
# Setup
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)
set.seed(9248)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [2]:
# Obtain data from external source and combine into one table
names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
types <- "nnnnnnnnnnnnnn"
cleveland <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                      col_names = names, col_types = types) |>
    mutate(location = "Cleveland")
hungary <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data",
                    col_names = names, col_types = types) |>
    mutate(location = "Hungary")
switzerland <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data",
                        col_names = names, col_types = types) |>
    mutate(location = "Switzerland")
longbeach <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data",
                      col_names = names, col_types = types) |>
    mutate(location = "Long Beach")
complete <- bind_rows(cleveland, hungary, switzerland, longbeach)

# Clean data
heart_disease <- complete |>
    mutate(sex = as_factor(ifelse(sex == 1, "male", "female")),
           cp = as_factor(cp),
           fbs = as.logical(fbs),
           restecg = as_factor(restecg),
           exang = as.logical(exang),
           slope = as_factor(slope),
           thal = as_factor(thal),
           disease = num != 0,
           location = as_factor(location)) |>
    select(-num)
heart_disease

“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,location,disease
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<lgl>,<fct>,<dbl>,<lgl>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<lgl>
63,male,1,145,233,TRUE,2,150,FALSE,2.3,3,0,6,Cleveland,FALSE
67,male,4,160,286,FALSE,2,108,TRUE,1.5,2,3,3,Cleveland,TRUE
67,male,4,120,229,FALSE,2,129,TRUE,2.6,2,2,7,Cleveland,TRUE
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
55,male,4,122,223,TRUE,1,100,FALSE,0,,,6,Long Beach,TRUE
58,male,4,,385,TRUE,2,,,,,,,Long Beach,FALSE
62,male,2,120,254,FALSE,2,93,TRUE,0,,,,Long Beach,TRUE


In [3]:
# Training/testing split
heart_split <- initial_split(heart_disease, prop = 0.75, strata = disease)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)

In [4]:
# Count rows in each column with missing data
heart_missing <- heart_training |>
    map_df(function(x) sum(is.na(x)))
heart_missing

# Number of observations with and without heart disease from each location
disease_count <- heart_training |>
    group_by(location, disease) |>
    summarize(count = n()) |>
    pivot_wider(names_from = disease, values_from = count)
disease_count

# Select only columns with numeric data and calculate mean of each column
heart_numeric <- heart_training |>
    select(age, trestbps, chol, thalach, oldpeak, ca)
heart_mean <- heart_numeric |>
    map_df(mean, na.rm = TRUE)
heart_mean

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,location,disease
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,44,22,68,1,40,40,43,219,449,353,0,0


[1m[22m`summarise()` has grouped output by 'location'. You can override using the
`.groups` argument.


location,FALSE,TRUE
<fct>,<int>,<int>
Cleveland,125,110
Hungary,136,79
Switzerland,7,87
Long Beach,40,105


age,trestbps,chol,thalach,oldpeak,ca
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
53.40348,131.9752,197.7316,138.7442,0.9113003,0.625


## Methods (Paul)

## Expected outcomes and significance (Danyal)