# Using Symptoms to Predict the Sex of an Individual with Heart Disease
Dsci 100-006, Group 2

## Introduction

Can the sex of an individual who is diagnosed with heart disease be accurately predicted by age and cholesterol level?

Heart disease is the leading cause of death worldwide, manifested when the heart and blood vessels are compromised. It is characterized by the plaque build up around blood vessels, restricting blood flow to the heart. High blood pressure, smoking, obesity, high cholesterol, and inactivity are common factors that contribute to the development of cardiovascular disease. It is important to highlight that symptoms and risk factors differ significantly between men and women. Due to these differences, can the sex of the patient be predicted by the symptoms they present in response to cardiovascular diseases, particularly their age and cholesterol levels? To answer this question, the “International application of a new probability algorithm for the diagnosis of coronary artery disease” dataset will be used. The data set provides a variables table regarding the patient’s symptoms. We will use age and cholesterol levels as our predictors to identify the sex of the patient. 


In [24]:
### Run this cell before continuing. 
library(tidyverse)
library(tidymodels)
library(repr)
source("cleanup.R")
options(repr.matrix.max.rows = 10)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mpurrr    [39m 1.0.2
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39m 1.2.0
[32m✔[39m [34mdials       [39m 

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [20]:
library(readr)

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

read_data <- read_csv(url, col_names = FALSE)

heart_data <- read_data |>

    mutate(age = X1,  chol = X5, sex = X2) |>
    select(age, chol, sex) |>
    mutate(sex = ifelse(sex == 1, "male", "female"))

head(heart_data)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,chol,sex
<dbl>,<dbl>,<chr>
63,233,male
67,286,male
67,229,male
37,250,male
41,204,female
56,236,male


In [21]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = sex)  
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

## Preliminary exploratory data analysis:



## Methods:

The columns for age "age" and cholesterol "chol" will be used to create a model that will predict the class "sex".

A scatterplot graph will be generated with colourcoding for the class (sex) from the training data.

In [55]:
summary_table <- heart_train |>
    select(age, chol) |>
    mutate("missing data count" = sum(is.na(age) | is.na(chol))) |>
    map_df(mean) 
summary_table


age,chol,missing data count
<dbl>,<dbl>,<dbl>
54.13717,246.615,0


## Expected Outcomes and Predicitions 

