# Group Proposal #

**[Github Repo](https://github.com/AnishkaFP/dsci-100-2023w1-group-029)**

## Predicting Thalassemia (Heart Disease) #

## Introduction ## 
The topic aims to show the presence of heart disease in individuals, depending on various factors. The study specifically focuses on the Thalassemia heart disease, which is a blood disorder that causes your body to have too little hemoglobin, which is what enables red blood cells to carry oxygen. Based on the 12 factors we see outlined and described below, the study will show if the patient developed Thalassemia.

## Goal: ##
Our project will seek to predict heart disease in patients based on their performance in a series of easily accessible metrics with the intent of identifying patients that are likely to be affected early for further testing. 

To do so, we will make use of a [simplified version](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) of a [dataset by UC Irvine](https://archive.ics.uci.edu/dataset/45/heart+disease).


| Variable                           | Coding   | Description                                                 | Type          |
|------------------------------------|----------|-------------------------------------------------------------|---------------|
| Age                                | age      | Age of the patient                                         | Integer       |
| Sex                                | sex      | Sex of the patient                                         | Categorical, 0-1 |
| Chest pain type                    | cp       | Level of patient pain                                      | Categorical, 0 - 3 |
| Resting blood pressure (mmHg)      | trestbps | Blood pressure of the patient at rest                       | Integer       |
| Serum cholesterol in mg/dl         | chol     | Cholesterol level in the patient                            | Integer       |
| Fasting blood sugar > 120 mg/dl    | fbs      | Is the blood sugar level of the patient above 120 mg/dl?   | Categorical, 0-1 |
| Resting electrocardiographic results | restecg  | Result of the electrocardiogram of the patient              | Categorical, 0-2 |
| Maximum heart rate achieved (bpm)  | thalach  | Heart rate achieved during exercise                          | Integer       |
| Exercise induced angina             | exang    | Did the patient suffer from angina due to exercise?         | Categorical, 0-1 |
| ST depression                      | oldpeak  | ST depression induced by exercise relative to rest on the electrocardiogram | Numeric |
| ST Slope                           | slope    | The slope of the peak exercise ST segment                   | Categorical, 1-3 |
| Major vessels                      | ca       | Number of major vessels colored by fluoroscopy              | Integer       |
| Thalassemia                        | thal     | Whether the patient is affected by Thalassemia              | Categorical, 0-2 |
| Target                             | target   | Presence of heart disease in the patient                    | Categorical, 0-1 |


## Preliminary Exploratory Data Analysis ##

**Load R Library**

In [1]:
# load librares
library(tidyverse)
library(dplyr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


**Read Data Set**

In [2]:
# read data
heart_data_raw <- read_csv("heart.csv")
head(heart_data_raw)

[1mRows: [22m[34m1025[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
58,0,0,100,248,0,0,122,0,1.0,1,0,2,1


**Clean and Wrangle Data into a Tidy Format**

In [3]:
# Check for Missing values
missing_values <- colSums(is.na(heart_data_raw))
print(data.frame(missing_values))

         missing_values
age                   0
sex                   0
cp                    0
trestbps              0
chol                  0
fbs                   0
restecg               0
thalach               0
exang                 0
oldpeak               0
slope                 0
ca                    0
thal                  0
target                0


In [5]:
# check the variable types of each column
variable_types0 <- sapply(heart_data_raw, class)
print(data.frame(variable_types0))

         variable_types0
age              numeric
sex              numeric
cp               numeric
trestbps         numeric
chol             numeric
fbs              numeric
restecg          numeric
thalach          numeric
exang            numeric
oldpeak          numeric
slope            numeric
ca               numeric
thal             numeric
target           numeric


In [7]:
# mutate the data type for categorical variables as factor
heart_data_raw <- heart_data_raw |> 
                  mutate_at(vars(sex, cp, fbs, restecg, exang, slope, thal, target), as_factor)
# check the data type of mutated data
variable_types1 <- sapply(heart_data_raw, class)
print(data.frame(variable_types1))

         variable_types1
age              numeric
sex               factor
cp                factor
trestbps         numeric
chol             numeric
fbs               factor
restecg           factor
thalach          numeric
exang             factor
oldpeak          numeric
slope             factor
ca               numeric
thal              factor
target            factor
