# Individual Assignment 1 - Hanxi Chen

## Data

#### Background
The datasets contain clinical and noninvasive test results of patients undergoing angiography in various clinics from four areas in 1988(Cleveland Clinic, Hungarian Institute of Cardiology in Budapest, University Hospitals in Zurich and Basel in Switzerland and Veterans Administration Medical Center in California), the original purpose of data collection is to applying a model derived from  303 patients who underwent angiography at the Cleveland Clinic in Cleveland, Ohio to estimate probabilities of angiographic coronary disease for patients in the other three places. In our case, the raw data are four separate .data files from Cleaveland, Hungarian, Switzerland and VA. For the purposes of the project, they were converted to a .csv format, modified to contain the data’s corresponding columns. We use a combination of four data files to be our dataset, and incorporate a new variable location to represent four places.

Reference link to dataset package: Heart Disease - https://archive.ics.uci.edu/dataset/45/heart+disease

#### Dataset Overview
Number of observations in the dataset: total 1071 (303 from Cleaveland, 425 from Hungarian, 143 from Switzerland and 200 from VA)

Number of observations after removing missing values: total 674 (303 from cleaveland, 269 from hungarian and 102 from va)


Number of variables: 15

Description of variables:
- location: location of institutions where patients take angiography, categorical, no missing value
- age: age of the patients in years, numerical, no missing value
- sex: sex of the patients, categorical, no missing value
- cp: chest pain type, 4 types denoted in 0,1,2,3, categorical, no missing value
- trestbps: resting blood pressure (on admission to the hospital) in mm Hg, numerical, has missing values
- chol: serum cholestoral in mg/dl, numerical, has missing values
- fbs: fasting blood sugar > 120 mg/dl, categorical, has missing values
- restecg: resting electrocardiographic results (values 0,1,2), categorical, has missing values
- thalach: maximum heart rate achieved, numerical, has missing values
- exang: exercise induced angina, Categorical, has missing values
- oldpeak: ST depression induced by exercise relative to rest, numerical, has missing values
- slope: the slope of the peak exercise ST segment, numerical, has missing values
- ca: number of major vessels (0-3) colored by flourosopy, numerical, has missing values
- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect, categorical, has missing values
- num: diagnosis of heart disease (our target variable), 0 = no disease and 1 = disease, categorical, no missing value

Variables of interest: location, age, sex, cp, trestbps, chol, restecg, and num(response variable).

My choice of variables is guided by the need to conduct a comprehensive investigation into the likelihood of angiographic coronary disease across diverse clinical settings. I selected 'location' to account for potential regional disparities in healthcare practices. 'Age' and 'sex' are fundamental demographic factors widely recognized for their influence on heart diseases. 'Chest Pain Type' (cp), 'Resting Blood Pressure' (trestbps), 'Serum Cholesterol' (chol), and 'Resting ECG Results' (restecg) are included due to their clinical relevance as diagnostic indicators. The variable 'num' (diagnosis of heart disease) is our response variable, central to the research question, to predict coronary disease likelihood. This selection of variables collectively enables us to explore the associations and regional variations in coronary disease, aiming to develop predictive models for informed decision-making and risk assessment.

## Question

- How accurate can we predict the likelihood of coronary disease(variable 'num') based on location, age, sex, chest pain type, resting blood pressure, serum cholesterol, and resting ECG results of the patients from four specific regions?

Our analysis will involve the development of a predictive model to estimate the likelihood of angiographic coronary disease based on these variables. Additionally, we will explore regional variations and demographic influences on heart disease risk. This research question is focused on both prediction, as we seek to build a predictive model, and inference, as we aim to gain insights into the factors influencing the likelihood of coronary disease diagnosis in different locations and demographic groups. 

In terms of the data, we will remove the missing values in data wrangling process in later assignment. Since all of the "chol" from patients' data in Switzerland are missing, in later data analysis step we expect our "location" variable to contain only 3 categories.

## Supporting Code: 

In [101]:
# load the required packages
library("tidyverse")
library("dplyr")

In [102]:
# read in the data 
cleaveland <- read.csv("data/processed_Cleaveland.csv") %>% mutate(location = "cleaveland")
hungarian <- read.csv("data/processed_Hungarian.csv") %>% mutate(location = "hungarian")
switzerland <- read.csv("data/processed_Switzerland.csv") %>% mutate(location = "switzerland")
va <- read.csv("data/processed_VA.csv") %>% mutate(location = "va")

# replace missing values "?" and "0" for chol with NA
cleaveland[cleaveland == "?"] <- NA
hungarian[hungarian == "?"] <- NA
switzerland[switzerland == "?"] <- NA
va[va == "?"] <- NA
cleaveland$chol <- ifelse(cleaveland$chol == 0, NA, cleaveland$chol) # replace missing values where chol = 0 with NA
hungarian$chol <- ifelse(hungarian$chol == 0, NA, hungarian$chol)
switzerland$chol <- ifelse(switzerland$chol == 0, NA, switzerland$chol)
va$chol <- ifelse(va$chol == 0, NA, va$chol)

# combine all raw datasets into a single dataset of interest
heart <- rbind(cleaveland, hungarian, switzerland, va)
colnames(heart)[1] <- "age" # change the name of first variable to a tidy name

# select varibles of interest
heart <- heart %>% select(location, age, sex, cp, trestbps, chol, restecg, num)
heart

location,age,sex,cp,trestbps,chol,restecg,num
<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<int>
cleaveland,63,1,1,145,233,2,0
cleaveland,67,1,4,160,286,2,2
cleaveland,67,1,4,120,229,2,1
cleaveland,37,1,3,130,250,0,0
cleaveland,41,0,2,130,204,2,0
cleaveland,56,1,2,120,236,0,0
cleaveland,62,0,4,140,268,2,3
cleaveland,57,0,4,120,354,0,0
cleaveland,63,1,4,130,254,2,2
cleaveland,53,1,4,140,203,2,1


In [103]:
# count the number of NAs for each variable
na_counts <- colSums(is.na(heart)) 
data.frame(NA_Count = na_counts)

Unnamed: 0_level_0,NA_Count
Unnamed: 0_level_1,<dbl>
location,0
age,0
sex,0
cp,0
trestbps,59
chol,202
restecg,2
num,0


In [106]:
# drop the missing values
heart <- na.omit(heart)

# count the number of observations for each location
table(heart$location)


cleaveland  hungarian         va 
       303        269        102 

In [108]:
write.csv(heart,"heart.csv")