In [1]:
library(dplyr) #Loading all libraries 
library(tidyr)
library(purrr)
library(forcats)
library(readr)
library(readxl)
library(ggplot2)
library(cowplot)
library(repr)
library(RPostgres)
library(RSQLite)
library(workflows)
library(recipes)
library(parsnip)
library(DBI)
library(tidyverse)
library(tidymodels)
library(gridExtra)
library(janitor)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


“package ‘cowplot’ was built under R version 4.3.2”

Attaching package: ‘recipes’


The following object is masked from ‘package:stats’:

    step


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mstringr  [39m 1.5.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mstringr[39m::[32mfixed()[39m   masks [34mrecipes[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mlubridate[39m::[32mstamp()[39m masks [34mcowplot[39m::stamp()
[36mℹ[39m Use 

In [2]:
download.file("https://raw.githubusercontent.com/An-Dao/dsci_project/main/data/healthcare_dataset%202.csv","data/healthcare_data_read.csv")
health_data <- read_csv("data/healthcare_data_read.csv")

# Define age ranges
age_ranges <- c(0, 20, 30, 40, 50, 60, 70, 80, Inf)
age_labels <- c("0-20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80", "81+")

compress_health_data <- health_data |> 
    clean_names() |>
    select(-c(date_of_admission:discharge_date,name,medication,test_results))|>
    mutate(age_range = cut(age, breaks = age_ranges, labels = age_labels, include.lowest = TRUE))

compress_heart_data |> head(10)
write.csv(compress_heart_data, "data/medical_condition_data.csv")

[1mRows: [22m[34m10000[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (10): Name, Gender, Blood Type, Medical Condition, Doctor, Hospital, In...
[32mdbl[39m   (3): Age, Billing Amount, Room Number
[34mdate[39m  (2): Date of Admission, Discharge Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: Error in eval(expr, envir, enclos): object 'compress_heart_data' not found


In [None]:
set.seed(2000) 


health_split <- initial_split(compress_heart_data, prop = 3/4, strata = medical_condition)
health_training <- training(health_split)
health_testing <- testing(health_split)

In [None]:
#Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
#An example of a useful table could be one that reports the number of observations in each class, 
#the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

count_blood_type <- health_training |>
    group_by(blood_type)|>
    summarize(count = n())
count_blood_type 

count_gender <- health_training |>
    group_by(gender)|>
    summarize(count = n())
count_gender


count_age_range <- health_training |>
    group_by(age_range)|>
    summarize(count = n())
count_age_range

count_med_cond <- health_training |>
    group_by(medical_condition) |>
    summarize(count = n())
count_med_cond

summary_data <- health_training |>
    group_by (age_range,blood_type, gender, medical_condition)|>
    summarize(count = n() )
summary_data

In [None]:

training_plot <- summary_data |>
   ggplot(aes(x = age_range, y = count, fill = gender)) + 
    geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ medical_condition, scales = "free") +
  labs(title = "Distribution of Male and Female by Age Group for Each Medical Condition", x = "Age Group", y = "Patient ammount")+
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

training_plot2 <- summary_data |>
   ggplot(aes(x = blood_type, y = count, fill = gender)) + 
    geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ medical_condition, scales = "free") +
  labs(title = "Distribution of Male and Female by Blood type for Each Medical Condition", x = "Blood type", y = "Patient ammount")+
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
training_plot
training_plot2

**Classification Analysis of Patients with Asthma Based on Age, Gender, and BloodType**

**Introduction**
<br> 
    Understanding and learning about human health is pivotal in health improvement within society. Healthcare is the system in which improvement of human health is developed through various practices and studies for the prevention and treatment of patients. For our group project, we will be analyzing a dataset found through Kaggle called “Healthcare Dataset”. This dataset includes patient information that simulates real-life healthcare files. Each column provides information about the patient, their date of admission, and services provided based on their condition. Through data classification, the organization and categorizing of new data through past data, we will answer the predictive question of “Will patients be diagnosed with asthma or not based on their age, gender, and blood type?”. We will be using the K-nearest neighbor classification algorithm to analyze our data to get our prediction result. 

**Preliminary Exploratory Data Analysis**
<br> 
    Using our data from the web source https://www.kaggle.com/datasets/prasad22/healthcare-dataset/data, we aren't able to directly use the data in Jypiter Notebook. So, we first need to read it by downloading the file and importing it into our repository then we get the directory of the raw file on GitHub. The data is read in and assigned to object "health_data". Although the data table is tidy, many variables are unnecessary for our project. We've shortened the table to age, gender, blood_type, medical_condition, and age_group. Of the age_group column, we will split the data into 8 groups of 0-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81+. Although every other age group is split by 10 years gap we decided to group 0-10 and 11-20 together as they lack amount of data. To split the data into training and testing sets we have picked a random seed of 2000 and split them by 3/4 with 3 parts going to the training set and the remaining for testing. We've grouped and found each variable's amount and concluded that data are fairly evenly distributed for all the groups we did. The main table is the "summary_data" which counts the number of patients that have medical conditions by "age_group", "blood_type" and "gender". We then visualize the "summary_data" into a bar graph showing the distribution of males and females having medical conditions. We can verify the even distribution between male and female patients in all "age_group" for each medical condition. This even spread will allow our model to have a higher precision.

**Methods**

**Expected Outcomes and Significance** 

**Contributors:**
<br>*An Dao*, *Moya Ku*