# <div style="text-align: center;">Project 3: Classification</div>

## **1. Problem Description (Business Understanding)**

The ongoing COVID-19 pandemic has revealed stark disparities in how counties across the United States—and particularly in Texas—have been impacted. Research has consistently shown that differences in health outcomes are influenced by a complex interaction of socioeconomic, demographic, and behavioral factors (CDC, 2021; Benitez et al., 2020). State and local health departments, including the Texas Department of State Health Services (DSHS), require reliable models to anticipate where resources and interventions are most needed.

The goal of this project is to build predictive models that can accurately identify counties with high COVID-19 mortality rates using a range of population and health-related features. These include income inequality, poverty rates, commuting patterns, ability to work remotely, food insecurity, and access to healthcare (Raifman & Raifman, 2020). By understanding which variables most strongly correlate with severe outcomes, policymakers can prioritize data-driven strategies for prevention and resource allocation.

Key stakeholders include:
- **Public health agencies** (e.g., DSHS, CDC)
- **Local governments** seeking to allocate medical supplies, testing, or vaccination campaigns
- **Epidemiologists and data analysts** responsible for surveillance and predictive monitoring

**Research Questions:**
1. Which counties in Texas are most vulnerable to COVID-19 mortality based on social and economic indicators?
2. Can predictive models classify counties as low, medium, or high risk with high accuracy?
3. What are the most important features that drive model predictions, and how can they guide policy interventions?

This project is important because it translates raw COVID-19 case data into actionable intelligence. A well-performing model will not only help in short-term response planning but also inform longer-term policy decisions to reduce structural vulnerabilities before future public health emergencies (Bertsimas et al., 2020).

### References (APA Style)

- Benitez, J. A., Courtemanche, C., & Yelowitz, A. (2020). *Racial and ethnic disparities in COVID-19: Evidence from six large cities*. National Bureau of Economic Research. https://doi.org/10.3386/w27592  
- Bertsimas, D., Boussioux, L., Cory-Wright, R., Delarue, A., Digalakis Jr, V., & Lami, O. (2020). *From predictions to prescriptions: A data-driven response to COVID-19*. Health Care Management Science, 24(2), 1–22. https://doi.org/10.1007/s10729-020-09549-8  
- Centers for Disease Control and Prevention (CDC). (2021). *COVID-19 racial and ethnic health disparities*. https://www.cdc.gov/coronavirus/2019-ncov/community/health-equity/race-ethnicity.html  
- Raifman, M. A., & Raifman, J. R. (2020). *Disparities in the population at risk of severe illness from COVID-19 by race/ethnicity and income*. American Journal of Preventive Medicine, 59(1), 137–139. https://doi.org/10.1016/j.amepre.2020.04.003  

In [18]:
# Define all the required packages
packages <- c(
  "tidyverse", "ggrepel", "ggcorrplot", "DT", "gridExtra",
  "sf", "modeest", "factoextra", "kableExtra", "reshape2",
  "knitr", "caret", "car", "lubridate", "tigris",
  "mclust", "dbscan"
)

# Install any that aren't already installed
installed <- packages %in% rownames(installed.packages())
if (any(!installed)) {
  install.packages(packages[!installed])
}

In [20]:
# List of required libraries
libs <- c(
  "tidyverse", "ggrepel", "ggcorrplot", "DT", "gridExtra",
  "sf", "modeest", "knitr", "factoextra", "reshape2",
  "kableExtra", "stringr", "caret", "car", "lubridate",
  "tigris", "mclust", "cluster", "dbscan"
)

# Load each library
invisible(lapply(libs, library, character.only = TRUE))

In [22]:
# Define base directory
#base_dir <- "../../../../../COVID-19/"
#base_dir <- "C:/Users/leona/OneDrive/CSMS/Data Mining (CS7331)/Projects/Project 1/COVID-19/"
base_dir <- "/Users/salissa/Desktop/Data Mining/Datasets/COVID-19"

# Use file.path() for better compatibility
global_mobility_path <- file.path(base_dir, "Global_Mobility_Report.csv")
covid_cases_census_path <- file.path(base_dir, "c19_census.csv")
covid_cases_tx_path <- file.path(base_dir, "c19_tx.csv")

In [24]:
# Custom function to read CSV files
read_data <- function(file_path, dataset_name) {
  if (file.exists(file_path)) {
    cat(paste0("\n--- Loading ", dataset_name, " ---\n"))
    # Suppress col_type messages
    return(read_csv(file_path, show_col_types = FALSE))
  } else {
    stop(paste0("Error: File not found -> ", file_path))
  }
}

# Load datasets
global_mobility <- read_data(global_mobility_path, "Global Mobility Data")
c19_census <- read_data(covid_cases_census_path, "COVID-19 Cases and Census Data")
covid_cases_tx <- read_data(covid_cases_tx_path, "COVID-19 Cases for Texas")

cat("\n--- All datasets successfully loaded! ---\n")


--- Loading Global Mobility Data ---

--- Loading COVID-19 Cases and Census Data ---

--- Loading COVID-19 Cases for Texas ---

--- All datasets successfully loaded! ---


### 2. Data Preparation [40 points]
Define your classes (e.g., more than x corona-related cases or fatalities per population of 10000 per week). Explain why you defined the classes this way. You should look at the data to answer this question. [10]
Combine files as needed to prepare the data set for classification. You will need a single table with a class attribute to learn a model. [10]
Identify predictive features, create additional features, and deal with missing data (for classification models that cannot handle missing data). [20]


### 2.1 Defining Classes [10 points]
To predict county-level vulnerability to a future virus similar to COVID-19, we defined three classes based on historical COVID-19 death rates per 100,000 people. This outcome reflects the severity of pandemic impact, which public health departments use to allocate resources and prioritize interventions.

**Class Definition:**

- Low Risk: ≤ 100 deaths per 100k
- Medium Risk: 101–200 deaths per 100k
- High Risk: > 200 deaths per 100k

**Rationale:**

These thresholds align with natural breakpoints in the observed distribution of the data (see histogram in Figure 1).
They offer clear, interpretable groupings useful for health policy decision-making.
Classification models benefit from roughly balanced classes, which these thresholds help achieve after inspecting the distribution.


### 2. Modeling [50 points]
Prepare the data for training, testing, and hyperparameter tuning. [5]
Using the training data, create at least three different classification models (different techniques). Discuss each model and the advantages of each used classification method for your classification task. [30]
Assess each model's performance (use training/test data, cross-validation, etc., as appropriate). [15]


### 3. Evaluation [5 points]
Discuss how useful your model is for your chosen stakeholders. How would you assess the model's value if
it was used?

### 4. Deployment [5 points]
• How would your model be used in practice? What actions would be taken based on your model? How often would the model be updated? Etc.