# <div style="text-align: center;">Project 1: Data & Visualization</div>

I have extracted data about the number of cases, demographics, and social distancing and put the data sets on Canvas (see above). You can always get more data directly from the link above.

Some general example questions we are interested in answering throughout the semester are:

- What is the trend in different areas (states, counties) of the US?
- Is social distancing done, and is it working?
- Can we identify regions that do particularly well? Why did they do well?
- Can we predict the development in a region given the data from other areas?
- What actions can we recommend to provide an effective emergency response?
- During the projects, you will come up with your own questions.

In this project, we will focus on cleaning and understanding the data. You need to work on the following steps of the CRISP-DM framework:

# 1. Problem Description (Business Understanding) [10 points]
[3 points]
Describe the Problem: 
- What is COVID-19, and what is social distancing and flattening the curve? 
- Why is it important to look at data about the virus spread, hospitalizations, and available resources? 

[7 point]
- Choose a stakeholder for whom you analyze and, later on, model the data. 
- Define some questions that are important for this stakeholder. 
- What decisions can your stakeholders make, and how would they affect COVID-19 outcomes? 
- Brainstorm this a lot since this choice will guide your exploration of this and all the following projects. 
- Make sure you can produce actionable recommendations for these questions using your data later in your report. 

## 1 Problem Description
##### 1.1 *What is COVID-19, and what are social distancing and flattening the curve?*
- COVID-19, caused by the SARS-CoV-2 virus, is a respiratory illness that emerged in late 2019 and led to a global pandemic. It spreads primarily through respiratory droplets and airborne transmission, causing symptoms ranging from mild respiratory issues to severe complications such as pneumonia, organ failure, and death, particularly in high-risk populations ([cdc.gov](https://www.cdc.gov/covid/about/index.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fprevent-getting-sick%2Fhow-covid-spreads.html)). 
- Social distancing refers to reducing close physical interactions between individuals to prevent viral spread. This includes staying at least six feet apart, avoiding large gatherings, and minimizing non-essential travel ([who.int](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public)).
- Flattening the curve refers to implementing measures, such as social distancing, mask mandates, lockdowns, and remote work policies, to slow the virus's spread. The goal is to reduce the peak number of cases, preventing healthcare systems from becoming overwhelmed and ensuring medical resources remain available for critical patientsof cases and preventing healthcare systems from being overwhelmed (source: [michiganmedicine.org](https://www.michiganmedicine.org/health-lab/flattening-curve-covid-19-what-does-it-mean-and-how-can-you-help?)).

##### 1.2 *Why is it important to look at data about the virus spread, hospitalizations, and available resources?*
Analyzing COVID-19 data is essential for evidence-based decision-making in public health. Monitoring infection rates, hospital capacity, and medical resource availability helps in:
- Identifying outbreak hotspots – Tracking new cases allows health agencies to implement targeted restrictions before widespread community transmission occurs ([jhu.edu](https://coronavirus.jhu.edu/data)).
- Resource allocation – Data on hospitalizations guide the distribution of ICU beds, ventilators, and medical personnel where they are needed most ([nih.gov](https://www.nhlbi.nih.gov/covid)).
- Assessing policy effectiveness – Evaluating trends before and after interventions (e.g., lockdowns, mask mandates, and vaccine rollouts) determines which strategies work best ([researchunc.edu](https://research.unc.edu/2020/10/01/the-importance-of-covid-19-data-collection-and-transmission).
- Protecting vulnerable populations – Disaggregated data by age, race, socioeconomic status, and pre-existing conditions ensure at-risk groups receive priority care and vaccinations ([cdc.gov](https://covid.cdc.gov/covid-data-tracker/#datatracker-home)).

##### 1.3 *Choose a stakeholder for whom you analyze and, later on, model the data*
- The chosen stakeholder is the **Public Health Department** of Dallas, TX: **Dallas County Health and Human Services**. The Dallas County Health and Human Services (DCHHS) is responsible for monitoring public health trends, implementing COVID-19 policies, and allocating healthcare resources within Dallas County. They provide COVID-19 testing, vaccinations, hospital coordination, and outbreak management ([DCHSS](https://www.dallascounty.org/departments/dchhs)). 
-  This stakeholder is ideal for this analysis because:
    - They oversee pandemic response efforts for a major metropolitan area with a diverse population.
    - They make critical decisions on public health measures affecting millions of residents in Dallas County.
    - Their policies impact local businesses, healthcare facilities, schools, and public services.

##### 1.4 *Define some questions that are important for this stakeholder.*
To effectively support DCHHS in decision-making, this analysis focuses on the following key questions:
- What are the current COVID-19 infection rates across different city neighborhoods?
    - This helps identify outbreak hotspots and areas needing immediate intervention.
- Which demographic groups are experiencing higher hospitalization rates?
    - Understanding vulnerable populations allows for more targeted healthcare and vaccination efforts.
- How effective have social distancing measures been in reducing virus transmission?
    - This will guide decisions on continuing, adjusting, or relaxing restrictions.
- What is the current capacity of local hospitals, and how close are they to being overwhelmed?
    - This informs resource allocation and helps prevent healthcare system collapse.
- What percentage of the population has been vaccinated, and how does this correlate with infection rates?
    - This helps determine if vaccination efforts need to be expanded or if additional measures are required
- How have previous public health mandates (e.g., mask requirements, business restrictions) impacted case trends?
    - This is important for evaluating the effectiveness of past strategies and shaping future actions.

- Why Are These Questions Important?
    - These questions are critical because they allow DCHHS to:
        - Identify and address COVID-19 hotspots to implement targeted interventions before widespread transmission occurs.
        - Allocate medical resources more efficiently based on the current needs and capacity of local hospitals.
        - Evaluate the effectiveness of public health measures to decide whether they should be intensified, relaxed, or restructured.
        - Protect vulnerable populations by ensuring that healthcare services, such as vaccination and treatment, are prioritized based on demographic risk factors.
    - Answering these questions will enable DCHHS to make informed, data-driven decisions that can directly improve public health outcomes in Dallas County.

##### 1.5 *What decisions can your stakeholders make, and how would they affect COVID-19 outcomes?*
DCHHS plays a critical role in pandemic response by making data-driven decisions that directly impact COVID-19 transmission, healthcare system strain, and mortality rates. Their key decision areas include:
- Implementing or Adjusting Social Distancing Measures
    - By analyzing infection trends, DCHHS can tighten or relax mask mandates, capacity limits, and stay-at-home orders.
    - Example: During the Delta variant surge (2021), Dallas County reinstated indoor mask mandates based on rising case counts, reducing the growth rate of hospitalizations ([dallasnews.com](https://www.dallasnews.com/news/2021/08/11/dallas-county-judge-clay-jenkins-issues-new-mandate-requiring-masks-in-schools-businesses/)).

- Allocating Medical Resources Efficiently
    - Hospitalization and ICU data guide staffing and resource distribution across Dallas County hospitals.
    - Example: In July 2020, data-driven planning helped prevent hospital overload by redistributing ventilators to high-need areas in Dallas hospitals ([texastribune.org](https://www.texastribune.org/2020/07/14/texas-hospitals-coronavirus/))

- Target Vaccination and Outreach Programs
    - Demographic analysis ensures that high-risk and underserved communities receive vaccines early.
    - Example: DCHHS launched community-based vaccination clinics in South Dallas, where vaccination rates were low and case rates were high ([parklandhealth.org](https://www.parklandhealth.org/community-calendar/dchhs-popup-clinics-2176?utm_source=chatgpt.com)).

- Public Health Communication Strategies
    - Data helps shape effective health campaigns to encourage compliance with safety measures.
    - Example: DCHHS used real-time mobility data to warn against holiday travel surges in December 2020, reducing travel-related infections ([nbcdfw.com](https://www.nbcdfw.com/news/local/holiday-travel-surge-amid-rising-covid-19-cases/2843212/)).

Each of these decisions directly affects COVID-19 outcomes by controlling transmission, reducing hospital burden, and saving lives.

##### 1.6 *What data is needed?*
For this analysis, several types of data will be necessary:
- COVID-19 case data: Information on the number of infections, recoveries, and deaths, segmented by location, age, and other demographics.
- Hospitalization data: Data on ICU capacity, ventilator use, and hospital admissions related to COVID-19.
- Social distancing and mobility data: Information on people's movement patterns to understand compliance with public health mandates and detect emerging risks (e.g., increased mobility during holidays).
- Vaccination rates: Data on vaccination coverage, particularly broken down by demographic groups, to assess disparities in vaccine access and uptake.
- Public health intervention data: Insights on mask mandates, social distancing measures, and other interventions, along with their timeline and geographic implementation.

By collecting and analyzing these data points, DCHHS can develop a clearer picture of the pandemic’s impact, forecast future trends, and make decisions that mitigate harm to the community.

# 2. Data Understanding [45 points]
You must include all three provided datasets in your analysis!

[9 point]
- Describe what data is available. 
- Choose 5-10 important variables for the questions you have identified in the section above. 
- Describe the type of data (scale, values, etc.) of the most critical variables in the data. 

[9 Points]
- Verify data quality: 
    - Are there missing values? 
    - Duplicate Data? 
    - Outliers? 
    - Are those mistakes? 
    - How can these be fixed? 
- Ensure your report states how much data is removed and how much you have left. 

[9 points]
- Give appropriate statistics (range, mode, mean, median, variance, etc.) for the most important variables in these files and describe what they mean or if you find something interesting. 

[9 points]
- Visually explore the chosen attributes appropriately. 
- Provide an interpretation for each graph. 
- Explain why you chose the visualization for each attribute type. 

[9 points]
- Explore relationships between attributes: 
    - Look at the attributes and then use cross-tabulation, correlation, group-wise averages, box plots, etc., as appropriate. 

# 3. Data Preparation [30 points]
[30 points]
- Create a data set with objects as rows and features as columns. 
- Use as objects counties in the US. 
- The data set must be included in your report. 
- Provide a table in your Word document that shows the data (values) for the first 10 rows for all the features you have selected and/or created. 
- Interesting additional features may be, for example: 
    - When was the first case reported?
    - How (densely) populated is the county?
    - What resources does a county have (money, hospital)?
    - What is the social distancing response, and how long did it take after the first case?
- You can come up with more critical questions for your chosen stakeholder.

# 4. Data Preparation [15 points]
[15 points]
- Formulate some recommendations for the questions developed in section 1. based on the results in 2. and 3. 
- Make sure your recommendations are based on data and are actionable for the stakeholder (i.e., the stakeholder has the power to execute the recommendations). 

# Required Appendix
A list that specifies the part of the project they have worked on as the lead or a supporter.