<a href="https://colab.research.google.com/github/GayathripavaniGuntamukkala/HDS-5230-07/blob/main/WEEK_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name - Gayathri Pavani Guntamukkala,
Course Number - HDS 5230,  
Assignment Number - Week 05 Dask Programming Assignment.

In [69]:
!pip install "dask[dataframe]"




INTERPRETATION: This installs dask[dataframe], which includes dask-expr to avoid warnings related to query planning.

In [70]:
import dask.dataframe as dd

INTERPRETATION: Dask is imported to handle large-scale data processing in a parallelized manner.

In [71]:
# Define column data types explicitly to avoid mismatches
dtype_mapping = {
    "cases": "float64",
    "county": "object",
    "deaths": "float64",
    "recovered": "float64",
    "state": "object",
    "country": "object",
    "date": "object",
    "aggregate": "object",
    "city": "object",
    "population": "float64"
}
# Load the dataset using Dask
df = dd.read_csv("/content/timeseries.csv", dtype=dtype_mapping, low_memory=False)

# Check the first few rows
df.head()

Unnamed: 0,name,level,city,county,state,country,population,lat,long,url,...,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,growthFactor,date
0,"Antwerp, Flanders, Belgium",county,,Antwerp,Flanders,Belgium,1847486.0,51.2485,4.7175,https://epistat.wiv-isp.be/,...,,,,,,,,,,2020-01-22
1,"Antwerp, Flanders, Belgium",county,,Antwerp,Flanders,Belgium,1847486.0,51.2485,4.7175,https://epistat.wiv-isp.be/,...,,,,,,,,,1.0,2020-01-23
2,"Antwerp, Flanders, Belgium",county,,Antwerp,Flanders,Belgium,1847486.0,51.2485,4.7175,https://epistat.wiv-isp.be/,...,,,,,,,,,1.0,2020-01-24
3,"Antwerp, Flanders, Belgium",county,,Antwerp,Flanders,Belgium,1847486.0,51.2485,4.7175,https://epistat.wiv-isp.be/,...,,,,,,,,,1.0,2020-01-25
4,"Antwerp, Flanders, Belgium",county,,Antwerp,Flanders,Belgium,1847486.0,51.2485,4.7175,https://epistat.wiv-isp.be/,...,,,,,,,,,1.0,2020-01-26


INTERPRETATION: The dictionary explicitly defines the data types for each column in the dataset to ensure consistency and avoid mismatches during processing. Columns such as `cases`, `deaths`, `recovered`, and `population` are assigned the `float64` data type to handle numeric values, while columns like `county`, `state`, `country`, `date`, `aggregate`, and `city` are set to `object`, which is typically used for strings or mixed data types. The dataset is then loaded into a Dask DataFrame using the `dtype_mapping` dictionary to enforce these specified data types, with `low_memory=False` set to prevent memory issues when reading large datasets. Finally, the `.head()` method is used to display the first few rows of the dataset, allowing for inspection of its structure and verification that it has been loaded correctly. This approach ensures efficient handling of the dataset while maintaining data integrity and accuracy.

In [72]:
# Filter rows where 'country' is 'USA' and 'state' is not empty
df_states = df[(df["country"] == "United States") & (df["state"].notnull())]

INTERPRETATION: This is particularly useful for analyses focused on US state.The code is used to isolate data specific to the United States and ensure that only rows with valid state information are included and  filters the rows of the Dask DataFrame (df) to include only those rows where the country column has the value "United States" and the state column is not empty.

In [73]:
df_states.head()

Unnamed: 0,name,level,city,county,state,country,population,lat,long,url,...,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,growthFactor,date
24958,"Washoe County, Nevada, United States",county,,Washoe County,Nevada,United States,471519.0,40.582,-119.5885,https://services.arcgis.com/iCGWaR7ZHc5saRIl/a...,...,,1.0,,,,,,,,2020-03-05
24959,"Washoe County, Nevada, United States",county,,Washoe County,Nevada,United States,471519.0,40.582,-119.5885,https://services.arcgis.com/iCGWaR7ZHc5saRIl/a...,...,,1.0,,,,,,,1.0,2020-03-06
24960,"Washoe County, Nevada, United States",county,,Washoe County,Nevada,United States,471519.0,40.582,-119.5885,https://services.arcgis.com/iCGWaR7ZHc5saRIl/a...,...,,,,,,,,,1.0,2020-03-07
24961,"Washoe County, Nevada, United States",county,,Washoe County,Nevada,United States,471519.0,40.582,-119.5885,https://services.arcgis.com/iCGWaR7ZHc5saRIl/a...,...,,,,,,,,,2.0,2020-03-08
24962,"Washoe County, Nevada, United States",county,,Washoe County,Nevada,United States,471519.0,40.582,-119.5885,https://services.arcgis.com/iCGWaR7ZHc5saRIl/a...,...,,,,,,,,,1.0,2020-03-09


INTERPRETATION: The code is used to display the first few rows of the DataFrame df_states.

In [74]:
# Filter data for the specified time period
time_filtered_df = df_states[(df_states['date'] >= '2020-01-01') & (df_states['date'] <= '2021-02-28')]

INTERPRETATION: The code filters the dataset to include only COVID-19 data for U.S. states within the time period January 1, 2020, to February 28, 2021.

In [75]:
# Group by state and compute total deaths and average population
state_stats = time_filtered_df.groupby('state').agg({
    'deaths': 'sum',
    'population': 'mean'
}).compute()

INTERPRETATION: The code performs a grouped aggregation on the `time_filtered_df` DataFrame, which contains COVID-19 data filtered for a specific time period. It groups the data by the `state` column, ensuring that all rows corresponding to the same state are processed together. It calculates two key metrics for each state: the total deaths by summing the `deaths` column and the average population by taking the mean of the `population` column. The `.compute()` method is used to execute the computation.This aggregated data is useful for state-level analysis, such as comparing COVID-19 impacts across states or computing per-capita metrics like mortality rates. The use of Dask ensures efficient handling of large datasets by leveraging parallel and distributed computing.

In [76]:
# Calculate per-capita mortality
state_stats['per_capita_mortality'] = state_stats['deaths'] / state_stats['population']
state_stats.head()

Unnamed: 0_level_0,deaths,population,per_capita_mortality
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,62459.0,494295.554677,0.12636
Virginia,186685.0,137362.997694,1.359063
Washington,184760.0,514111.314877,0.359377
Alabama,93139.0,155294.044143,0.599759
Alaska,1604.0,82083.801964,0.019541


INTERPRETATION: This calculates the per-capita mortality for each state by dividing the total number of deaths (`state_stats['deaths']`) by the average population (`state_stats['population']`). This metric provides a normalized measure of COVID-19 impact, accounting for differences in population size across states. The result is stored in a new column, `per_capita_mortality`, within the `state_stats` DataFrame. Finally, the `.head()` method is used to display the first few rows of the updated DataFrame, allowing for a quick inspection of the calculated per-capita mortality values. This step is essential for comparing the relative severity of the pandemic across states on a per-capita basis.

In [77]:
# Rank states by per-capita mortality
state_stats['rank'] = state_stats['per_capita_mortality'].rank(ascending=False)
#Display the ranked states
print(state_stats.sort_values('rank'))

                                 deaths    population  per_capita_mortality  \
state                                                                         
New York                      3852431.0  6.023470e+05              6.395701   
Michigan                       880814.0  2.748462e+05              3.204753   
Louisiana                      423430.0  1.548027e+05              2.735288   
Illinois                       765763.0  3.746646e+05              2.043863   
New Jersey                    1710535.0  8.421302e+05              2.031200   
Georgia                        309920.0  1.529650e+05              2.026085   
Pennsylvania                   758770.0  4.142725e+05              1.831572   
Virginia                       186685.0  1.373630e+05              1.359063   
Mississippi                    104608.0  7.710448e+04              1.356705   
Indiana                        217660.0  1.608219e+05              1.353423   
Ohio                           308006.0  2.908617e+0

INTERPRETATION: The code ranks states based on their **per-capita mortality**, which is calculated as the ratio of total deaths to the average population. It adds a new column, `rank`, to the `state_stats` DataFrame, where each state is assigned a rank based on its per-capita mortality value. The `ascending=False` parameter ensures that states with higher per-capita mortality receive lower (more severe) rank values. Finally, the DataFrame is sorted by the `rank` column and displayed, showing the states ordered from highest to lowest per-capita mortality. This ranking helps identify states that were most severely affected by COVID-19 relative to their population size.

In [78]:
# Extract year and month from the date column
time_filtered_df['month'] = dd.to_datetime(time_filtered_df['date']).dt.to_period('M')



INTERPRETATION: The code extracts the **year and month** from the `date` column in the `time_filtered_df` DataFrame and stores it in a new column called `month`. It first converts the `date` column to a datetime format using `dd.to_datetime()` to ensure proper date manipulation. Then, it uses the `.dt.to_period('M')` method to transform the datetime values into a **year-month format** (e.g., `2020-01` for January 2020). This new `month` column is useful for grouping or analyzing data on a monthly basis, enabling time-based aggregations or trend analysis over specific months. The use of Dask ensures that this operation is performed efficiently, even on large datasets.

In [79]:
# Group by state and month, and compute total deaths and cases
monthly_stats = time_filtered_df.groupby(['state', 'month']).agg({
    'deaths': 'sum',
    'cases': 'sum'
}).compute()

INTERPRETATION: This code performs a **grouped aggregation** on the `time_filtered_df` DataFrame, which contains COVID-19 data filtered for a specific time period. It groups the data by both `state` and `month`, ensuring that rows are organized by state and further subdivided by each month. For each state-month combination, it calculates two key metrics: the **total deaths** by summing the `deaths` column and the **total cases** by summing the `cases` column. The `.compute()` method is used to execute the computation. The result is a new DataFrame, `monthly_stats`, where each row represents a state-month pair, with columns for total deaths and total cases. This aggregated data is useful for analyzing trends over time, such as monthly changes in cases and deaths, and can support further calculations like case fatality rates (CFR). The use of Dask ensures efficient processing of large datasets through parallel and distributed computing.

In [80]:
# Calculate CFR
monthly_stats['cfr'] = monthly_stats['deaths'] / monthly_stats['cases']*100

INTERPRETATION: The code calculates the **Case Fatality Rate (CFR)** for each entry in the `monthly_stats` DataFrame. CFR is a metric used to measure the severity of a disease by determining the proportion of deaths relative to the total number of confirmed cases. Here, `monthly_stats['deaths']` represents the total deaths, and `monthly_stats['cases']` represents the total confirmed cases. The result is stored in a new column, `monthly_stats['cfr']`, which contains the CFR values expressed as percentages. This calculation is essential for understanding the lethality of COVID-19 over time or across different regions, providing insights into the effectiveness of healthcare responses and the severity of the pandemic.

In [81]:
# Handle cases where 'cases' is zero or missing
monthly_stats['cfr'] = monthly_stats['cfr'].fillna(0)


INTERPRETATION: This code addresses potential issues in the `cfr` (Case Fatality Rate) column of the `monthly_stats` DataFrame by handling cases where the `cfr` values are either **missing (NaN)** or **invalid due to zero cases**. The `.fillna(0)` method is used to replace any missing or NaN values in the `cfr` column with `0`. This ensures that the `cfr` column contains only valid numeric values, preventing errors in subsequent calculations or analyses. By filling missing values with `0`, the code assumes that no fatalities occurred in those specific instances, maintaining the integrity of the dataset and making it suitable for further computations, such as aggregations or visualizations. This step is crucial for accurate and reliable analysis, especially when working with metrics like CFR that depend on both cases and deaths.

In [82]:
# Reshape the data into a 50 (states) x 14 (months) matrix
cfr_matrix = monthly_stats['cfr'].unstack()

# Display the CFR matrix
print(cfr_matrix)

month                          2020-03    2020-04    2020-05   2020-06  \
state                                                                    
Nevada                        2.033319   4.051924   4.970358  3.830004   
Virginia                      1.406402   2.865308   3.315829  2.862225   
Washington                    5.051079   5.255217   5.308562  4.396312   
Alabama                       0.532313   2.830899   3.889270  2.962907   
Alaska                        0.335008   2.314519   2.196905  1.303247   
Arizona                       0.000000   1.486545   1.992175  0.211513   
Arkansas                      0.915656   1.911450   2.129628  1.515155   
California                    2.006735   3.479974   3.983350  3.178666   
Colorado                      0.939250   2.636616   5.372019  5.419220   
Connecticut                   1.814771   6.477626   9.016204  9.383106   
Delaware                      1.334107   2.734038   3.574849  4.164733   
Florida                       0.842669

INTERPRETATION: The code reshapes the `monthly_stats` DataFrame, which contains monthly **case fatality rates (CFR)** for each state, into a **50 (states) x 14 (months) matrix**. This is achieved using the `.unstack()` method, which pivots the data so that each row represents a state, each column represents a month, and the values in the matrix correspond to the CFR for that state and month. The resulting `cfr_matrix` provides a structured view of how CFR varies across states and over time. Finally, the matrix is displayed using `print(cfr_matrix)`, allowing for easy inspection of the CFR trends. This reshaping is particularly useful for analyzing temporal patterns in CFR and performing further computations, such as ranking states based on changes in CFR over time.

In [83]:
# Compute month-to-month changes in CFR
cfr_changes = cfr_matrix.diff(axis=1)

# Aggregate changes (e.g., sum of absolute changes)
cfr_changes_aggregated = cfr_changes.abs().sum(axis=1)

# Rank states based on aggregated changes
ranked_states = cfr_changes_aggregated.rank(ascending=False)

# Display the ranked states
print(ranked_states.sort_values())

state
Michigan                         1.0
Illinois                         2.0
Northern Mariana Islands         3.0
New Jersey                       4.0
Connecticut                      5.0
Massachusetts                    6.0
Washington                       7.0
Pennsylvania                     8.0
New Hampshire                    9.0
Missouri                        10.0
California                      11.0
Rhode Island                    12.0
Florida                         13.0
Oklahoma                        14.0
Wisconsin                       15.0
United States Virgin Islands    16.0
Arizona                         17.0
New York                        18.0
Nevada                          19.0
Ohio                            20.0
Louisiana                       21.0
Alabama                         22.0
Colorado                        23.0
South Carolina                  24.0
Maryland                        25.0
Maine                           26.0
North Carolina                  

INTERPRETATION: The code computes **month-to-month changes in the Case Fatality Rate (CFR)** for each state, aggregates these changes by summing their absolute values, and ranks the states based on the total magnitude of changes. The results are displayed to show which states experienced the most significant fluctuations in CFR over time. This analysis helps identify states with varying COVID-19 outcomes, potentially reflecting differences in healthcare response or testing rates. The use of absolute values ensures both increases and decreases in CFR are equally considered in the ranking.

Using a **parallelized/distributed approach** (via Dask) is ideal due to the **large size and complexity** of the COVID-19 dataset. Dask breaks the data into smaller chunks, processes them in parallel, and reduces memory overhead, making it efficient for tasks like filtering, grouping, and aggregating. It also scales from single machines to clusters, handling both small and large-scale computations effectively. Parallelization is essential for **large datasets** and **complex workflows** but may not be necessary for small datasets or trivial tasks. Overall, Dask ensures efficient and scalable processing for this assignment.