# Lecture Activity

In this activity, you will work in small groups to analyze one of the following datasets. Your tasks are as follows:

1. Choose the most suitable method for handling missing data from the techniques discussed today (either dropping or imputing).
1. Discuss the potential considerations for the approach you selected.

## 1. Weather Data Example

This dataset contains daily temperature records for Columbia, Missouri, spanning from January 15, 2024, to November 15, 2024. The temperatures are provided in both standard and metric units. The data was collected by the [University of Missouri Weather Station](https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00231801/detail) and sourced from the National Centers for Environmental Information (NCEI), a division of NOAA. You can access the original data source here: [NCEI Climate Data](https://www.ncei.noaa.gov/cdo-web/).

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/images/weather_station_figure.jpg" alt="University of Missouri Weather Station" width="500">
</center>

**Dataset Details:** Each file includes daily minimum and maximum temperature readings, with some files containing intentional missing values (NaNs) for exercises in data cleaning and imputation.

- **Metric Units** ([data_2024_metric_missing.csv](https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_2024_metric_missing.csv)):
  - **PRCP**: Daily Precipitation (inches)
  - **TMIN**: Minimum Daily Temperature (°C)  
  - **TMAX**: Maximum Daily Temperature (°C)  

- **Standard Units** ([data_2024_standard_missing.csv](https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_2024_standard_missing.csv)):  
  - **PRCP**: Daily Precipitation (mm)
  - **TMIN**: Minimum Daily Temperature (°F)  
  - **TMAX**: Maximum Daily Temperature (°F)  

**Note:** This dataset is intended for **educational use only**. Although based on historical data, it has been modified for training purposes to illustrate techniques for handling missing data in Python and may not be suitable for operational weather analysis or forecasting.

In [None]:
# Import pandas for data manipulation
import pandas as pd

# Set unit system: 'standard' (Fahrenheit, inches) or 'metric' (Celsius, mm)
unit_system = 'standard'
# unit_system = 'metric'  # Uncomment to switch to metric

# URL of the CSV file with climate data (includes missing values)
link = f'https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_2024_{unit_system}_missing.csv'

# Load the CSV file into a DataFrame, using the first column as the index
climate_data = pd.read_csv(link, index_col=0)

# Convert the index to datetime format for time series analysis
climate_data.index = pd.to_datetime(climate_data.index)

# Display the DataFrame to check data loading
display(climate_data)

In [None]:
# --- Your Analyses ---

## 2. Missouri Monthly Unemployment Claims by Industry Overview

This dataset represents monthly unemployment claims in Missouri across five industries from August 2011 to October 2024. It includes both complete and missing-value variants, making it useful for exploring temporal patterns in unemployment and developing imputation techniques for missing data.

This dataset is sourced from the [State of Missouri Data Portal](https://data.mo.gov/). You can access the original data source here: [Missouri Monthly Unemployment Claims by Industry](https://data.mo.gov/Labor/Missouri-Monthly-Unemployment-Claims-By-Industry/cj66-t7xq/about_data).

**Dataset Details:** This dataset includes monthly unemployment claims with some entries containing intentional missing values (NaNs) for exercises in data cleaning and imputation.

- **Unemployment Claims Dataset** ([monthly_unemployment_missing.csv](https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/monthly_unemployment_missing.csv)):

**Industries Included:**  
- Administrative & Support/Waste Management/Remediation Services
- Manufacturing
- Construction
- Health Care & Social Assistance
- Accommodation & Food Services

**Note:** This dataset is intended for **educational use only**. Although based on historical data, it has been modified for training purposes to illustrate techniques for handling missing data in Python and may not be suitable for operational analysis or forecasting.

In [None]:
# Import pandas for data manipulation
import pandas as pd

# URL of the CSV file with unemployment claims data (includes missing values)
link = 'https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/monthly_unemployment_missing.csv'

# Load the CSV file into a DataFrame, using the first column as the index
unemployment_data = pd.read_csv(link, index_col=0)

# Convert the index to datetime format for time series analysis
unemployment_data.index = pd.to_datetime(unemployment_data.index)

# Display the DataFrame to check data loading
display(unemployment_data)

In [None]:
# --- Your Analyses ---