# 1. Weather Data Example

This dataset contains daily temperature records for Columbia, Missouri, spanning from October 1 to October 10, 2024. The temperatures are presented in both standard and metric units. The data was collected by the [University of Missouri weather station](https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00231801/detail) and sourced from the National Centers for Environmental Information (NCEI), a division of NOAA. You can access the original data source here: [NCEI Climate Data](https://www.ncei.noaa.gov/cdo-web/).

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/images/weather_station_figure.jpg" alt="University of Missouri Weather Station" width="500">
</center>

**Dataset Details:** Each file includes daily minimum and maximum temperature readings, with some files containing intentional missing values (NaNs) for exercises in data cleaning and imputation.

- **Metric Units** ([data_10day_metric_missing.csv](https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_10day_metric_missing.csv)):  
  - **TMIN**: Minimum Daily Temperature (°C)  
  - **TMAX**: Maximum Daily Temperature (°C)  

- **Standard Units** ([data_10day_standard_missing.csv](https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_10day_standard_missing.csv)):  
  - **TMIN**: Minimum Daily Temperature (°F)  
  - **TMAX**: Maximum Daily Temperature (°F)  

**Note**: This dataset is intended for **educational use only**. Although based on historical data, it has been modified for training purposes to illustrate techniques for handling missing data in Python and may not be suitable for operational weather analysis or forecasting.

## 1.1 Reading Data

<font color='Green'><b>Example:</b></font> This example demonstrates how to load climate data from a CSV file into a pandas DataFrame. We set the unit system (either 'standard' for Fahrenheit or 'metric' for Celsius), construct the URL for the CSV file with missing values, and read the data into a DataFrame. The index is then converted to datetime format for time series analysis, and the DataFrame is displayed to verify successful loading.

<div class="alert alert-block alert-warning">
Pandas can read data from various file formats such as CSV, Excel, SQL databases, and more. The most commonly used method is <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" target="_blank">`pandas.read_csv()`</a> for reading CSV files.
</div>

In [1]:
# Import pandas for data manipulation
import pandas as pd

# Set unit system: 'standard' (Fahrenheit, inches) or 'metric' (Celsius, mm)
unit_system = 'standard'
# unit_system = 'metric'  # Uncomment to switch to metric

# URL of the CSV file with climate data (includes missing values)
link = f'https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/refs/heads/main/data_files/data_10day_{unit_system}_missing.csv'

# Load the CSV file into a DataFrame, using the first column as the index
climate_data = pd.read_csv(link, index_col=0)

# Convert the index to datetime format for time series analysis
climate_data.index = pd.to_datetime(climate_data.index)

# Display the DataFrame to check data loading
display(climate_data)

Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,,
2024-10-04,,83.0
2024-10-05,62.0,89.0
2024-10-06,,79.0
2024-10-07,47.0,72.0
2024-10-08,,
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


## 1.2 Identifying Missing Data


<font color='Green'><b>Example:</b></font> This example demonstrates how to identify missing values in the `climate_data` DataFrame. The `isnull()` (or `isna()`) method is used to create a boolean DataFrame, where `True` indicates the presence of a missing value. The resulting DataFrame is displayed to visualize the locations of missing data.


<div class="alert alert-block alert-warning">
In Pandas, the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html" target="_blank">`pandas.DataFrame.isna()`</a> and <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html#pandas.DataFrame.isnull" target="_blank">`pandas.DataFrame.isnull()`</a> methods are used to check for missing values in a DataFrame or Series. These methods are functionally identical and can be used interchangeably. Both return a mask where `True` marks a missing value, and `False` marks a non-missing value.
</div>

In [2]:
# Identify missing values in the DataFrame
missing_values = climate_data.isnull()

# Display the boolean DataFrame showing missing values
display(missing_values)

Unnamed: 0,TMIN,TMAX
2024-10-01,False,False
2024-10-02,False,False
2024-10-03,True,True
2024-10-04,True,False
2024-10-05,False,False
2024-10-06,True,False
2024-10-07,False,False
2024-10-08,True,True
2024-10-09,False,False
2024-10-10,False,False


The following image shows two tables comparing temperature data for a series of dates in October 2024:

- **Left Table**: Contains the minimum (TMIN) and maximum (TMAX) temperatures for each date, with some values marked as "NaN" (Not a Number), indicating missing data. This represents the original data.
- **Right Table**: A corresponding boolean table where "True" indicates the presence of "NaN" in the left table for the respective date and temperature type, and "False" indicates the absence of "NaN". This represents the output of applying `isna()` or `isnull()` to the DataFrame, where each "True" value corresponds to a "NaN" in the original data.

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/DSA_Lecture/f58baee17355cef9aaec59c85257e9b27b903b6a/images/figure_example_isnull.jpg" alt="Temperature Data Comparison" width="650">
</center>

In [3]:
# Count missing values for each column.
missing_values_count = climate_data.isnull().sum().to_frame('Missing Count')

# Calculate the percentage of missing values.
missing_values_count['Missing Percentage'] = (missing_values_count['Missing Count'] * 100 / len(climate_data)).round(2)

# Display the missing values count and percentage.
display(missing_values_count)

Unnamed: 0,Missing Count,Missing Percentage
TMIN,4,40.0
TMAX,2,20.0


# 2. Common Methods for Handling Missing Data

## 2.1 Dropping Missing Values
### 2.1.1 Listwise Deletion

Listwise deletion removes an entire row from the dataset if it contains any missing values in any column.


<font color='Green'><b>Example:</b></font> This example demonstrates how to remove rows with any missing values from the `climate_data` DataFrame using the `dropna()` method. The original DataFrame is displayed alongside the modified DataFrame after listwise deletion, allowing for a clear comparison of the data before and after the operation.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html" target="_blank">`pandas.DataFrame.dropna()`</a> method in Pandas is used to remove missing values from a DataFrame. This method can drop rows or columns that contain `NaN` (Not a Number) values.
</div>

In [4]:
# Remove rows with any missing values
listwise_deleted = climate_data.dropna(how='any')

print("Original Data:")
display(climate_data)
print("Data After Listwise Deletion:")
display(listwise_deleted)

Original Data:


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,,
2024-10-04,,83.0
2024-10-05,62.0,89.0
2024-10-06,,79.0
2024-10-07,47.0,72.0
2024-10-08,,
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


Data After Listwise Deletion:


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-05,62.0,89.0
2024-10-07,47.0,72.0
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


### 2.1.2 Pairwise Deletion

Pairwise deletion uses all available data for each analysis without removing entire records. It only excludes specific missing values when calculating statistics.


<font color='Green'><b>Example:</b></font> This example demonstrates the calculation of mean temperatures using two different methods for handling missing values. First, pairwise deletion is used to compute the mean of `TMIN` and `TMAX`, ignoring NaN values with the `mean()` method. Next, listwise deletion is applied by removing any rows with missing values before calculating the means again. Both results are displayed for comparison.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html" target="_blank">`pandas.DataFrame.mean`</a> method in Pandas calculates the average of values in a Series or DataFrame while ignoring NaN values by default.
</div>

In [5]:
# Using Pairwise Deletion
# Here `skipna =True` excludes NaN values when computing the result.
mean_tmin_pairwise = climate_data['TMIN'].mean(skipna=True) 
mean_tmax_pairwise = climate_data['TMAX'].mean(skipna=True)

# Print results using f-strings
print(f"Mean TMIN (using pairwise deletion): {mean_tmin_pairwise:.2f}")
print(f"Mean TMAX (using pairwise deletion): {mean_tmax_pairwise:.2f}")

# Using Listwise Deletion
listwise_deleted = climate_data.dropna()
mean_tmin_listwise = listwise_deleted['TMIN'].mean()
mean_tmax_listwise = listwise_deleted['TMAX'].mean()

# Print results using f-strings
print(f"\nMean TMIN (after listwise deletion): {mean_tmin_listwise:.2f}")
print(f"Mean TMAX (after listwise deletion): {mean_tmax_listwise:.2f}")

Mean TMIN (using pairwise deletion): 49.50
Mean TMAX (using pairwise deletion): 78.38

Mean TMIN (after listwise deletion): 49.50
Mean TMAX (after listwise deletion): 77.50


## 2.2 Imputation Methods

### 2.2.1 Constant Fill

Replaces missing values with a specified constant value (e.g., zero, mean, median, or another meaningful number).

<font color='Green'><b>Example:</b></font> This example demonstrates how to fill missing values in the `climate_data` DataFrame using constant fill methods. First, we calculate the mean for each feature and use it to replace NaN values with the `fillna()` method. The mean values are printed for reference. Next, we perform a similar operation using the median, filling missing values with the median for each feature. The modified DataFrames after both constant fill operations are displayed for comparison.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html" target="_blank">`pandas.DataFrame.fillna()`</a> method in pandas is used to replace missing values (NaNs) in a Series or DataFrame with specified values.
</div>

In [6]:
# Constant Fill with Mean
climate_data_constant_mean = climate_data.copy()

# Fill missing values with the mean for specified features.
for feat in climate_data.columns:
    mean_value = climate_data[feat].mean(skipna=True).round(2)
    print(f'Mean of {feat} = {mean_value:.2f}')
    climate_data_constant_mean[feat] = climate_data_constant_mean[feat].fillna(mean_value)

print("Data After Constant Fill (Mean):")
display(climate_data_constant_mean)

Mean of TMIN = 49.50
Mean of TMAX = 78.38
Data After Constant Fill (Mean):


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,49.5,78.38
2024-10-04,49.5,83.0
2024-10-05,62.0,89.0
2024-10-06,49.5,79.0
2024-10-07,47.0,72.0
2024-10-08,49.5,78.38
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


In [7]:
# Constant Fill with Median
climate_data_constant_median = climate_data.copy()

# Fill missing values with the median for specified features.
for feat in climate_data.columns:
    median_value = climate_data[feat].median(skipna=True)
    climate_data_constant_median[feat] = climate_data_constant_median[feat].fillna(median_value)

print("Data After Constant Fill (Median):")
display(climate_data_constant_median)

Data After Constant Fill (Median):


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,48.0,78.5
2024-10-04,48.0,83.0
2024-10-05,62.0,89.0
2024-10-06,48.0,79.0
2024-10-07,47.0,72.0
2024-10-08,48.0,78.5
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


### 2.2.2 Backward Fill (bfill)

A method that fills missing values by using the next valid observation to fill gaps.
  
<font color='Green'><b>Example:</b></font> This example demonstrates how to fill missing values in the `climate_data` DataFrame using the backward fill method (`bfill`). This method propagates the next valid observation backward to replace NaN values. The modified DataFrame after applying backward fill is displayed for review.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html" target="_blank">`pandas.DataFrame.bfill()`</a> method in pandas is used to fill missing values (NaNs) in a Series or DataFrame by using the next valid observation to fill the gap. This method can be particularly useful for time series data where you want to propagate the next available value backward to replace NaNs.
</div>

In [8]:
# Apply backward fill
climate_data_bfill = climate_data.bfill()

print("Data After Backward Fill:")
display(climate_data_bfill)

Data After Backward Fill:


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,62.0,83.0
2024-10-04,62.0,83.0
2024-10-05,62.0,89.0
2024-10-06,47.0,79.0
2024-10-07,47.0,72.0
2024-10-08,46.0,78.0
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


### 2.2.3 Forward Fill (ffill)

A method that fills missing values by propagating the last valid observation forward to fill gaps.

<font color='Green'><b>Example:</b></font> This example demonstrates how to fill missing values in the `climate_data` DataFrame using the forward fill method (`ffill`). This method propagates the last valid observation forward to replace NaN values. The modified DataFrame after applying forward fill is displayed for review.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html" target="_blank">`pandas.DataFrame.ffill()`</a> method in pandas is used to fill missing values (NaNs) in a Series or DataFrame by propagating the last valid observation forward. This method is particularly useful for time series data where you want to carry the previous value forward to replace NaNs.
</div>

In [9]:
# Apply Forward Fill
climate_data_ffill = climate_data.ffill(axis=0)

print("Data After Forward Fill:")
display(climate_data_ffill)

Data After Forward Fill:


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,44.0,74.0
2024-10-04,44.0,83.0
2024-10-05,62.0,89.0
2024-10-06,62.0,79.0
2024-10-07,47.0,72.0
2024-10-08,47.0,72.0
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0


In [10]:
# Identify rows where the forward-filled data differs from the original data.
different_rows = (climate_data_ffill != climate_data).any(axis=1)

# Use .loc[] with the boolean index to select the rows.
rows_with_differences = climate_data_ffill.loc[different_rows]

# Display the rows where the data differs.
display(rows_with_differences)

Unnamed: 0,TMIN,TMAX
2024-10-03,44.0,74.0
2024-10-04,44.0,83.0
2024-10-06,62.0,79.0
2024-10-08,47.0,72.0


### 2.2.4 Interpolation

Linear interpolation estimates missing values in time series data by assuming a straight line between known data points.

<font color='Green'><b>Example:</b></font> This example demonstrates how to fill missing values in the `climate_data` DataFrame using linear interpolation. The modified DataFrame after applying linear interpolation is displayed for review.

<div class="alert alert-block alert-warning">
The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html" target="_blank">`pandas.DataFrame.interpolate()`</a> method in pandas is used to fill missing values (NaNs) in a Series or DataFrame by performing interpolation, which estimates values based on surrounding data points. This method can be configured with various interpolation techniques, such as linear, polynomial, or time-based methods, making it particularly useful for time series and numerical data where a smooth transition between values is desired.
</div>

In [11]:
# Interpolate missing values
climate_data_interpolated = climate_data.interpolate(method='linear')
print("Data after interpolation:")
display(climate_data_interpolated)

Data after interpolation:


Unnamed: 0,TMIN,TMAX
2024-10-01,49.0,72.0
2024-10-02,44.0,74.0
2024-10-03,50.0,78.5
2024-10-04,56.0,83.0
2024-10-05,62.0,89.0
2024-10-06,54.5,79.0
2024-10-07,47.0,72.0
2024-10-08,46.5,75.0
2024-10-09,46.0,78.0
2024-10-10,49.0,80.0
