<div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 40px; margin-top: 0;">
    <div style="flex: 0 0 auto; margin-left: 0; margin-bottom: 0; margin-top: 0;">
        <img src="./pics/UCSD Logo.png" alt="UCSD Logo" style="width: 179px; margin-bottom: 0px; margin-top: 20px;">
    </div>
    <div style="flex: 0 0 auto; margin-left: auto; margin-bottom: 0; margin-top: 20px;">
        <img src="./pics/ndp-logo.png" alt="LANL Logo" style="width: 200px; margin-bottom: 0px;">
    </div>
    <div style="flex: 0 0 auto; margin-left: auto; margin-bottom: 0; margin-top: 20px;">
        <img src="./pics/sdsc-logo.png" alt="Prowess Logo" style="width: 200px; margin-bottom: 0px;">
    </div>
</div>

<h1 style="text-align: center; font-size: 48px; margin-top: 0;">Onboarding Module</h1>

This module is designed to provide an onboarding experience and introduce you to working with NDP Modules. The problem and dataset presented here align with research areas explored by multiple collaborators of the National Data Platform, including the [WORDS team](https://words.sdsc.edu/) at the San Diego Supercomputer Center.

The problem and data used in this demo module were originally developed as part of the [Big Data Specialization](https://www.coursera.org/specializations/big-data#courses) offered by UC San Diego on Coursera.

## The Data

The file `daily_weather.csv` is a comma-separated file that contains weather data.  This data comes from a weather station located in San Diego, California.  The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity.  Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Sensor measurements from the weather station were captured at one-minute intervals.  These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternooy.

Each row in daily_weather.csv captures weather data for a separate day.  Each row, or sample, consists of the following variables:

| Variable                  | Description                                                | Unit of Measure           |
|---------------------------|------------------------------------------------------------|---------------------------|
| number                    | Unique number for each row                                 | NA                        |
| air_pressure_9am          | Air pressure averaged over a period from 8:55am to 9:04am | hectopascals              |
| air_temp_9am             | Air temperature averaged over a period from 8:55am to 9:04am | degrees Fahrenheit        |
| avg_wind_direction_9am    | Wind direction averaged over a period from 8:55am to 9:04am | degrees, 0 = North, increasing clockwise |
| avg_wind_speed_9am       | Wind speed averaged over a period from 8:55am to 9:04am    | miles per hour            |
| max_wind_direction_9am    | Wind gust direction averaged over a period from 8:55am to 9:04am | degrees, 0 = North, increasing clockwise |
| max_wind_speed_9am       | Wind gust speed averaged over a period from 8:55am to 9:04am | miles per hour            |
| rain_accumulation_9am     | Amount of rain accumulated in the 24 hours prior to 9am   | millimeters               |
| rain_duration_9am         | Amount of time rain was recorded in the 24 hours prior to 9am | seconds                   |
| relative_humidity_9am     | Relative humidity averaged over a period from 8:55am to 9:04am | percent               |
| relative_humidity_3pm     | Relative humidity averaged over a period from 2:55pm to 3:04pm | percent               |

### The Task

In this onboarding module, our goal is to predict whether a day is considered **humid** or not. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relatively humidity. We will define a **humid day** as one where humidity at 3 PM is at least 25%.

Before diving into model training, let's first explore our dataset.

## Exploratory Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# We load the data into a pandas df and drop the column 'number' since we don't have a use for it
df = pd.read_csv("data/daily_weather.csv").drop(columns=['number'], errors='ignore') 

In [None]:
# Let's look at our data, including a few rows and the number of columns and rows
df

In [None]:
# Let's display the summary statistics for each of the columns
df.describe()

In [None]:
# Now, let's look at the distribution of each of the columns

num_cols = len(df.columns)
num_rows = (num_cols // 3) + (num_cols % 3 > 0) 

fig, axes = plt.subplots(num_rows, 3, figsize=(15, 4 * num_rows))
axes = axes.flatten()  # Flatten to iterate easily

# Create a histogram for each column
for i, col in enumerate(df.columns):
    sns.histplot(df[col].dropna(), kde=True, ax=axes[i], bins=30)
    axes[i].set_title(col)
    axes[i].set_xlabel("")
    
# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

# Predicting whether a day is humid or not

Now that we took a look at our data, we will train a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to predict whether a day can be considered humid or not, based on morning data.

In [None]:
# We drop relative humidity at 9am since is highly correlated to our target column
X = df.drop(columns=['relative_humidity_3pm', 'relative_humidity_9am']) 

# We set a threshold of 25%. Any day with a humidity % higher than that, we consider it humid
threshold = 24.99999
y = (df["relative_humidity_3pm"] > threshold).astype(int)

In [None]:
# We will impute the missing values with the median
X = X.fillna(X.median())  
y = y.fillna(y.median()) 

In [None]:
# We use a test size of 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

### Model Evaluation

In [None]:
# First, we make predictions
y_pred = clf.predict(X_test)# Evaluate the model

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

In [None]:
# Let's look at our predictions with more detail with the confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(6, 5))
cax = ax.matshow(cm, cmap="Blues")

plt.colorbar(cax)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', color='black', fontsize=12)
        
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")

plt.show()

### Can you build a better model? 

Use the cell below to improve upon our current benchmark. You can experiment with different model types, adjust hyperparameters, or engineer new features to enhance performance.

In [None]:
# Your code here