<div style="text-align: left; background-color: black; color: white; padding: 10px;">
    <h1 style="font-size: 48px;">rAIn</h1>
    <p style="font-size: 16px; color: grey;">Australia Festivals Inc.</p>
</div>


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 20px;">
    <strong style="text-decoration: underline;">ABOUT</strong>
    <p>rAIn is an all-in-one rainfall prediction application for locations across Australia. Data is pulled from 10 years' worth of environmental variables from 49 different weather stations. This data is loaded and processed for machine learning and the model is evaluated. <br><br>
    <strong style="text-decoration: underline;">HOW TO USE rAIN</strong>
    <p>Navigate to the User Interface section towards the bottom of this page to choose a location and receive a prediction as to whether or not it will rain the next day with above 75% accuracy.</p>
</div>


<div id="loading-data" style="text-align: center; background-color: black; color: white; padding: 10px;">
    <h2>LOADING LIBRARIES & DATA</h2>
</div>


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import ipywidgets as widgets
from ipywidgets import Button, VBox, Image
from IPython.display import display
import requests
import warnings

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression


In [None]:
# suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# import and read the data set from GitHub 
# weatherAUS sourced from Kaggle
url = 'https://raw.githubusercontent.com/JDollWGU/rAInCapstone/main/weatherAUS.csv'

# read the data and store in a dataframe
df = pd.read_csv(url)
print(df.head())

<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> Raw data is displayed.
</div>


In [None]:
df.info()

<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> The range of indices and the number of rows in the data set, data types and non-null counts, and memory usage are listed. It's important to note that data is missing from the columns.
</div>


<div id="data-processing" style="text-align: center; background-color: black; color: white; padding: 10px;">
    <h2>DATA PROCESSING</h2>
</div>

In [None]:
# begin to understand data relationships
# only include columns with numbers
numeric_df = df.select_dtypes(include='number')

# calculate the correlation matrix
corr_matrix = numeric_df.corr()

# create a heatmap 
plt.figure(figsize=(18, 18))
sns.heatmap(corr_matrix, annot=True, cmap='Blues', fmt=".2f", square=True)
plt.title('Correlation Matrix of Numeric Environmental Variables')
plt.show()

<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> A heatmap of the correlation matrix demonstrates how variables interact with each other and the strength of their relationships. A darker blue with a value closer to 1 indicates a stronger relationship while a lighter blue closer to -1 indicates a weaker relationship.
    <br><br>Columns with data types other than numbers have not been included, such as Date, Location, WindGustDir, WindDir9am, WindDir3pm, RainToday, and RainTomorrow.
</div>


In [None]:
# define an empty dictionary to store location data
location_tables = {}

# iterate over unique location names
for location_name in df['Location'].unique():
    # find data for the current location
    location_data = df[df['Location'] == location_name]

    # store in dictionary with location name as the key
    location_tables[location_name] = location_data

# assign unique values to each location alphabetically
location_name_mapping = {
    'Albury': 1, 'BadgerysCreek': 2, 'Cobar': 3, 'CoffsHarbour': 4, 'Moree': 5,
    'Newcastle': 6, 'NorahHead': 7, 'NorfolkIsland': 8, 'Penrith': 9, 'Richmond': 10,
    'Sydney': 11, 'SydneyAirport': 12, 'WaggaWagga': 13, 'Williamtown': 14, 'Wollongong': 15, 'Canberra': 16,
    'Tuggeranong': 17, 'MountGinini': 18, 'Ballarat': 19, 'Bendigo': 20, 'Sale': 21,
    'MelbourneAirport': 22, 'Melbourne': 23, 'Mildura': 24, 'Nhil': 25, 'Portland': 26,
    'Watsonia': 27, 'Dartmoor': 28, 'Brisbane': 29, 'Cairns': 30, 'GoldCoast': 31,
    'Townsville': 32, 'Adelaide': 33, 'MountGambier': 34, 'Nuriootpa': 35, 'Woomera': 36,
    'Albany': 37, 'Witchcliffe': 38, 'PearceRAAF': 39, 'PerthAirport': 40, 'Perth': 41,
    'SalmonGums': 42, 'Walpole': 43, 'Hobart': 44, 'Launceston': 45, 'AliceSprings': 46, 'Darwin': 47,
    'Katherine': 48, 'Uluru': 49
}

# convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# split date to year/month/day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# delete original date column
df.drop(columns=['Date'], inplace=True)

# assign values to wind directions according to their degree value
wind_direction_mapping = {
    'N': 0, 'NNE': 22.5, 'NE': 45, 'ENE': 67.5, 'E': 90, 'ESE': 112.5,
    'SE': 135, 'SSE': 157.5, 'S': 180, 'SSW': 202.5, 'SW': 225,
    'WSW': 247.5, 'W': 270, 'WNW': 292.5, 'NW': 315, 'NNW': 337.5
}

# map wind values
df['WindGustDir'] = df['WindGustDir'].map(wind_direction_mapping)
df['WindDir9am'] = df['WindDir9am'].map(wind_direction_mapping)
df['WindDir3pm'] = df['WindDir3pm'].map(wind_direction_mapping)

# convert RainToday/RainTomorrow columns to boolean
df['RainToday'] = df['RainToday'].map({'Yes': 1, 'No': 0})
df['RainTomorrow'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0})

# map the numerical codes in the dataframe to location names
df['Location'] = df['Location'].map(location_name_mapping)

# make date be first in columns
columns = ['Year', 'Month', 'Day'] + [col for col in df.columns if col not in ['Year', 'Month', 'Day']]
df = df[columns]

print(df.head())

<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> The objects within the raw data (Date, Location, WindGustDir, WindDir9am, WindDir3pm, RainToday, and RainTomorrow) are transformed to numeric data types for better processing.
</div>


In [None]:
# create a table for each location and its given variables
modified_location_tables = {}

# group by location
grouped = df.groupby('Location')

# handle missing data
for location, group in grouped:
    # drop columns with more than 95% missing values for the current location
    missing_percentage = group.isnull().mean()
    columns_to_drop = missing_percentage[missing_percentage > 0.95].index
    location_data = group.drop(columns=columns_to_drop)
    
    # delete duplicates before the median is taken as this can cause erroneous duplicates
    df.drop_duplicates(inplace=True)

    # replace missing values with the median of that column at that location
    # exclude Year, Month, Day, RainToday, and RainTomorrow as the averages shouldn't be taken
    numerical_columns = location_data.select_dtypes(include='number').columns
    for column in numerical_columns:
        if column not in ['Year', 'Month', 'Day', 'RainToday', 'RainTomorrow']:
            median_value = location_data[column].median()
            location_data[column].fillna(median_value, inplace=True)

    # delete any remaining rows with missing data
    location_data.dropna(inplace=True)

    # store new table in the dictionary with location as the key
    modified_location_tables[location] = location_data

# calculate total null values
total_null_values = sum(data.isnull().sum().sum() for data in modified_location_tables.values())
print("Total null values:", total_null_values)

# get the location name
def get_location_name(location_key):
    return list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location_key)]

# print each location in modified_location_tables
for location, data in modified_location_tables.items():
    # use get_location_name
    location_name = get_location_name(location)
    print(f"Location: {location_name}")
    print(data.head())
    print("-" * 50)

<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> Each modified location table displays the remaining valid rows and columns, along with transformed data points. Confirmation that no null values remain in the modified data set is apparent (at top).
</div>


In [None]:
# initialize an empty list to store the rainfall average for each location
average_rainfall_per_location = []

# iterate through each modified table
for location, modified_table in modified_location_tables.items():
    # group data by year and calculate average rainfall
    average_rainfall_by_year = modified_table.groupby('Year')['Rainfall'].mean().reset_index()
    
    # get the corresponding location name
    location_name = list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location)]
    
    # add location information to dataframe
    average_rainfall_by_year['Location'] = location_name
    
    # add the average to each location table
    average_rainfall_per_location.append(average_rainfall_by_year)

# form a single dataframe
average_rainfall_per_location = pd.concat(average_rainfall_per_location)

# change the data to have years as columns
average_rainfall_per_location_pivot = average_rainfall_per_location.pivot(index='Year', columns='Location', values='Rainfall')

# calculate average rainfall across all locations for each year
overall_average_rainfall = average_rainfall_per_location.groupby('Year')['Rainfall'].mean().reset_index()

# plot the line plot
plt.figure(figsize=(16, 10))
colors = sns.color_palette('Blues', n_colors=len(average_rainfall_per_location_pivot.columns))
for i, location in enumerate(average_rainfall_per_location_pivot.columns):
    sns.lineplot(data=average_rainfall_per_location_pivot[location], label=location, color=colors[i])
sns.lineplot(data=overall_average_rainfall, x='Year', y='Rainfall', color='red', label='Overall Average', linewidth=3) 
plt.title('Average Rainfall per Location per Year')
plt.xlabel('Year')
plt.ylabel('Average Rainfall (mm)')
plt.legend(title='Location', loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> A line plot showing the average rainfall amounts (in mm) per location per year. A red trendline is applied to demonstrate the average rainfall across all locations for each year.
</div>


<div id="modeling" style="text-align: center; background-color: black; color: white; padding: 10px;">
    <h2>MODELING & EVALUATION</h2>
</div>

In [None]:
# initiate modeling
print("Data Processing...")

# initialize an empty dictionary to store models for each location
logistic_models = {}

# initialize an empty list to store accuracy
accuracy_data = []

# iterate through modified tables for each location
for location, modified_table in modified_location_tables.items():
    # STEP 1 - split the data
    # exclude raintomorrow as it's the variable in question
    X = modified_table.drop(columns=['RainTomorrow'])
    y = modified_table['RainTomorrow']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # STEP 2 - train the model
    model = LogisticRegression(penalty='l2', solver='saga', max_iter=8500)
    model.fit(X_train, y_train)

    # STEP 3 - evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) * 100
    accuracy_formatted = f"{accuracy:.2f}%"
    
    # get location name
    location_name = list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location)]
    
    # store model and accuracy with location
    logistic_models[location] = (model, accuracy)
    
    # add accuracy to the accuracy data list
    accuracy_data.append({'Location': location_name, 'Accuracy': accuracy})
    

# finish loop
print("Data Processing Complete!")

# create dataframe from the accuracy data list
accuracy_df = pd.DataFrame(accuracy_data)

# sort by descending order
accuracy_df = accuracy_df.sort_values(by='Accuracy', ascending=False)

# calculate average accuracy
average_accuracy = np.mean(accuracy_df['Accuracy'])
print(f"Average Accuracy: {average_accuracy:.2f}%")

# plot the accuracy scores
plt.figure(figsize=(20, 12))
sns.barplot(x='Accuracy', y='Location', data=accuracy_df, palette='Blues')
plt.xlabel('Accuracy (%)')
plt.ylabel('Location')
plt.title('Model Accuracies by Location')
plt.xlim(0, 100)  # Set the limit of the x-axis to 0-100
plt.grid(axis='x')

# add labels to each bar
for index, value in enumerate(accuracy_df['Accuracy']):
    plt.text(value + 1, index, f"{value:.2f}%", color='black', va='center')

# add average accuracy line
plt.axvline(average_accuracy, color='red', linestyle='--', linewidth=2)
plt.text(average_accuracy + 1, len(accuracy_df) - 1, f'Average Accuracy: {average_accuracy:.2f}%', color='red', va='center')

plt.show()


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> Data may take several minutes to load. The bar plot shows the accuracy of the logistic regression model for each location. An average accuracy line is included, representing the calculated average accuracy across all locations.
           
</div>


In [None]:
# iterate through each location
for location, model_accuracy_tuple in logistic_models.items():
    # get location name
    location_name = list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location)]
    
    # split the data for each specific location
    modified_table = modified_location_tables[location]
    X = modified_table.drop(columns=['RainTomorrow']) 
    y = modified_table['RainTomorrow']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # get the model from the tuple
    model = model_accuracy_tuple[0]
    
    # predicting the test set results
    y_pred = model.predict(X_test)
    y_pred = (y_pred > 0.5)
    
    # print reports
    print(f"Classification Report for {location_name}:")
    print(classification_report(y_test, y_pred))


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> Classification reports for each location.
    <ul>
        <li><strong>Precision:</strong> Ratio of correctly predicted rain occurences to the total predicted rain occurences.</li>
        <li><strong>Recall:</strong> Ratio of correctly predicted rain occurences to all actual rain occurences.</li>
        <li><strong>F1-score:</strong> The mean of precision and recall.</li>
        <li><strong>Accuracy:</strong> The overall accuracy of the model's predictions.</li>
        <li><strong>Support:</strong> The number of instances where it didn't rain and the number of instances where it did.</li>
    </ul>
</div>

In [None]:
# get the location names
def get_location_name(location_key):
    return list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location_key)]

# iterate through each location
for location_key, (model, _) in logistic_models.items():
    # get the modified tables
    modified_table = modified_location_tables[location_key]
    
    # determine features and target
    X_test = modified_table.drop(columns=['RainTomorrow'])
    y_test = modified_table['RainTomorrow']
    
    # pedict using the model
    y_pred = model.predict(X_test)
    y_pred = (y_pred > 0.5)
    
    # calculate the confusion matrices
    cf_matrix = confusion_matrix(y_test, y_pred)
    
    # plot matrices
    plt.figure(figsize=(12, 8))
    sns.heatmap(cf_matrix / np.sum(cf_matrix), annot=True, cmap='Blues', annot_kws={'size': 15})
    location_name = get_location_name(location_key)
    plt.title(f'Confusion Matrix for {location_name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> Confusion matrices for each location.
    <ul>
        <li><strong>True Positives (TP):</strong> Top Left. The model correctly predicted rain when it actually rained.</li>
        <li><strong>False Positives (FP):</strong> Top Right. The model incorrectly predicted rain when it didn't rain.</li>
        <li><strong>False Negatives (FN):</strong> Bottom Left. The model incorrectly predicted no rain when it actually rained.</li>
        <li><strong>True Negatives (TN):</strong> Bottom Right. The model correctly predicted no rain when it didn't rain.</li>
    </ul>
</div>


<div id="user" style="text-align: center; background-color: black; color: white; padding: 10px;">
    <h2>USER INTERFACE</h2>
</div>

In [None]:
# get location names
def get_location_name(location_key):
    return list(location_name_mapping.keys())[list(location_name_mapping.values()).index(location_key)]

# make the dropdown with location names
location_dropdown = widgets.Dropdown(
    options=[(get_location_name(location), location) for location in modified_location_tables.keys()],
    description='Location:'
)

# predict button
predict_button = Button(description='Predict', button_style='info')

output_label = widgets.HTML(value='', layout={'margin': '20px 0'})

# apply predict weather function
def predict_weather(b):
    location = location_dropdown.value
    model, accuracy = logistic_models[location]
    latest_data = modified_location_tables[location].iloc[-1].drop('RainTomorrow').to_frame().T
    prediction = model.predict(latest_data)[0]
    rain_prediction = "Yes" if prediction == 1 else "No"
    
    output_label.value = f'<div style="font-size: 16px;"><b>Will it rain tomorrow?</b> {rain_prediction}<br><b>Model Accuracy:</b> {accuracy:.2f}%</div>'
    
    # if yes, rain photo
    if rain_prediction == "Yes":
        image_url = "https://static.vecteezy.com/system/resources/thumbnails/023/155/507/small/drop-of-water-on-a-white-background-rain-drop-3d-illustration-vector.jpg"  # URL of the rain image
    # if no, sun photo
    else:
        image_url = "https://t4.ftcdn.net/jpg/02/81/12/37/360_F_281123779_WSopbvuFrjfZs9EBX1jLcaEl3m1OnP29.jpg"  # URL of a default image
        
    # get the image
    image_content = requests.get(image_url).content

    # update widget
    image_widget.value = image_content

# attach function to button
predict_button.on_click(predict_weather)

# default image
default_image_url = "https://img.freepik.com/free-vector/doodle-australia-map_1034-834.jpg?size=338&ext=jpg&ga=GA1.1.2082370165.1716940800&semt=ais_user"
default_image_content = requests.get(default_image_url).content
image_widget = Image(value=default_image_content, width="200px", height="200px")

# output widget
display(VBox([image_widget, location_dropdown, predict_button, output_label]))


<div style="background-color: #f0f0f0; border: 2px solid black; padding: 10px;">
    <strong>ABOVE:</strong> To use rAIn, select a location from the dropdown  menu. Click 'Predict'. A yes or no answer will appear depending on if it will rain or not the next day with a respective model accuracy. 
</div>


<div id="citations" style="text-align: center; background-color: black; color: white; padding: 10px;">
    <h2>CITATIONS</h2>
</div>

<pre>
    Australia map. (n.d.). Freepik. Retrieved June 2, 2024, from <a href="https://img.freepik.com/free-vector/doodle-australia-map_1034-834.jpg?size=338&ext=jpg&ga=GA1.1.2082370165.1716940800&semt=ais_user">https://img.freepik.com/free-vector/doodle-australia-map_1034-834.jpg?size=338&ext=jpg&ga=GA1.1.2082370165.1716940800&semt=ais_user</a><br>
    Rain drop. (n.d.). Vecteezy. Retrieved June 2, 2024, from <a href="https://static.vecteezy.com/system/resources/thumbnails/023/155/507/small/drop-of-water-on-a-white-background-rain-drop-3d-illustration-vector.jpg">https://static.vecteezy.com/system/resources/thumbnails/023/155/507/small/drop-of-water-on-a-white-background-rain-drop-3d-illustration-vector.jpg</a><br>
    Sun. (n.d.). t4. Retrieved June 2, 2024, from <a href="https://t4.ftcdn.net/jpg/02/81/12/37/360_F_281123779_WSopbvuFrjfZs9EBX1jLcaEl3m1OnP29.jpg">https://t4.ftcdn.net/jpg/02/81/12/37/360_F_281123779_WSopbvuFrjfZs9EBX1jLcaEl3m1OnP29.jpg</a><br>
    Young, J., & Adamyoung. (n.d.). <em>Rain in Australia: Predict next-day rain in Australia</em> [Data set]. Kaggle. Retrieved June 2, 2024, from <a href="https://www.kaggle.com/jsphyg/weather-dataset-rattle-package">https://www.kaggle.com/jsphyg/weather-dataset-rattle-package</a><br>
</pre>