# Binary Confusion Matrix Predict Next-Day Rain in Australia

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
Kaggle Link: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package <br>
17 March 2025 <br>

### Introduction
In this project, we will construct a Binary Confusion Matrix to evaluate the performance of a classification model using the weatherAUS.csv dataset from Kaggle. The dataset contains weather-related variables collected in Australia. For this analysis, we will focus solely on the last two columns of the dataset, which represent the actual and predicted classifications of a binary event—whether or not it rained the next day and limiting our sample size to only 1000 records. <br>

A Confusion Matrix is a fundamental tool used in classification problems to summarize model performance by comparing predicted and actual outcomes. It consists of four key components: <br>

True Positives (TP): Instances where the model correctly predicts a positive class (rain predicted and it actually rained). <br>
False Positives (FP): Instances where the model incorrectly predicts a positive class (rain predicted, but it did not rain). <br>
False Negatives (FN): Cases where the model fails to predict the positive class (rain not predicted, but it actually rained). <br>
True Negatives (TN): Cases where the model correctly predicts a negative class (rain not predicted, and it did not rain). <br>

In this context, the Actual Positives (P) represent the cases where rain actually occurred, while the Actual Negatives (N) represent cases where it did not rain. By analyzing the distribution of these values in the confusion matrix, we can assess the accuracy and reliability of the predictive model. <br>

This project will involve: <br>

- Extracting the relevant data columns from the dataset. <br>
- Constructing a Binary Confusion Matrix based on the actual and predicted values. <br>
- Analyzing model performance using standard evaluation metrics derived from the confusion matrix. <br>

By the end of this analysis, we will have a clearer understanding of how well the model distinguishes between rainy and non-rainy days, providing insight into its effectiveness for weather prediction. <br>

### Imports
Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>
sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

In [41]:
#Imports

# Data handling and manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning utilities
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import seaborn as sns

# Display settings
%matplotlib inline

file_path = "C:/Projects/applied-ml-data-git-hub/D3/Data/weatherAUS.csv"
df = pd.read_csv(file_path)
# Select only the last two columns
df = df.iloc[:, -2:]  # Assuming last two columns contain actual and predicted values


### Project Outline
This project systematically analyzes the weatherAUS dataset and constructs a Binary Confusion Matrix to evaluate the performance of a predictive model. By following a structured workflow, we ensure a thorough understanding of the data and an accurate assessment of the model's ability to classify rainy and non-rainy days. <br>

#### Section 1: Load and Explore the Data (Base Model)
The first step is to load the weatherAUS dataset from its file path (D3/Data/weatherAUS.csv) and examine its structure. This includes: <br>

Reading the dataset using pandas. <br>
Inspecting column names, data types, and the number of missing values. <br>
Displaying a basic statistical summary to understand data distribution. <br>
This foundational step ensures that we understand the dataset before building our model. <br>

#### Section 2: Data Preprocessing (Base Model)
Since we are only using the last two columns, we will: <br>

Extract the relevant columns representing actual and predicted values. <br>
Handle any missing or inconsistent data entries. <br>
Convert categorical values (if necessary) into numerical representations. <br>
Cleaning the dataset ensures that the confusion matrix calculation is accurate. <br>

#### Section 3: Constructing the Binary Confusion Matrix (Base Model)
Using sklearn.metrics, we will generate a Binary Confusion Matrix that summarizes the model’s performance. This matrix will show: <br>

True Positives (TP): Correctly predicted rainy days. <br>
False Positives (FP): Incorrectly predicted rainy days. <br>
False Negatives (FN): Missed rainy days. <br>
True Negatives (TN): Correctly predicted non-rainy days. <br>
This confusion matrix will be visualized using Seaborn’s heatmap, making it easier to interpret model performance. <br>

#### Section 4: Model Evaluation Metrics
To assess the effectiveness of our model, we will calculate key performance metrics derived from the confusion matrix: <br>

- Base Model Evaluation <br>
    Accuracy = (TP + TN) / (TP + TN + FP + FN) <br>
    Precision = TP / (TP + FP) (How many predicted rainy days were correct?) <br>
    Recall (Sensitivity) = TP / (TP + FN) (How well did the model capture actual rainy days?) <br>
    F1-score = Harmonic mean of Precision and Recall. <br>
    This baseline performance provides a reference point for improvement. <br>

- Raise the Bar Model <br>
    To enhance the model's performance, we will explore: <br>

    Feature Engineering: Adding additional weather attributes (e.g., humidity, wind speed) to improve classification. <br>
    Threshold Optimization: Adjusting the probability threshold for predicting rain to reduce false positives or false negatives. <br>
    Advanced Models: Exploring machine learning algorithms beyond a simple threshold-based classifier, such as Logistic Regression or Decision Trees. <br>

- Lower the Bar Model <br>
    To compare against a weaker alternative, we will: <br>

    Use a Naïve Baseline Approach: Randomly predicting "rain" or "no rain" based on overall dataset proportions. <br>
    Ignore Preprocessing: Using raw data without handling missing values or categorical transformations. <br>
    Fixed Thresholds Without Optimization: Sticking to an arbitrary classification threshold without adjustment. <br>
    By contrasting the Raise the Bar Model and Lower the Bar Model against the Base Model, we gain deeper insights into what contributes to a stronger predictive model. <br>

#### Section 5: Conclusion and Next Steps (Base Model, Raise the Bar Model, Lower the Bar Model)
The final section will summarize findings, highlighting: <br>

The effectiveness of the base model in predicting rainy days. <br>
How performance improved in the Raise the Bar Model. <br>
The limitations observed in the Lower the Bar Model. <br>

Potential improvements, such as adjusting decision thresholds or using more advanced models. <br>
By following this structured approach, we ensure that our Binary Confusion Matrix accurately represents the model’s performance, guiding data-driven improvements. <br>

### Section 1. Load and Explore the Data (Base Model)
- Limit the data to 1000 sample size and clean the data to ensure NAN, null, and the data is binary. <br>

In [42]:
# Display basic dataset structure
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.expand_frame_repr", False)  # Prevent column wrapping

# Display first few rows
print("First ten rows of the dataset:")
display(df.head(10))  # Using display() for better rendering in Jupyter

# Display column names
print("\nColumn names in the dataset:")
print(df.columns.to_list())  # Converts to list for better readability

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum().to_string())  # Prevent truncation

# Select only the last two columns
df = df.iloc[:, -2:]  # Assuming last two columns contain actual and predicted values

# Rename columns for clarity (if necessary)
df.columns = ["Actual", "Predicted"]

# Drop rows with missing values
df = df.dropna()

# Limit to the first 1,000 data pairs
df = df.head(1000)

# Display dataset info after cleaning
print("\nDataset info after cleaning and selection:")
df.info()  # Prints a summary of dataset structure

# Display summary statistics
print("\nSummary statistics of the dataset:")
display(df.describe())  # Using display() for better Jupyter output formatting

First ten rows of the dataset:


Unnamed: 0,RainToday,RainTomorrow
0,No,No
1,No,No
2,No,No
3,No,No
4,No,No
5,No,No
6,No,No
7,No,No
8,No,Yes
9,Yes,No



Column names in the dataset:
['RainToday', 'RainTomorrow']

Missing values in each column:
RainToday       3261
RainTomorrow    3267

Dataset info after cleaning and selection:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 1024
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Actual     1000 non-null   object
 1   Predicted  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB

Summary statistics of the dataset:


Unnamed: 0,Actual,Predicted
count,1000,1000
unique,2,2
top,No,No
freq,773,773


### Section 2: Data Preprocessing (Base Model)
- Conversion of Data to boolean formatting for machine learning compatibility.
- Verification of binary 0 and 1 remains in both columns.

In [43]:
# Convert categorical values to numerical (if applicable)
df["Actual"] = df["Actual"].map({"No": 0, "Yes": 1})
df["Predicted"] = df["Predicted"].map({"No": 0, "Yes": 1})

# Drop any remaining NaN values after conversion
df = df.dropna()

# Ensure the dataset only contains binary values (0 and 1)
valid_values = [0, 1]
df = df[df["Actual"].isin(valid_values) & df["Predicted"].isin(valid_values)]

# Display unique values to verify
print("\nUnique values in 'Actual' column after conversion:", df["Actual"].unique())
print("Unique values in 'Predicted' column after conversion:", df["Predicted"].unique())

# Verify the final shape of the dataset
print("\nFinal dataset shape after preprocessing:", df.shape)

# Display first few rows to confirm successful preprocessing
print("\nFirst ten rows of preprocessed dataset:")
display(df.head(10))



Unique values in 'Actual' column after conversion: [0 1]
Unique values in 'Predicted' column after conversion: [0 1]

Final dataset shape after preprocessing: (1000, 2)

First ten rows of preprocessed dataset:


Unnamed: 0,Actual,Predicted
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,1
9,1,0


### Section 3: Constructing the Binary Confusion Matrix (Base Model)
- 