<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# Weather ML Models — Classification & Regression Report

**Short description:**  
This notebook loads an Australian Government's Bureau of Meteorology dataset, performs preprocessing (one-hot encoding and numeric casting), trains several models (Linear Regression, KNN, Decision Tree, Logistic Regression, SVM) and reports evaluation metrics (MAE, MSE, R2 for regression; Accuracy, Jaccard, F1, LogLoss for classification).

**Objectives**
- Load and inspect the Weather_Data dataset.
- Prepare categorical features using one-hot encoding and convert data to numeric.
- Train multiple supervised models (regression and classification) and compare performance metrics.
- Demonstrate train/test splitting and use standard performance metrics to assess models.
- Provide a concise results table summarizing model performance for easy comparison.

**Models:**
1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

**Evaluation:**

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

**Notice about documentation:**  
The original notebook submission (course assignment) was kept intact. I have **only modified documentation (comments, headings, markdown)** and made **minimal, necessary corrections** to ensure the notebook runs without errors. All rights related to the lab/workshop design and original exercise belong exclusively to **IBM Corporation**. This notebook includes additional documentation for clarity, but the intellectual property of the original exercise is retained by IBM.

---

## Table of contents

1. Dependencies & execution instructions  
2. Data loading & initial inspection  
3. Preprocessing (one-hot encoding, replacements, type casting)  
4. Feature / target split and train/test split  
5. Model training (Linear Regression, KNN, Decision Tree, Logistic Regression, SVM)  
6. Model evaluation metrics and results table  
7. Notes & reproducibility

---

## About The Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## 1) Dependencies & execution instructions

This section installs and imports required Python packages.  

**Recommended local execution steps:**

1. Create and activate a Python virtual environment:
   - `python -m venv venv`
   - `source venv/bin/activate` (macOS / Linux) or `venv\Scripts\activate` (Windows)
2. Install dependencies:
   - `pip install -r requirements.txt`
3. Launch Jupyter Notebook:
   - `jupyter notebook`
4. Open this notebook and run cells top-to-bottom.


In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

## 2) Data loading & initial inspection

This section reads the CSV from the course URL into a pandas DataFrame and displays the first rows to inspect the dataset schema and contents.


In [2]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## 3) Preprocessing

This section:
- Converts categorical columns into one-hot encoded columns (`pd.get_dummies`),
- Replaces 'Yes'/'No' strings with 1/0,
- Drops the 'Date' column,
- Casts the whole DataFrame to float to ensure models receive numeric arrays.

Perform one hot encoding to convert categorical variables to binary variables.

In [3]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column.

In [4]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

Set 'features' or X values, and target variable or Y values.


In [5]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [6]:
df_sydney_processed = df_sydney_processed.astype(float)

## 4) Features and target split, train/test split

The notebook separates the features (all columns except 'RainTomorrow') and the target `Y = RainTomorrow`. It performs a train/test split (80/20) and prepares data for model training.

In [7]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

In [8]:
x_train, x_test, y_train, y_test = train_test_split(features,Y,test_size=0.2,random_state=1)

## 5) Model training

Models trained in this notebook (in given order):
- LinearRegression (regression)
- KNeighborsClassifier (classification)
- DecisionTreeClassifier (classification)
- LogisticRegression (classification, with probability outputs)
- SVM (classification)

Each model is trained on the training set and used to predict on the test set. Predictions and relevant probability outputs are produced where applicable.

### 5.1 Linear Regression


Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [9]:
LinearReg = LinearRegression().fit(x_train,y_train)

Use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [10]:
predictions = LinearReg.predict(x_test)

Using the `predictions` and the `y_test` dataframe, calculate the value for each metric using the appropriate function.


In [11]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test,predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test,predictions)
LinearRegression_R2 = metrics.r2_score(y_test,predictions)

Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [12]:
Report = pd.DataFrame(data={'MAE': [LinearRegression_MAE], 'MSE': [LinearRegression_MSE],
                            'R2': [LinearRegression_R2]})
Report

Unnamed: 0,MAE,MSE,R2
0,0.265566,0.123146,0.386278


### 5.2 KNN


Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [13]:
KNN = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)

Use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [14]:
predictions = KNN.predict(x_test)

Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [15]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)
print(f"Accuracy Score: {KNN_Accuracy_Score}, Jaccard Index: {KNN_JaccardIndex}, F1 Score: {KNN_F1_Score}")

Accuracy Score: 0.8122137404580153, Jaccard Index: 0.39408866995073893, F1 Score: 0.5653710247349824


### 5.3 Decision Tree


Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [16]:
Tree = DecisionTreeClassifier().fit(x_train, y_train)

Use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [17]:
predictions = Tree.predict(x_test)

Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [18]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)
print(f"Accuracy Score: {Tree_Accuracy_Score}, Jaccard Index: {Tree_JaccardIndex}, F1 Score: {Tree_F1_Score}")

Accuracy Score: 0.76793893129771, Jaccard Index: 0.3795918367346939, F1 Score: 0.5502958579881657


### 5.4 Logistic Regression


Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [19]:
LR = LogisticRegression(solver='liblinear').fit(x_train, y_train)

Use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [20]:
predictions = LR.predict(x_test)

In [21]:
predict_proba = LR.predict_proba(x_test)

Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [22]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)
print(f"Accuracy Score: {LR_Accuracy_Score}, Jaccard Index: {LR_JaccardIndex}, F1 Score: {LR_F1_Score}, Log Loss: {LR_Log_Loss}")

Accuracy Score: 0.8366412213740458, Jaccard Index: 0.5091743119266054, F1 Score: 0.6747720364741642, Log Loss: 0.3804510672347215


### 5.5 SVM


Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [23]:
SVM = svm.SVC().fit(x_train, y_train)

Use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [24]:
predictions = SVM.predict(x_test)

Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [25]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)
print(f"Accuracy Score: {SVM_Accuracy_Score}, Jaccard Index: {SVM_JaccardIndex}, F1 Score: {SVM_F1_Score}")

Accuracy Score: 0.7221374045801526, Jaccard Index: 0.0, F1 Score: 0.0


## 6) Model evaluation and results

Evaluation metrics computed in the notebook:
- Regression metrics for LinearRegression: MAE, MSE, R².
- Classification metrics for classifiers: Accuracy, Jaccard Index, F1-Score.
- Logistic Regression also reports Log Loss using predicted probabilities.

A final summary `Report` DataFrame aggregates metrics for easy comparison.

Show the Accuracy, Jaccard Index, F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [28]:
data = {'Accuracy': [KNN_Accuracy_Score,Tree_Accuracy_Score,LR_Accuracy_Score,SVM_Accuracy_Score],
        'Jaccard Index': [KNN_JaccardIndex,Tree_JaccardIndex,LR_JaccardIndex,SVM_JaccardIndex],
        'F1-Score': [KNN_F1_Score,Tree_F1_Score,LR_F1_Score,SVM_F1_Score],
       'LogLoss':[np.nan,np.nan,LR_Log_Loss,np.nan]}
Report = pd.DataFrame(data=data)
Report

Unnamed: 0,Accuracy,Jaccard Index,F1-Score,LogLoss
0,0.812214,0.394089,0.565371,
1,0.767939,0.379592,0.550296,
2,0.836641,0.509174,0.674772,0.380451
3,0.722137,0.0,0.0,


## 7) Notes & reproducibility

- The notebook reads the dataset from the course URL; for offline execution download the CSV and update `path` to the local file.
- Required packages are listed in `requirements.txt`. Install them into a virtual environment before running.


<h2>About the Authors:</h2> 

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01">Joseph Santarcangelo</a>

### Other Contributors

[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By    | Change Description          |
| ----------------- | ------- | ------------- | --------------------------- |
| 2022-06-22        | 2.0     | Svitlana K.   | Deleted GridSearch and Mock |

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
