# Neural Networks
From the previous lecture we have learnt the key concepts behind neural networks. To reinforce your understanding of these concepts we are once again looking at the **Rain in Australia (weatherAUS)** dataset as to create a model that predicts rainfall given meteorological features.

## Imports
Below are the associated imports for the project. Don't be afraid to add more if to improve the quality of your model.

In [1]:
import torch
import torch.nn as nn
from torch.nn import Module
from torch.optim import Optimizer
from torch import Tensor
from torch.utils.data import Dataset, DataLoader, random_split
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from typing import List, Dict, Tuple
type LossFN = Union[Module]

## Dataset
Lets have a look into what features and aspects are found within our dataset. A method we've used already is `head` however another useful one to keep note of is `info` which gives information on the amount of `NaN` values. Where a nan value is just any data that isn't available.

For a more complex project it is advised to spend more time analysing the data as you may be able to extract important information useful for a model. For the sake of brevity we'll focus on filling and encoding data.

In [2]:
dataset = pd.read_csv("./data/weatherAUS.csv")
dataset.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

## **(Question 1)** Data-Preprocessing
### **(Part A)** Feature Extraction
Before we go about wrangling our data we first may want to choose features we believe are most important as to decrease the complexity of the training data. This will involves removing specific columns. Remember there are tradeoffs between both, with less features our model will train faster but at the cost of possibly missing important information. So when choosing be careful not to remove anything too vital. 

**Optionally remove columns to simplify the dataset**, make sure not to go overboard with this.

<details>
  <summary>Hint</summary>
    The function <code>dataset.drop([...], axis=1, inplace=True)</code> may be useful.
</details>

In [4]:
# Remove columns if you want.
dataset.drop(["Date"], axis=1, inplace=True) # Lazy Here

### **(Part B)** Not Available Data
From running `info` and `head` above we may see many incomplete fields within our dataset. For example despite their being 145460 entries only 142199 have a rain today value. Our first order of business is correcting these mistakes. There are two main approaches to this:
- Removing unused data.
- Replacing unused data (possibly with a calculation such as mean or median).

Be aware of the downsides of both strategies, when removing data you decrease its size which can mean you decrease the model's accuracy. However by filling the data you may be adding bias and simplifying its complexity. 

**Preprocess the data as to ensure there are no `NaN` values**.

<details>
  <summary>Hint</summary>
    Some functions which may be useful depending on the approach used are:
    <ul>
        <li><code>dataset[column].fillna(...)</code></li>
        <li><code>dataset[column].dropna()</code></li>
    </ul>
</details>

In [5]:
# Place modifications to the dataset here...
numeric_columns = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm']
for col in numeric_columns:
    dataset[col].fillna(dataset[col].median())
dataset.dropna(inplace=True)

To verify that you've removed all `NaN` data ensure values are 0 by running the below code.

In [6]:
dataset.isna().sum() # Verify by counting na

Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

### **(Part $\beta$)** Dates
This part is optional depending on if you removed the dates. In order to make use of our dates better it is best we split them into individual arguments rather than keeping them as one value. This will be easier to train on and be more generalisable in the long term.

**Split the dates into Days, Months, Years and remove Dates**. Given you haven't removed dates already.

In [7]:
# Apply modifications to the dataset here

This will print nothing when you have no longer have a date field.

In [8]:
if "Date" in dataset.columns:
    print("Please remove date")

### **(Part C)** Encoding Arguments
Looking back at the `dataset.info()` we see that a lot of data we are reliant on has the datatype `object`rather than boolean/integer. This data is mostly string labels, such as compass directions. This will cause problems later on as to generate weights and biases we need to encode these as numeric values. This can be done through two main approaches:
- Label encoding, where every category is encoded as an integer. This can be done automatically or done by using a relevant number to the data, such as the amount of appearences. 
- On hot encoding, where each category is represented as $n$ Boolean columns, where $n$ is the amount of categories. Below is an example with `RainToday`.

| ID | RainToday |
| ---- | ----- |
| 0 | Yes |
| 1 | No |

| ID | Yes | No |
| --- | ---- | ----- |
| 0 | 1 | 0 |
| 1 | 0 | 1 |

**Encode object arguments as integer values**, keep in mind columns that you don't encode need to be dropped as objects aren't a valid in neural networks.

<details>
  <summary>Hint</summary>
    Some functions which may be useful depending on the approach used are:
    <ul>
        <li><code>pd.get_dummies</code></li>
        <li><code>dataset[...].map({})</code></li>
        <li><code>OneHotEncoder</code> from sklearn</li>
        <li><code>LabelEncoder</code> from sklearn</li>
    </ul>
</details>

In [9]:
# Encode Dataset...
for x in ("RainToday", "RainTomorrow"):
    dataset[x] = dataset[x].map({"Yes": 1, "No": 0})
try:
    encoded_cols = pd.get_dummies(dataset[['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']])
    dataset = pd.concat([dataset, encoded_cols], axis=1)
    dataset.drop(['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm'], axis=1, inplace=True)
except: pass
boolean_columns = dataset.select_dtypes(include=bool).columns
dataset[boolean_columns] = dataset[boolean_columns].astype(int)

Given that you have gone about doing the above task nothing should be printed.

In [10]:
non_encoded_cols = dataset.select_dtypes(include=['object']).columns
if len(non_encoded_cols) > 0:
    print(f"The following rows need to be encoded:\n{non_encoded_cols}")

## Pytorch Datasets
After ensuring our dataset has the appropriate preprocessing done we can now turn it into a format that can be used by Torch. This is done with the `dataset` abstract class provided by Torch. For now we won't go too indepth just know that the below code creates a class that is able to access our data and then puts it into a iterator that allows for it to be returned by calling the `next(data)` command.

In [11]:
class WeatherAUS(Dataset):
    def __init__(self, data: pd.DataFrame):
        """Intialise the dataset with key information"""
        # If you want to modularise your model nicely it is fairly common to put
        # your initial preprocessing here. 
        labels = dataset["RainTomorrow"] # Use Rain Tomorrow as output
        d = data.drop(["RainTomorrow"], axis=1) # Remove it from dataset
        self.data = torch.tensor(d.values, dtype=torch.float32) # Convert data into tensors
        labels = torch.tensor(labels.values, dtype=torch.float32)
        self.labels = labels.view(-1, 1) # Turn Tensor[64] to Tensor[64, 1]

    def __len__(self) -> int:
        """Get length of the data"""
        return len(self.data)

    def __getitem__(self, idx: int) -> Tuple[Tensor, Tensor]:
        """Return the respective data at the index"""
        return self.data[idx], self.labels[idx]

# Apply wrapper to the dataset
data = WeatherAUS(dataset)

## **(Question 2)** Splitting Data
Now that we have our dataset we need to split it into two sets, one for training and one for testing. This will enable us to use the training data to find weights and biases for our model, and the testing to see how accuracy and loss has changed between training sessions.

**Split into a training dataset and a test dataset**, use variables `train_dataset` and `test_dataset` as output. Pytorch provides a useful function `random_split` which has been imported.

<details>
  <summary>Hint</summary>
    Remember that we can use <code>len(data)</code> to get the length of our dataset.
</details>

In [12]:
train_size = int(0.8*len(data))
test_size = len(data) - train_size
train_dataset, test_dataset = random_split(data, [train_size, test_size])

After splitting the data we wrap it in Torch's DataLoader class to allow it to be iterated by a model. If all this was successful no error should be produced

In [13]:
batch_size = 64
train_dataloader = DataLoader(train_dataset, batch_size)
test_dataloader = DataLoader(test_dataset, batch_size)

In [14]:
ttensor, tlabel = next(iter(test_dataloader))
if ttensor.shape[0] != batch_size and tlabel.shape[0] != batch_size:
    print(f"Tensor incorrect with shape:\n{ttensor.shape}\n{tlabel.shape}")
    print(ttensor, "\n", tlabel)

Notice how the `batch_size` results in an array of 64 elements. Training in batches is a common as it is more memory efficient than loading the entire dataset and trains faster than only loading one element at a time.

## Device
Before we can go about creating a model we need to make sure we know what device we are going to put the model on. The below snippet decides if the device being used is cuda, mps or the cpu. With modern GPU's being optimised for general purpose mathematical operations it can be efficiently used to train models. Keep in mind if you do use your gpu that the model size fits within the vram of the device.

In [15]:
# Find device being used
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


## **(Question 3)** Model
While you won't be able to analyse how your model performs yet it is still a good idea to learn the process of model creation. The below code provided is the start of a very basic model. Once you fix the issues indicated you can start to experiment with it.

**Fill in the gaps and comeback to this section later to create a better model**.

In [16]:
# Define model
class WeatherNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(len(ttensor[0]), 32)
        self.fc2 = nn.Linear(32, 32)
        self.fc3 = nn.Linear(32, 16)
        self.dropout1 = nn.Dropout(0.25)
        self.fc4 = nn.Linear(16, 8)
        self.dropout2 = nn.Dropout(0.5)
        self.fc5 = nn.Linear(8, 1)

    def forward(self, x: Tensor):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc4(x))
        x = self.dropout2(x)
        x = torch.sigmoid(self.fc5(x))
        return x

model = WeatherNet().to(device)

## Loss Function & Optimiser
Loss functions can be ran like `loss_fn(prediction, answer)`. Optimisers have the functions `optimiser.step()` to use computed gradient values to changes a models parameters and `optimiser.zero_grad()` to reset these gradients.

These are both values that can be changed. With the optimiser determining the rate at which the model learns the parameters, and loss 

In [17]:
loss_fn = nn.MSELoss() # Loss function
optimiser = torch.optim.SGD(model.parameters(), lr=0.0001) # Gradient Descent

## **(Question 4)** Training
In order for all the inner weights and biases inside of the neural network to be found we first need train the model. This process generally follows a process of:

1. Set model to `model.train()`
2. Iterate through batches within the data loader.
3. Assign the iterated values to a device `X, y = X.to(device), y.to(device)`.
4. Compute the prediction error (*forward pass*).
5. Run backpropogation (*back pass*).

**Finish the train function to follow along with the above instructions**.
<details>
  <summary>Forward Pass Hint</summary>
    Remember how the ML models functioned, we started by computing the predicted model. For our problem we can simply run <code>model(...)</code>, followed by computing the loss.
</details>
<details>
  <summary>Back Pass Hint</summary>
    Backpropogation can be computed by using <code>loss.backward()</code>. These gradients can then be used to update parameters by running <code>optimiser.step()</code> before reseting gradient computations to zero <code>optimiser.zero_grad()</code>.
</details>

In [18]:
def train(dataloader: DataLoader, model: WeatherNet, loss_fn: LossFN, optimiser: Optimizer):
    n = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimiser.step()
        optimiser.zero_grad()

        # Print loss occasionally
        if batch % 100 == 0:
            print(f"Batch {batch}, Loss: {loss.item()}")

## **(Question 5)** Testing
Once we have finished running training we can now test the model against our testing dataset. This will provide us with a look into its current loss and accuracy. This can be done similar to training. Just avoid the backpropogation step as our goal isn't to change the model as this stage.

**Create a test function that computes the loss and accuracy**.

<details>
  <summary>Hint</summary>
    The loss and accuracy can be found by iterating through the data and adding the loss <code>loss_fn(pred, y)</code> and accuracy <code>pred.argmax(1)==y</code>.
</details>

In [19]:
def test(dataloader: DataLoader, model: WeatherNet, loss_fn: LossFN):
    """
    Test the model to see how it compares against the test portion of the
    dataset
    """
    n = len(dataloader.dataset)
    n_batch = len(dataloader)
    model.eval() # Set Torch to evaluation mode
    loss = 0
    acc = 0
    with torch.no_grad(): # This step removes gradient descent modifications
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            loss += loss_fn(pred, y).item()
            acc += (pred.argmax(1) == y).type(torch.float).sum().item()
    loss /= n_batch
    acc /= n
    print(f"Test Error: \nAccuracy: {(acc):>0.1f}%, Avg loss: {loss:>8f} \n")

## Results
Given that the last two functions are correct you should be able to run the below loop. This will run the training and test phases creating a model.

In [30]:
epochs = 50
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimiser)
    test(test_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
Batch 0, Loss: 0.15621289610862732
Batch 100, Loss: 0.2074325829744339
Batch 200, Loss: 0.18956616520881653
Batch 300, Loss: 0.21230560541152954
Batch 400, Loss: 0.2714022696018219
Batch 500, Loss: 0.19510801136493683
Batch 600, Loss: 0.14476263523101807
Batch 700, Loss: 0.2078814059495926
Test Error: 
Accuracy: 49.7%, Avg loss: 0.161335 

Epoch 2
-------------------------------
Batch 0, Loss: 0.1433093547821045
Batch 100, Loss: 0.20733773708343506
Batch 200, Loss: 0.21770885586738586
Batch 300, Loss: 0.1936275064945221
Batch 400, Loss: 0.24502038955688477
Batch 500, Loss: 0.20431651175022125
Batch 600, Loss: 0.15416623651981354
Batch 700, Loss: 0.17253081500530243
Test Error: 
Accuracy: 49.7%, Avg loss: 0.161120 

Epoch 3
-------------------------------
Batch 0, Loss: 0.16956141591072083
Batch 100, Loss: 0.21775877475738525
Batch 200, Loss: 0.17557395994663239
Batch 300, Loss: 0.21311374008655548
Batch 400, Loss: 0.25120809674263
Batch 500, Loss

If you kept track of your loss/accuracy with a variable you can use the helper function below to plot them.

In [29]:
def plot_loss_acc(loss: List[int], acc: List[int]):
    """Helper to plot loss and accuracy if you store them in a list"""
    fig, (ax0, ax1) = plt.subplots(1,2,figsize=(16,5))
    ax0.plot(acc, 's-')
    ax0.set_title(f'Final test accuracy {acc[-1]:.2f}%')
    ax0.set_ylabel('Accuracy%')
    ax0.set_xlabel('Epochs')
    ax1.plot(loss, 's-')
    ax1.set_title(f'Final test loss {loss[-1]:.2f}')
    ax1.set_ylabel('Loss')
    ax1.set_xlabel('Epochs')

Congratulations you have trained a neural network. Most likely the results from this will be less than satisfactory, with it being possilby lower than 50%. Now this is only the start of your journey, every single variable whether your *loss function*, *optimiser (and its learning rate)*, *extracted features*, *model layers*, and more can be changed as to create a better model. This process if called **hyperparameter tuning** and is a large part of machine learning. If you ever want to get better at ML you must be willing to spend the time the improve your model. Therefore the last task is:

**Improve your model to have a better accuracy**. Tune respective parameters as to produce the best result possible.