# Neural Network from scratch in PyTorch

### 1. Load and inspect data

In [530]:
import pandas as pd

data_path = 'data.csv'
data = pd.read_csv(data_path)       # dataframe

print(data)

   x_0  x_1  x_2  y
0  1.0    0    0  0
1  0.0    0    5  0
2  1.0    1    3  1
3  0.0    1    1  0
4  0.0    1    1  1
5  0.0    1    2  0
6  1.0    0    1  1
7  1.1    0    1  0
8  1.0    0    0  1


#### Features, data points

In [531]:
# FEATURES: measurable properties/attributes we can use to predict
# check number of data points (rows) and number of features (columns except target 'y' in this case)

print(data.shape)       # tuple (rows, columns)

(9, 4)


- 9 rows
- 4 columns 
    - 3 features (since 'y' is a column too in this case)
    - 3 x 9 : 27 data points

#### Range of features

By knowing the range of each feature, we can apply proper normalization (transform feature values to a standard scale) to ensure all features contribute proportionaly during training.
- For ex., if the range of one feature is 1000x larger that another, then during loss minimization, the gradients associated with the larger-scaled feature will likely be larger. This disproportion can cause the optimization process to overemphasize that feature, even though that feature might not actually be too influential in the prediction itself, potentially skewing weight updates and adversely affecting the overall training process

In [532]:
# determine range of each feature:      max - min

features_columns = [col for col in data if col != 'y']
features_ranges = {}

for feature in features_columns:
    min_val = data[feature].min()
    max_val = data[feature].max()
    features_ranges[feature] = float(max_val - min_val)

print("Range of features: ")
for features_range in features_ranges.items():
    print(features_range)

Range of features: 
('x_0', 1.1)
('x_1', 1.0)
('x_2', 5.0)


### Model and package selection

- Because the target column consists of 0s and 1s, this is a binary classification problem (predicting y from x features)
    - Use a multi-layer perceptron (MLP)

- Use pytorch to define, train and evaluate the model
- Use scikit-learn to split the data

In [533]:
import torch

# seperate features and target
features_values = data[features_columns].values
target_values = data['y'].values

# convert to tensors    (tensors: multidimensional homogenous data structures, good for parallelism and have many operation optimizations in packages like pytorch)
features_values = torch.Tensor(features_values)
target_values = torch.Tensor(target_values) 

### `**` Normalization / scale data `**`

StandardScaler standardizes features by rescale them to have a mean of 0 and a standard deviation of 1.
- Standardization does NOT change the shape of the data: it does NOT transform the data into a Gaussian distribution, it only standardizes the scale. The underlying distribution of the data remains unchanged
    - i.e. DISTRIBUTION of the original data remains the same, but the numerical values are scaled such that 0 is the center/average and each data point is spread out by 1 unit

<br><br>
$x' = \frac{x - \mu}{\sigma}$

**Where:**

- $x$ = original data point  
- $\mu$ = mean of the feature (before standardization)  
- $\sigma$ = standard deviation of the feature (before standardization)  


In [534]:
"""
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_values = scaler.fit_transform(features_values)
"""

'\nfrom sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler()\nfeatures_values = scaler.fit_transform(features_values)\n'

### Split the data (80% train, 20% test)

In [535]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features_values, target_values, test_size = 0.2)

### Create pytorch Dataset and Dataloader

- Dataset: stores the samples and their labels
- DataLoader: wraps an iterable around the `Dataset` to enable easy access to the samples. Makes it parallelized to load in batches to the model

<br>

- Batch: a subset of the training data processed together in one forward/backward pass
    - batch size value depends on memory constraints, model size, dataset size, etc

In [536]:
from torch.utils.data import TensorDataset, DataLoader

# create datasets
train_dataset = TensorDataset(x_train, y_train)         
test_dataset = TensorDataset(x_test, y_test)

# create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

"""
**ADD SHUFFLE**
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)       # shuffle tells dataloader pull images in random order (not order of dataset)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
"""

for x, y in test_loader:
    break


print(x)    

#           tensor([[1.0000, 0.0000, 0.0000],
#                 [1.0000, 1.0000, 3.0000],
#                 [1.0000, 0.0000, 1.0000],
#                 [1.1000, 0.0000, 1.0000],
#                 [0.0000, 0.0000, 5.0000],
#                 [1.0000, 0.0000, 0.0000],
#                 [0.0000, 1.0000, 1.0000]])

print(x.shape)  # torch.Size([7, 3])        each batch has 7 samples, each with 3 features

print(y)        # tensor([1., 1., 1., 0., 0., 0., 1.])

print(y.shape)  # torch.Size([7])           7 different labels corresponding to the 7 samples in this batch


tensor([[1., 1., 3.],
        [1., 0., 0.]])
torch.Size([2, 3])
tensor([1., 1.])
torch.Size([2])


### Neural Network

<i>Note: Layer sizes decrease gradually to funnel information</i>

**Architecture**

- Input layer: (# features) neurons
<br><br>
- Hidden layer 1: 64 neurons (2-3x input features)
<br><br>
- Hidden layer 2: 32 neurons (half previous layer for gradual dimension reduction)
<br><br>
- Output layer: 1 neuron (for binary classification)
<br><br>
- Activation function: **ReLU**. max(0, x) - returns x if positive, 0 if negative
    - prevents vanishing gradient problem (when gradients used to update the network become very slow. so network learns too slow or not at all)
<br><br>
- Sigmoid: squash output between 0 and 1 (for binary classification problem)

In [537]:
import torch.nn as nn

"""
In this cell, see improvements at comments with `** IMPROVEMENTS:`

- Dropout: neurons are zeroed out with a probability p (a hyperparameter), and those neurons produce zeroes for that forward pass. 
    since zeroing neurons reduces the overall weight of the activation values, the remaining neurons are scaled up by 1/p
            
"""


class NeuralNetwork(nn.Module):         # nn.Module is the base class for all neural networks. Our model will be a subclass that inherits this superclass
    def __init__(self, input_size):     # input_size: number of the features, `len(features_columns)`
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(input_size, 64),  
            # **  IMPROVEMENT: nn.BatchNorm1d(64),          # normalizes layer inputs during training  (can help with unstable loss values)
            nn.ReLU(),                                  
            # **  IMPROVEMENT: nn.Dropout(0.2),             # dropout: randomly sets some neurons to 0 (deactivates them) during training (helps when little data, easy to overfit)
            # **  IMPROVEMENT: nn.ReLU(),                          
            nn.Linear(64, 32),
            nn.ReLU(),
            # **  IMPROVEMENT: nn.Dropout(0.2),
            nn.Linear(32, 1),
            nn.Sigmoid()            
        )

    def forward(self, x):
        return self.model(x)
    

# initialize
model = NeuralNetwork(len(features_columns))
model

NeuralNetwork(
  (model): Sequential(
    (0): Linear(in_features=3, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

### Loss function

Measure how inaccurate model predictions are and give gradient direction for optimization. The model **learns by MINIMIZING the loss function** (i.e. minimizing its errors).

<br><br>
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

**Where:**

- $n$ = number of samples  
- $y_i$ = actual (true) value  
- $\hat{y}_i$ = predicted value  
  
*scaled by* $\frac{1}{n}$ *so the derivative is cleaner for backpropagation*


In [538]:
loss = nn.MSELoss()

""" 
# For binary classification, pytorch's BCE:
loss = nn.BCELoss
"""

" \n# For binary classification, pytorch's BCE:\nloss = nn.BCELoss\n"

### Gradient Descent

Gradient descent is an optimization algorithm used to iteratively adjust parameters in order to minimize the loss function. 
- Computes the gradient (partial derivatives) of the loss function w.r.t the parameters
- Updates parameters in the direction of steepest descent (negative gradient)

<br><br>
Learning rate (lr): scaling factor that controls how much the model updates the weights at each step

In [539]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=0.0001)

""" Increase learning rate for faster convergence (convergence: reach optimal/stable model performance)
optimizer = optim.Adam(model.parameters(), lr=0.01)
"""

' Increase learning rate for faster convergence (convergence: reach optimal/stable model performance)\noptimizer = optim.Adam(model.parameters(), lr=0.01)\n'

## Train model

- 1 **epoch**: 1 complete pass through the entire training dataset
<br><br>

The model's predicted outputs appear like ex.: tensor([0.5198]).
<br><br>
This is the raw probability produced by the sigmoid activation. In binary classification, it's common for the model to output a continuous value between 0 or 1 (through sigmoid). These probabilities are then typically thresholded (often at 0.5) to decide the final binary class (0 or 1)

In [540]:
for epoch in range(20):
    model.train()

    for x_batch, y_batch in train_loader:       # for each batch (data/x and label/y) in train dataset,
        optimizer.zero_grad()                       # resets gradients to zero before each batch
        y_pred = model(x_batch).squeeze()          # get model prediction. Without .squeeze(), loss fails because of dimension mismatch
        loss_val = loss(y_pred, y_batch)            # calculate loss (mse)
        #print(f"Loss: {loss_val.item():.4f}")
        loss_val.backward()                         # backpropagation
        optimizer.step()                            # updates model weights using gradients & applies learning rate

    model.eval() # put model into eval mode

    with torch.no_grad():                           # disables gradient calculations (saves memory during eval, faster inference)
        correct = 0                 
        total = 0
        for x_batch, y_batch in test_loader:
            y_pred = model(x_batch).squeeze()       # ex. tensor([[0.3047],[0.1656]]) -> tensor([0.3274, 0.1942])
            predicted = (y_pred >= 0.5).float()     # (y_pred >= 0.5) converts probabilities to boolan tensor([False, False]), .float() converts it to tensor([0., 0.])
            total += y_batch.size(0)                         # add batch size
            correct += (predicted == y_batch).sum().item()   # count matches
        
        accuracy = correct/total * 100              # calculate percentage
        print(f"Epoch {epoch + 1}, Accuracy: {accuracy:}%")

print(f'Final accuracy: {accuracy:}%')


Epoch 1, Accuracy: 0.0%
Epoch 2, Accuracy: 0.0%
Epoch 3, Accuracy: 0.0%
Epoch 4, Accuracy: 0.0%
Epoch 5, Accuracy: 0.0%
Epoch 6, Accuracy: 0.0%
Epoch 7, Accuracy: 0.0%
Epoch 8, Accuracy: 0.0%
Epoch 9, Accuracy: 0.0%
Epoch 10, Accuracy: 0.0%
Epoch 11, Accuracy: 0.0%
Epoch 12, Accuracy: 0.0%
Epoch 13, Accuracy: 0.0%
Epoch 14, Accuracy: 0.0%
Epoch 15, Accuracy: 0.0%
Epoch 16, Accuracy: 0.0%
Epoch 17, Accuracy: 0.0%
Epoch 18, Accuracy: 0.0%
Epoch 19, Accuracy: 0.0%
Epoch 20, Accuracy: 0.0%
Final accuracy: 0.0%


### "If this was time series how would you take that into account and train it too"

- If data were a time series, we would need to account for the **sequential order**
    - capture **temporal dependencies** (relationships between past and future events)

<br>

- set `shuffle=False` for train_loader: shuffling breaks temporal dependencies by randomizing order. sequential batches allow model to learn trends/seasonality
- set smaller batch size for train_loader: smaller batches preserve more local patterns