# Pytorch Ultimate 2024 - Bert Gollnick

### 1. Course Setup

### 2. Machine Learning

### 3. Deep Learning Introduction

Layer Types:
- Dense Layer - all perceptrons have a connection between one another
- Convolutional Layer - layers consist of "filters", not all perceptrons connected
- Recurrent Neural Networks - take their own output as an input with delay based on context
- Long Short-Term Memory - uses a 'memory cell' for temporal sequences

Activation Functions:
- ReLU
    - LeakyReLU: x for x => 0, x * a for x < 0, a is usually .01
        - this ensures the gradient is never 0
- tanh - nonlinear, but has a small range (*normalize*), activation btwn -1, 1
- sigmoid - nonlinear, activation btwn 0, 1 -> better for probability
- softmax - probability among n classes, used for multi-class classification

Loss Functions:
- Regression
    - Mean Squared Error
    - Mean Absolute Error - MSE w/ abs instead of square
    - Mean Bias Error - take away the abs sign now
    - Output layer must have 1 node, typically used with linear activation functions
- Binary Classification
    - Binary Cross Entropy
    - Hinge (SVM) Loss
    - Output layer must have 1 node, typically used with sigmoid activation
- Multi-label Classification
    - Multi-label Cross Entropy
    - Output layer has n nodes, typical activation function is softmax

Optimizers:
- Gradient Descent
    - Learning rate: can be too large (misses min) and too small (takes too long)
- Adagrad - adapts learning rate to features, works well for sparse data sets
- Adam - ADAptive Momentum estimation, includes previous gradients in calculation, popular
- Stochastic Gradient Descent, Batch Gradient Descent

Frameworks:
- Tensorflow - most popular, made by google
    - he's making it seem like we're using tensorflow -_-

### 4. Model Evaluation
- High Bias = Low Accuracy, High Variance = Low Precision
    - High Bias means R^2 values of training or validation are off
    - High Variance means the difference between the R^2 values of training and validation is high
- General rule: More **complex models** -> Lower Bias and More Variance
- Low variance algorithms: Linear Regression, LDA, Logistic Regression
- High variance algorithms: Decision Trees, kNN, SVM
- <img src="tttgraph.png" width="300" height="260" alt="train-test trend graph"> <img src="bvtgraph.png" width="300" height="260" alt="bias-variance graph">
- Resampling: e.g. train 5 models using 80/20 train/test splits so that all data is used for validation

### 5. Neural Network from Scratch
- working on files 015_NeuralNetworkFromScratch/*
- StandardScaler from sklearn.preprocessing to normalize
    - X_train_scale = scaler.fit_transform(X_train)
    - X_test_scale = scaler.transform(X_test)

### 6. Tensors
- gradients are calculated automatically
- working on file 020_TensorIntro/Tensors.py

In [1]:
import torch

# create tensor with gradients enabled
x = torch.tensor(1.0, requires_grad=True)
# create second tensor depending on first tensor
y = (x - 3) * (x - 6) * (x - 4)
# calculate gradients
y.backward() # this populates the grad of the x tensor
# show gradient of first tensor
print(x.grad)

tensor(31.)


### 7. PyTorch Modeling Introduction
working on files 030_ModelingIntroduction/*

- 00 - linear regression from scratch
- 10 - linear regression with model class
     - more epochs: takes longer to train, better model, higher chance of instability
- 20 - passing data as batch is literally just a slice from the tensor
     - small batch size:
        - less gpu usage, more iterations, less training stability
     - bigger batch sizes are the opposite
- 30 - `from torch.utils.data import Dataset, DataLoader`
- 40 - model saving/loading `torch.save() and torch.load()`
     - state dictionary .pth
- 50 - hyperparameter tuning
     - packages: RayTune, Optuna, skorch
     - hyperparams:
        - topology: number of nodes, layer types, activation functions
            - more hidden layers/nodes per layer: can learn more complex patterns
            - less hidden layers/nodes per layer: less training time, more inference time, less risk of overfitting
        - objects: loss function, optimizer
        - training: learning rate, batch size, number of epochs
     - types of hyperparam tuning: grid (test all combinations of guesses), random


In [5]:
# models are usually separate objects from the optimizer and loss functions
# but you can define them in the model and it simplifies the training loop:
#
# ngl i think the step function shown in the training loop may have to be
# implemented manually
epochs = 1000
data_loader = [([1,2], [2, 3]), ([3, 4], [4, 5])]

model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 10),
    torch.nn.LogSoftmax(),
    # loss=torch.nn.NLLLoss(),
    # optimizer=torch.optim.Adam(lr=.01)
)

# for epoch in range(epochs):
#     for i, (feature, label) in enumerate(data_loader):
#         model.step(feature, label)


### 8. Classification
- confusion matrix
- ROC Curve - FPR v. TPR
- all work done in folder 045