Notes:
- all titles with `**` & all orange comments are to be discussed AFTER first build. discussed when "how to make model better"
- here, all imports are added when needed. during interview, add them when needed but at the top of the file

### 1. Load and inspect data

In [27]:
import pandas as pd

data_path = 'data.csv'
data = pd.read_csv(data_path)       # dataframe

print(data)

   x_0  x_1  x_2  y
0  1.0    0    0  0
1  0.0    0    5  0
2  1.0    1    3  1
3  0.0    1    1  0
4  0.0    1    1  1
5  0.0    1    2  0
6  1.0    0    1  1
7  1.1    0    1  0
8  1.0    0    0  1


#### Features, data points

In [28]:
# FEATURES: measurable properties/attributes we can use to predict
# check number of data points (rows) and number of features (columns except target 'y' in this case)

print(data.shape)       # tuple (rows, columns)

(9, 4)


- 9 rows
- 4 columns 
    - 3 features (since 'y' is a column too in this case)
    - 3 x 9 : 27 data points

#### Range of features

By knowing the range of each feature, we can apply proper normalization (transform feature values to a standard scale) to ensure all features contribute proportionaly during training.
- For ex., if the range of one feature is 1000x larger that another, then during loss minimization, the gradients associated with the larger-scaled feature will likely be larger. This disproportion can cause the optimization process to overemphasize that feature, even though that feature might not actually be too influential in the prediction itself, potentially skewing weight updates and adversely affecting the overall training process

In [29]:
# determine range of each feature:      max - min

features_columns = [col for col in data if col != 'y']
features_ranges = {}

for feature in features_columns:
    min_val = data[feature].min()
    max_val = data[feature].max()
    features_ranges[feature] = float(max_val - min_val)

print("Range of features: ")
for range in features_ranges.items():
    print(range)

Range of features: 
('x_0', 1.1)
('x_1', 1.0)
('x_2', 5.0)


### Model and package selection

- Because the target column consists of 0s and 1s, this is a binary classification problem (predicting y from x features)
    - Use a multi-layer perceptron (MLP)

- Use pytorch to define, train and evaluate the model
- Use scikit-learn to split the data

In [30]:
import torch

# seperate features and target
features_values = data[features_columns].values
target_values = data['y'].values

# convert to tensors    (tensors: multidimensional homogenous data structures, good for parallelism and have many operation optimizations in packages like pytorch)
features_values = torch.Tensor(features_values)
target_values = torch.Tensor(target_values) 

### `**` Normalization / scale data `**`

StandardScaler standardizes features by rescale them to have a mean of 0 and a standard deviation of 1.
- Standardization does NOT change the shape of the data: it does NOT transform the data into a Gaussian distribution, it only standardizes the scale. The underlying distribution of the data remains unchanged
    - i.e. DISTRIBUTION of the original data remains the same, but the numerical values are scaled such that 0 is the center/average and each data point is spread out by 1 unit

<br><br>
$x' = \frac{x - \mu}{\sigma}$

**Where:**

- $x$ = original data point  
- $\mu$ = mean of the feature (before standardization)  
- $\sigma$ = standard deviation of the feature (before standardization)  


In [31]:
"""
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_values = scaler.fit_transform(features_values)
"""

'\nfrom sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler()\nfeatures_values = scaler.fit_transform(features_values)\n'

### Split the data (80% train, 20% test)

In [32]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features_values, target_values, test_size = 0.2)

### Create pytorch Dataset and Dataloader

- Dataset: stores the samples and their labels
- DataLoader: wraps an iterable around the `Dataset` to enable easy access to the samples. Makes it parallelized to load in batches to the model

<br>

- Batch: a subset of the training data processed together in one forward/backward pass
    - batch size value depends on memory constraints, model size, dataset size, etc

In [39]:
from torch.utils.data import TensorDataset, DataLoader

# create datasets
train_dataset = TensorDataset(x_train, y_train)
test_dataset = TensorDataset(x_test, y_test)

# create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

"""
**ADD SHUFFLE**
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
"""

'\n**ADD SHUFFLE**\ntrain_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)\ntest_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)\n'

### Neural Network

<i>Note: Layer sizes decrease gradually to funnel information</i>

**Architecture**

- Input layer: (# features) neurons
<br><br>
- Hidden layer 1: 64 neurons (2-3x input features)
<br><br>
- Hidden layer 2: 32 neurons (half previous layer for gradual dimension reduction)
<br><br>
- Output layer: 1 neuron (for binary classification)
<br><br>
- Activation function: **ReLU**. max(0, x) - returns x if positive, 0 if negative
    - prevents vanishing gradient problem (when gradients used to update the network become very slow. so network learns too slow or not at all)
<br><br>
- Sigmoid: squash output between 0 and 1 (for binary classification problem)

In [48]:
import torch.nn as nn

class NeuralNetwork(nn.Module):         # nn.Module is the base class for all neural networks. Our model will be a subclass that inherits this superclass
    def __init__(self, input_size):     # input_size: number of the features, `len(features_columns)`
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(input_size, 64),       
            nn.ReLU(),                                  
            # **  IMPROVEMENT: nn.Dropout(0.2),
            # **  IMPROVEMENT: nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()            
        )

    def forward(self, x):
        return self.model(x)
    

# initialize
model = NeuralNetwork(len(features_columns))
model

NeuralNetwork(
  (model): Sequential(
    (0): Linear(in_features=3, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=1, bias=True)
    (3): Sigmoid()
  )
)

### Loss function

Measure how inaccurate model predictions are and give gradient direction for optimization. The model **learns by MINIMIZING the loss function** (i.e. minimizing its errors).

<br><br>
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

**Where:**

- $n$ = number of samples  
- $y_i$ = actual (true) value  
- $\hat{y}_i$ = predicted value  
  
*scaled by* $\frac{1}{n}$ *so the derivative is cleaner for backpropagation*


In [50]:
loss = nn.MSELoss

""" 
# For binary classification, pytorch's BCE:
loss = nn.BCELoss
"""

" \n# For binary classification, pytorch's BCE:\nloss = nn.BCELoss\n"

### Gradient Descent

Gradient descent is an optimization algorithm used to iteratively adjust parameters in order to minimize the loss function. 
- Computes the gradient (partial derivatives) of the loss function w.r.t the parameters
- Updates parameters in the direction of steepest descent (negative gradient)

<br><br>
Learning rate (lr): scaling factor that controls how much the model updates the weights at each step

In [52]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=0.0001)

## Train model

- 1 Epoch: 1 complete pass through the entire training dataset

In [None]:
for epoch in range(20):
    model.train()

    