## Q1

### (ii) Consider a 1-layer NN obtained by removing one of the hidden layers from the 2-layer NN above. Suppose the true data generating process is y = σ(x), where x ∼ N(0, 1). Generate n = 1, 000, 000 data points and fit both NNs by minimizing the average squared loss (you need not use backpropagation here; use scipy.optimize.minimize). Report training errors and optimized weights. Explain why in this case adding another layer increases the training error.

#### The 2-layer NN has more parameters and is therefore more complex than the 1-layer NN. In a complex model, there is a higher risk of overfitting, especially if the true underlying process is simple. Also, The additional layer increases the model's capacity, which means it can represent more complex functions. However, if the true function is simple, the extra capacity isn't necessary and can lead to a model that doesn't generalize well.

In [1]:
import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm

In [2]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [3]:
np.random.seed(0)  # For reproducibility
n = 1000000
x_data = norm.rvs(size=n)
y_data = sigmoid(x_data)  # True data generating process

In [4]:
# Define the 1-layer neural network function 
def nn_1_layer(weights, inputs):
    h1 = sigmoid(weights[0] * inputs)
    return weights[1] * h1

# Define the 2-layer neural network function
def nn_2_layer(weights, inputs):
    h1 = sigmoid(weights[0] * inputs)
    h2 = sigmoid(weights[1] * h1)
    return weights[2] * h2

# Define the loss function(MSE)
def loss(weights, inputs, true_outputs, nn_function):
    predictions = nn_function(weights, inputs)
    return np.mean((predictions - true_outputs) ** 2)

# Initial guess for weights
initial_weights = np.array([0.1, 0.1, 0.1])

# Train the 1-layer NN
res_1_layer = minimize(fun=loss, 
                       x0=initial_weights[:2], 
                       args=(x_data, y_data, nn_1_layer), method='BFGS')
trained_weights_1_layer = res_1_layer.x
training_error_1_layer = res_1_layer.fun

# Train the 2-layer NN
res_2_layer = minimize(fun=loss, 
                       x0=initial_weights, 
                       args=(x_data, y_data, nn_2_layer), method='BFGS')
trained_weights_2_layer = res_2_layer.x
training_error_2_layer = res_2_layer.fun




In [5]:
res_1_layer

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 5.895695734947004e-10
        x: [ 9.998e-01  1.000e+00]
      nit: 9
      jac: [-7.026e-06 -9.299e-07]
 hess_inv: [[ 2.344e+01 -2.465e+00]
            [-2.465e+00  1.859e+00]]
     nfev: 30
     njev: 10

In [6]:
res_2_layer

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 0.010075646133647132
        x: [ 5.593e+00  3.461e+00  6.731e-01]
      nit: 28
      jac: [-1.930e-07  4.368e-07  1.510e-06]
 hess_inv: [[ 6.460e+03  9.728e+02 -6.490e+00]
            [ 9.728e+02  1.332e+03 -3.151e+01]
            [-6.490e+00 -3.151e+01  1.590e+00]]
     nfev: 136
     njev: 34

In [7]:
print('Training error for 1-layer NN:', training_error_1_layer)
print('Training error for 2-layer NN:', training_error_2_layer)
print('Difference in training error:', training_error_1_layer - training_error_2_layer)
print('Optimized weights for 1-layer NN:', trained_weights_1_layer)
print('Optimized weights for 2-layer NN:', trained_weights_2_layer)


Training error for 1-layer NN: 5.895695734947004e-10
Training error for 2-layer NN: 0.010075646133647132
Difference in training error: -0.010075645544077558
Optimized weights for 1-layer NN: [0.99982998 1.00001655]
Optimized weights for 2-layer NN: [5.59337314 3.46053203 0.67314477]


## Q2

### Use the dataset card transdata.csv from the previous homework and maintain the same traintest split.
### Fit a feedforward neural network with two ReLU layers using stochastic gradient descent (SGD). Follow this tutorial. Experiment with the number of neurons per layer, the number of epochs, the learning rate for SGD, and the batch size for backpropagation. Report accuracy and F1 score on the test sample. Does your model perform better than a simple decision tree from the last homework?

#### After the experiment with the number of neurons per layer, the number of epochs, the learning rate for SGD, and the batch size for backpropagation, it's shown that the accuracy and F1 score on the test sample are lower than the ones for a simple decision tree.

In [8]:
import pandas as pd

df = pd.read_csv('W5_card_transdata-1.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  int64  
 4   used_chip                       1000000 non-null  int64  
 5   used_pin_number                 1000000 non-null  int64  
 6   online_order                    1000000 non-null  int64  
 7   fraud                           1000000 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 61.0 MB


Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1,1,0,0,0
1,10.829943,0.175592,1.294219,1,0,0,0,0
2,5.091079,0.805153,0.427715,1,0,0,1,0
3,2.247564,5.600044,0.362663,1,1,0,1,0
4,44.190936,0.566486,2.222767,1,1,0,1,0


In [14]:
from sklearn.preprocessing import StandardScaler

train_size = 500000

scaler = StandardScaler()
X = df.drop('fraud', axis=1)
X = scaler.fit_transform(X)
y = df['fraud'].values

X_train, X_test, y_train, y_test = X[:train_size], X[train_size:], y[:train_size], y[train_size:]

In [15]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import Dataset

#--- Define the hyperparameters
learning_rate = 1e-3
batch_size = 64
epochs = 10
#---------------------------


class CustomDataset(Dataset):
    def __init__(self, X, Y):
        self.X = torch.tensor(X, dtype=torch.float32)  # Convert X to a PyTorch tensor
        self.Y = torch.tensor(Y, dtype=torch.long)  # Convert Y to a PyTorch tensor
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]
    

train_data = CustomDataset(X_train, y_train)
test_data = CustomDataset(X_test, y_test)
train_dataloader = DataLoader(train_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)


class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(7, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 2),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()


In [16]:
# Initialize the loss function
loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [17]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [18]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")


Epoch 1
-------------------------------
loss: 0.674639  [   64/500000]
loss: 0.428592  [ 6464/500000]
loss: 0.358829  [12864/500000]
loss: 0.299000  [19264/500000]
loss: 0.231646  [25664/500000]
loss: 0.289466  [32064/500000]
loss: 0.254577  [38464/500000]
loss: 0.339768  [44864/500000]
loss: 0.212799  [51264/500000]
loss: 0.242656  [57664/500000]
loss: 0.173129  [64064/500000]
loss: 0.180806  [70464/500000]
loss: 0.228233  [76864/500000]
loss: 0.183010  [83264/500000]
loss: 0.264446  [89664/500000]
loss: 0.217681  [96064/500000]
loss: 0.098201  [102464/500000]
loss: 0.169662  [108864/500000]
loss: 0.153682  [115264/500000]
loss: 0.159917  [121664/500000]
loss: 0.228816  [128064/500000]
loss: 0.134855  [134464/500000]
loss: 0.210606  [140864/500000]
loss: 0.195117  [147264/500000]
loss: 0.119279  [153664/500000]
loss: 0.127207  [160064/500000]
loss: 0.113220  [166464/500000]
loss: 0.103192  [172864/500000]
loss: 0.120505  [179264/500000]
loss: 0.104364  [185664/500000]
loss: 0.104943  

In [23]:
from sklearn.metrics import f1_score, accuracy_score

# Calculate the F1 score and accuracy
y_pred = model(torch.tensor(X_test, dtype=torch.float32)).argmax(1).numpy()
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Print the F1 scores and accuracies for each epoch
print('NN Model')
print("Testing F1 Score:", f1)
print("Testing Accuracy:", accuracy)


NN Model
Testing F1 Score: 0.9440298073004599
Testing Accuracy: 0.990386


In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score

clf_original = DecisionTreeClassifier(criterion='entropy', random_state=123)

# Fit the classifier to the original training data
clf_original.fit(X_train, y_train)

# Make predictions on the original training and testing data
y_train_pred_original = clf_original.predict(X_train)
y_test_pred_original = clf_original.predict(X_test)

# Calculate evaluation metrics for the original training data
train_accuracy_original = accuracy_score(y_train, y_train_pred_original)
train_f1_original = f1_score(y_train, y_train_pred_original)

# Calculate evaluation metrics for the testing data
test_accuracy_original = accuracy_score(y_test, y_test_pred_original)
test_f1_original = f1_score(y_test, y_test_pred_original)

print('Original Decision Tree')
print('Testing F1:', test_f1_original)
print('Testing Accuracy:', test_accuracy_original)


Original Decision Tree
Testing F1: 0.9998284949863367
Testing Accuracy: 0.99997
