# Assignment 2: Convolutional Neural Networks
Instructions: In Assignment 2, you will learn all about the convolutional neural networks. In particular, you will gain a first-hand experience of the training process, understand the architectural details, and familiarize with transfer learning
with deep networks.

## Part 1: Convolutional Neural Networks
In this part, you will experiment with a convolutional neural network implementation to perform image classification. The dataset we will use for this assignment was created by Zoya Bylinskii, and contains 451 works of art from 11 different artists all downsampled and padded to the same size. The task is to identify which artist produced each image. The original images can be found in the `art_data/artists` directory included with the data zip file. The composition of the dataset and a sample painting from each artist are shown in Table 1.

Figure 1 shows an example of the type of convolutional architecture typically employed for similar image recognition problems. Convolutional layers apply filters to the image, and produce layers of
feature maps. Often, the convolutional layers are interspersed with pooling layers. The final layers of the network are fully connected, and lead to an output layer with one node for each of the K classes
the network is trying to detect. We will use a similar architecture for our network.

![](figures/figure1.jpg)

The code for performing the data processing and training the network is provided in the starter
pack. You will use PyTorch to implement convolutional neural networks. We create a dataset from the artists’ images by downsampling them to 50x50 pixels, and transforming the RGB values to lie within the range $[-0.5, 0.5]$. We provide a lot of starter code below, but you will need to modify the hyperparameters and network structure.

### Part 1.1: Convolutional Filter Receptive Field

First, it is important to develop an intuition for how a convolutional layer affects the feature representations that the network learns. Assume that you have a network in which the first convolutional layer
applies a 5x5 patch to the image, producing a feature map $Z_{1}$. The next layer of the network is also convolutional; in this case, a 3x3 patch is applied to the feature map $Z_{1}$ to produce a new feature
map, $Z_{2}$. Assume the stride used in both cases is 1. Let the receptive field of a node in this network be the portion of the original image that contributes information to the node (that it can, through the filters of the network, “see”). What are the dimensions of the receptive field for a node in $Z_{2}$? Note that you can ignore padding, and just consider patches in the middle of the image and $Z_{1}$. Thinking about your answer, why is it effective to build convolutional networks deeper, i.e. with more layers?

**ANSWER**<br>
For a node in $Z_{2}$ the receptive field is 7x7. A 7x7 receptive field can be accomplished directly by using a 7x7 filter. But if we do it as explained above or even better with 3 3x3 filters, we will use more non-linear layers and the extracted features will be improved.<br> Also, using a single 7x7 layer, we will need K x (7 x 7 x C) = 49 x K x C parameters where K is the number of filters (or output channels) and C is the number of input channels. Whereas using 3 3x3 filters we will need  3 x (K x (3 x 3 x C)) = 27 x K x C for the same output volume, less that a single convoluional layer with a 7x7 filter.

### Part 1.2: Run the PyTorch ConvNet

Study the provided SimpleCNN class below, and take a look at the hyperparameters. Answer the following questions about the initial implementation:

1) How many layers are there? Are they all convolutional? If not, what structure do they have?
2) Which activation function is used on the hidden nodes?
3) What loss function is being used to train the network?
4) How is the loss being minimized?

**ANSWER**<br>
The architecture:<br>
CONV -> ReLU -> POOL (?) -> CONV -> RelLU -> ReLU -> POOL (?) -> FC -> ReLU -> FC<br>
There are 2 convolutional layers, which are both followed by ReLU activation functions. Depending on the pooling flag, these are followed by pooling. At the end, there are 2 fully connected layers and a ReLU activation function in between.<br>
Cross entropy loss is being used in this multiclass classification problem.<br>
The loss is being minimized by backpropagation.


Now that you are familiar with the code, try training the network. It should take between 60-120 seconds to train for 50 epochs. What is the training accuracy for your network after training? What is the validation accuracy? What do these two numbers tell you about what your network is doing?

In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from PIL import Image, ImageFile
import tqdm
from torch.nn import CrossEntropyLoss
import time
import random
from torchvision import transforms, utils
import numpy as np
import os
from torch import optim
import math

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

In [None]:
class FlexibleCNN(torch.nn.Module):
    def __init__(self, device, num_conv_layers=2, in_channels=3, conv_out_channels=[16, 16],
                 kernel_sizes=[5, 5], strides=[2, 2], pooling=False, dropout_flag=False):
        super(FlexibleCNN, self).__init__()
        self.device = device
        self.pooling = pooling
        self.dropout_flag = dropout_flag
        
        # Define convolutional layers based on specified parameters
        self.conv_layers = nn.ModuleList()
        current_in_channels = in_channels
        for i in range(num_conv_layers):
            conv_layer = torch.nn.Conv2d(in_channels=current_in_channels,
                                         out_channels=conv_out_channels[i],
                                         kernel_size=kernel_sizes[i],
                                         stride=strides[i],
                                         device=device)
            self.conv_layers.append(conv_layer)
            current_in_channels = conv_out_channels[i]
        
        # Define pooling layers if pooling is enabled
        if pooling:
            self.pool_layers = nn.ModuleList([torch.nn.MaxPool2d(kernel_size=2, stride=2) 
                                              for _ in range(num_conv_layers)])
        
        # Define fully connected layers based on the final convolutional output size
        self.fc_input_size = (self.compute_fc_input_size(kernel_sizes, strides)**2) * conv_out_channels[-1]
        #print(f'fc_input_size: {self.fc_input_size}')
        self.fully_connected_layer = nn.Linear(self.fc_input_size, 64, device=device)
        self.final_layer = nn.Linear(64, 11, device=device)
        
        # Optional dropout layer
        if dropout_flag:
            self.dropout = nn.Dropout(p=0.5)
    
    def compute_fc_input_size(self, kernel_sizes, strides):
        w_in = 50
        for f, s in zip(kernel_sizes, strides):
            w_in = math.floor((w_in - f)/s) +1
            if self.pooling:
                # filter size and stride both equal to 2 by defaul for pooling
                w_in = math.floor((w_in - 2) / 2) + 1
        #print(f'calculated input to fc: {w_in}x{w_in}x16')
        return w_in

            
    def forward(self, inp):
        x = inp
        for i, conv_layer in enumerate(self.conv_layers):
            x = torch.nn.functional.relu(conv_layer(x))
            if self.pooling:
                x = self.pool_layers[i](x)  # Apply corresponding pooling layer if enabled
        
        x = x.reshape(x.size(0), -1)  # Flatten the output for fully connected layer
        x = torch.nn.functional.relu(self.fully_connected_layer(x))
        if self.dropout_flag:
            x = self.dropout(x)
        x = self.final_layer(x)
        return x


In [8]:
class SimpleCNN(torch.nn.Module):
    def __init__(self,device,pooling= False, dropout_flag=False):
        super(SimpleCNN, self).__init__()
        self.device = device
        self.pooling = pooling
        self.dropout_flag = dropout_flag
        self.conv_layer1 =  torch.nn.Conv2d(in_channels=3,out_channels=16,kernel_size=5,stride=2, device=device)
        self.pool_layer1 = torch.nn.MaxPool2d(kernel_size=2,stride=2)
        self.conv_layer2 = torch.nn.Conv2d(in_channels=16,out_channels=16,kernel_size=5,stride=2, device=device)
        self.pool_layer2 = torch.nn.MaxPool2d(kernel_size=2,stride=2)
        if pooling:
            self.fully_connected_layer = nn.Linear(64,64, device=device)
            self.final_layer = nn.Linear(64,11, device=device)
        else:
            self.fully_connected_layer = nn.Linear(1600, 64, device=device)
            self.final_layer = nn.Linear(64, 11, device=device)
        if dropout_flag:
            self.dropout = nn.Dropout(p=0.5)
    def forward(self,inp):
        x = torch.nn.functional.relu(self.conv_layer1(inp))
        if self.pooling:
            x = self.pool_layer1(x)
        x = torch.nn.functional.relu(self.conv_layer2(x))
        if self.pooling:
            x = self.pool_layer2(x)
        x = x.reshape(x.size(0),-1)
        x = torch.nn.functional.relu(self.fully_connected_layer(x))
        if self.dropout_flag:
            x = self.dropout(x)
        x = self.final_layer(x)
        return x

In [2]:
class LoaderClass(Dataset):
    def __init__(self,data,labels,phase,transforms):
        super(LoaderClass, self).__init__()
        self.transforms = transforms
        self.labels = labels[phase + "_labels"]
        self.data = data[phase + "_data"]
        self.phase = phase

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        label = self.labels[idx]
        img = self.data[idx]
        img = Image.fromarray(img)
        img = self.transforms(img)
        return img,torch.from_numpy(label)

In [3]:
class Trainer():
    def __init__(self,model,criterion,tr_loader,val_loader,optimizer,
                 num_epoch,patience,batch_size,lr_scheduler=None):
        self.model = model
        self.tr_loader = tr_loader
        self.val_loader = val_loader
        self.optimizer = optimizer
        self.num_epoch = num_epoch
        self.patience = patience
        self.lr_scheduler = lr_scheduler
        self.criterion = criterion
        self.softmax = nn.Softmax()
        self.no_inc = 0
        self.best_loss = 9999
        self.phases = ["train","val"]
        self.best_model = []
        self.best_val_acc = 0
        self.best_train_acc = 0
        self.best_val_loss = 0
        self.best_train_loss = 0
        self.batch_size = batch_size

        pass
    def train(self):
        pbar = tqdm.tqdm(desc= "Epoch 0, phase: Train",postfix="train_loss : ?, train_acc: ?")
        for i in range(self.num_epoch):
            last_train_acc = 0
            last_val_acc = 0
            last_val_loss = 0
            last_train_loss = 0
            pbar.update(1)

            for phase in self.phases:
                total_acc = 0
                total_loss = 0
                start = time.time()
                if phase == "train":
                    pbar.set_description_str("Epoch %d,"% i + "phase: Training")
                    loader = self.tr_loader
                    self.model.train()
                else:
                    pbar.set_description_str("Epoch %d,"% i + "phase: Validation")
                    loader = self.val_loader
                    self.model.eval()
                iter = 0
                for images,labels in loader:
                    iter += 1
                    images = images.to(self.model.device)
                    labels = labels.to(self.model.device)
                    self.optimizer.zero_grad()
                    logits = self.model(images)
                    softmaxed_scores = self.softmax(logits)
                    _, predictions = torch.max(softmaxed_scores,1)
                    _, labels = torch.max(labels,1)
                    loss = self.criterion(softmaxed_scores.float(),labels.long())
                    total_loss += loss.item()
                    total_acc += torch.sum(predictions == labels).item()

                    if phase == "train":
                        pbar.set_postfix_str("train acc: %6.3f," %(total_acc/ (iter*self.batch_size)) + ("train loss: %6.3f" % (total_loss / iter)))
                        loss.backward()
                        self.optimizer.step()
                    else:
                        pass
                        pbar.set_postfix_str("val acc: %6.3f," %(total_acc/ (iter*self.batch_size)) + ("val loss: %6.3f" % (total_loss / iter)))


                if phase == "train":
                    if self.lr_scheduler:

                        self.lr_scheduler.step()
                end = time.time()
                if phase == "train":
                    loss_p = total_loss / iter
                    acc_p = total_acc / len(self.tr_loader.dataset)
                    last_train_acc = acc_p
                    last_train_loss = loss_p
                else:
                    loss_p = total_loss / iter
                    acc_p = total_acc / len(self.val_loader.dataset)
                    last_val_acc = acc_p
                    last_val_loss = loss_p

                    if loss_p < self.best_loss:
                        #print("New best loss, loss is: ",str(loss_p), "acc is: ",acc_p )
                        self.best_loss = loss_p
                        self.no_inc = 0
                        self.best_model = self.model
                        self.best_train_acc = last_train_acc
                        self.best_train_loss = last_train_loss
                        self.best_val_loss = last_val_loss
                        self.best_val_acc = last_val_acc
                    else:
                        #print("Not a better score")


                        self.no_inc += 1
                        if self.no_inc == self.patience:
                            print("Out of patience returning the best model")
                            print(
                                "Best val acc: {}, Best val loss: {}, Best train acc: {}, Best train loss: {} ".format(
                                    self.best_val_acc, self.best_val_loss, self.best_train_acc, self.best_train_loss
                                ))  # Stats of the best model
                            return self.best_model
        print("Training ended returning the best model")
        print(
            "Best val acc: {}, Best val loss: {}, Best train acc: {}, Best train loss: {} ".format(
                self.best_val_acc, self.best_val_loss, self.best_train_acc, self.best_train_loss
            ))  # Stats of the best model
        return self.best_model

In [24]:
LR = 1e-4
Momentum = 0.9 # If you use SGD with momentum
BATCH_SIZE = 16
POOLING = False
NUM_EPOCHS = 200
PATIENCE = -1
TRAIN_PERCENT = 0.8
VAL_PERCENT = 0.2
NUM_ARTISTS = 11
DATA_PATH = "./art_data/artists"
ImageFile.LOAD_TRUNCATED_IMAGES = True # Do not change this

In [5]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [6]:
def load_artist_data():
    data = []
    labels = []
    artists = [x for x in os.listdir(DATA_PATH) if x != '.DS_Store']
    print(artists)
    for folder in os.listdir(DATA_PATH):
        class_index = artists.index(folder)
        for image_name in os.listdir(DATA_PATH + "/" + folder):
            img = Image.open(DATA_PATH + "/" + folder + "/" + image_name)
            artist_label = (np.arange(NUM_ARTISTS) == class_index).astype(np.float32)
            data.append(np.array(img))
            labels.append(artist_label)
    shuffler = np.random.permutation(len(labels))
    data = np.array(data)[shuffler]
    labels = np.array(labels)[shuffler]

    length = len(data)
    val_size = int(length*0.2)
    val_data = data[0:val_size+1]
    train_data = data[val_size+1::]
    val_labels = labels[0:val_size+1]
    train_labels = labels[val_size+1::]
    print(val_labels)
    data_dict = {"train_data":train_data,"val_data":val_data}
    label_dict = {"train_labels":np.array(train_labels),"val_labels":np.array(val_labels)}

    return data_dict,label_dict

In [9]:
seed_everything(42)
data,labels = load_artist_data()
model = SimpleCNN(device=device,pooling=False)
optimizer = optim.Adam(model.parameters(), lr=LR)
transform = {
    'train': transforms.Compose([
        transforms.Resize(50),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(50),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    }

['canaletto', 'claude monet', 'george romney', 'j. m. w. turner', 'john robert cozens', 'paul cezanne', 'paul gauguin', 'paul sandby', 'peter paul rubens', 'rembrandt', 'richard wilson']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]]


In [10]:
train_dataset = LoaderClass(data,labels,"train",transform["train"])
valid_dataset = LoaderClass(data,labels,"val",transform["val"])
train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=BATCH_SIZE,
                                               shuffle=True, num_workers=0, pin_memory=True)
val_loader = torch.utils.data.DataLoader(valid_dataset,
                                             batch_size=BATCH_SIZE,
                                             shuffle=True, num_workers=0, pin_memory=True)


In [11]:
criterion = nn.CrossEntropyLoss()

In [12]:
# standard adam optimizer, no pooling, no weight decay, no dropout, no scheduler
trainer_m = Trainer(model, criterion, train_loader, val_loader, optimizer, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model = trainer_m.train()
# Best val acc: 0.4835164835164835, Best val loss: 2.047486503918966, Best train acc: 0.6722222222222223, Best train loss: 1.8831380450207253 

Epoch 199,phase: Validation: 200it [04:27,  1.34s/it, val acc:  0.406,val loss:  2.110]    

Training ended returning the best model
Best val acc: 0.4835164835164835, Best val loss: 2.047486503918966, Best train acc: 0.6722222222222223, Best train loss: 1.8831380450207253 





**ANSWER**
What is the training accuracy for your network after training? What is the validation accuracy? What do these two numbers tell you about what your network is doing?
The training accuracy is 0.67 and the validation accuracy is 0.48. Here we can see that the network is overfitting to training data.

### Part 1.3: Add Pooling Layers
We will now add max pooling layers after each of our convolutional layers. This code has already been provided for you; all you need to do is switch the pooling flag in the hyper-parameters to True,
and choose different values for the pooling filter size and stride. After you applied max pooling, what happened to your results? How did the training accuracy vs. validation accuracy change? What does
that tell you about the effect of max pooling on your network?

In [None]:
# create new model because the other model's params got updated.
model_pooling = SimpleCNN(device=device,pooling=True)
opt_pooling = optim.Adam(model_pooling.parameters(), lr=LR)
trainer_m_pooling = Trainer(model_pooling, criterion, train_loader, val_loader, opt_pooling, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_pooling = trainer_m_pooling.train()
# Best val acc: 0.46153846153846156, Best val loss: 2.062834401925405, Best train acc: 0.675, Best train loss: 1.9000464470490166 

**ANSWER**<br>
Applying pooling reduces the spatial dimensions, helping the model to focus on the most important features. This also helps with avoiding overfitting as the number of parameters are decreased.<br><br>
The model with max pooling has decreased validation accuracy, struggles to generalize to unseen data. This is unexpected and might be because maxpooling discarded important information while downsampling.

### Part 1.4: Regularize Your Network!
Because this is such a small dataset, your network is likely to overfit the data. Implement the following ways of regularizing your network. Test each one individually, and discuss how it affects your results.

- __Dropout__: In PyTorch, this is implemented using the `torch.nn.dropout` class, which takes a value called the `keep_prob`, representing the probability that an activation will be dropped out. This value should be between 0.1 and 0.5 during training, and 0 for evaluation and testing. An example of how this works is available here. You should add this to your network and try different values to find one that works well.

- __Weight Regularization__: You should try different optimizers, and different weight decay values for optimizers.

- __Early Stopping__: Stop training your model after your validation accuracy starts to plateau or decrease (so you do not overtrain your model). The number of steps can be controlled through the `patience` hyperparameter in the code.

- __Learning Rate Scheduling__: Learning rate scheduling is an important part of training neural networks. There are a lot of techniques for learning rate scheduling. You should try
different schedulers such as `StepLR`, `CosineAnnealing`, etc.

Give your results for each of these regularization techniques, and discuss which ones were the most effective.

**ANSWER**<br>
Testing each individually:<br><br>
Dropout (p=0.5) between the two fully connected layers:

In [None]:
model_dropout = SimpleCNN(device=device,pooling=True, dropout=True)
opt_dropout = optim.Adam(model_dropout.parameters(), lr=LR)
trainer_m_dropout = Trainer(model_dropout, criterion, train_loader, val_loader, opt_dropout, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_dropout = trainer_m_dropout.train()
# Best val acc: 0.4945054945054945, Best val loss: 2.027242044607798, Best train acc: 0.6333333333333333, Best train loss: 1.9216425211533257 

Even though the training accuracy is similar to no dropout, the validation accuracy increased significantly meaning that the model generalizes well to new data.<br>
<br>
Weight regularization:<br>




In [None]:
model_sgd = SimpleCNN(device=device,pooling=True)
opt_sgd = torch.optim.SGD(model_sgd.parameters(), lr=LR, weight_decay=0.001, momentum=0.9)
trainer_m_sgd = Trainer(model_sgd, criterion, train_loader, val_loader, opt_sgd, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_sgd= trainer_m_sgd.train()
# Best val acc: 0.2967032967032967, Best val loss: 2.3858046531677246, Best train acc: 0.2722222222222222, Best train loss: 2.3842520713806152 

In [None]:
model_adam1 = SimpleCNN(device=device,pooling=True)
opt_adam1 = torch.optim.Adam(model_adam1.parameters(), lr=LR, weight_decay=0.001)
trainer_m_adam1 = Trainer(model_adam1, criterion, train_loader, val_loader, opt_adam1, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_adam1= trainer_m_adam1.train()
# Best val acc: 0.5054945054945055, Best val loss: 2.024585505326589, Best train acc: 0.7083333333333334, Best train loss: 1.8514255337093188

In [None]:
model_adam2 = SimpleCNN(device=device,pooling=True)
opt_adam2 = torch.optim.Adam(model_adam2.parameters(), lr=LR, weight_decay=0.01)
trainer_m_adam2 = Trainer(model_adam2, criterion, train_loader, val_loader, opt_adam2, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_adam2= trainer_m_adam2.train()
# Best val acc: 0.5164835164835165, Best val loss: 2.019010921319326, Best train acc: 0.7, Best train loss: 1.8626101846280305 

In [None]:
model_adam3 = SimpleCNN(device=device,pooling=True)
opt_adam3 = torch.optim.Adam(model_adam3.parameters(), lr=LR, weight_decay=0.0001)
trainer_m_adam3 = Trainer(model_adam3, criterion, train_loader, val_loader, opt_adam3, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_adam3= trainer_m_adam3.train()
# Best val acc: 0.46153846153846156, Best val loss: 2.057038187980652, Best train acc: 0.6861111111111111, Best train loss: 1.8569457012674082 

Weight regularization penalizes large weights to avoid strong dependancy on certain features thus prevent overfitting.<br>
With SGD as optimizer and using momentum, validation accuracy decreases significantly, but this simply shows how bad SGD is. 
<br>

With Adam as optimizer and weight_decay=0.001, validation accuracy increases to 0.505. I believe that this improvement is mainly caused by the adaptive learning rate of Adam, yielding a better convergence.<br>
<br>
With Adam as optimizer and weight_decay=0.01, validation accuracy increases significantly, meaning in our case more aggressive regularization of large weights is required. When weight_decay parameter is decreased to 0.0001, we observe worse generalization.<br>
<br>
Early Stopping:<br>
(without weight decay and dropout)

In [None]:
# es -> early stopping
model_es = SimpleCNN(device=device,pooling=True)
opt_es = torch.optim.Adam(model_es.parameters(), lr=LR)
trainer_m_es = Trainer(model_es, criterion, train_loader, val_loader, opt_es, num_epoch=NUM_EPOCHS, patience=7,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_es= trainer_m_es.train()
# Best val acc: 0.37362637362637363, Best val loss: 2.1712493896484375, Best train acc: 0.3638888888888889, Best train loss: 2.1862813234329224 

Early stopping is not effective at all, but I might also be unsuccesfull with setting a meaningful patience.<br>
<br>
Learning Rate Scheduling:<br>

In [None]:
model_lr1 = SimpleCNN(device=device,pooling=True)
opt_lr1 = torch.optim.Adam(model_lr1.parameters(), lr=LR)
scheduler1 = torch.optim.lr_scheduler.StepLR(opt_lr1, step_size=5, gamma=0.1)
trainer_m_lr1 = Trainer(model_lr1, criterion, train_loader, val_loader, opt_lr1, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= scheduler1)
best_model_lr1= trainer_m_lr1.train()
# Best val acc: 0.2967032967032967, Best val loss: 2.2756608724594116, Best train acc: 0.2638888888888889, Best train loss: 2.305739112522291

In [None]:
model_lr2 = SimpleCNN(device=device,pooling=True)
opt_lr2 = torch.optim.Adam(model_lr2.parameters(), lr=LR)
scheduler2 =  torch.optim.lr_scheduler.ExponentialLR(opt_lr2, gamma=0.95)
trainer_m_lr2 = Trainer(model_lr2, criterion, train_loader, val_loader, opt_lr2, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= scheduler2)
best_model_lr2= trainer_m_lr2.train()
# Best val acc: 0.3516483516483517, Best val loss: 2.161716083685557, Best train acc: 0.46111111111111114, Best train loss: 2.089998447376749 

### Part 1.5: Experiment with Your Architecture

All those parameters at the top of `SimpleCNN` still need to be set. You cannot possibly explore all combinations; so try to change some of them individually to get some feeling for their effect (if any).
Optionally, you can explore adding more layers. Report which changes led to the biggest increases and decreases in performance. In particular, what is the effect of making the convolutional layers have (a) a larger filter size, (b) a larger stride and (c) greater depth? How does a pyramidal-shaped network in which the feature maps gradually decrease in height and width but increase in depth compare to a flat architecture, or one with the opposite shape?

Below is the same architecture using the new network class declaration:

In [None]:
model_flexible = FlexibleCNN(device=device, num_conv_layers=2, in_channels=3,
                              conv_out_channels=[16,16], kernel_sizes=[5,5], strides=[2,2], pooling=True, dropout_flag=False)
opt_flexible = optim.Adam(model_flexible.parameters(), lr=LR)
trainer_m_flexible = Trainer(model_flexible, criterion, train_loader, val_loader, opt_flexible, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible = trainer_m_flexible.train()
# Best val acc: 0.4945054945054945, Best val loss: 2.011147916316986, Best train acc: 0.7, Best train loss: 1.8516293297643247 

Now, i will first examine the effect of depth on the accuracy and training time.<br>
Using a single layer of 5x5 filters with stride=2 (thus only changing the depth)<br>
The input to the fully connected layer is 11 x 11 x 16<br>
This architecture yields a higher validation accuracy than using 2 5x5 layers, suggesting that 2 layers overfit to the training data.<br>

In [None]:
model_flexible2 = FlexibleCNN(device=device, num_conv_layers=1, in_channels=3,
                              conv_out_channels=[16], kernel_sizes=[5], strides=[2], pooling=True, dropout_flag=True)
opt_flexible2 = optim.Adam(model_flexible2.parameters(), lr=LR)
trainer_m_flexible2 = Trainer(model_flexible2, criterion, train_loader, val_loader, opt_flexible2, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible2 = trainer_m_flexible2.train()
# Best val acc: 0.5604395604395604, Best val loss: 1.9935139616330464, Best train acc: 0.7305555555555555, Best train loss: 1.8350506398988806
# Best val acc: 0.5824175824175825, Best val loss: 1.9595629771550496, Best train acc: 0.7861111111111111, Best train loss: 1.7817718360735022

Using three layers of 5x5 filters with stride=2 (thus only changing the depth)<br>
The input to the fully connected layer is 3 x 3 x 16, which is very small even though it is without pooling. This is an effect of striding with 3 conv layers.<br>
With this architecture, I obtain worse val accuracy, the network is overfitting as it has too many parameters.

In [None]:
model_flexible3 = FlexibleCNN(device=device, num_conv_layers=3, in_channels=3,
                              conv_out_channels=[16,16,16], kernel_sizes=[5,5,5], strides=[2,2,2], pooling=False, dropout_flag=True)
opt_flexible3 = optim.Adam(model_flexible3.parameters(), lr=LR)
trainer_m_flexible3 = Trainer(model_flexible3, criterion, train_loader, val_loader, opt_flexible3, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible3 = trainer_m_flexible3.train()
# Best val acc: 0.3956043956043956, Best val loss: 2.115185578664144, Best train acc: 0.6277777777777778, Best train loss: 1.9305321081824924 


Using stride=1 instead of stride=2 will shrink the receptive field of every neuron, but the network might be able to capture high resolution features more efficiently.<br>


In [None]:
model_flexible4 = FlexibleCNN(device=device, num_conv_layers=2, in_channels=3,
                              conv_out_channels=[16,16], kernel_sizes=[5,5], strides=[1,1], pooling=True, dropout_flag=True)
opt_flexible4 = optim.Adam(model_flexible4.parameters(), lr=LR)
trainer_m_flexible4 = Trainer(model_flexible4, criterion, train_loader, val_loader, opt_flexible4, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible4 = trainer_m_flexible4.train()
# Best val acc: 0.5274725274725275, Best val loss: 2.00963286558787, Best train acc: 0.7222222222222222, Best train loss: 1.8219199698904287 

Finally, the filter size can be changed as well.<br>
Using larger filters will enable us tp capture relations between pixels that are further away, but this might miss the features in the details.<br>
With a smaller kernel, we will catch these details but this time miss on broader context and relations.<br>
Experimenting with 3 x 3 and 7 x 7 kernels without changing any other architecture parameters, ???

In [None]:
model_flexible5 = FlexibleCNN(device=device, num_conv_layers=2, in_channels=3,
                              conv_out_channels=[16,16], kernel_sizes=[3,3], strides=[2,2], pooling=True, dropout_flag=True)
opt_flexible5 = optim.Adam(model_flexible5.parameters(), lr=LR)
trainer_m_flexible5 = Trainer(model_flexible5, criterion, train_loader, val_loader, opt_flexible5, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible5 = trainer_m_flexible5.train()
# Best val acc: 0.4065934065934066, Best val loss: 2.1159814993540444, Best train acc: 0.5416666666666666, Best train loss: 2.015954660332721 

In [14]:
model_flexible6 = FlexibleCNN(device=device, num_conv_layers=2, in_channels=3,
                              conv_out_channels=[16,16], kernel_sizes=[7,7], strides=[2,2], pooling=True, dropout_flag=True)
opt_flexible6 = optim.Adam(model_flexible6.parameters(), lr=LR)
trainer_m_flexible6 = Trainer(model_flexible6, criterion, train_loader, val_loader, opt_flexible6, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_flexible6 = trainer_m_flexible6.train()
# Best val acc: 0.6043956043956044, Best val loss: 1.9428099195162456, Best train acc: 0.7138888888888889, Best train loss: 1.851250321968742

Epoch 199,phase: Validation: 200it [04:23,  1.32s/it, val acc:  0.458,val loss:  2.043]    


Training ended returning the best model
Best val acc: 0.5274725274725275, Best val loss: 2.0259124835332236, Best train acc: 0.6555555555555556, Best train loss: 1.8992950294328772 


Larger filter sizes produce better results, meaning that in our problem, larger receptive field is crucial

### Part 1.6: Optimize Your Architecture
Based on your experience with these tests, try to achieve the best performance that you can on the validation set by varying the hyperparameters, architecture, and regularization methods. You can even (optionally) try to think of additional ways to augment the data, or experiment with techniques like local response normalization layers using `torch.nn.LocalResponseNorm` or weight normalization using the implementation [here](https://pytorch.org/docs/stable/_modules/torch/nn/utils/weight_norm.html#weight_norm). Report the best performance you are able to achieve, and the settings you used to obtain it.

In [57]:
import math

def compute_fc_input_size(pooling, kernel_sizes, strides):
        w_in = 50
        for f, s in zip(kernel_sizes, strides):
            w_in = math.floor((w_in - f)/s) +1
            if pooling:
                # filter size and stride both equal to 2 by defaul for pooling
                w_in = math.floor((w_in - 2) / 2) + 1
        return w_in
kernel_sizes=[9,3,3]
strides=[2,2,1]
pooling=True
x = compute_fc_input_size(pooling=pooling, kernel_sizes=kernel_sizes, strides=strides)
print(f'with kernel sizes = {kernel_sizes}, and strides={strides} and pooling={pooling}: the input to fc is: {x}')

with kernel sizes = [9, 3, 3], and strides=[2, 2, 1] and pooling=True: the input to fc is: 0


In [60]:
#LR = 0.0005
model_optim = FlexibleCNN(device=device, num_conv_layers=3, in_channels=3,
                              conv_out_channels=[16,64,64], kernel_sizes=[7,3,3], strides=[2,2,1], pooling=False, dropout_flag=True)
opt_optim = optim.Adam(model_optim.parameters(), lr=LR, weight_decay=0.001)
trainer_m_optim = Trainer(model_optim, criterion, train_loader, val_loader, opt_optim, num_epoch=NUM_EPOCHS, patience=PATIENCE,batch_size=BATCH_SIZE,lr_scheduler= None)
best_model_optim = trainer_m_optim.train()
"""
9,9 - 2,2 - no pool 
Epoch 199,phase: Validation: 200it [01:38,  2.03it/s, val acc:  0.469,val loss:  2.049]
Training ended returning the best model
Best val acc: 0.5164835164835165, Best val loss: 2.015752991040548, Best train acc: 0.7472222222222222, Best train loss: 1.8079308478728584 

7,3,3 - 2,2,1 - no pool - w_d = 0.0005 - 16,64,64
Epoch 199,phase: Validation: 200it [02:34,  1.30it/s, val acc:  0.490,val loss:  2.034]    
Training ended returning the best model
Best val acc: 0.5494505494505495, Best val loss: 1.9998568296432495, Best train acc: 0.725, Best train loss: 1.8211293116859768 

7,3,3 - 2,2,1 - no pool - w_d = 0.0001 - 16,64,64
Epoch 199,phase: Validation: 200it [02:52,  1.16it/s, val acc:  0.458,val loss:  2.065]    
Training ended returning the best model
Best val acc: 0.5384615384615384, Best val loss: 1.9924977620442708, Best train acc: 0.7583333333333333, Best train loss: 1.788537139477937

9,3,3 - 2,2,1 - no pool - 16,64,64 - w_d = 0.0005
Epoch 199,phase: Validation: 200it [06:09,  1.85s/it, val acc:  0.438,val loss:  2.084]    
Training ended returning the best model
Best val acc: 0.5274725274725275, Best val loss: 2.0125731229782104, Best train acc: 0.7611111111111111, Best train loss: 1.7932016849517822
"""

Epoch 199,phase: Validation: 200it [05:28,  1.64s/it, val acc:  0.479,val loss:  2.062]    

Training ended returning the best model
Best val acc: 0.5054945054945055, Best val loss: 2.026154577732086, Best train acc: 0.7472222222222222, Best train loss: 1.7959446751553079 





'\n9,9 - 2,2 - no pool \nEpoch 199,phase: Validation: 200it [01:38,  2.03it/s, val acc:  0.469,val loss:  2.049]\nTraining ended returning the best model\nBest val acc: 0.5164835164835165, Best val loss: 2.015752991040548, Best train acc: 0.7472222222222222, Best train loss: 1.8079308478728584 \n\n7,3,3 - 2,2,1 - no pool - w_d = 0.0005 - 16,64,64\nEpoch 199,phase: Validation: 200it [02:34,  1.30it/s, val acc:  0.490,val loss:  2.034]    \nTraining ended returning the best model\nBest val acc: 0.5494505494505495, Best val loss: 1.9998568296432495, Best train acc: 0.725, Best train loss: 1.8211293116859768 \n\n7,3,3 - 2,2,1 - no pool - w_d = 0.0001 - 16,64,64\nEpoch 199,phase: Validation: 200it [02:52,  1.16it/s, val acc:  0.458,val loss:  2.065]    \nTraining ended returning the best model\nBest val acc: 0.5384615384615384, Best val loss: 1.9924977620442708, Best train acc: 0.7583333333333333, Best train loss: 1.788537139477937\n\n9,3,3 - 2,2,1 - no pool - 16,64,64 - w_d = 0.0005\nEpoch

### Part 1.7: Test Your Final Architecture on Variations of the Data
In PyTorch data augmentation can be done dynamically while loading the data using what they call `transforms`. Note that some of the transforms are already implemented. You can
try other transformations, such as the ones shown in Figure 3 and also try different probabilities for these transformations. You may find [this link](https://pytorch.org/vision/stable/transforms.html) helpful. Note that the PyTorch data loader refreshes the
data in each epoch and apply different transformations to the different instances.

Now that you have optimized your architecture, you are ready to test it on augmented data!
Report your performance on each of the transformed datasets. Are you surprised by any of the results?
Which transformations is your network most invariant to, and which lead it to be unable to recognize the images? What does that tell you about what features your network has learned to use to recognize artists’ images?

In [None]:
transform = {
    'train': transforms.Compose([
        transforms.Resize(50),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(50),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    }
train_dataset = LoaderClass(data,labels,"train",transform["train"])
valid_dataset = LoaderClass(data,labels,"val",transform["val"])
train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=BATCH_SIZE,
                                               shuffle=True, num_workers=0, pin_memory=True)
val_loader = torch.utils.data.DataLoader(valid_dataset,
                                             batch_size=BATCH_SIZE,
                                             shuffle=True, num_workers=0, pin_memory=True)

## Part 2: Transfer Learning with Deep Network

In this part, you will fine-tune AlexNet model pretrained on ImageNet to recognize faces. For the sake of simplicity you may use [the pretrained AlexNet model](https://pytorch.org/hub/pytorch_vision_alexnet/) provided in PyTorch Hub. You will
work with a subset of the FaceScrub dataset. The subset of male actors is [here](http://www.cs.toronto.edu/~guerzhoy/321/proj1/subset_actors.txt) and the subset of female actors is [here](http://www.cs.toronto.edu/~guerzhoy/321/proj1/subset_actresses.txt). The dataset consists of URLs of images with faces, as well as the bounding boxes of the faces. The format of the bounding box is as follows (from the FaceScrub `readme.txt` file):

` 
The format is x1,y1,x2,y2, where (x1,y1) is the coordinate of the top-left corner of the bounding box and (x2,y2) is that of the bottom-right corner, with (0,0) as the top-left corner of the image. Assuming the image is represented as a Python NumPy array I, a face
in I can be obtained as I[y1:y2, x1:x2].
`

You may find it helpful to use and/or modify [this script](www.cs.toronto.edu/~guerzhoy/321/proj1/get_data.py) for downloading the image data. Note that you should crop out the images of the faces and resize them to appropriate size before proceeding further. Make sure to check the SHA-256 hashes, and make sure to only keep faces for which the hashes match. You should set aside 70 images per faces for the training set, and use the rest for the test and validation set.

### Part 2.1: Train a Multilayer Perceptron
First resize the images to 28 × 28 pixels. Use a fully-connected neural network with a single hidden layer of size 300 units.
Below, include the learning curve for the test, training, and validation sets, and the final performance classification on the test set. Include a text description of your system. In particular, describe how you preprocessed the input and initialized the weights, what activation function you used, and what the exact architecture of the network that you selected was. You might get performances close to 80-85% accuracy rate.

In [None]:
"""
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import random
import time
from scipy.misc import imread
from scipy.misc import imresize
import matplotlib.image as mpimg
import os
from scipy.ndimage import filters
import urllib
from hashlib import sha256
#from rgb2gray import rgb2gray


# Instructions to run the code:
# the two paths I used below are my local paths, "uncropped/" and "cropped/" folders
# should be created at the same location where this python file exists. The code
# will download the images automatically and it's implemented the way such that
# the gray scale images are generated and cropped right after the image is
# downloaded. "faces.txt" file is a txt file which contains all info from
# "subset_actors.txt" and "subset_actresses.txt", so the code can handle all
# required actors/actresses at one time.

act = list(set([a.split("\t")[0] for a in open("subset_actors.txt").readlines()]))

def timeout(func, args=(), kwargs={}, timeout_duration=1, default=None):
    '''From:
    http://code.activestate.com/recipes/473878-timeout-function-using-threading/'''
    import threading
    class InterruptableThread(threading.Thread):
        def __init__(self):
            threading.Thread.__init__(self)
            self.result = None

        def run(self):
            try:
                self.result = func(*args, **kwargs)
            except:
                self.result = default

    it = InterruptableThread()
    it.start()
    it.join(timeout_duration)
    if it.isAlive():
        return False
    else:
        return it.result

testfile = urllib.request.URLopener()            

path1 = './uncropped/'
path2 = './cropped/'
# First loop through the follwing actors' images in uncropped folder
#act = ['Gerard Butler', 'Daniel Radcliffe', 'Michael Vartan', 'Lorraine Bracco', 'Peri Gilpin', 'Angie Harmon'] 
gray()
for a in act:
    name = a.split()[1].lower()
    i = 0
    # This faces.txt contains all actors and actresses
    for line in open("subset_actors.txt"):
        if a in line:
            filename = name+str(i)+'.'+line.split()[4].split('.')[-1]
            x1 = int(line.split()[5].split(',')[0])
            y1 = int(line.split()[5].split(',')[1])  
            x2 = int(line.split()[5].split(',')[2]) 
            y2 = int(line.split()[5].split(',')[3])
            correctHash = line.split()[6]
            timeout(testfile.retrieve, (line.split()[4], "./uncropped/"+filename), {}, 30)
            if not os.path.isfile("./uncropped/"+filename):
                continue
            else:
                # Filter out the corrupted images
                file = open("./uncropped/"+filename, "rb").read()
                fileHash = sha256(file).hexdigest()
                if fileHash != correctHash:
                    continue
                try:
                    # Now crop the image at each loop
                    j = imread("./uncropped/"+filename)
                    # Crop the image at each loop and call rgb2gray function
                    out = j[y1:y2, x1:x2]
                    # Resize the image and save it
                    out = imresize(out, [28, 28])
                    imsave("./cropped/"+filename, out)
                except: # Handle the unexpected runtime errors
                    continue
            print(filename)b
            
            i += 1
"""

    `imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
    Use ``imageio.imread`` instead.
    `imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
    Use ``skimage.transform.resize`` instead.


noth0.jpg
noth1.jpg
noth2.jpg
noth3.jpg
noth4.jpg
noth5.jpg
noth6.jpg
noth7.jpg
noth8.jpg
noth9.jpg
noth10.jpg
noth11.jpg
noth12.jpg
noth13.jpg
noth14.jpg
noth15.jpg
noth16.jpg
noth17.jpg
noth18.jpg
noth19.png
noth20.jpg
noth21.jpg
noth22.jpg
noth23.jpg
klein0.jpg
klein1.jpg
klein2.jpg
klein3.jpg
klein4.jpg
klein5.jpg
klein6.jpg
klein7.jpg
klein8.jpg
klein9.jpg
klein10.jpg
statham0.jpg
statham1.jpg
statham2.jpg
statham3.jpg
statham4.jpg
statham5.jpg
statham6.jpg
statham7.jpg
statham8.jpg
statham9.jpg
statham10.jpg
statham11.jpg
statham12.jpg
butler0.jpg
butler1.jpg
butler2.jpg
butler3.jpg
butler4.jpg
butler5.jpg
butler6.jpg
butler7.jpg
butler8.jpg
butler9.jpg
butler10.jpg
butler11.jpg
butler12.jpg
butler13.jpg
butler14.jpg
butler15.jpg
butler16.jpg
butler17.jpg
butler18.jpg
butler19.jpg
butler20.jpg
butler21.jpg
butler22.jpg
butler23.jpg
butler24.jpg
butler25.jpg
butler26.jpg
butler27.jpg
butler28.jpg
butler29.jpg
butler30.jpg
butler31.jpg
butler32.png
butler33.jpg
butler34.jpg
butler3

KeyboardInterrupt: 

: 

In [5]:
import os
import random
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from sklearn.model_selection import train_test_split


In [None]:
class ActorFacesDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.image_paths = []
        self.labels = []
        self.class_num = 0
        
       
        self.actors = sorted(os.listdir(root_dir))
        
        for label, actor in enumerate(self.actors):
            self.class_num += 1
            actor_folder = os.path.join(root_dir, actor)
            
            if os.path.isdir(actor_folder):
                for image_name in os.listdir(actor_folder):
                    if image_name.endswith(('.jpg', '.png', '.jpeg')): 
                        image_path = os.path.join(actor_folder, image_name)
                        self.image_paths.append(image_path)
                        self.labels.append(label)  # Use actor index as label
                
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        label = self.labels[idx]
        
        image = Image.open(image_path).convert("RGB")
        
        if self.transform:
            image = self.transform(image)
        
        return image, label


In [None]:
transform = transforms.Compose([
    transforms.Resize((28, 28)),  # Resize
    transforms.ToTensor(),  # Convert image to Tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize to ImageNet stats
])


In [None]:
root_dir = 'actor_faces'

dataset = ActorFacesDataset(root_dir=root_dir, transform=transform)

train_images, testval_images = train_test_split(dataset.image_paths, test_size=0.3, random_state=42)
val_images, test_images = train_test_split(testval_images, test_size=0.5, random_state=42)

train_labels = [dataset.labels[dataset.image_paths.index(img)] for img in train_images]
val_labels = [dataset.labels[dataset.image_paths.index(img)] for img in val_images]
test_labels = [dataset.labels[dataset.image_paths.index(img)] for img in test_images]

train_dataset = torch.utils.data.Subset(dataset, range(len(train_images)))
val_dataset = torch.utils.data.Subset(dataset, range(len(train_images), len(train_images) + len(val_images)))
test_dataset = torch.utils.data.Subset(dataset, range(len(train_images) + len(val_images), len(dataset)))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


In [None]:

for images, labels in train_loader:
    print(images[0].shape) 
    break


torch.Size([3, 28, 28])


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class FCNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FCNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        nn.init.kaiming_normal_(self.fc1.weight, mode='fan_in', nonlinearity='relu')
        self.relu = nn.ReLU()                      
        self.fc2 = nn.Linear(hidden_dim, output_dim)  
        nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in', nonlinearity='relu')
        self.dropout = nn.Dropout(p=0.5)
        self.softmax = nn.Softmax(dim=1)     

    def forward(self, x):
        x = x.reshape(x.shape[0], -1) # Flatten input
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

input_dim = 28 * 28 *3 
hidden_dim = 300    
output_dim = dataset.class_num
print(output_dim)

model = FCNN(input_dim, hidden_dim, output_dim)


265


In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
def train(model, train_loader, val_loader, epochs):
    model.train()
    train_losses = []
    val_losses = []
    for epoch in range(epochs):
        train_loss = 0.0
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            #print(f'outputs: {outputs.shape}')
            #print(f'labels: {labels.shape}')
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        # Validation Loss
        val_loss = 0.0
        model.eval()
        with torch.no_grad():
            for images, labels in val_loader:
                outputs = model(images)

                #print(f'outputs: {outputs.shape}')
                #print(f'labels: {labels.shape}')
                loss = criterion(outputs, labels)
                val_loss += loss.item()

        train_losses.append(train_loss / len(train_loader))
        val_losses.append(val_loss / len(val_loader))
        print(f"Epoch {epoch + 1}/{epochs}, Train Loss: {train_loss / len(train_loader):.4f}, Validation Loss: {val_loss / len(val_loader):.4f}")

    return train_losses, val_losses

epochs = 5
train_losses, val_losses = train(model, train_loader, val_loader, epochs)


In [None]:
def evaluate(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            #print(f'outputs: {outputs}')
            #print(f'labels: {labels}')
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            #print(f'predictions: {predicted}')
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")
    return accuracy

# Evaluate the model
test_accuracy = evaluate(model, test_loader)


labels: tensor([223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223])
predictions: tensor([ 39, 127, 134, 134,  89,   1, 126, 157, 127, 157, 153,   4,  62,  88,
         49,   1, 144,  39,  77, 125,  62,  39,  39,  81,  21,  53,  61,  39,
         62, 127,  56,  77])
labels: tensor([223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223])
predictions: tensor([ 39,  93, 173,  75,  39, 153,  30,  26,  39,  77, 127,  71,   1, 126,
        119,   1, 147, 157, 151,  53,  90, 127, 157,  39,   1,  23, 157,  48,
         39, 125,  37, 134])
labels: tensor([223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223, 223,
        223, 223, 223, 223])
predictions: tensor([ 56,

In [None]:
import matplotlib.pyplot as plt

plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Learning Curves')
plt.show()

### Part 2.2: AlexNet as a Fixed Feature Extractor
Extract the values of the activations of AlexNet on the face images. Use those as features in order to perform face classification: learn a fully-connected neural network that takes in the activations of the units in the AlexNet layer as inputs, and outputs the name of the person. Below, include a description of the system you built and its performance. It is recommended to start out with only using the `conv4` activations. Using `conv4` is sufficient here.

In [39]:
alexnet = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)

Using cache found in C:\Users\Tolga/.cache\torch\hub\pytorch_vision_v0.10.0
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, "


In [40]:
for param in alexnet.parameters():
    param.requires_grad = False

In [54]:
print(alexnet.features[:10])

Sequential(
  (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU(inplace=True)
  (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): ReLU(inplace=True)
)


In [None]:
dummy_input = torch.randn(1, 3, 224, 224)
output = alexnet.features[:10](dummy_input)
print(output.shape)

torch.Size([1, 256, 13, 13])


In [56]:
class ModifiedAlexNet(nn.Module):
    def __init__(self, num_classes):
        super(ModifiedAlexNet, self).__init__()
        self.alexnet = torch.nn.Sequential(*list(alexnet.features)[:10])
        self.fc1 = nn.Linear(256 * 13 * 13, 300)  # First hidden layer (300 units)
        self.fc2 = nn.Linear(300, num_classes)

    def forward(self, x):
        x = self.alexnet(x)
        x = x.reshape(x.shape[0], -1)
        x = self.fc1(x)
        x = self.fc2(x)
        return x

In [51]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to 28x28 pixels
    transforms.ToTensor(),  # Convert image to Tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize to ImageNet stats
])

In [None]:
root_dir = 'actor_faces'

dataset = ActorFacesDataset(root_dir=root_dir, transform=transform)

train_images, testval_images = train_test_split(dataset.image_paths, test_size=0.3, random_state=42)
val_images, test_images = train_test_split(testval_images, test_size=0.5, random_state=42)

train_labels = [dataset.labels[dataset.image_paths.index(img)] for img in train_images]
val_labels = [dataset.labels[dataset.image_paths.index(img)] for img in val_images]
test_labels = [dataset.labels[dataset.image_paths.index(img)] for img in test_images]

train_dataset = torch.utils.data.Subset(dataset, range(len(train_images)))
val_dataset = torch.utils.data.Subset(dataset, range(len(train_images), len(train_images) + len(val_images)))
test_dataset = torch.utils.data.Subset(dataset, range(len(train_images) + len(val_images), len(dataset)))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


In [58]:
criterion = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer
model = ModifiedAlexNet(dataset.class_num)
def train(model, train_loader, val_loader, epochs):
    model.train()
    train_losses = []
    val_losses = []
    for epoch in range(epochs):
        train_loss = 0.0
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            #print(f'outputs: {outputs.shape}')
            #print(f'labels: {labels.shape}')
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        # Validation Loss
        val_loss = 0.0
        model.eval()
        with torch.no_grad():
            for images, labels in val_loader:
                outputs = model(images)

                #print(f'outputs: {outputs.shape}')
                #print(f'labels: {labels.shape}')
                loss = criterion(outputs, labels)
                val_loss += loss.item()

        train_losses.append(train_loss / len(train_loader))
        val_losses.append(val_loss / len(val_loader))
        print(f"Epoch {epoch + 1}/{epochs}, Train Loss: {train_loss / len(train_loader):.4f}, Validation Loss: {val_loss / len(val_loader):.4f}")

    return train_losses, val_losses

# Train the model
epochs = 1
train_losses, val_losses = train(model, train_loader, val_loader, epochs)


Epoch 1/1, Train Loss: 5.7942, Validation Loss: 5.7759


In [None]:
def evaluate_fc_model(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for features, labels in test_loader:
            outputs = model(features)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")
    return accuracy
"""
# Test set
X_test, y_test = ...  # Split features and labels for testing
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
"""

# Evaluate the model
evaluate_fc_model(model, test_loader)


### Part 2.3: Visualize Weights
Train two networks the way you did in Part 2.1. Use 300 and 800 hidden units in the hidden layer. Visualize 2 different hidden features (neurons) for each of the two settings, and briefly explain why they are interesting. A sample visualization of a hidden feature is shown below. Note that you probably need to use L2 regularization while training to obtain nice weight visualizations.

![](figures/figure2.jpg)

### Part 2.4: Finetuning AlexNet
Train two networks the way you did in Part 2.1. Use 300 and 800 hidden units in the hidden layer. Visualize 2 different hidden features (neurons) for each of the two settings, and briefly explain why they are interesting. A sample visualization of a hidden feature is shown in Figure 4. Note that you probably need to use L2 regularization while training to obtain nice weight visualizations.

### Part 2.5: Bonus: Gradient Visualization
Here, you will use [Utku Ozbulak’s PyTorch CNN Visualizations Library](https://github.com/utkuozbulak/pytorch-cnn-visualizations/) to visualize the important parts of the input image for a particular output class. In particular, just select a specific picture of an actor, and then using your trained network in Part 2.4, perform Gradient visualization with guided backpropagation to understand the prediction for that actor with respect to the input image. Comment on your results.

## What to Turn In
You have two options for submission:
1) Provide all the relevant answers to questions, images, figures, etc, in this Jupyter notebook, convert the jupyter notebook into a PDF, and upload the PDF.
2) Write all the answers to the questions and any relevant figures in a LaTeX report, convert the report to a PDF, and upload a zip file containing both the jupyter notebook and the report. 

## Grading
The assignment will be graded out of `100` points: `0` (no submission), `20` (an attempt at a solution), `40` (a partially correct solution), `60` (a mostly correct solution), `80` (a correct solution), `100` (a particularly creative or insightful solution). The grading depends on both the content and clarity of your report.