# **Training and testing gender detector model**

Mount google drive with dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Unpacking the dataset archive

In [None]:
!unzip "/content/drive/My Drive/internship_data_cleaned.zip"

Analysis of the dataset revealed the following problems:
- presence of duplicates
- class mismatch
- lack of informative signs
- impossibility of visual identification

The FSlint program was used to remove duplicates.
The rest of the problems were partially solved in manual mode.
To fully solve these problems, it is necessary to use machine learning methods.

In [1]:
from glob import glob

data_path = 'internship_data/'
total_count = len(glob(data_path + '*/*'))
batch = 16 # Experiments have shown that a small batch is more preferable
workers = 4

# Divide the dataset in ration 0.9/0.05/0.05. 
# We need a validation dataset to evaluate overfitting
train_count = int(0.9 * total_count)
valid_count = int(0.05 * total_count)
test_count = total_count - train_count - valid_count
print('''train_count: {} 
valid_count: {}
test_count:  {}'''.format(train_count, valid_count, test_count))

train_count: 89679 
valid_count: 4982
test_count:  4983


For validation and testing we will resize and normalize dataset. Large image size is not required therefore we use a 128 by 128 image. For training we will use additional transformations. Random rotation and horizontal flip will diversify training data.

In [2]:
from torch.utils.data import random_split, DataLoader, Dataset
from torchvision.datasets import ImageFolder
from torchvision import transforms
from torch import cuda

# Class for applying multiple transforms
class MapDataset(Dataset):
    def __init__(self, dataset, map_fn):
        self.dataset = dataset
        self.map = map_fn

    def __getitem__(self, index):
        if self.map:     
            x = self.map(self.dataset[index][0]) 
        else:     
            x = self.dataset[index][0]  # image
        y = self.dataset[index][1]   # label      
        return x, y

    def __len__(self):
        return len(self.dataset)


data_transforms = {'train': transforms.Compose([
                            transforms.Resize([128, 128]),
                            transforms.RandomHorizontalFlip(),
                            transforms.RandomRotation(degrees=15),
                            transforms.ToTensor(),
                            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                 std=[0.229, 0.224, 0.225])]),
                   'test':  transforms.Compose([
                            transforms.Resize([128, 128]),
                            transforms.ToTensor(),
                            transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                                                 std=[0.229, 0.224, 0.225])])}

# Create dataset without transforms
dataset = ImageFolder(data_path)

# Split to train, validation and test
train_dataset, valid_dataset, test_dataset = random_split(dataset, (train_count, 
                                                                    valid_count, 
                                                                    test_count))

# Mapping transforms to the corresponding datasets 
train_dataset = MapDataset(train_dataset, data_transforms['train'])
valid_dataset = MapDataset(valid_dataset, data_transforms['test'])
test_dataset = MapDataset(test_dataset, data_transforms['test'])

# Create three dataloaders. We don't need to shuffle the test dataset
train_dataset_loader = DataLoader(train_dataset, batch_size=batch, shuffle=True, 
                                  num_workers=workers)  
valid_dataset_loader = DataLoader(valid_dataset, batch_size=batch, shuffle=True, 
                                  num_workers=workers) 
test_dataset_loader  = DataLoader(test_dataset, batch_size=batch, shuffle=False,
                                  num_workers=workers)
# Dict of dataloaders
dataloaders = {'train': train_dataset_loader, 
               'valid': valid_dataset_loader, 
               'test': test_dataset_loader}

# Use of GPU for fast learning
use_cuda = cuda.is_available()
print('\nuse_cuda: ', use_cuda)


use_cuda:  True


The assigned task is a classification task. We will use CNN architecture. Since it is necessary to define only two classes, max poolling layers will reduce the number of parameters without unnecessary loss of information. Using Dropout layers will reduce the impact of overfitting during long training. At the output, we use the log_softmax function to apply the NLLLoss loss function.

In [3]:
from torch.nn import Module, Conv2d, MaxPool2d, Linear, Dropout
from torch.nn.functional import relu, log_softmax

class Net(Module):
    def __init__(self):
        super(Net, self).__init__()
        # 128x128x3
        self.conv1 = Conv2d(3, 32, kernel_size=3, stride=1, padding=0)
        # 126x126x32
        self.conv2 = Conv2d(32, 32, kernel_size=3, stride=1, padding=0)
        # 124x124x32
        self.max_pool1 = MaxPool2d(2, 2)
        
        # 62x62x32
        self.conv3 = Conv2d(32, 32, kernel_size=3, stride=1, padding=0)
        # 60x60x32
        self.conv4 = Conv2d(32, 32, kernel_size=3, stride=1, padding=0)
        # 58x58x32
        self.max_pool2 = MaxPool2d(2, 2)
         
        # 29x29x32
        self.conv5 = Conv2d(32, 64, kernel_size=3, stride=1, padding=0)
        # 27x27x64
        self.conv6 = Conv2d(64, 64, kernel_size=3, stride=1, padding=0)
        # 25x25x64
        self.max_pool3 = MaxPool2d(2, 2)

        # 12x12x64
        self.conv7 = Conv2d(64, 64, kernel_size=3, stride=1, padding=0)
        # 10x10x64
        self.conv8 = Conv2d(64, 64, kernel_size=3, stride=1, padding=0)
        # 8x8x64
        self.max_pool4 = MaxPool2d(2, 2)

        # 4x4x64
        self.fc1 = Linear(4*4*64, 512)        
        self.drop = Dropout(0.2)
        self.fc2 = Linear(512, 2)
        
    def forward(self, x):
        # Define forward behavior
        x = relu(self.conv1(x))
        x = relu(self.conv2(x))
        x = self.max_pool1(x)
            
        x = relu(self.conv3(x))
        x = relu(self.conv4(x))
        x = self.max_pool2(x)

        x = relu(self.conv5(x))
        x = relu(self.conv6(x))
        x = self.max_pool3(x)

        x = relu(self.conv7(x))
        x = relu(self.conv8(x))
        x = self.max_pool4(x)
             
        x = x.view(x.size(0), -1)
        x = self.drop(x)
        x = self.fc1(x)
        x = self.drop(x)        
        x = self.fc2(x)
        x = log_softmax(x, -1)
            
        return x

CNN model initialization

In [4]:
model = Net()
print(model)

if use_cuda:
    model.cuda()

Net(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  (max_pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  (max_pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv5): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (conv6): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (max_pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv7): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (conv8): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (max_pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=1024, out_features=512, bias=True)
  (drop): Dropout(p=0.2, inplace=False)
  (fc2): Linear(in_features=512, out_features

After several experiments it was found that NLLLoss is more suitable than CrossEntropy function for this problem. It gives a more accuracy. Adam was chosen as the learning algorithm. it's stable and does a fairly good job of finding a more or less optimal solution. experiments have shown that the learning rate 2e-4 is optimal for getting out of local minimum.

In [5]:
from torch.optim import Adam
from torch.nn import NLLLoss

criterion = NLLLoss()
optimizer = Adam(model.parameters(), lr=2e-4)

Write the training and validation function

In [6]:
import numpy as np
from torch import save

def train(n_epochs, loaders, model, optimizer, criterion, use_cuda, save_path):
    """returns trained model"""
    # Initialize tracker for minimum validation loss
    valid_loss_min = np.Inf 
    
    for epoch in range(1, n_epochs+1):
        # Initialize variables to monitor training and validation loss
        train_loss = 0.0
        valid_loss = 0.0
        
        # Train the model
        model.train()
        for batch_idx, (data, target) in enumerate(loaders['train']):
            # Move to GPU
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            # Calculate the loss
            loss = criterion(model(data), target)
            # Compute gradient of the loss
            loss.backward()
            # Parameters update
            optimizer.step()
            # Update running training loss
            train_loss += (loss.data - train_loss) / (batch_idx + 1)
            # Print loss every 100 epochs
            if batch_idx % 100 == 0:
                print('train_loss: {:.6f}'.format(train_loss))
        
        # Average training loss
        train_loss /= len(loaders['train'].dataset) 
          

        # Validate the model
        model.eval()
        for batch_idx, (data, target) in enumerate(loaders['valid']):
            # Move to GPU
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            # Calculate the loss
            loss = criterion(model(data), target)
            
            valid_loss += (loss.data - valid_loss) / (batch_idx + 1)
            
        # Average validation loss
        valid_loss /= len(loaders['valid'].dataset)
        
        # Print training/validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'
              .format(epoch, train_loss, valid_loss))
        
        # Save the model if it's better
        if valid_loss_min > valid_loss:
            save(model.state_dict(), save_path)
            print('Model saved')
            valid_loss_min = valid_loss

    return model

Model training

In [None]:
epochs = 100 # We will track overfitting during training
model_path = 'model.pt'
model = train(epochs, dataloaders, model, optimizer, criterion, use_cuda, 
              model_path)

**Now let's test the model**

In [8]:
from torch import load, device

# Check location for model
if use_cuda:
    location = lambda storage, loc: storage.cuda()
else:
    location = 'cpu'

# Load the model parameters to the device used
model.load_state_dict(load(model_path, map_location=location))

<All keys matched successfully>

Test function

In [9]:
def test(loaders, model, criterion, use_cuda):
    # Monitor test loss and accuracy
    test_loss = 0.
    correct = 0.
    total = 0.

    model.eval()
    for batch_idx, (data, target) in enumerate(loaders['test']):
        # Move to GPU
        if use_cuda:
            data, target = data.cuda(), target.cuda()
        # Forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # Calculate the loss
        loss = criterion(output, target)
        # Update average test loss
        test_loss += ((1 / (batch_idx + 1)) * (loss.data - test_loss))
        # Convert output probabilities to predicted class
        pred = output.data.max(1, keepdim=True)[1]
        # Compare predictions to true label
        correct += \
            np.sum(np.squeeze(pred.eq(target.data.view_as(pred))).cpu().numpy())
        total += data.size(0)
            
    print('Test Loss: {:.6f}\n'.format(test_loss))

    print('\nTest Accuracy: %2d%% (%2d/%2d)' % (
        100. * correct / total, correct, total))

Run test

In [10]:
test(dataloaders, model, criterion, use_cuda)

Test Loss: 0.064187


Test Accuracy: 97% (4868/4983)


The model showed a high rate of accuracy on the test sample. 
However, the dataset requires a deeper cleanup using clustering algorithms to eliminate the remaining problems. 
Optimization of the model can be aimed at reducing the depth of the network and using the ensemble of networks. 
In one of the grayscale tests, the model also performed well. 
Thus, using a less resource-intensive architecture is possible.