# **Instruction**

We built a classifier model for the particular multi-label image classification task, trained and validated on the 30000 training data set. When running the code, please refer to the table of contents and comments in the .ipynb file and run the code from top to bottom. One of the experiment result has been shown in the Training model section.


### **1.1 Package import**

In this section, package needed would be imported and at the same time, we will apply the access to the Google Drive in order to download data insection 2.1(Data Loading).

In [None]:
import numpy as np
import re
import pandas as pd
import os

# torch package
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import transforms, models

import io
from io import StringIO
from itertools import chain
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import time
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

# NLP technology package
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.tag.perceptron import PerceptronTagger
import gensim.downloader as api

# Apply to the google drive
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


### **1.2 Hyperparameters setting**

This unit contains most of the parameters that will be used in our model
Our project team mainly conducts comparison experiments by changing the value of hyperparameters in this cell

In [None]:
# show thw runnnig time type
in_cuda = torch.cuda.is_available()
device = torch.device('cuda' if in_cuda else 'cpu')
if using_cuda:
    print(torch.cuda.get_device_properties(device))

# some hyperparameters
label_nums = 20                                   # length of label
dataPath = './data/'                              # load the data
LEAST_FREQUENCY = 1                               # time of word show time
SEQUENCE_LENGTH = 8                               # the length of a sentence
VALIDATION_RATIO = 0.2                            # split the dataset
BATCH_SIZE = 32                                   # batch size
LEARNING_RATE = 8e-5                              # lr 
DROPOUT = 0.3                                     # Dropout
THRESHOLD = 0.3                                   # probability threshlod of output
NUM_EPOCHS = 7                                    # epoch nums

kwargs = {'num_workers': 1, 'pin_memory': True} if using_cuda else {}

We are using cuda.
_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16280MB, multi_processor_count=56)


### **1.3 Supporting Fuctions**

We have created many support functions for subsequent experiments, such as nlp preprocessing, data loading, etc. These support functions will be explained in this section.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.tag.perceptron import PerceptronTagger
import gensim.downloader as api

# the function is used. to read the file 
def csv_read(file_load):
    with open(file_load) as CsvFile:
        lines = [re.sub(r'([^,])"(\s*[^\n])', r'\1/"\2', line) for line in CsvFile]
    return pd.read_csv(StringIO(''.join(lines)), escapechar="/")

# preprocess caption
def nlp_process(PathTa, PathTe,label_nums,emb_model):
    # read the file to get information
    ListTa = csv_read(PathTa)
    ListTe = csv_read(PathTe)
    # get different aspect from list  train and test
    traImageId = ListTa['ImageID'].tolist()
    traLable = ListTa['Labels'].tolist()
    traCaption = ListTa['Caption'].tolist()
    testImageId = ListTe['ImageID'].tolist()
    testCaption = ListTe['Caption'].tolist()

    # just change the labels into one-hot
    TransKit = MultiLabelBinarizer().fit([np.arange(label_nums)])
    traLabelsL = [[int(label) for label in labels.split()] for labels in traLable]
    traLable = TransKit.transform(traLabelsL)

    # NLP. preprocess
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words('english')
    word_error = {'windsurfs','kiteboards','tball','skiboard','bluewhite',  'krispee', 'deckered',
                'firsbee', 'frizbee',  'snowcovered',  'rared','surfboarding',  'umbrells',
                'midswing',  'firehydrant',  'frisbe','skii', 'skiies', 'surfboarder', 
                'baeball','parasailers','basball','fourwheeler','checkerd','blackandwhite'}
    for idx, sentence in enumerate(traCaption):
        # remove punctuation
        sentence = re.sub(r'[^\w\s]', '', sentence)
        # lowcase 
        sentence = sentence.lower()
        # tokenize the sentence. split into one words list
        sentence = word_tokenize(sentence)
        # remove stop words from list
        sentence = [word for word in sentence if not word in stop_words]
        # do lemmatization
        # sentence = [lemmatizer.lemmatize(word) for word in sentence]
        # pos tagger
        pos_tagger = PerceptronTagger()
        tag_set = set(['NN', 'NNS', 'NNP', 'NNPS', 'VBG'])
        sentence =   [word for word in sentence if pos_tagger.tag(word.split())[0][1] in tag_set or word == 'puppy']
        # remove spell error
        sentence = [word for word in sentence if not word in word_error]
        traCaption[idx] = sentence


    # for test
    for idx, sentence in enumerate(testCaption):

        # remove punctuation
        sentence = re.sub(r'[^\w\s]', '', sentence)
        # lowcase 
        sentence = sentence.lower()
        # tokenize the sentence. split into one words list
        sentence = word_tokenize(sentence)
        # remove stop words from list
        sentence = [word for word in sentence if not word in stop_words]
        # do lemmatization
        # sentence = [lemmatizer.lemmatize(word) for word in sentence]
        # pos tagger
        pos_tagger = PerceptronTagger()
        tag_set = set(['NN', 'NNS', 'NNP', 'NNPS', 'VBG'])
        sentence =   [word for word in sentence if pos_tagger.tag(word.split())[0][1] in tag_set or word == 'puppy']
        # remove spell error
        sentence = [word for word in sentence if not word in word_error]
        testCaption[idx] = sentence

    # count words apperance times.
    freqWord = {}
    for abs in traCaption:
        for word in abs:
            try:
                freqWord[word] += 1
            except KeyError:
                freqWord[word] = 1
    for ads in testCaption:
        for word in ads:
            try:
                freqWord[word] += 1
            except KeyError:
                freqWord[word] = 1
    freq_list = freqWord


    # delet the rarely apperanced words
    traHighWords = []
    for wordOne in traCaption:
        temp = []
        for word in wordOne:
            # 1 represent the leaset time of a word is 2
            if freq_list[word] > 1:  
                temp.append(word)
        if len(temp) == 0:
            traHighWords.append(wordOne)
        else:
            traHighWords.append(temp)
    trainCaption = traHighWords

    testHigh = []
    for wordTwo in testCaption:
        temp = []
        for word in wordTwo:
            # 1 represent the leaset time of a word is 2
            if freq_list[word] > 1:
                temp.append(word)
        if len(temp) == 0:
            testHigh.append(wordTwo)
        else:
            testHigh.append(temp)
    testCaption = testHigh

    # do padding process
    output_train = []
    for wordlistaa in trainCaption:
        while len(wordlistaa) < 8 :
            wordlistaa = wordlistaa * 2
        output_train.append(wordlistaa[: 8])
    trainCaption = output_train

    output_test = []
    for wordlistbb in testCaption:
        while len(wordlistbb) < 8:
            wordlistbb = wordlistbb * 2
        output_test.append(wordlistbb[: 8])
    testCaption = output_test


    # use word_embeding to represent real word by pretrained model
    emb_dim = emb_model.vector_size
    output_train = []
    for dimWordaa in trainCaption:
        out_temp = []
        for word in dimWordaa:
            try:
                out_temp.append(emb_model.wv[word])
            except:
                out_temp.append(np.array([0.0] * emb_dim, np.float32))    
        output_train.append(out_temp)
    trainCaption = np.array(output_train, np.float32)

    # for test 
    output_test = []
    for dimWordbb in testCaption:
        out_temp = []
        for word in dimWordbb:
            try:
                out_temp.append(emb_model.wv[word])
            except:
                out_temp.append(np.array([0.0] * emb_dim, np.float32))    
        output_test.append(out_temp)
    testCaption = np.array(output_test, np.float32)

    return traImageId,trainCaption,traLable,testImageId,testCaption



# define how to preprocess image information
# https://pytorch.org/docs/stable/torchvision/transforms.html#transforms-on-torch-tensor
image_transforms = {
    'train': transforms.Compose([
        #improve picture quality
        transforms.ColorJitter(brightness=0.15, contrast=0.1, hue=0.2),                         
        transforms.Resize(256),
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.463, 0.449, 0.421], [0.241, 0.234, 0.238])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.463, 0.449, 0.421], [0.241, 0.234, 0.238])
    ]),
}

# define how to load image from directory
def load_image(dataId):
    return Image.open(os.path.join(dataPath, dataId))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### **2.1 Data loading**

Load the trainingh and test data sets and unzip them

In [None]:
# download zip file from my Google Drive and unzip it
id = '1dtjKoWgZO_gYm7Kt94jVnZgbjlrG8DVi'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('dataset')

!unzip dataset

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: data/550.jpg            
  inflating: data/5500.jpg           
  inflating: data/5501.jpg           
  inflating: data/5502.jpg           
  inflating: data/5503.jpg           
  inflating: data/5504.jpg           
  inflating: data/5505.jpg           
  inflating: data/5506.jpg           
  inflating: data/5507.jpg           
  inflating: data/5508.jpg           
  inflating: data/5509.jpg           
  inflating: data/551.jpg            
  inflating: data/5510.jpg           
  inflating: data/5511.jpg           
  inflating: data/5512.jpg           
  inflating: data/5513.jpg           
  inflating: data/5514.jpg           
  inflating: data/5515.jpg           
  inflating: data/5516.jpg           
  inflating: data/5517.jpg           
  inflating: data/5518.jpg           
  inflating: data/5519.jpg           
  inflating: data/552.jpg            
  inflating: data/5520.jpg           
  inflating: data/5521.

### **2.2 Data Preprocessing**

We will use natural language processing technologies like Punctuation removal, Stopwords removal and Wrong spell, Case-folding, Lemmatization, Tokenization, Word Embedding, Padding and Word Embedding (pretrained model like "glove-twitter-100") for the Caption Preprocess.

In [None]:
# prepare train，validation set 

train_path = './train.csv'
test_path = './test.csv'
emb_model = api.load("glove-twitter-100") 
input_dim = emb_model.vector_size

traImageId,train_captions_list,traLable,testImageId,testCaption = nlp_process(train_path, test_path,label_nums,emb_model)
train_id, val_id, train_caption, val_caption, train_label, val_label = train_test_split(np.array(traImageId), 
                                                                                        np.array(train_captions_list, np.float32), 
                                                                                        np.array(traLable), 
                                                                                        test_size=VALIDATION_RATIO, shuffle=True)



### **2.3 Input preparation**

We prepare the model input in this section.
The output of this section will splitted to three part, which are train_data, val_data and test_data.

In [None]:

# Map-style datasets
# https://pytorch.org/docs/stable/data.html?highlight=dataset#map-style-datasets
class my_Data_Set(nn.Module):
    def __init__(self, flag, transform=None, loader=None):
        super(my_Data_Set, self).__init__()

        self.flag = flag
        if flag == 'train':
            self.images = train_id.tolist()
            self.caption = torch.from_numpy(train_caption)
            self.labels = train_label.tolist()
        elif flag == 'val':
            self.images = val_id.tolist()
            self.caption = torch.from_numpy(val_caption)
            self.labels = val_label.tolist()
        elif flag == 'test':
            self.images = testImageId
            self.caption = torch.from_numpy(testCaption)

        self.transform = transform
        self.loader = loader

    # rewrite this function to acquire data
    def __getitem__(self, item):
        # obtain the image_id
        imageName = self.images[item]
        # get image
        image = self.loader(imageName)
        # do transform
        if self.transform is not None:
            image = self.transform(image)
        # get captions
        caption = self.caption[item]

        # get labels
        if not self.flag == 'test':
            label = self.labels[item]
            # do label into folat for BCE
            label = torch.FloatTensor(label)
            return image, caption, label
        # only test can have labels
        else:
            return image, caption
 
    # get the length
    def __len__(self):
        return len(self.images)


# use the dataloader to get data
train_data = my_Data_Set(flag='train', transform=image_transforms['train'], loader=load_image)
val_data = my_Data_Set(flag='val', transform=image_transforms['val'], loader=load_image)
test_data = my_Data_Set(flag='test', transform=image_transforms['val'], loader=load_image)

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, **kwargs)
val_dataloader = DataLoader(val_data, batch_size=BATCH_SIZE, **kwargs)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, **kwargs)

dataloaders = {'train': train_dataloader, 'val': val_dataloader, 'test': test_dataloader}
# use the len function to get information to put in model
dataset_sizes = {'train': train_data.__len__(), 'val': val_data.__len__(), 'test': test_data.__len__()}

### **2.4 Training Functions**

The training procedure will be:

train_model(): iterate process including training and validating

train_epoch(): one training epoch process

val_epoch(): one validating epoch process

train_iter(): single iteration for one batch

calculate_acuracy(): return current precision and recall

predict():predict the labels of test data.

In [None]:
# calculate the accuracy rate
def calculate_acuracy(model_pred, labels):

    # Threshold to make sure the label got a baseline 
    pred_result = model_pred > THRESHOLD
    pred_result = pred_result.float()
    pred_one_num = torch.sum(pred_result)
    #  no label means no accuracy
    if pred_one_num == 0:
        return 0, 0
    # nums of true labels we predict
    target_one_num = torch.sum(labels)
    # nums of true labels
    true_predict_num = torch.sum(pred_result * labels)
    # calculate precision
    precision = true_predict_num / pred_one_num
    # calculate recall
    recall = true_predict_num / target_one_num
 
    return precision.item(), recall.item()


def train_iter(model, device, optimizer, criterion, images, captions, targets):

    #  use gpu
    images, captions, targets = images.to(device), captions.to(device), targets.to(device)
    # initialize the gradient
    optimizer.zero_grad()
    # do training
    outputs = model(images, captions)
    # BCE to do loss
    loss = criterion(outputs, targets)
    # get recall and precision
    precision, recall = calculate_acuracy(outputs, targets)
    # do backward
    loss.backward()
    # Update the parameters
    optimizer.step()
    return loss, precision, recall


def train_epoch(model, device, dataloader, optimizer, criterion, scheduler, running_loss, running_precision, running_recall, batch_num):

    # do training
    model.train()
    for data in dataloader:
        # get the dataloader to get data
        images, captions, targets = data
        loss, precision, recall = train_iter(model, device, optimizer, criterion, images, captions, targets)
        # get loss and recall precision
        running_loss += loss.item() * images.size(0)
        running_precision += precision
        running_recall += recall
        batch_num += 1
    # change learn rate
    scheduler.step()
    return running_loss, running_precision, running_recall, batch_num


def val_epoch(model, device, dataloader, criterion, running_loss, running_precision, running_recall, batch_num):

    # do on evaluation (the test part)
    model.eval()
    # initialization on gradient
    with torch.no_grad():
        for data in dataloader:
            # from gpu or cup to get data
            images, captions, targets = data
            images, captions, targets = images.to(device), captions.to(device), targets.to(device)
            # start forward
            outputs = model(images, captions)
            # use BCE to get loss
            loss = criterion(outputs, targets)
            # get loss and accuracy
            running_loss += loss.item() * images.size(0)
            precision, recall = calculate_acuracy(outputs, targets)
            running_precision += precision
            running_recall += recall
            batch_num += 1  
    return running_loss, running_precision, running_recall, batch_num


def train_model(model, device, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()
    # start the epoch by the setting epoch nums
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch + 1, num_epochs))
        # in each epoch do things on training and valuation dataset
        for phase in ['train', 'val']:
            # do initialization before each epoch
            running_loss = 0.0
            running_precision = 0.0
            running_recall = 0.0
            batch_num = 0

            if phase == 'train':
                running_loss, running_precision, running_recall, batch_num = train_epoch(model, device, dataloaders[phase], optimizer, criterion, scheduler, running_loss, running_precision, running_recall, batch_num)
            elif phase == 'val':
                running_loss, running_precision, running_recall, batch_num = val_epoch(model, device, dataloaders[phase], criterion, running_loss, running_precision, running_recall, batch_num)
        
            # based on size to mean the loss and running precision
            epoch_loss = running_loss / dataset_sizes[phase]
            print('{:5} Loss: {:.4f} '.format(phase, epoch_loss), end=' ')
            epoch_precision = running_precision / batch_num
            print('{:5} Precision: {:.4f} '.format(phase, epoch_precision), end=' ')
            epoch_recall = running_recall / batch_num
            print('{:5} Recall: {:.4f} '.format(phase, epoch_recall), end=' ')
            mean_f1_score = 2 * epoch_precision * epoch_recall / (epoch_precision + epoch_recall)
            print('{:5} F1_score: {:.4f}'.format(phase, mean_f1_score))
            # just put each epoch into file 
            torch.save(model, 'The_'+ str(epoch + 1) + '_epoch_model.pt')
    time_cost = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_cost // 60, time_cost % 60))


def predict(model, device, dataloader=dataloaders['test']):

    # evaluation the model
    model.eval()
    predicted = torch.tensor([])
    # initialization the gradient
    with torch.no_grad():
        for data in dataloader:
            # get dataloader
            images, captions = data
            images, captions = images.to(device), captions.to(device)
            # do training
            output = model(images, captions)
            predicted = torch.cat((predicted, output.cpu()), 0)
    return predicted

### **3.1 Defining and training Model**

Implementing image classification model with densenet model framework and transformer.
The model is a combination of Multi-head Self-Attention Transformer and DenseNet16. After training with the best hyper-parameter, the highest accuracy of the final model was 0.8516 on the training
set, and it took six minutes to evaluate the training of an Epoch

In [None]:
class MixNet(nn.Module):
    def __init__(self):

        super(MixNet, self).__init__()
        
        # Building densenet framework layer
        # Referred from https://pytorch.org/vision/stable/generated/torchvision.models.densenet169.html
        self.densenet = models.densenet169(pretrained = True, progress = False, memory_efficient = False)
        
        # Obtain the number of image features
        image_features_num = self.densenet.classifier.in_features

        # Implementation of densenet framework
        self.densenet.classifier = nn.Linear(image_features_num, 64)
        
        # Building Transformer layer (5 layers number)
        encoder_layer = nn.TransformerEncoderLayer(d_model=input_dim, nhead=5, dropout = DROPOUT)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=5)

        # To get the output of encoder_layer
        self.trans_linear = nn.Linear(SEQUENCE_LENGTH * input_dim, 64)
        
        # Building Concatenating layer
        self.linear = nn.Linear(64*2, label_nums)
        self.sigmoid = nn.Sigmoid()

    def forward(self, image, caption):
        """
        :param image: image data vector generated in the preprocessing stage
        :param caption: caption data vector generated in the preprocessing stage
        """  
        # Obtain the features of image
        densenet_output = self.densenet(image)

        # Obtain the output of transformer (Using transpose to temporarily fit transformer framework)
        caption = caption.transpose(1, 0)
        encoder_out = self.transformer_encoder(caption).transpose(1, 0)

        # flatten the output of transformer
        trans_out = self.trans_linear(torch.flatten(encoder_out, start_dim=1, end_dim=-1))
        
        # concatenate the output of densenet framework and transformer architecture
        output_concate = self.linear(torch.cat((densenet_output, trans_out), 1))

        # Limit the output to 0-1
        output_final = self.sigmoid(output_concate)
        return output_final


The implementation of the experiment, in this part, we will train the model and find the optimal model by tuning the parameters

In [None]:
model = MixNet().to(device)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

train_model(model, device, criterion=criterion, optimizer=optimizer, scheduler=scheduler, num_epochs=NUM_EPOCHS)

Downloading: "https://download.pytorch.org/models/densenet169-b2777c0a.pth" to /root/.cache/torch/hub/checkpoints/densenet169-b2777c0a.pth


Epoch 1/7
train Loss: 0.1263  train Precision: 0.7541  train Recall: 0.6881  train F1_score: 0.7196
val   Loss: 0.0828  val   Precision: 0.7923  val   Recall: 0.8080  val   F1_score: 0.8001
-----------------------------------------------------------------------------------------
Epoch 2/7
train Loss: 0.0819  train Precision: 0.8166  train Recall: 0.7989  train F1_score: 0.8076
val   Loss: 0.0742  val   Precision: 0.8238  val   Recall: 0.8205  val   F1_score: 0.8221
-----------------------------------------------------------------------------------------
Epoch 3/7
train Loss: 0.0748  train Precision: 0.8328  train Recall: 0.8147  train F1_score: 0.8237
val   Loss: 0.0724  val   Precision: 0.8147  val   Recall: 0.8312  val   F1_score: 0.8229
-----------------------------------------------------------------------------------------
Epoch 4/7
train Loss: 0.0704  train Precision: 0.8421  train Recall: 0.8252  train F1_score: 0.8336
val   Loss: 0.0705  val   Precision: 0.8309  val   Recall: 0

### **3.2 Predicting**

Making prediction using test data and exporting Predicted_labels file

In [None]:
# load model
model = torch.load('The_7_epoch_model.pt')

# predict labels on test data set
predicted_results = predict(model, device, dataloader=dataloaders['test'])

# Save the predictation results in colab
with open('Predicted_labels.txt', 'w') as f:
    f.write('ImageID,Labels\n')
    for x1, x2 in enumerate(predicted_results):
        line = x2 > 0.3
        pred = [i for i in range(len(line)) if line[i]]
        predicted_labels = ' '.join(str(e) for e in pred)
        f.write(str(testImageId[x1]) + ',' + predicted_labels + '\n')
