In [35]:
%matplotlib inline 

import nltk
import torch 
import sklearn 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import collections
import re
import time
import random
import sklearn

from sklearn.metrics import confusion_matrix
from scipy.interpolate import make_interp_spline

# Gender classification assignment

You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is provided.

You will be working on classifying the genders of people from their blog posts using a data set called the [Blog Authorship Corpus](https://www.kaggle.com/rtatman/blog-authorship-corpus).
This has been pre-split and reduced for you to use in this assignment.

10% of the marks from this assignment are based on neatness.

This assignment will carry 40% of the final mark.

## Data processing (10%)

You have a train/dev/test split data set consisting of CSV files with two fields: gender and text.
The gender field contains either 'male' or 'female' whilst the text is a string containing text from blog posts.

Do the following tasks:

Load these three CSV files and tokenise each text.

In [37]:
def tolower(data):
    return data.lower()

#Import
dev = pd.read_csv("dev.csv")
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

#Split to X and Y 
dev_x = dev['text']
dev_y = dev['gender']

test_x = test['text']
test_y = test['gender']

train_x = train['text']
train_y = train['gender']

#Convert to Lower Case
dev_x = dev_x.apply(tolower)
test_x = test_x.apply(tolower)
train_x = train_x.apply(tolower)

#Tokenize Text 
tdev_x = dev_x.apply(nltk.tokenize.word_tokenize)
ttest_x = test_x.apply(nltk.tokenize.word_tokenize)
ttrain_x = train_x.apply(nltk.tokenize.word_tokenize)



Write code that counts the number of lines in each data set as well as the maximum number of tokens in each data set.

In [38]:
dev_len = len(dev.index)
test_len = len(test.index)
train_len = len(train.index)

print('Length of DEV: ', dev_len)
print('Length of TRAIN: ', train_len)
print('Length of TEST: ', test_len)

dev_lens = [len(x) for x in tdev_x]
dev_max = max(dev_lens)

train_lens = [len(x) for x in ttrain_x]
train_max = max(train_lens)

test_lens = [len(x) for x in ttest_x]
test_max = max(test_lens)

print('Max Tokens in DEV: ', str(dev_max))
print('Max Tokens in TRAIN: ',str(train_max))
print('Max Tokens in TEST: ', str(test_max))

Length of DEV:  4650
Length of TRAIN:  37208
Length of TEST:  4652
Max Tokens in DEV:  61
Max Tokens in TRAIN:  97
Max Tokens in TEST:  66


Convert each data set's labels (gender) into numeric form.

In [40]:
categories = sorted(set(train_y))
cat2index = {c:i for (i, c) in enumerate(categories)}

tensor_ind_dev_y = torch.tensor([cat2index[category] for category in dev_y], dtype=torch.int64)
tensor_ind_train_y = torch.tensor([cat2index[category] for category in train_y], dtype=torch.int64)
tensor_ind_test_y = torch.tensor([cat2index[category] for category in test_y], dtype=torch.int64)


Extract a vocabulary consisting of the tokens that occur at least 5 times in the train set and output the size of your vocabulary.
Include the unknown token and pad token in the vocabulary.

In [43]:
min_freq = 5

frequencies = collections.Counter(word for text in ttrain_x for word in text)
vocab = sorted(frequencies.keys(), key=frequencies.get, reverse=True)
while frequencies[vocab[-1]] < min_freq:
    vocab.pop()
vocab = ['<PAD>', '<UNK>'] + sorted(vocab)

print("Vocab", len(vocab))
    

Vocab 7113


Create binary bag of words feature vectors for all data set texts using the vocabulary created above (include stop words).

In [42]:
encoder = sklearn.feature_extraction.text.CountVectorizer(vocabulary=vocab, binary=True, analyzer=lambda text: text, dtype=np.float32)
encoder.fit(train_x)

vdev_x = encoder.transform(tdev_x).toarray()
vtrain_x = encoder.transform(ttrain_x).toarray()
vtest_x = encoder.transform(ttest_x).toarray()

Create a data set of indexified token sequences for all texts using the vocabulary created above, making use of unknown tokens and pad tokens.

In [52]:
word2index = {w:i for (i,w) in enumerate(vocab)}

for i in range(len(tdev_x)):
    for j in range(len(tdev_x[i])):
        if tdev_x[i][j] not in word2index:
            tdev_x[i][j] = '<UNK>'
    tdev_x[i].extend(['<PAD>']*(dev_max - len(tdev_x[i])))
    
for i in range(len(ttrain_x)):
    for j in range(len(ttrain_x[i])):
        if ttrain_x[i][j] not in word2index:
            ttrain_x[i][j] = '<UNK>'
    ttrain_x[i].extend(['<PAD>']*(train_max - len(ttrain_x[i])))

for i in range(len(ttest_x)):
    for j in range(len(ttest_x[i])):
        if ttest_x[i][j] not in word2index:
            ttest_x[i][j] = '<UNK>'
    ttest_x[i].extend(['<PAD>']*(test_max - len(ttest_x[i])))

indexed_dev_x = torch.tensor([[word2index[word] for word in text] for text in tdev_x], dtype = torch.int64)
indexed_test_x = torch.tensor([[word2index[word] for word in text] for text in ttest_x], dtype = torch.int64)
indexed_train_x = torch.tensor([[word2index[word] for word in text] for text in ttrain_x], dtype = torch.int64)

tensor_dev_len = torch.tensor(dev_lens, dtype=torch.int64)
tensor_test_len = torch.tensor(test_lens, dtype=torch.int64)
tensor_train_len = torch.tensor(train_lens, dtype=torch.int64)

Write code that counts the percentage of tokens in each data set that are unknown tokens (not including pad tokens).

In [54]:
def unk(tokens):
    total_tokens = sum([len(x) for x in tokens])    
    unk_tokens = sum([1 if word == "<UNK>" else 0 for text in tokens for word in text])
    return (unk_tokens/total_tokens)

dev_unk = unk(tdev_x)
train_unk = unk(ttrain_x)
test_unk = unk(ttest_x)

print("UNK in DEV {:.2%}".format(dev_unk))
print("UNK in TRAIN {:.2%}".format(train_unk))
print("UNK in TEST {:.2%}".format(test_unk))

UNK in DEV 2.32%
UNK in TRAIN 1.31%
UNK in TEST 2.19%


## Linear regression classification (20%)

Write a linear regression classifier (single layer neural net) that is trained to classify the author gender from the bag of words vector of the text.
You do not need to perform any hyperparameter tuning.
Use L1 weight decay regularisation.

In [57]:
class Linear(torch.nn.Module):
    
    def __init__(self, vocab_size, num_categories):
        super().__init__()
        self.w = torch.nn.Parameter(torch.zeros((vocab_size, num_categories), dtype=torch.float32, requires_grad=True))
        self.b = torch.nn.Parameter(torch.zeros((num_categories,), dtype=torch.float32, requires_grad=True))

    def forward(self, x):
        return x@self.w + self.b
    
linear = Linear(len(vocab), 2)
linear.to('cpu')

optimiser = torch.optim.Adam(linear.parameters())

tensor_trainx = torch.tensor(vtrain_x, dtype=torch.float32)

print('step', 'error')
for step in range(1, 200+1):
    optimiser.zero_grad()
    output = linear(tensor_trainx)
    error = torch.nn.functional.cross_entropy(output, tensor_ind_train_y) + linear.w.abs().mean()
    error.backward()
    optimiser.step()

    if step%100 == 0:
        print(step, error.detach().tolist())

step error
100 0.6532658338546753
200 0.6431275606155396


Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

In [62]:
def accuracy(TP, TN, tot):
    return (TP + TN)/tot

def precision(TP, FP):
    return TP/(TP + FP)

def recall(TP, FN):
    return TP/(TP + FN)

def f1(precision, recall):
    num = precision * recall
    denom = precision + recall
    return 2 * (num/denom)

ttest_x_vec = torch.tensor(vtest_x, dtype = torch.float32)
targets = np.array(tensor_ind_test_y, np.int64)

with torch.no_grad():
    probability = torch.sigmoid(linear(ttest_x_vec))
    output = probability.detach().numpy().argmax(axis=1)
    
tp = 0 
tn = 0
fp = 0
fn = 0

for i in range(len(targets)):
    if(targets[i] == output[i]):
        if(targets[i] == 0):
            tn += 1
        else:
            tp += 1
    else:
        if(targets[i] == 0):
            fp += 1
        else:
            fn += 1
            
accuracy = accuracy(tp, tn, len(targets))
precision = precision(tp, fp)
recall = recall(tp, fn)
f1_score = f1(precision, recall)


print('Accuracy: {:.2%}'.format(accuracy))
print('Precision: {:.2%}'.format(precision))
print('Recall: {:.2%}'.format(recall))
print('F1-Score: {:.2%}'.format(f1_score))



Accuracy: 62.38%
Precision: 62.50%
Recall: 61.91%
F1-Score: 62.20%


Write code that shows the top 10 tokens that are the most important for determining the author gender according to the classifier.

In [59]:
temp = np.abs(linear.w.detach().numpy())

category_index = 5
weighted = sorted(zip(temp[:, :].tolist(), vocab), reverse=True)
ten = []

print('Top 10')
for i, w in enumerate(weighted[:10]):
    m = (w[0][0] + w[0][1]) / 2
    mean = "{:.2%}".format(m)
    print(i+1,") ",w[1]," (",mean,")",sep="")
    ten.append(w[1])

Top 10
1) -arv (19.25%)
2) hakx (18.94%)
3) gio (18.93%)
4) jhayne (18.82%)
5) dan (18.73%)
6) venerable (18.60%)
7) 1. (18.58%)
8) killy (18.37%)
9) managed (18.35%)
10) -sane (18.35%)


Write code that, for each data split and gender, shows the percentage of rows that include at least one of these important words (so 6 percentages in all).

In [66]:
def percentage(text, data, gender):
    temp = 0
    total = len(data.index)
    
    for i, t in enumerate(data["text"]):
        t = t.lower()
        
        if(data["gender"][i] == gender):
            if(t.find(text) != -1):
                temp += 1
    
    return temp/total

#Chosen Random Word from Top 10
top = 1

dev_male = percentage(ten[top], dev, "male")
dev_female = percentage(ten[top], dev, "female")

train_male = percentage(ten[top], train, "male")
train_female = percentage(ten[top], train, "female")

test_male = percentage(ten[top], test, "male")
test_female = percentage(ten[top], test, "female")

print("Percentage Occurance of 'HAKX'")

print("Dev Male: {:.2%}".format(dev_male))
print("Dev Female: {:.2%}".format(dev_female))

print("Train Male: {:.2%}".format(train_male))
print("Train Female: {:.2%}".format(train_female))

print("Test Male: {:.2%}".format(test_male))
print("Test Female: {:.2%}".format(test_female))


Percentage Occurance of 'HAKX'
Dev Male: 0.06%
Dev Female: 0.00%
Train Male: 0.09%
Train Female: 0.00%
Test Male: 0.00%
Test Female: 0.00%


## Deep learning classifier (50%)

Perform hyperparameter tuning on a deep learning classifier (with a convolutional neural network or a recurrent neural network) that is trained to classify the author gender from the indexified sequences of the text.
Using the dev set for evaluation.
Output the best hyperparameters found and do not store the best trained model as you will be training it again in the next bit.

In [67]:
class Model(torch.nn.Module):
    
    def __init__(self, vocab_size, categ_size, is_lstm, embedding_size, hidden_size, init_dev):
        super().__init__()
        self.hidden_size = hidden_size
        self.is_lstm = is_lstm
        
        self.embedding_matrix = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (vocab_size, embedding_size)), dtype=torch.float32))
        self.s0 = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (hidden_size,)), dtype=torch.float32))
        
        if is_lstm:
            self.lstm = torch.nn.LSTMCell(embedding_size, hidden_size)
            self.c0 = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (hidden_size,)), dtype=torch.float32))

        else:
            self.gru = torch.nn.GRUCell(embedding_size, hidden_size)   

        self.w = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, init_dev, (hidden_size, categ_size)), dtype=torch.float32))
        self.b = torch.nn.Parameter(torch.zeros((categ_size,), dtype=torch.float32))

    def forward(self, x, text_lens):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_matrix[x]
        state = self.s0.unsqueeze(0).tile((batch_size, 1))
        if self.is_lstm:
            c = self.c0.unsqueeze(0).tile((batch_size, 1))
        for t in range(time_steps):
            mask = (t < text_lens).unsqueeze(1).tile((1, self.hidden_size))
            if self.is_lstm:
                (next_state, c) = self.lstm(embedded[:, t, :], (state, c))
            else:
                next_state = self.gru(embedded[:, t, :], state)
            state = torch.where(mask, next_state, state)
        return state@self.w + self.b
    


Use the hyperparameters found in the previous bit to train the classifier, this time outputting a graph showing the dev set accuracy after every epoch.

In [68]:
def plot(x, y, epochs):
    
    plt.figure(figsize=(10,10))
    
    x = np.array(x)
    y = np.array(y)
    
    xy = make_interp_spline(x, y)
    X_ = np.linspace(x.min(), x.max(), 500)
    Y_ = xy(X_)
    
    plt.plot(X_, Y_, color='blue')
    
    plt.title('Dev Accuracy')
    
    plt.xlabel('Epochs')
    
    plt.ylabel('Accuracy (%)')
    
    plt.xlim([0,epochs])
    
    y = np.delete(y,0)
    
    min_y = min(y)
    max_y = max(y)
    min_y -= 5
    max_y += 5
    
    if min_y < 0:
        min_y = 0
    if max_y > 100:
        max_y = 100
        
    plt.ylim([min_y, max_y])
    
    plt.show()

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

In [32]:
def accuracy(TP, TN, tot):
    return (TP + TN)/tot

def precision(TP, FP):
    return TP/(TP + FP)

def recall(TP, FN):
    return TP/(TP + FN)

def f1(precision, recall):
    num = precision * recall
    denom = precision + recall
    return 2 * (num/denom)

ttest_x_vec = torch.tensor(vtest_x, dtype = torch.float32)
targets = np.array(tensor_ind_test_y, np.int64)

with torch.no_grad():
    probability = torch.sigmoid(recurrent(ttest_x_vec))
    output = probability.detach().numpy().argmax(axis=1)
    
tp = 0 
tn = 0
fp = 0
fn = 0

for i in range(len(targets)):
    if(targets[i] == output[i]):
        if(targets[i] == 0):
            tn += 1
        else:
            tp += 1
    else:
        if(targets[i] == 0):
            fp += 1
        else:
            fn += 1
            
accuracy = accuracy(tp, tn, len(targets))
precision = precision(tp, fp)
recall = recall(tp, fn)
f1_score = f1(precision, recall)


print('Accuracy: {:.2%}'.format(accuracy))
print('Precision: {:.2%}'.format(precision))
print('Recall: {:.2%}'.format(recall))
print('F1-Score: {:.2%}'.format(f1_score))



NameError: name 'metrics' is not defined

Output a confusion matrix of the trained model on the test set.

In [33]:
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
        
    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

Output 5 examples of correctly classified text for each gender and 5 examples of incorrectly classified text for each gender (so 20 text examples in total), all of which must be from the test set.
This is assuming that you have at least 5 instances of each group.
If you have less, then show whatever is available.

Remember the list of important tokens determined previously (from the logistic regression classifier)?
Write code that takes all the texts in the test set that have at least one of the important tokens and shows the percentage of these texts that were correctly classified.
Similarly, take all the texts that don't have any of the important tokens and show the percentage of these texts that were correctly classified (so 2 percentages in total).

## Conclusion (10%)

Write, in less than 300 words, your interpretation of the results and how you think the model could perform better.
You should talk about things like overfitting/underfitting and whether the model is learning anything deep about how the different genders write or if it's just basing everything on the words used.