In [2]:
import nltk
import torch
import numpy as np
import matplotlib.pyplot as plt
import sklearn.feature_extraction
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt

# Gender classification assignment

You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is provided.

You will be working on classifying the genders of people from their blog posts using a data set called the [Blog Authorship Corpus](https://www.kaggle.com/rtatman/blog-authorship-corpus).
This has been pre-split and reduced for you to use in this assignment.

10% of the marks from this assignment are based on neatness.

This assignment will carry 40% of the final mark.

## Data processing (10%)

You have a train/dev/test split data set consisting of CSV files with two fields: gender and text.
The gender field contains either 'male' or 'female' whilst the text is a string containing text from blog posts.

Do the following tasks:

Load these three CSV files and tokenise each text.

In [6]:
def load(file):
    data = pd.read_csv(file, header=None) # read the csv
    data.columns = ['text', 'gender'] # add column names
    return data

dev = load('dev.csv')
test = load('test.csv')
train = load('train.csv')

dev['text'] = dev.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
test['text'] = test.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
train['text'] = train.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

#Example after Tokenization
dev.head()



Unnamed: 0,text,gender
0,[text],gender
1,"['People, who, feel, good, about, themselves, ...",male
2,"[We, just, wan, na, say, that, tongue, rings, ...",male
3,"[urlLink, Extreme, Round, of, the, heat, compe...",male
4,"[IMPORTANT, UPDATE, It, is, VITAL, that, peopl...",male


Write code that counts the number of lines in each data set as well as the maximum number of tokens in each data set.

In [7]:
print('Length of DEV: ', len(dev))
print('Length of TRAIN: ', len(train))
print('Length of TEST: ', len(test))

print('Max Tokens in DEV: ', max(dev['text'].str.len()))
print('Max Tokens in TRAIN: ', max(train['text'].str.len()))
print('Max Tokens in TEST: ', max(test['text'].str.len()))

Length of DEV:  4651
Length of TRAIN:  37209
Length of TEST:  4653
Max Tokens in DEV:  61
Max Tokens in TRAIN:  97
Max Tokens in TEST:  66


Convert each data set's labels (gender) into numeric form.

In [8]:
dev = dev.replace(to_replace =["male", "female"], value =[1, 0])
train = train.replace(to_replace =["male", "female"], value =[1, 0])
test = test.replace(to_replace =["male", "female"], value =[1, 0])

#Example 
dev.head()

Unnamed: 0,text,gender
0,[text],gender
1,"['People, who, feel, good, about, themselves, ...",1
2,"[We, just, wan, na, say, that, tongue, rings, ...",1
3,"[urlLink, Extreme, Round, of, the, heat, compe...",1
4,"[IMPORTANT, UPDATE, It, is, VITAL, that, peopl...",1


Extract a vocabulary consisting of the tokens that occur at least 5 times in the train set and output the size of your vocabulary.
Include the unknown token and pad token in the vocabulary.

In [9]:
def multipleTokens(data):
    met = [] #stores words already met
    ret = [] #stores words already met 5 times
    counter = 0 #counts times met
    for word in data['text']:
        if word not in met:
            met.append(word) #stores all words
        if word in met: #if word is already met
            counter += 1 #increment
            if counter >= 5: #if the increment exceeds 5, word has been met 5 times
                ret.append(word)
                
    return ret

dev_tokens = multipleTokens(dev)
train_tokens = multipleTokens(train)
test_tokens = multipleTokens(test)
    

Create binary bag of words feature vectors for all data set texts using the vocabulary created above (include stop words).

In [46]:
def bag_of_words(data, data_tokens):
    CountVec = CountVectorizer(ngram_range=(1,1), analyzer=lambda text: text)
    cnt = CountVec.fit_transform([data['text'], data_tokens])
    data_bow = pd.DataFrame(cnt.toarray(),columns=CountVec.get_feature_names())
    
    return data_bow

bag_of_words(dev, dev_tokens)
bag_of_words(train, train_tokens)
bag_of_words(test, test_tokens)

TypeError: unhashable type: 'list'

Create a data set of indexified token sequences for all texts using the vocabulary created above, making use of unknown tokens and pad tokens.

Write code that counts the percentage of tokens in each data set that are unknown tokens (not including pad tokens).

## Linear regression classification (20%)

Write a linear regression classifier (single layer neural net) that is trained to classify the author gender from the bag of words vector of the text.
You do not need to perform any hyperparameter tuning.
Use L1 weight decay regularisation.

In [10]:
class Linear(torch.nn.Module):

    def __init__(self, w0, w1, b):
        super().__init__()
        self.w0 = torch.tensor(w0, dtype=torch.float32)
        self.w1 = torch.tensor(w1, dtype=torch.float32)
        self.b = torch.tensor(b, dtype=torch.float32)

    def forward(self, x0, x1):
        return self.w0*x0 + self.w1*x1 + self.b

model = Linear(1, 1, -1)

train_x = []
train_y = []
test_x = []
test_y = []

def get_error_and_grad(b):
    model = Linear(b)
    error = torch.nn.functional.mse_loss(model(train_x), train_y)
    error.backward()
    grad = model.b.grad.tolist()
    model.b.grad.zero_()
    return (error.detach().tolist(), grad)

error = torch.nn.functional.binary_cross_entropy_with_logits(dev['text'], dev_tokens)
print(error)

AttributeError: 'list' object has no attribute 'size'

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

In [None]:
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_pred)
# store accuracy in results
results['accuracy'] = accuracy
print('-----------------------')
print('|       Accuracy      |')
print('-----------------------')
print('\n      {}\n\n'.format(accuracy))

Write code that shows the top 10 tokens that are the most important for determining the author gender according to the classifier.

Write code that, for each data split and gender, shows the percentage of rows that include at least one of these important words (so 6 percentages in all).

## Deep learning classifier (50%)

Perform hyperparameter tuning on a deep learning classifier (with a convolutional neural network or a recurrent neural network) that is trained to classify the author gender from the indexified sequences of the text.
Using the dev set for evaluation.
Output the best hyperparameters found and do not store the best trained model as you will be training it again in the next bit.

In [4]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding_matrix = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, 1.0, (vocab_size, embedding_size)), dtype=torch.float32))
        self.rnn_s0 = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, 1.0, (hidden_size,)), dtype=torch.float32))
        self.rnn_w = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, 1.0, (hidden_size + embedding_size, hidden_size)), dtype=torch.float32))
        self.rnn_b = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.w = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, 1.0, (hidden_size, 1)), dtype=torch.float32))
        self.b = torch.nn.Parameter(torch.zeros((1,), dtype=torch.float32))

    def forward(self, x, text_lens):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_matrix[x]
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        for t in range(time_steps):
            mask = (t < text_lens).unsqueeze(1).tile((1, self.hidden_size))
            next_state = torch.nn.functional.leaky_relu(torch.cat((state, embedded[:, t, :]), dim=1)@self.rnn_w + self.rnn_b)
            state = torch.where(mask, next_state, state)
        return state@self.w + self.b

model = Model(len(vocab), embedding_size=2, hidden_size=3)
model.to('cpu')

optimiser = torch.optim.Adam(model.parameters())

print('step', 'error')
for step in range(1, 2000+1):
    optimiser.zero_grad()
    output = model(indexed_train_x, text_lens)
    error = torch.nn.functional.binary_cross_entropy_with_logits(output, train_y)
    error.backward()
    optimiser.step()

    if step%200 == 0:
        print(step, error.detach().tolist())
print()

with torch.no_grad():
    print('sent', 'prediction')
    outputs = torch.sigmoid(model(indexed_train_x, text_lens))
    for (sent, output) in zip(train_x, outputs):
        print(sent, output)
        
input_seq = torch.tensor(np.random.normal(0, 1, (10, 1)), dtype=torch.float32, requires_grad=True)
w = torch.tensor(np.random.normal(0, 1.0, (2, 1)), dtype=torch.float32)
b = torch.zeros((1,), dtype=torch.float32)

state = torch.tensor([0], dtype=torch.float32)
for t in range(input_seq.shape[0]):
    state = torch.nn.functional.leaky_relu(torch.cat((state, input_seq[t, :]), dim=0)@w + b)

state[0].backward()
grads = np.abs(input_seq.grad.numpy()[:, 0])

(fig, ax) = plt.subplots(1, 1)
ax.bar(np.arange(input_seq.shape[0]), grads)
ax.set_xlabel('time step')
ax.set_ylabel('gradient')
ax.grid()

NameError: name 'vocab' is not defined

Use the hyperparameters found in the previous bit to train the classifier, this time outputting a graph showing the dev set accuracy after every epoch.

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

In [None]:
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_pred)
# store accuracy in results
results['accuracy'] = accuracy
print('-----------------------')
print('|       Accuracy      |')
print('-----------------------')
print('\n      {}\n\n'.format(accuracy))

Output a confusion matrix of the trained model on the test set.

In [3]:
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
        
    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

Output 5 examples of correctly classified text for each gender and 5 examples of incorrectly classified text for each gender (so 20 text examples in total), all of which must be from the test set.
This is assuming that you have at least 5 instances of each group.
If you have less, then show whatever is available.

Remember the list of important tokens determined previously (from the logistic regression classifier)?
Write code that takes all the texts in the test set that have at least one of the important tokens and shows the percentage of these texts that were correctly classified.
Similarly, take all the texts that don't have any of the important tokens and show the percentage of these texts that were correctly classified (so 2 percentages in total).

## Conclusion (10%)

Write, in less than 300 words, your interpretation of the results and how you think the model could perform better.
You should talk about things like overfitting/underfitting and whether the model is learning anything deep about how the different genders write or if it's just basing everything on the words used.