In [30]:
import nltk
import torch
import numpy as np
import matplotlib.pyplot as plt
import sklearn.feature_extraction
from keras.preprocessing.text import Tokenizer
import pandas as pd

# Gender classification assignment

You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is provided.

You will be working on classifying the genders of people from their blog posts using a data set called the [Blog Authorship Corpus](https://www.kaggle.com/rtatman/blog-authorship-corpus).
This has been pre-split and reduced for you to use in this assignment.

10% of the marks from this assignment are based on neatness.

This assignment will carry 40% of the final mark.

## Data processing (10%)

You have a train/dev/test split data set consisting of CSV files with two fields: gender and text.
The gender field contains either 'male' or 'female' whilst the text is a string containing text from blog posts.

Do the following tasks:

Load these three CSV files and tokenise each text.

In [35]:
def load(file):
    data = pd.read_csv(file, header=None) # read the csv
    data.columns = ['text', 'gender'] # add column names
    return data

dev = load('dev.csv')
test = load('test.csv')
train = load('train.csv')

dev['text'] = dev.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
test['text'] = test.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
train['text'] = train.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)



                                                   text  gender
0                                                [text]  gender
1     ['People, who, feel, good, about, themselves, ...    male
2     [We, just, wan, na, say, that, tongue, rings, ...    male
3     [urlLink, Extreme, Round, of, the, heat, compe...    male
4     [IMPORTANT, UPDATE, It, is, VITAL, that, peopl...    male
...                                                 ...     ...
4646  [urlLink, YES, !, The, house, takes, shape, ar...  female
4647  [Yay, !, !, !, It, 's, ALIVE, and, it, works, ...  female
4648  [urlLink, A, picture, every, five, minutes, du...  female
4649  [urlLink, Kaitlin, ,, Eri, ,, and, some, rando...  female
4650  [I, ca, n't, remember, the, name, of, this, .,...  female

[4651 rows x 2 columns]


Write code that counts the number of lines in each data set as well as the maximum number of tokens in each data set.

Convert each data set's labels (gender) into numeric form.

Extract a vocabulary consisting of the tokens that occur at least 5 times in the train set and output the size of your vocabulary.
Include the unknown token and pad token in the vocabulary.

Create binary bag of words feature vectors for all data set texts using the vocabulary created above (include stop words).

Create a data set of indexified token sequences for all texts using the vocabulary created above, making use of unknown tokens and pad tokens.

Write code that counts the percentage of tokens in each data set that are unknown tokens (not including pad tokens).

## Linear regression classification (20%)

Write a linear regression classifier (single layer neural net) that is trained to classify the author gender from the bag of words vector of the text.
You do not need to perform any hyperparameter tuning.
Use L1 weight decay regularisation.

In [4]:
class Linear(torch.nn.Module):

    def __init__(self, w0, w1, b):
        super().__init__()
        self.w0 = torch.tensor(w0, dtype=torch.float32)
        self.w1 = torch.tensor(w1, dtype=torch.float32)
        self.b = torch.tensor(b, dtype=torch.float32)

    def forward(self, x0, x1):
        return self.w0*x0 + self.w1*x1 + self.b

model = Linear(1, 1, -1)

train_x = []
train_y = []
test_x = []
test_y = []

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

Write code that shows the top 10 tokens that are the most important for determining the author gender according to the classifier.

Write code that, for each data split and gender, shows the percentage of rows that include at least one of these important words (so 6 percentages in all).

## Deep learning classifier (50%)

Perform hyperparameter tuning on a deep learning classifier (with a convolutional neural network or a recurrent neural network) that is trained to classify the author gender from the indexified sequences of the text.
Using the dev set for evaluation.
Output the best hyperparameters found and do not store the best trained model as you will be training it again in the next bit.

Use the hyperparameters found in the previous bit to train the classifier, this time outputting a graph showing the dev set accuracy after every epoch.

Measure the accuracy, precision, recall, and F1-score of this classifier on the test set.

Output a confusion matrix of the trained model on the test set.

Output 5 examples of correctly classified text for each gender and 5 examples of incorrectly classified text for each gender (so 20 text examples in total), all of which must be from the test set.
This is assuming that you have at least 5 instances of each group.
If you have less, then show whatever is available.

Remember the list of important tokens determined previously (from the logistic regression classifier)?
Write code that takes all the texts in the test set that have at least one of the important tokens and shows the percentage of these texts that were correctly classified.
Similarly, take all the texts that don't have any of the important tokens and show the percentage of these texts that were correctly classified (so 2 percentages in total).

## Conclusion (10%)

Write, in less than 300 words, your interpretation of the results and how you think the model could perform better.
You should talk about things like overfitting/underfitting and whether the model is learning anything deep about how the different genders write or if it's just basing everything on the words used.