<a href="https://colab.research.google.com/github/Dovermore/COMP5046-ass1/blob/master/zhua9812_COMP5046_Ass1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMP5046 Assignment 1
*Make sure you change the file name with your unikey.*

# Readme
*If there is something to be noted for the user, please mention here.* 

*If you are planning to implement a program with Object Oriented Programming style*

***Visualising the comparison of different results is a good way to justify your decision.***

# 1 - Data Preprocessing

## 1.1. Download Dataset

In [1]:
# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1vF3FqgBC1Y-RPefeVmY8zetdZG1jmHzT'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('imdb_train.csv')

id = '1XhaV8YMuQeSwozQww8PeyiWMJfia13G6'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('imdb_test.csv')

import pandas as pd
df_train = pd.read_csv("imdb_train.csv")
df_test = pd.read_csv("imdb_test.csv")

reviews_train = df_train['review'].tolist()
sentiments_train = df_train['sentiment'].tolist()
reviews_test = df_test['review'].tolist()
sentiments_test = df_test['sentiment'].tolist()

print("Training set number:",len(reviews_train))
print("Testing set number:",len(reviews_test))

Training set number: 25000
Testing set number: 25000


## 1.2. Preprocess data

*You are required to describe which data preprocessing techniques were conducted with justification of your decision. *

In [2]:
!pip install beautifulsoup4
!pip install contractions



In [3]:
# Please comment your code
from sklearn.preprocessing import LabelEncoder
from sklearn.base import TransformerMixin

import copy
import re
from bs4 import BeautifulSoup
import contractions

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# nltk.download('stopwords')
# from nltk.corpus import stopwords as sw

# eng_stopwords = sw.words("english")
lemmatizer = WordNetLemmatizer()


def remove_punctuation(x):
    """
    Removes punctuation from a string x
    """
    x = re.sub(r'[^\w\s]','',x)
    return x

def preprocess_texts(X):
    # Use beautiful soup to remove html tags if any
    X = [BeautifulSoup(s).get_text() for s in X]

    # expand contactions (english only) to normalise text (this before lower case because this will give uppercase)
    X = [contractions.fix(s) for s in X]

    # Case folding is necessary to reduce the unique words and removing some irregular case formulation for words. 
    # Though this may cause the loss of some information (for instance, all CAPPED words have strong emotion), 
    # it is generally beneficial to smooth the occurances of words
    X = [s.lower() for s in X]

    # Remove punctuations is necessary for almost the same reason as the case folding. Here because each tweet is self 
    # contained, no need to add end of sentence token.
    X = [remove_punctuation(s) for s in X]

    # Tokenization is important to extract each individual words instead of feeding in raw sentences.
    X = [word_tokenize(sent) for sent in X]

    # Stop words are NOT removed (yet) for they sometimes affect the sentiment by a lot (like word not, wouldn't)
    # If I can get better list and spend more time understanding the data then I will remove them

    # Lemmatise tokens to reduce the number of unique words, and make the training process easier by reducing the labels
    X = [[lemmatizer.lemmatize(w) for w in tokens] for tokens in X]
    
    return X


class TextPreprocessTransformer(TransformerMixin):
    ""
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return preprocess_texts(X)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
tpt = TextPreprocessTransformer()
texts_train = tpt.fit_transform(reviews_train)
texts_test = tpt.fit_transform(reviews_test)
print(texts_train[:2])
print(texts_test[:2])

In [0]:
label_encoder = LabelEncoder()
label_train = label_encoder.fit_transform(sentiments_train)
label_test = label_encoder.transform(sentiments_test)
print(label_train[:50])
print(label_test[:50])

# 2 - Model Implementation

## 2.1. Word Embeddings

*You are required to describe which model was implemented (i.e. Word2Vec with CBOW, FastText with SkipGram, etc.) with justification of your decision *

### 2.1.1. Data Preprocessing for Word Embeddings

*You are required to describe which preprocessing techniques were used with justification of your decision.*

**Important**: If you are going to use the code from lab3 word2vec preprocessing. Please note that `word_list = list(set(word_list)) ` has randomness. So to make sure the word_list is the same every time you run it, you can put `word_list.sort()` after that line of code.

In [0]:
from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
# The following code implements Word2Vec with Skip-gram. For it is simple and straightforward to implement and there
# is previous code base in previous labs

# First define a dataset generator
class SkipGramTransformer(TransformerMixin):
    def __init__(self, window=10):
        self.window = window
        self.token_list = []
        self.token_dict = {}

    def fit(self, X, y=None, **fit_params):
        refit = fit_params.get("refit", False)
        token_set = set()
        for tokens in X:
            token_set |= set(tokens)
        if refit:
            self.token_list = list(token_set)
        else:
            token_set -= set(self.token_list)
            self.token_list += list(token_set)
        self.token_dict = {w: i for i, w in enumerate(self.token_list)}
        return self
    
    def transform(self, X, y=None):
        skip_grams = []
        for tokens in X:
            for i in range(len(tokens)):
                target = self.token_dict.get(tokens[i], None)
                if target is None: continue
                for k in range(max(i - self.window, 0), min(i + self.window, len(tokens))):
                    if k == i:
                        continue
                    context = self.token_dict.get(tokens[k], None)
                    if context is None: continue
                    skip_grams.append([target, context])
        return list(zip(*skip_grams))

    def generator(self, X, batch_size=1024, pool_size=5120):
        skip_gram_pool = np.zeros((0, 2), dtype=int)
        idx = 0
        while True:
            end_epoch = False
            while skip_gram_pool.shape[0] < pool_size:
                tokens = X[idx]
                skip_grams = []
                for i in range(len(tokens)):
                    target = self.token_dict.get(tokens[i], None)
                    if target is None: continue
                    for k in range(max(i - self.window, 0), min(i + self.window, len(tokens))):
                        if k == i:
                            continue
                        context = self.token_dict.get(tokens[k], None)
                        if context is None: continue
                        skip_grams.append([target, context])
                skip_gram_pool = np.concatenate([skip_gram_pool, skip_grams], axis=0)
                idx += 1
                if idx >= len(X):
                    end_epoch = True
                idx %= len(X)
            batch_idx = np.random.choice(skip_gram_pool.shape[0], size=batch_size, replace=False)
            yield skip_gram_pool[batch_idx, 0], skip_gram_pool[batch_idx, 1].reshape(-1), end_epoch 
            skip_gram_pool = np.delete(skip_gram_pool, batch_idx, axis=0)


    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X, y)


from sklearn.pipeline import make_pipeline, Pipeline
skip_gram_pipeline = make_pipeline(TextPreprocessTransformer(), SkipGramTransformer())

In [0]:
# Test
sgt = SkipGramTransformer()
test_data = np.array(range(10)).reshape(5, 2).tolist()
sgt.fit(test_data)
i = 0
for i, (a, b, c) in enumerate(sgt.generator(test_data, 2, 10)):
    print(i, a, b, c)
    if i > 20:
        break

In [0]:
# Profile time
sgt = SkipGramTransformer()
sgt.fit(texts_train)
datagen = sgt.generator(texts_train)
%timeit next(datagen)

### 2.1.2. Build Word Embeddings Model

*You are required to describe how hyperparameters were decided with justification of your decision.*

In [0]:
import torch
from torch import nn
from torch.utils import data
from torch.optim import Adam
from datetime import datetime


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class BaseModel(nn.Module):
    def __init__(self):
        super().__init__()
        # store optimizer
        self.optimizer = None 

    def train_step(self, X_batch, y_batch):
        # zero the parameter gradients
        self.optimizer.zero_grad()
    
        # forward + backward + optimize
        outputs = self.forward(X_batch)
        loss = self.loss_fn(outputs, y_batch)
        loss.backward()
        self.optimizer.step()
        return outputs, loss

    def train(self, data, epochs, batch_size=1024, batch_display_interval=10000, epoch_display_interval=100, data_gen_dict={}):
        if batch_display_interval <= 0:
            batch_display_interval = 1000000000
        if epoch_display_interval <= 0:
            epoch_display_interval = 1000000000
        data_gen = self.data_generator(data, batch_size=batch_size, **data_gen_dict)
        batch = 0
        for epoch in range(epochs):
            epoch_loss = 0
            epoch_size = 0
            for X_batch, y_batch, end_epoch in data_gen:
                X_batch = torch.from_numpy(X_batch).to(device)
                y_batch = torch.from_numpy(y_batch).to(device)
                # Train
                outputs, loss = self.train_step(X_batch, y_batch)
                epoch_loss += loss * batch_size
                epoch_size += batch_size
                batch += 1
                if batch % batch_display_interval == batch_display_interval: 
                    print('    Batch: %d, loss: %.4f' %(batch, loss))
                if end_epoch:
                    break
            epoch_loss /= epoch_size
            if epoch % epoch_display_interval == epoch_display_interval - 1: 
                print('Epoch: %d, loss: %.4f' %(epoch + 1, epoch_loss))

    def data_generator(self, X, batch_size, **kwargs):
        pass

    def set_optimizer(self, optimizer):
        self.optimizer = optimizer


# Define model
class W2VSkipGram(BaseModel):
    def __init__(self, num_embeddings, embedding_dim, window=10, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # linear embedding
        self.embedding_layer = nn.Embedding(num_embeddings, embedding_dim)
        # linear mapping
        self.forward_layer = nn.Linear(embedding_dim, num_embeddings, bias=False)
        # text transformer
        self.skip_gram_transformer = SkipGramTransformer(window)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, X):
        # forward pass
        return self.forward_layer(self.embedding_layer(X))

    def data_generator(self, X, batch_size, **kwargs):
        # Fit vocab only if training
        if self.training:
            self.skip_gram_transformer.fit(X)
        return self.skip_gram_transformer.generator(X, batch_size=batch_size, **kwargs)

    def predict(self, X):
        output = self.forward(X)
        return output.argmax(dim=1)

In [0]:
test_data = np.array(range(4)).reshape(2, 2).tolist()
test_model = W2VSkipGram(4, 4).to(device)
optimizer = Adam(test_model.parameters())
test_model.set_optimizer(optimizer)
datagen = test_model.data_generator(test_data, batch_size=2, pool_size=4)
next(datagen)

In [0]:
_X = torch.from_numpy(np.array([0, 1, 2, 3])).to(device)
_y = torch.from_numpy(np.array([1, 0, 3, 2]))
# Before
print(test_model.forward(_X), test_model.predict(_X))
test_model.train(test_data, 1000, 2, batch_display_interval=0, epoch_display_interval=0, data_gen_dict={"pool_size":2})
# After
# Overfit the small dataset
print(test_model.forward(_X), test_model.predict(_X))

In [0]:
# profile step time
sgt = SkipGramTransformer().fit(texts_train)
num_embeddings = len(sgt.token_list)
test_model = W2VSkipGram(num_embeddings, 64).to(device)
optimizer = Adam(test_model.parameters())
test_model.set_optimizer(optimizer)
datagen = test_model.data_generator(texts_train, batch_size=1024, pool_size=5000)
X_batch, y_batch, _ = next(datagen)
X_batch = torch.from_numpy(X_batch).to(device)
y_batch = torch.from_numpy(y_batch).to(device)
%timeit -n 10 test_model.train_step(X_batch, y_batch)

### 2.1.3. Train Word Embeddings Model

In [0]:
sgt = SkipGramTransformer().fit(texts_train)
num_embeddings = len(sgt.token_list)
embedding_dim = 64
w2v_model = W2VSkipGram(num_embeddings, embedding_dim)
optimizer = Adam(w2v_model.parameters())
w2v_model.set_optimizer(optimizer)
w2v_model.train(texts_train, 2, batch_display_interval=100)

### 2.1.4. Save Word Embeddings Model

In [0]:
# Please comment your code

### 2.1.5. Load Word Embeddings Model

In [0]:
# Please comment your code

## 2.2. Character Embeddings

### 2.2.1. Data Preprocessing for Character Embeddings

*You are required to describe which preprocessing techniques were used with justification of your decision.*

In [0]:
# Please comment your code

### 2.2.2. Build Character Embeddings Model

*You are required to describe how hyperparameters were decided with justification of your decision.*

In [0]:
# Please comment your code

### 2.1.4. Train Character Embeddings Model

In [0]:
# Please comment your code

### 2.1.5. Save Character Embeddings Model

In [0]:
# Please comment your code

### 2.1.6. Load Character Embeddings Model

In [0]:
# Please comment your code

## 2.3. Sequence model

### 2.3.1. Apply/Import Word Embedding and Character Embedding Model

*You are required to describe how hyperparameters were decided with justification of your decision.*

In [0]:
# Please comment your code

### 2.3.2. Build Sequence Model

*You are required to describe how hyperparameters were decided with justification of your decision.*

In [0]:
# Please comment your code

### 2.3.3. Train Sequence Model

In [0]:
# Please comment your code

### 2.3.4. Save Sequence Model

In [0]:
# Please comment your code

### 2.3.5. Load Sequence Model

In [0]:
# Please comment your code

# 3 - Evaluation

(*Please show your empirical evidence*)

## 3.1. Performance Evaluation


You are required to provide the table with precision, recall, f1 of test set.

In [0]:
# Please comment your code

## 3.2. Hyperparameter Testing
*You are required to draw a graph(y-axis: f1, x-axis: epoch) for test set and explain the optimal number of epochs based on the learning rate you have already chosen.*

In [0]:
# Please comment your code

## Object Oriented Programming codes here

*You can use multiple code snippets. Just add more if needed* 

In [0]:
# If you used OOP style, use this section