# [COM6513] Assignment 1: Sentiment Analysis with Logistic Regression

### Instructor: Nikos Aletras


The goal of this assignment is to develop and test a **text classification** system for **sentiment analysis**, in particular to predict the sentiment of movie reviews, i.e. positive or negative (binary classification).



For that purpose, you will implement:


- Text processing methods for extracting Bag-Of-Word features, using 
    - n-grams (BOW), i.e. unigrams, bigrams and trigrams to obtain vector representations of documents where n=1,2,3 respectively. Two vector weighting schemes should be tested: (1) raw frequencies (**1 mark**); (2) tf.idf (**1 mark**). 
    - character n-grams (BOCN). A character n-gram is a contiguous sequence of characters given a word, e.g. for n=2, 'coffee' is split into {'co', 'of', 'ff', 'fe', 'ee'}. Two vector weighting schemes should be tested: (1) raw frequencies (**1 mark**); (2) tf.idf (**1 mark**). **Tip: Note the large vocabulary size!** 
    - a combination of the two vector spaces (n-grams and character n-grams) choosing your best performing wighting respectively (i.e. raw or tfidf). (**1 mark**) **Tip: you should merge the two representations**



- Binary Logistic Regression (LR) classifiers that will be able to accurately classify movie reviews trained with: 
    - (1) BOW-count (raw frequencies) 
    - (2) BOW-tfidf (tf.idf weighted)
    - (3) BOCN-count
    - (4) BOCN-tfidf
    - (5) BOW+BOCN (best performing weighting; raw or tfidf)



- The Stochastic Gradient Descent (SGD) algorithm to estimate the parameters of your Logistic Regression models. Your SGD algorithm should:
    - Minimise the Binary Cross-entropy loss function (**1 mark**)
    - Use L2 regularisation (**1 mark**)
    - Perform multiple passes (epochs) over the training data (**1 mark**)
    - Randomise the order of training data after each pass (**1 mark**)
    - Stop training if the difference between the current and previous development loss is smaller than a threshold (**1 mark**)
    - After each epoch print the training and development loss (**1 mark**)



- Discuss how did you choose hyperparameters (e.g. learning rate and regularisation strength) for each LR model? You should use a table showing model performance using different set of hyperparameter values. (**2 marks). **Tip: Instead of using all possible combinations, you could perform a random sampling of combinations.**


- After training each LR model, plot the learning process (i.e. training and validation loss in each epoch) using a line plot. Does your model underfit, overfit or is it about right? Explain why. (**1 mark**). 


- Identify and show the most important features (model interpretability) for each class (i.e. top-10 most positive and top-10 negative weights). Give the top 10 for each class and comment on whether they make sense (if they don't you might have a bug!). If you were to apply the classifier into a different domain such laptop reviews or restaurant reviews, do you think these features would generalise well? Can you propose what features the classifier could pick up as important in the new domain? (**2 marks**)


- Provide well documented and commented code describing all of your choices. In general, you are free to make decisions about text processing (e.g. punctuation, numbers, vocabulary size) and hyperparameter values. We expect to see justifications and discussion for all of your choices (**2 marks**). 


- Provide efficient solutions by using Numpy arrays when possible (you can find tips in Lab 1 sheet). Executing the whole notebook with your code should not take more than 5 minutes on a any standard computer (e.g. Intel Core i5 CPU, 8 or 16GB RAM) excluding hyperparameter tuning runs (**2 marks**). 






### Data 

The data you will use are taken from here: [http://www.cs.cornell.edu/people/pabo/movie-review-data/](http://www.cs.cornell.edu/people/pabo/movie-review-data/) and you can find it in the `./data_sentiment` folder in CSV format:

- `data_sentiment/train.csv`: contains 1,400 reviews, 700 positive (label: 1) and 700 negative (label: 0) to be used for training.
- `data_sentiment/dev.csv`: contains 200 reviews, 100 positive and 100 negative to be used for hyperparameter selection and monitoring the training process.
- `data_sentiment/test.csv`: contains 400 reviews, 200 positive and 200 negative to be used for testing.




### Submission Instructions

You should submit a Jupyter Notebook file (assignment1.ipynb) and an exported PDF version (you can do it from Jupyter: `File->Download as->PDF via Latex` or you can print it as PDF using your browser).

You are advised to follow the code structure given in this notebook by completing all given funtions. You can also write any auxilliary/helper functions (and arguments for the functions) that you might need but note that you can provide a full solution without any such functions. Similarly, you can just use only the packages imported below but you are free to use any functionality from the [Python Standard Library](https://docs.python.org/2/library/index.html), NumPy, SciPy (excluding built-in softmax funtcions) and Pandas. You are not allowed to use any third-party library such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy, Keras etc.. 

There is no single correct answer on what your accuracy should be, but correct implementations usually achieve F1-scores around 80\% or higher. The quality of the analysis of the results is as important as the accuracy itself. 

This assignment will be marked out of 20. It is worth 20\% of your final grade in the module.

The deadline for this assignment is **23:59 on Mon, 14 Mar 2022** and it needs to be submitted via Blackboard. Standard departmental penalties for lateness will be applied. We use a range of strategies to **detect [unfair means](https://www.sheffield.ac.uk/ssid/unfair-means/index)**, including Turnitin which helps detect plagiarism. Use of unfair means would result in getting a failing grade.



In [1]:
import pandas as pd
import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random

# fixing random seed for reproducibility
random.seed(123)
np.random.seed(123)


## Load Raw texts and labels into arrays

First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can use Pandas dataframes).

In [113]:
train_data = pd.read_csv('data_sentiment/train.csv', encoding= 'unicode_escape',names=['text','lable'])
val_data = pd.read_csv('data_sentiment/dev.csv', encoding= 'unicode_escape',names=['text','lable'])
test_data = pd.read_csv('data_sentiment/test.csv', encoding= 'unicode_escape',names=['text','lable'])

train_data.describe()

Unnamed: 0,lable
count,1400.0
mean,0.5
std,0.500179
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


If you use Pandas you can see a sample of the data.

In [115]:
train_data.sample(5)

Unnamed: 0,text,lable
782,""" i would appreciate it if you didn't do that...",0
680,it might surprise some to know that joel and e...,1
212,"when bulworth ended , i allowed myself a sigh ...",1
157,"richard linklater's "" slacker , "" made in 1991...",1
1242,""" first rule of fight club is , don't talk ab...",0


The next step is to put the raw texts into Python lists and their corresponding labels into NumPy arrays:


In [116]:
train_data_text = train_data['text'].tolist()
train_data_label = train_data['lable'].to_numpy()

val_data_text = val_data['text'].tolist()
val_data_label = val_data['lable'].to_numpy()

test_data_text = test_data['text'].tolist()
test_data_label = test_data['lable'].to_numpy()

len(val_data_text)

200

In [117]:
val_data_label

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int64)

# Vector Representations of Text 


To train and test Logisitc Regression models, you first need to obtain vector representations for all documents given a vocabulary of features (unigrams, bigrams, trigrams).


## Text Pre-Processing Pipeline

To obtain a vocabulary of features, you should: 
- tokenise all texts into a list of unigrams (tip: using a regular expression) 
- remove stop words (using the one provided or one of your preference) 
- compute bigrams, trigrams given the remaining unigrams (or character ngrams from the unigrams)
- remove ngrams appearing in less than K documents
- use the remaining to create a vocabulary of unigrams, bigrams and trigrams (or character n-grams). You can keep top N if you encounter memory issues.


In [118]:
stop_words = ['a','in','on','at','and','or', 
              'to', 'the', 'of', 'an', 'by', 
              'as', 'is', 'was', 'were', 'been', 'be', 
              'are','for', 'this', 'that', 'these', 'those', 'you', 'i',
             'it', 'he', 'she', 'we', 'they', 'will', 'have', 'has',
              'do', 'did', 'can', 'could', 'who', 'which', 'what', 
             'his', 'her', 'they', 'them', 'from', 'with', 'its']
# care about the punctuation like ',' '--'

### N-gram extraction from a document

You first need to implement the `extract_ngrams` function. It takes as input:
- `x_raw`: a string corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `vocab`: a given vocabulary. It should be used to extract specific features.
- `char_ngrams`: boolean. If true the function extracts character n-grams

and returns:

- `x': a list of all extracted features.

See the examples below to see how this function should work.

In [6]:
def extract_ngrams(x_raw, ngram_range=(1,3), token_pattern=r'', 
                   stop_words=[], vocab=set(), char_ngrams=False):
    
    x = []
    if(char_ngrams == False):        
        k=0
        for i,word in enumerate(split_text):
            #if word in vocab:
            temp = [word]
            for j in range(ngram_range[0],ngram_range[1]):
                temp.append(split_text[i:i+j+1])
            x.append(temp)
            k+=1
    
    else:
        split_text = re.split(r'\W*', x_raw)
        for char in split_text:
            if(char == ''):
                split_text.remove(char)
        
        for i,char in enumerate(split_text):
            for j in range(ngram_range[0],ngram_range[1]+1):
                if(i+j<len(split_text)):
                    x.append(''.join(split_text[i:i+j]))
                
        
    return x




In [110]:
x = dict()
ngram_range = (1,3)
split_text = re.split(r'\W+', train_data_text[0])
for word in split_text:
    if word in stop_words:
        split_text.remove(word)

for i in range(ngram_range[0],ngram_range[1]+1):
    x[i]=[]

text = ''.join(split_text)
split_text = re.split(r'\W*', text)
for char in split_text:
    if(char == ''):
        split_text.remove(char)
        

for i,char in enumerate(split_text):
    for j in range(ngram_range[0],ngram_range[1]+1):
        x[j].append(''.join(split_text[i:i+j]))

In [32]:
x = dict()
ngram_range = (1,3)
split_text = re.split(r'\W+', train_data_text[0])
for word in split_text:
    if word in stop_words:
        split_text.remove(word)
for i in range(ngram_range[0],ngram_range[1]+1):
    x[i]=[]
for i,word in enumerate(split_text):
    for j in range(ngram_range[0],ngram_range[1]+1):
        x[j].append(' '.join(split_text[i:i+j])) 


In [6]:
def extract_ngrams_list(x_raw, ngram_range=(1,3), token_pattern=r'', 
                   stop_words=[], vocab=set(), char_ngrams=False):
    
    x_dict = dict()
    x = []
    x_temp = []
    for i in range(ngram_range[0],ngram_range[1]+1):
        x_dict[i]=[]
    
    split_text = re.split(r'\W+',x_raw)
    for word in split_text:
        if word in stop_words:
            split_text.remove(word)

    if(char_ngrams == False):
        for i,word in enumerate(split_text):
            for j in range(ngram_range[0],ngram_range[1]+1):
                x_dict[j].append(' '.join(split_text[i:i+j]))
        for key,values in x_dict.items():
            x_temp.append(values)        

    else:
        text = ''.join(split_text)
        split_text = re.split(r'\W*', text)
        for char in split_text:
            if(char == ''):
                split_text.remove(char)

        for i,char in enumerate(split_text):
            for j in range(ngram_range[0],ngram_range[1]+1):
                x_dict[j].append(''.join(split_text[i:i+j]))
                
        for key,values in x_dict.items():
            x_temp.append(values)
        
    x_temp = [i for k in x_temp for i in k]
    if len(vocab)>0:
        for voc in x_temp:
            if voc in vocab:
                x.append(voc)
    else:
        x = x_temp 
                
        
    return x

In [17]:
ngram_range = (1,3)
split_text = re.split(r'\W*', train_data_text[0])
for char in split_text:
    if(char == ''):
        split_text.remove(char)

xx = dict()
for i in range(ngram_range[0],ngram_range[1]+1):
    xx[i]=[]
for i,word in enumerate(split_text):
    for j in range(ngram_range[0],ngram_range[1]+1):
            xx[j].append(''.join(split_text[i:i+j]))


Note that it is OK to represent n-grams using lists instead of tuples: e.g. `['great', ['great', 'movie']]`

For extracting character n-grams the function should work as follows:

In [None]:
extract_ngrams("movie", 
               ngram_range=(2,4), 
               stop_words=[],
               char_ngrams=True)

### Create a vocabulary 

The `get_vocab` function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of ngrams; (3) their raw frequency. It takes as input:
- `X_raw`: a list of strings each corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `min_df`: keep ngrams with a minimum document frequency.
- `keep_topN`: keep top-N more frequent ngrams.

and returns:

- `vocab`: a set of the n-grams that will be used as features.
- `df`: a Counter (or dict) that contains ngrams as keys and their corresponding document frequency as values.
- `ngram_counts`: counts of each ngram in vocab

Hint: it should make use of the `extract_ngrams` function.

In [44]:
def get_vocab_dict(X_raw, ngram_range=(1,3), token_pattern=r'', 
              min_df=0, keep_topN=0, 
              stop_words=[],char_ngrams=False):
    
    ngram_counts = dict()
    ngram_counts_temp = dict()
    df = dict()
    df_temp = dict()
    ngrams = []
    vocab = dict()

    for i in range(ngram_range[0],ngram_range[1]+1):
        ngram_counts_temp[i] = []
        ngram_counts[i]=[]
        df_temp[i] =[]
        df[i] = []
        vocab[i]=[]
        
    for text in X_raw:
        split_text = re.split(r'\W+',text)
        for word in split_text:
            if word in stop_words:
                split_text.remove(word)
        
        docs = extract_ngrams_dict(split_text,ngram_range,token_pattern=r'\W*',char_ngrams=char_ngrams)
        for i in range(ngram_range[0],ngram_range[1]+1):
            ngram_counts_temp[i] += docs[i]
            df_temp[i] += set(docs[i])
        
    for i in range(ngram_range[0],ngram_range[1]+1):
        ngram_counts[i] = Counter(ngram_counts_temp[i])
        temp = Counter(df_temp[i])
        for ngram,num in dict(temp).items():
            if num < min_df:
                temp.pop(ngram)
                ngram_counts[i].pop(ngram)
        df[i]=temp
        vocab[i]=list(df[i].keys())
    
    
    return vocab, df, ngram_counts

In [7]:
def get_vocab_list(X_raw, ngram_range=(1,3), token_pattern=r'', 
              min_df=0, keep_topN=0, 
              stop_words=[],char_ngrams=False):
    
    df = []
    df_save =[]
    ngram_counts_temp = []
    vocab = []

    for text in X_raw:
        docs = extract_ngrams_list(text,ngram_range,stop_words =stop_words,char_ngrams=char_ngrams)
        ngram_counts_temp.append(docs)
        df_save.append(set(docs)) 

    df_save = [i for k in df_save for i in k]
    ngram_counts_temp = [i for k in ngram_counts_temp for i in k]
    ngram_counts = Counter(ngram_counts_temp)
    temp = Counter(df_save)
    for ngram,num in dict(temp).items():
        if num < min_df:
            temp.pop(ngram)
            ngram_counts.pop(ngram)
        if num == len(X_raw):
            temp.pop(ngram)
            ngram_counts.pop(ngram)
    df=temp
    vocab=list(df.keys())
    
    
    return vocab, df, ngram_counts

In [49]:
l = ['s',['e'],['c']]
b =[]
for _ in l:
    b += _
b

['s', 'e', 'c']

In [None]:
    if(char_ngrams == False):
        token_pattern = r'\W+'
    else:
        token_pattern = r'\W*'

In [97]:
ngram_counts_temp=[]
split_text = re.split(r'\W+',train_data_text[0])
for word in split_text:
    if word in stop_words:
        split_text.remove(word)

docs = extract_ngrams_dict(split_text,ngram_range,token_pattern=r'\W*',char_ngrams=False)
ngram_counts_temp.append(docs)
ngram_counts_temp[0][2003]

'schreck nosferatu using'

In [21]:

docs = extract_ngrams_list(train_data_text[0],ngram_range,char_ngrams=True)

docs[6000]

'ali'

In [73]:
# contain the number!!!!  keep_topN
ngram_range = (2,4)

df = []
df_save =[]
ngram_counts_temp = []
vocab = []
            
for text in train_data_text:
    split_text = re.split(r'\W+',text)
    for word in split_text:
        if word in stop_words:
            split_text.remove(word)
            
    docs = extract_ngrams_dict(split_text,ngram_range,token_pattern=r'\W*',char_ngrams=True)
    ngram_counts_temp.append(docs)
    df_save.append(set(docs)) 
        
df_save = [i for k in df_save for i in k]
ngram_counts_temp = [i for k in ngram_counts_temp for i in k]
ngram_counts = Counter(ngram_counts_temp)
temp = Counter(df_save)
for ngram,num in dict(temp).items():
    if num < 50:
        temp.pop(ngram)
        ngram_counts.pop(ngram)
df=temp
vocab=list(df.keys())


print(len(df))
print(len(ngram_counts))
print(len(vocab))

18060
18060
18060


Now you should use `get_vocab` to create your vocabulary and get document and raw frequencies of n-grams:

In [119]:
ngram_range=(1,3)
min_df = 50
char_ngrams=False
vocab,df,ngram_counts = get_vocab_list(train_data_text, ngram_range=ngram_range, min_df=min_df,keep_topN=0,stop_words=stop_words,char_ngrams=char_ngrams)
len(vocab)  
# (1,3) word 20 5421
# (1,3) word 50 2034
#(2,4) char 50 18060
#(2,4) char 100 11038

2036

In [10]:
df

Counter({'important': 129,
         'every': 483,
         'often': 215,
         'keep': 214,
         'offer': 71,
         'but s': 203,
         's all': 87,
         'appears': 140,
         'we': 292,
         'looking': 266,
         'many': 561,
         'age': 123,
         'bad': 549,
         'clever': 110,
         'one': 1246,
         'that': 801,
         'happens': 160,
         'already': 176,
         'scene': 555,
         'mr': 110,
         'system': 65,
         'likable': 52,
         'cinematography': 91,
         'front': 88,
         't': 1175,
         'ones': 106,
         'a': 1383,
         'his': 1045,
         'one s': 51,
         'couldn t': 118,
         'close': 162,
         'him': 789,
         'helps': 52,
         'the film': 818,
         'edge': 64,
         'into a': 138,
         'note': 155,
         'called': 237,
         'overall': 98,
         'while': 677,
         'same': 444,
         'virtually': 68,
         'computer': 126,
       

Then, you need to create 2 dictionaries: (1) vocabulary id -> word; and  (2) word -> vocabulary id so you can use them for reference:

In [120]:
vocID_word = dict()
word_vocID = dict()

for i,voc in enumerate(vocab):
    vocID_word[i] = voc
    word_vocID[voc] = i

Now you should be able to extract n-grams for each text in the training, development and test sets:

In [121]:
ngram_range=(1,3)
char_ngrams=False
ngram_train=[]
ngram_val=[]
ngram_test=[]
for text in train_data_text:
    docs = extract_ngrams_list(text,ngram_range,vocab = vocab,char_ngrams = char_ngrams)
    ngram_train.append(docs)
for text in val_data_text:
    docs = extract_ngrams_list(text,ngram_range,vocab = vocab,char_ngrams = char_ngrams)
    ngram_val.append(docs)
for text in test_data_text:
    docs = extract_ngrams_list(text,ngram_range,vocab = vocab,char_ngrams = char_ngrams)
    ngram_test.append(docs)


In [22]:
X_ngram_train = [i for k in ngram_train for i in k]
X_ngram_val = [i for k in ngram_val for i in k]
X_ngram_test = [i for k in ngram_test for i in k]
len(Counter(X_ngram_train))

2024

## Vectorise documents 

Next, write a function `vectoriser` to obtain Bag-of-ngram representations for a list of documents. The function should take as input:
- `X_ngram`: a list of texts (documents), where each text is represented as list of n-grams in the `vocab`
- `vocab`: a set of n-grams to be used for representing the documents

and return:
- `X_vec`: an array with dimensionality Nx|vocab| where N is the number of documents and |vocab| is the size of the vocabulary. Each element of the array should represent the frequency of a given n-gram in a document.


In [122]:
def vectorise(X_ngram, vocab):
    rows, cols = len(X_ngram),len(vocab)
    X_vec = [([]*cols) for i in range(rows)]
    for i,doc in enumerate(X_ngram):
        doc_fre = Counter(doc)
        for j,word in enumerate(vocab):
            if doc_fre[word]>0:
                X_vec[i].append(doc_fre[word])
            else:
                X_vec[i].append(0)
    X_vec = np.squeeze(np.asarray(X_vec))
    return X_vec

Finally, use `vectorise` to obtain document vectors for each document in the train, development and test set. You should extract both count and tf.idf vectors respectively:

In [56]:
X_vec = [([]*2034) for i in range(1399)]
for i,doc in enumerate(ngram_train):
    doc_fre = Counter(doc)
    for j,word in enumerate(vocab):
        if doc_fre[word]>0:
            X_vec[i].append(doc_fre[word])
        else:
            X_vec[i].append(0)
X_vec = np.squeeze(np.asarray(X_vec))
X_vec

array([[30,  1,  1, ...,  0,  0,  0],
       [31,  0,  0, ...,  0,  0,  0],
       [12,  0,  0, ...,  0,  0,  0],
       ...,
       [17,  0,  0, ...,  0,  0,  0],
       [28,  0,  1, ...,  0,  0,  0],
       [24,  0,  0, ...,  0,  0,  0]])

#### Count vectors

In [123]:
X_trian = vectorise(ngram_train,vocab)
X_dev = vectorise(ngram_val,vocab)
X_test = vectorise(ngram_test,vocab)

In [59]:
X_trian[2]

array([12,  0,  0, ...,  0,  0,  0])

#### TF.IDF vectors

First compute `idfs` an array containing inverted document frequencies (Note: its elements should correspond to your `vocab`)

In [14]:
idf_dict = dict()

for word in vocab:
    idf_dict[word] = np.log10(len(ngram_train)/df[word])


Then transform your count vectors to tf.idf vectors:

In [28]:
idf_list = list(idf_dict.values())
idf_arr = np.array(idf_list)


In [29]:
tf_idf = X_trian * idf_arr
tf_idf

array([[1.035228  , 0.46187058, 0.81337925, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.46187058, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [3.10568401, 0.46187058, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

# Binary Logistic Regression

After obtaining vector representations of the data, now you are ready to implement Binary Logistic Regression for classifying sentiment.

First, you need to implement the `sigmoid` function. It takes as input:

- `z`: a real number or an array of real numbers 

and returns:

- `sig`: the sigmoid of `z`

In [124]:
def sigmoid(z):
    
    sig = 1/(1+np.exp(-z))
    
    return sig

Then, implement the `predict_proba` function to obtain prediction probabilities. It takes as input:

- `X`: an array of inputs, i.e. documents represented by bag-of-ngram vectors ($N \times |vocab|$)
- `weights`: a 1-D array of the model's weights $(1, |vocab|)$

and returns:

- `preds_proba`: the prediction probabilities of X given the weights

In [187]:
def predict_proba(X, weights):
    
    
    preds_proba = sigmoid(weights@X.T) 
    print(np.shape(preds_proba))
    
    return preds_proba

Then, implement the `predict_class` function to obtain the most probable class for each vector in an array of input vectors. It takes as input:

- `X`: an array of documents represented by bag-of-ngram vectors ($N \times |vocab|$)
- `weights`: a 1-D array of the model's weights $(1, |vocab|)$

and returns:

- `preds_class`: the predicted class for each x in X given the weights

In [206]:
def predict_class(X, weights):
    y = np.dot(weights,X.T)
    y_class = []
    for p in y:
        if(p>0.5):
            y_class.append(1)
        else:
            y_class.append(0)
    return np.array(y_class)

To learn the weights from data, we need to minimise the binary cross-entropy loss. Implement `binary_loss` that takes as input:

- `X`: input vectors
- `Y`: labels
- `weights`: model weights
- `alpha`: regularisation strength

and return:

- `l`: the loss score

In [221]:
def binary_loss(X, Y, Y_pre, weights, alpha=0.00001):
    '''
    Binary Cross-entropy Loss

    X:(len(X),len(vocab))
    Y: array len(Y)
    weights: array len(X)
    '''

    l = (-Y*np.log(Y_pre)-(1-Y)*np.log(1-Y_pre)+alpha*np.dot(weights.T,weights)).mean()
    
    return l

# attention: derivation of W?

In [None]:
l = (-Y*np.log(Y_pre)-(1-Y)*np.log(1-Y_pre)+alpha*np.dot(weights.T,weights)).mean()

In [None]:
    l = 0
    for i,e in enumerate(Y):
        if Y[i]==1:
            l += (-np.log(Y_pre[i]))
        else:
            l += (-np.log(1-Y_pre[i]))
        

Now, you can implement Stochastic Gradient Descent to learn the weights of your sentiment classifier. The `SGD` function takes as input:

- `X_tr`: array of training data (vectors)
- `Y_tr`: labels of `X_tr`
- `X_dev`: array of development (i.e. validation) data (vectors)
- `Y_dev`: labels of `X_dev`
- `lr`: learning rate
- `alpha`: regularisation strength
- `epochs`: number of full passes over the training data
- `tolerance`: stop training if the difference between the current and previous validation loss is smaller than a threshold
- `print_progress`: flag for printing the training progress (train/validation loss)


and returns:

- `weights`: the weights learned
- `training_loss_history`: an array with the average losses of the whole training set after each epoch
- `validation_loss_history`: an array with the average losses of the whole development set after each epoch

In [189]:
def update_W(x,weights,yTrue,yPred,lr,alpha):
    new_w = weights - lr*((yPred - yTrue)@x + 2*alpha*weights)
    return new_w

In [237]:
def SGD(X_tr, Y_tr, X_dev=[], Y_dev=[], lr=0.00001, 
        alpha=0.00001, epochs=10, 
        tolerance=0.00001, print_progress=True):
    
    training_loss_history = []
    validation_loss_history = []
    weights = np.zeros(np.shape(X_tr)[1])
    for i in range(epochs):
        indices = np.array(range(len(X_tr)))
        np.random.shuffle(indices)
        x_tr = X_tr[indices]
        y_tr = Y_tr[indices]
        Y_tr_pre = predict_proba(x_tr,weights)
        weights = update_W(x_tr,weights,y_tr,Y_tr_pre,lr,alpha)
            
        training_loss_history.append(binary_loss(x_tr,y_tr,Y_tr_pre,weights,alpha))    
        Y_dev_pre = predict_proba(X_dev,weights)
        validation_loss_history.append(binary_loss(X_dev,Y_dev,Y_dev_pre,weights,alpha))
    
        #if i>0:
            #if(validation_loss_history[i]-validation_loss_history[i-1]<tolerance):
                #break
    
    
    return weights, training_loss_history, validation_loss_history

## Train and Evaluate Logistic Regression with Count vectors

First train the model using SGD:

In [238]:
weights, training_loss_history, validation_loss_history = SGD(X_trian,train_data_label,X_dev,val_data_label)
weights

(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)
(1400,)
(200,)


  l = (-Y*np.log(Y_pre)-(1-Y)*np.log(1-Y_pre)+alpha*np.dot(weights.T,weights)).mean()
  l = (-Y*np.log(Y_pre)-(1-Y)*np.log(1-Y_pre)+alpha*np.dot(weights.T,weights)).mean()


array([ 0.00262644,  0.00070779,  0.0001334 , ..., -0.00061601,
       -0.00284893, -0.00180517])

In [239]:
Y_pre_test = predict_class(X_test,weights)
accuracy_score(test_data_label,Y_pre_test)


[-0.97234605 -0.88773889 -2.58563636 -0.16669081 -1.86471486 -0.83427216
 -2.73451352 -1.29913201 -2.34306333 -0.71129975 -0.90864383 -5.56385478
 -3.42551417 -0.9321981  -1.47591311 -0.12236719 -2.05808496 -0.56626991
 -0.85781952 -2.17658473 -1.49582415 -0.74492034 -2.7205256  -1.68641246
 -0.62224065 -1.99582622 -2.06337398 -0.75927147 -2.08564165 -0.79939319
 -1.51276473 -0.47212473 -1.09407397 -1.80788751 -1.25649498 -4.75911082
 -2.4504026  -2.7773717  -0.50183546 -2.35966793 -0.89928116 -1.61165381
 -1.55982966 -1.41953274 -0.70833415 -1.63457286 -0.615413   -1.00372295
 -3.7928747  -3.01584197 -1.58986025 -0.73876929 -1.48073257 -1.21168033
 -1.33825693 -3.9340089  -1.11138335 -0.82221357 -2.54590612 -2.34785352
 -2.73565243 -1.64293966 -1.03356684 -2.05005918 -4.98847569 -0.17508039
 -1.49774917 -1.81249037 -1.93726834 -4.2269838  -0.72535674  0.29509536
 -1.13001567 -0.44674131 -4.13301981 -0.82883205 -3.88822982 -0.16446604
 -0.42880367 -4.62735451 -2.4156255  -0.21411576  1

0.5025

Now plot the training and validation history per epoch for the best hyperparameter combination. Does your model underfit, overfit or is it about right? Explain why.

Explain here...

#### Evaluation

Compute accuracy, precision, recall and F1-scores:

In [None]:
preds_te_count = predict_class(X_te_count, w_count)

print('Accuracy:', accuracy_score(Y_te,preds_te_count))
print('Precision:', precision_score(Y_te,preds_te_count))
print('Recall:', recall_score(Y_te,preds_te_count))
print('F1-Score:', f1_score(Y_te,preds_te_count))

Finally, print the top-10 words for the negative and positive class respectively.

In [None]:
top_neg = w_count.argsort()[:10]
for i in top_neg:
    print(id2word[i])

In [None]:
top_pos = w_count.argsort()[::-1][:10]
for i in top_pos:
    print(id2word[i])

If we were to apply the classifier we've learned into a different domain such laptop reviews or restaurant reviews, do you think these features would generalise well? Can you propose what features the classifier could pick up as important in the new domain?

Provide your answer here...

### Discuss how did you choose model hyperparameters (e.g. learning rate and regularisation strength)? What is the relation between training epochs and learning rate? How the regularisation strength affects performance?

Enter your answer here...

## Train and Evaluate Logistic Regression with TF.IDF vectors

Follow the same steps as above (i.e. evaluating count n-gram representations).


### Now repeat the training and evaluation process for BOW-tfidf, BOCN-count, BOCN-tfidf, BOW+BOCN including hyperparameter tuning for each model...



## Full Results

Add here your results:

| LR | Precision  | Recall  | F1-Score  |
|:-:|:-:|:-:|:-:|
| BOW-count  |   |   |   |
| BOW-tfidf  |   |   |   |
| BOCN-count  |   |   |   |
| BOCN-tfidf  |   |   |   |
| BOW+BOCN  |   |   |   |

Please discuss why your best performing model is better than the rest.