# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

In [61]:
# Standard IPython notebook imports
%matplotlib inline

import os

from io import StringIO
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

import scipy.stats as sstats
import scipy.optimize as sopt
from scipy.linalg import solve_triangular

from sklearn.model_selection import train_test_split
from collections import defaultdict
import re
from tqdm import tqdm
import seaborn as sns
from nltk.stem.snowball import SnowballStemmer

sns.set_style('whitegrid')

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

## Assignment text
1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

In [2]:
DF = pd.read_table('train.tsv')

In [3]:
#DF2 = DF[[True] + [DF['SentenceId'][i] != DF['SentenceId'][i-1] for i in range(1,156060)]]
DF = DF.groupby('SentenceId',as_index = False).first()
DF = DF.drop(['SentenceId','PhraseId'],axis=1)

In [4]:
DF.head(10)

Unnamed: 0,Phrase,Sentiment
0,A series of escapades demonstrating the adage ...,1
1,"This quiet , introspective and entertaining in...",4
2,"Even fans of Ismail Merchant 's work , I suspe...",1
3,A positively thrilling combination of ethnogra...,3
4,Aggressive self-glorification and a manipulati...,1
5,A comedy-drama of nearly epic proportions root...,4
6,"Narratively , Trouble Every Day is a plodding ...",1
7,"The Importance of Being Earnest , so thick wit...",3
8,But it does n't leave you with much .,1
9,You could hate it for the same reason .,1


In [5]:
DF.shape

(8529, 2)

In [6]:
DF['Sentiment'] = DF['Sentiment']/4
DF['Phrase'] = list(map(lambda s: re.sub(r'[^\w\s]','',s.lower()),DF['Phrase']))

In [7]:
train_df,test_df = train_test_split(DF,train_size=7000,shuffle=False)

In [8]:
train_df

Unnamed: 0,Phrase,Sentiment
0,a series of escapades demonstrating the adage ...,0.25
1,this quiet introspective and entertaining ind...,1.00
2,even fans of ismail merchant s work i suspect...,0.25
3,a positively thrilling combination of ethnogra...,0.75
4,aggressive selfglorification and a manipulativ...,0.25
...,...,...
6995,snoots will no doubt rally to its cause trott...,0.25
6996,it s better suited for the history or biograph...,0.25
6997,buries an interesting storyline,0.50
6998,this one is a few bits funnier than malle s du...,0.75


2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of at most 20000 most frequent words.

In [9]:
WORDS = defaultdict(int)
for i in train_df.index:
    sample = train_df.loc[i]
    for word in sample['Phrase'].split():
        WORDS[word] += 1

In [10]:
len(WORDS)

14743

In [11]:
K = 2000
Common_words_list = list(sorted(list(WORDS.keys()),key = WORDS.get,reverse=True))[:K]
Common_words = {word:i for i,word in enumerate(Common_words_list)}

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 

$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$


In [12]:
def sent_to_vect(s):
    v = np.zeros(K,dtype=np.int32)
    for w in s.split():
        i = Common_words.get(w,-1)
        if i != -1:
            v[i] += 1
    return v

In [13]:
Encoded_DF = DF.copy()

In [14]:
Encoded_DF['Phrase'] = list(map(sent_to_vect,DF['Phrase']))

In [15]:
Encoded_DF.head(10)

Unnamed: 0,Phrase,Sentiment
0,"[3, 2, 0, 4, 1, 2, 0, 0, 1, 0, 0, 1, 0, 0, 2, ...",0.25
1,"[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.0
2,"[0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
3,"[1, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.75
4,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
5,"[1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1.0
6,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
7,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...",0.75
8,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, ...",0.25
9,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, ...",0.25


In [16]:
c_train_df,c_test_df = train_test_split(Encoded_DF,train_size=7000,shuffle=False)

In [17]:
class Logistic_Regression:
    def __init__(self, max_iter=500, solver_calls=5,lambda_ = 0.1,Theta=None, solver=sopt.fmin_l_bfgs_b, debug=False):
        self.Theta = Theta
        self.solver_calls = solver_calls
        self.max_iter = max_iter
        self.solver = solver
        self.debug = debug
        self.lambda_ = lambda_
    
    def __sigmoid(self,x):
        return 1 / (1 + np.exp(-x))    
    
    def __logreg_loss(self, Theta, X, Y):
        Theta = Theta.astype(np.float64)
        X = X.astype(np.float64)
        Y = Y.astype(np.float64)
        
        if self.debug:
            print(f"Loss calculating... ",end="")
        Z = np.dot(Theta,X.T)
        if self.debug:
            print(f" Z done... ",end="")
        SZ = self.__sigmoid(Z)
        Y_ = Y[:,np.newaxis]
        nll = -np.sum((Y_*np.log2(SZ+1e-50) + (1-Y_)*np.log2(1-SZ+1e-50)))
        nll += (self.lambda_/2) * np.sum(Theta**2)
        if self.debug:
            print(f" nll done... ",end="")
        grad = np.dot(X.T, (SZ - Y).T )
        grad = grad.reshape(Theta.shape) + self.lambda_ * Theta
        if self.debug:
            print(f" grad done... done ")
        return nll, grad
    
    def fit(self,X,y):
        Theta = self.Theta
        if Theta is None:
            Theta = np.ones(X.shape[1]+1)
        
        X_with_ones = np.hstack((np.ones((X.shape[0],1)),X))
      
        for i in tqdm(range(self.solver_calls), desc='Calculating Theta', position=0):
            Theta = self.solver(lambda th: self.__logreg_loss(th, X_with_ones, y), 
                                Theta, maxiter=self.max_iter)[0]
        self.Theta = Theta
        
    
    def predict(self,X):
        X_with_ones = np.hstack((np.ones((X.shape[0],1)),X))
        preds = np.dot(self.Theta,X_with_ones.T)
        return np.round(self.__sigmoid(preds) * 4) / 4
    
    def predict_ppb(self,X):
        X_with_ones = np.hstack((np.ones((X.shape[0],1)),X))
        preds = np.dot(self.Theta,X_with_ones.T)
        return self.__sigmoid(preds)

In [18]:
LR = Logistic_Regression(lambda_=0.5)
X_train,y_train = np.stack(np.array(c_train_df['Phrase'])),np.array(c_train_df['Sentiment'])
X_test,y_test = np.stack(np.array(c_test_df['Phrase'])),np.array(c_test_df['Sentiment'])

In [19]:
LR.fit(X_train,y_train)

Calculating Theta: 100%|█████████████████████████████████████████████████████████████████| 5/5 [03:00<00:00, 36.01s/it]


In [20]:
preds = LR.predict(X_test)
preds_unrounded = LR.predict_ppb(X_test)

In [21]:
np.mean(preds == y_test)

0.3642903858731197

In [22]:
np.mean((preds - y_test)**2)

0.07987246566383258

In [None]:
np.mean((preds_unrounded - y_test)**2)

In [23]:
Th = LR.Theta

In [24]:
most_neg_arg = np.argsort(Th)[:20]
most_neg_words = [Common_words_list[i-1] for i in most_neg_arg]
most_neg_words

['stupid',
 'worst',
 'devoid',
 'unpleasant',
 'lacking',
 'poor',
 'horrible',
 'terrible',
 'incoherent',
 'mess',
 'depressing',
 'barely',
 'suffers',
 'unfunny',
 'lazy',
 'poorly',
 'inept',
 'flat',
 'mediocre',
 'wasted']

In [25]:
most_pos_arg = np.argsort(Th)[::-1][:20]
most_pos_words = [Common_words_list[i-1] for i in most_pos_arg]
most_pos_words

['feelgood',
 'dazzling',
 'masterpiece',
 'intoxicating',
 'wonderful',
 'remarkable',
 'refreshing',
 'joyous',
 'originality',
 'imax',
 'delightfully',
 'follow',
 'charmer',
 'ahead',
 'chilling',
 'assured',
 'amazing',
 'pulls',
 'mesmerizing',
 'hilarious']

4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
    ```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
    ```
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is

$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$


In [26]:
negation_words = {'not','no','never','no','hardly','nobody','none','scarcely','nowhere','sparsely','scantly','seldom','sporadically','somewhat','infrequently','imperceptibly','rarely','comparatively','perceptibly','gradually','detectably'}

In [27]:
def sent_to_vect_with_neg(s):
    v = np.zeros(K,dtype=np.int32)
    neg = 1
    for w in s.split():
        i = Common_words.get(w,-1)
        if i != -1:
            v[i] += neg 
        if w in negation_words:
            neg *= -1
    return v

In [28]:
Encoded_neg_DF = DF.copy()
Encoded_neg_DF['Phrase'] = list(map(sent_to_vect_with_neg,DF['Phrase']))
Encoded_neg_DF.head(10)

Unnamed: 0,Phrase,Sentiment
0,"[3, 0, 0, 0, -1, 2, 0, 0, 1, 0, 0, 1, 0, 0, 2,...",0.25
1,"[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.0
2,"[0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
3,"[1, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.75
4,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
5,"[1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1.0
6,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
7,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...",0.75
8,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, ...",0.25
9,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, ...",0.25


In [29]:
nc_train_df,nc_test_df = train_test_split(Encoded_neg_DF,train_size=7000,shuffle=False)
neg_X_train,y_train = np.stack(np.array(nc_train_df['Phrase'])),np.array(c_train_df['Sentiment'])
neg_X_test,y_test = np.stack(np.array(nc_test_df['Phrase'])),np.array(c_test_df['Sentiment'])

In [30]:
LR_neg = Logistic_Regression(lambda_=0.5)
LR_neg.fit(neg_X_train,y_train)

Calculating Theta: 100%|█████████████████████████████████████████████████████████████████| 5/5 [02:40<00:00, 32.09s/it]


In [31]:
neg_preds = LR_neg.predict(neg_X_test)
neg_preds_unrounded = LR_neg.predict_ppb(neg_X_test)

In [32]:
np.mean(neg_preds == y_test)

0.35251798561151076

In [33]:
np.mean((neg_preds - y_test)**2)

0.08657619359058208

In [34]:
np.mean((neg_preds_unrounded - y_test)**2)

0.08132675077638069

5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{\#negs prec. }w}E^{\text{\#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

In [35]:
emphance_words  = {'very', 'consistently', 'constantly', 'continually', 'inadvertently', 'mutually', 'simply', 'strongly', 'actively', 'energetically', 'firmly', 'fully', 'heartily', 'heavily', 'resolutely', 'robustly', 'solidly', 'staunchly', 'steadily', 'vigorously', 'completely', 'decidedly', 'forcibly', 'indomitably', 'invincibly', 'mightily', 'securely', 'stoutly', 'sturdily'}

In [41]:
def sent_to_vect_with_emph(s,N,E):
    v = np.zeros(K,dtype=np.int32)
    emph = 1
    for w in s.split():
        i = Common_words.get(w,-1)
        if i != -1:
            v[i] += emph
        if w in negation_words:
            emph *= -N
        if w in emphance_words:
            emph *= E
    return v

In [42]:
Encoded_emph_DF = DF.copy()
Encoded_emph_DF['Phrase'] = list(map(lambda s: sent_to_vect_with_emph(s,1,3),DF['Phrase']))
Encoded_emph_DF.head(10)

Unnamed: 0,Phrase,Sentiment
0,"[3, 0, 0, 0, -1, 2, 0, 0, 1, 0, 0, 1, 0, 0, 2,...",0.25
1,"[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.0
2,"[0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
3,"[1, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.75
4,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
5,"[1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1.0
6,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
7,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...",0.75
8,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, ...",0.25
9,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, ...",0.25


In [44]:
ec_train_df,ec_test_df = train_test_split(Encoded_emph_DF,train_size=7000,shuffle=False)
emph_X_train,y_train = np.stack(np.array(ec_train_df['Phrase'])),np.array(c_train_df['Sentiment'])
emph_X_test,y_test = np.stack(np.array(ec_test_df['Phrase'])),np.array(c_test_df['Sentiment'])

In [45]:
LR_emph = Logistic_Regression(lambda_=0.5)
LR_emph.fit(emph_X_train,y_train)

Calculating Theta: 100%|█████████████████████████████████████████████████████████████████| 5/5 [02:54<00:00, 34.82s/it]


In [49]:
emph_preds = LR_emph.predict(emph_X_test)
emph_preds_unrounded = LR_emph.predict_ppb(emph_X_test)

In [50]:
np.mean(emph_preds == y_test)

0.33747547416612167

In [51]:
np.mean((emph_preds - y_test)**2)

0.08915140614780903

In [52]:
np.mean((emph_preds_unrounded - y_test)**2)

0.08301257948263047

6. **[2pb]** Propose, implement, and evaluate an extension to the above model.

In [53]:
def sent_to_vect_with_limited_emph(s,N,E):
    v = np.zeros(K,dtype=np.int32)
    emph = 1
    for w in s.split():
        i = Common_words.get(w,-1)
        if i != -1:
            v[i] += emph
        if emph > 1:
            emph -= 1
        elif emph < 1:
            emph += 1
        if w in negation_words:
            emph *= -N
        if w in emphance_words:
            emph *= E
    return v

In [54]:
N,E = 1,3
Encoded_lemph_DF = DF.copy()
Encoded_lemph_DF['Phrase'] = list(map(lambda s: sent_to_vect_with_limited_emph(s,1,3),DF['Phrase']))
Encoded_lemph_DF.head(10)

Unnamed: 0,Phrase,Sentiment
0,"[3, 2, 0, 2, 1, 2, 0, 0, 1, 0, 0, 1, 0, 0, 2, ...",0.25
1,"[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.0
2,"[0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
3,"[1, 3, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.75
4,"[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
5,"[1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1.0
6,"[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.25
7,"[1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...",0.75
8,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, ...",0.25
9,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, ...",0.25


In [55]:
lec_train_df,lec_test_df = train_test_split(Encoded_lemph_DF,train_size=7000,shuffle=False)
lemph_X_train,y_train = np.stack(np.array(lec_train_df['Phrase'])),np.array(c_train_df['Sentiment'])
lemph_X_test,y_test = np.stack(np.array(lec_test_df['Phrase'])),np.array(c_test_df['Sentiment'])

In [56]:
LR_lemph = Logistic_Regression(lambda_=0.5)
LR_lemph.fit(lemph_X_train,y_train)

Calculating Theta: 100%|█████████████████████████████████████████████████████████████████| 5/5 [03:10<00:00, 38.15s/it]


In [57]:
lemph_preds = LR_emph.predict(lemph_X_test)
lemph_preds_unrounded = LR_emph.predict_ppb(lemph_X_test)

In [58]:
np.mean(lemph_preds == y_test)

0.35251798561151076

In [59]:
np.mean((lemph_preds - y_test)**2)

0.0822841726618705

In [60]:
np.mean((lemph_preds_unrounded - y_test)**2)

0.07722449889812918

## Stemming

In [64]:
stemmer = SnowballStemmer("english")
def stem_(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

In [66]:
DF['Phrase'] = list(map(lambda s: stem_(s),DF['Phrase']))

In [67]:
train_df,test_df = train_test_split(DF,train_size=7000,shuffle=False)

In [68]:
train_df

Unnamed: 0,Phrase,Sentiment
0,a seri of escapad demonstr the adag that what ...,0.25
1,this quiet introspect and entertain independ i...,1.00
2,even fan of ismail merchant s work i suspect w...,0.25
3,a posit thrill combin of ethnographi and all t...,0.75
4,aggress selfglorif and a manipul whitewash,0.25
...,...,...
6995,snoot will no doubt ralli to it caus trot out ...,0.25
6996,it s better suit for the histori or biographi ...,0.25
6997,buri an interest storylin,0.50
6998,this one is a few bit funnier than mall s dud ...,0.75


In [69]:
WORDS = defaultdict(int)
for i in train_df.index:
    sample = train_df.loc[i]
    for word in sample['Phrase'].split():
        WORDS[word] += 1

In [70]:
len(WORDS)

10653

In [71]:
K = 2000
Common_words_list = list(sorted(list(WORDS.keys()),key = WORDS.get,reverse=True))[:K]
Common_words = {word:i for i,word in enumerate(Common_words_list)}

In [72]:
stemmed_Encoded_DF = DF.copy()
stemmed_Encoded_DF['Phrase'] = list(map(sent_to_vect,DF['Phrase']))

In [73]:
sc_train_df,sc_test_df = train_test_split(stemmed_Encoded_DF,train_size=7000,shuffle=False)
s_X_train,y_train = np.stack(np.array(sc_train_df['Phrase'])),np.array(c_train_df['Sentiment'])
s_X_test,y_test = np.stack(np.array(sc_test_df['Phrase'])),np.array(c_test_df['Sentiment'])

In [75]:
LR_s = Logistic_Regression(lambda_=0.5)
LR_s.fit(s_X_train,y_train)

Calculating Theta: 100%|█████████████████████████████████████████████████████████████████| 5/5 [02:59<00:00, 35.81s/it]


In [76]:
s_preds = LR_emph.predict(s_X_test)
s_preds_unrounded = LR_emph.predict_ppb(s_X_test)

In [77]:
np.mean(s_preds == y_test)

0.22498364944408109

In [78]:
np.mean((s_preds - y_test)**2)

0.1510791366906475

In [79]:
np.mean((s_preds_unrounded - y_test)**2)

0.14379943024515346