# Assignment 1

## Assignment text
1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of at most 20000 most frequent words.

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$

5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{\#negs prec. }w}E^{\text{\#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

6. **[2pb]** Propose, implement, and evaluate an extension to the above model.


In [1]:
# imports
import numpy as np
import pandas as pd
from collections import defaultdict 
import re

from sklearn.linear_model import LogisticRegression

## Task 1.

In [2]:
def regex(text):
    text = re.sub(r'[^\w\s]', '', text.lower())
    return text

In [3]:
df = pd.read_csv('train.tsv', sep='\t')
# test_df = pd.read_csv('test.tsv', sep='\t')

df = df.groupby(['SentenceId'], 
                          as_index=False).agg({'PhraseId' : 'min',
                                               'Phrase' : 'first',
                                               'Sentiment' : 'first'})

df['Sentiment'] = df['Sentiment'].apply(lambda x: x / 4)
df = df.drop(['PhraseId', 'SentenceId'], axis=1)
df.Phrase = df.Phrase.apply(lambda row: regex(row))
df.shape

(8529, 2)

In [4]:
df.head()

Unnamed: 0,Phrase,Sentiment
0,a series of escapades demonstrating the adage ...,0.25
1,this quiet introspective and entertaining ind...,1.0
2,even fans of ismail merchant s work i suspect...,0.25
3,a positively thrilling combination of ethnogra...,0.75
4,aggressive selfglorification and a manipulativ...,0.25


In [5]:
# target value counts
df['Sentiment'].value_counts()

0.75    2321
0.25    2200
0.50    1655
1.00    1281
0.00    1072
Name: Sentiment, dtype: int64

In [6]:
# test and train split
train_df = df.iloc[: 7000]
test_df = df.iloc[7000: ]

print(train_df.shape, test_df.shape)

(7000, 2) (1529, 2)


# Task 2 & 3

In [7]:
class MyCountVectorizer:
    def __init__(self, min_df=-1, max_df=1e18, binary=False):
        self.min_df = min_df
        self.max_df = max_df
        self.binary = binary
    
    def fit(self, df):
        words_cnt = defaultdict(int)
        col = df.columns[0]
        
        for i in range(len(df)):
            text = df.iloc[i][col]
            for word in text.split():
                words_cnt[word] += 1
                
        all_words = []
        for word, cnt in words_cnt.items():
            if self.min_df <= cnt <= self.max_df:
                all_words.append(word)
                
        self.all_words_ids = {w:i for i,w in enumerate(all_words)}
        self.width = len(all_words)
        
    
    def transform(self, df):
        col = df.columns[0]
        count_matrix = np.zeros([len(df), self.width], \
                                dtype=np.int32)
        
        for i in range(len(df)):
            text = df.iloc[i][col]
            words_cnt = defaultdict(int)
            
            for word in text.split():
                words_cnt[word] += 1
            
            for word, cnt in words_cnt.items():
                if word in self.all_words_ids:
                    pos = self.all_words_ids[word]
                    if self.binary:
                        count_matrix[i][pos] = 1
                    else:
                        count_matrix[i][pos] = cnt
                    
        return count_matrix

In [8]:
%%time

cv = MyCountVectorizer()
cv.fit(train_df)

X_train = cv.transform(train_df) 
X_test = cv.transform(test_df)

Wall time: 2.13 s


In [9]:
# Logistic Regression

LR = LogisticRegression(multi_class='multinomial', solver='lbfgs')
LR.fit(X_train, train_df.Sentiment * 4)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [10]:
pd.Series(LR.predict(X_test) == test_df.Sentiment * 4).value_counts()

False    926
True     603
Name: Sentiment, dtype: int64

In [26]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(train_df.Phrase)

X_train = cv.transform(train_df) 
X_test = cv.transform(test_df)


param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, train_df.Sentiment * 4)

# print("Best cross-validation score: {:.2f}".format(grid.best_score_))
# print("Best parameters: ", grid.best_params_)
# print("Best estimator: ", grid.best_estimator_)

ValueError: Found input variables with inconsistent numbers of samples: [2, 7000]

In [15]:
import matplotlib.pyplot as plt
import mglearn

feature_names = cv.get_feature_names()
mglearn.tools.visualize_coefficients(grid.best_estimator_.coef_, feature_names, n_top_features=25)
plt.show()

AttributeError: 'MyCountVectorizer' object has no attribute 'get_feature_names'