# Sentiment Analysis

Classifying extracts from newsgroups is all fine and well, but it's not really mind-blowing. In addition, it requires to have a lot of data already labelled, which is not typically the case in real life applications. 

For the last part of this tutorial, we will try to do something fancier. We will use Word Embeddings to learn a very simple Sentiment Analysis classifier. The goal is to be able to score how positive or negative an extract is on a one-dimensional scale.

In order to do that, we will train a classifier on a list of more than 6000 words, split between 'positive' and 'negative' words. Of course, sentiment associated with a word is very context-dependent, but we will see what we can do in this simplified case.

In [None]:
import pandas as pd
import numpy as np

In [None]:
sentiment_lexicon = pd.read_csv('data/sentiment_lexicon.csv', index_col=0)
sentiment_lexicon

Labels are 1 for positive words, and 0 for negative words.  Columns 0 to 299 contain the word vectors associated to each word.

## Train a sentiment analysis classifier

First, we will split our sentiment dataset between a training set and a validation set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(sentiment_lexicon.drop(['label'], axis=1).values,
                                                  sentiment_lexicon['label'],
                                                  test_size=.25,
                                                  stratify=sentiment_lexicon['label'])

Then, we train our classifier on the training set. Since we have only two classes, it's a binary classifier which is much easier to train. 

In [None]:
from sklearn.linear_model import LogisticRegressionCV

lr_classifier = LogisticRegressionCV()
lr_classifier.fit(X_train, y_train)
print('Optimal C value', lr_classifier.C_[0])
print('train accuracy', 
     lr_classifier.score(X_train, y_train),
     '\nvalidation accuracy',
     lr_classifier.score(X_val, y_val))

We have a pretty high accuracy, let's have a look on how the classifier generalizes on unseen words.

In [None]:
def vec_to_sentiment(vec):
    # predict_log_proba gives the log probability for each class
    predictions = lr_classifier.predict_log_proba(vec)

    # To see an overall positive vs. negative classification in one number,
    # we take the log probability of positive sentiment minus the log
    # probability of negative sentiment.
    return predictions[:, 1] - predictions[:, 0]

In [None]:
validation_words = y_val.to_frame()
validation_words['sentiment'] = vec_to_sentiment(X_val)[:, None]
validation_words.head(20)

Seems to generalize all right. 

Now we want to apply this classifier on whole sentences, by taking the average sentiment value for all word embeddings in the sentence.

In [None]:
from gensim.models import KeyedVectors
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

wordVectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

stopwords_en = set(stopwords.words('english'))

def clean_text(string):
    tokens = wordpunct_tokenize(string)
    return [token.lower() for token in tokens if (token.isalpha() and token.lower() not in stopwords_en)]

In [None]:
# %load solutions/words_to_sentiment.py

def words_to_sentiment(sentence):
    # Given a string representing a sentence, we want to split it into clean tokens with the clean_text function
    # Then we apply the vec_to_sentiment function to each token (if present in the vocabulary) to get a sentiment value
    # We return the average sentiment value
    ...


We can check if it is working on whole sentences now.

In [None]:
words_to_sentiment('I am happy and joyous!')

In [None]:
words_to_sentiment('I feel okay.')

In [None]:
words_to_sentiment('This is a sad day...')

Wow, that's great! (Sentiment score 4.78) 

Can we try on more sentences?

In [None]:
words_to_sentiment("I want to visit France")

In [None]:
words_to_sentiment("I want to visit Japan")

In [None]:
words_to_sentiment("I want to visit Congo")

In [None]:
words_to_sentiment("I want to visit Iraq")

What happened here?

In [None]:
words_to_sentiment("I like Italian food")

In [None]:
words_to_sentiment("I like Mexican food")

These sets of sentences should have similar sentiment scores, because they express the same objective idea. 

But actually, since word embeddings are trained on real-life corpuses containing prejudices and bias, they also learn these biases.

The inspiration for this third part is a tutorial that delves more deeply into this subject, available [here](http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/). 

Some papers on the subject:
 - [*Semantics derived automatically from language corpora contain human-like biases*](https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily),  Aylin Caliskan, Joanna J Bryson, Arvind Narayanan
 - [*Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings*](https://arxiv.org/abs/1607.06520), Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai



## Conclusion

Word Embeddings are a powerful way to encode text, even though they requires quite a lot of memory to load.

But with great power comes great responsability. Word Embeddings are often biaised, and you need to consider the influence of these biases on your application.