# Zero Shot Sentiment Analysis using LASER

This is a notebook to show an example of how NLP methods can utilize data in another language to perform a task in Danish with no need for Danish data. Imagine a task where there is no annotated data in Danish but there exists open source dataset in another language for a similar task. In such a case it is sometimes possible to train a model on data from the so called 'source' language an imply it straight away on the 'target' language. This is called  **zero-shot transfer** when the classifier has not seen any data from the target language.

The overall idea is to create a multilingual embedding space to represent words or sentences from several languages in a similar matter. This means that we can take a sentence in one language and map it into a vector representation which basically is an array of numbers. If we then have a sentence in another language with the same meaning, we can then map this sentence as well into the vector representation. This would ideally give us to arrays of numbers which are very similar. We can also think of it in a geometrically sense; these vectors lie close to each other in our multilingual embeddings space.

The idea is now to map the annotated data in the source language into such representations. We can then use these representations as features and train a (simple) classification model. When we want to apply the model on the target language, we simply take the input sentences and map it into the representation and then apply the trained classifier.

In this example we will be working with  **LASER sentence embeddings** from Facebook Research. Have a look at their [github](https://github.com/facebookresearch/LASER) or read the paper for further understanding: Holger Schwenk and Matthijs Douze, [Learning Joint Multilingual Sentence Representations with Neural Machine Translation](https://aclweb.org/anthology/papers/W/W17/W17-2619/), ACL workshop on Representation Learning for NLP, 2017.
The Laser embeddings are trained using machine translation on 93 languages using a shared encoder, and the data is different sets of parallel corpuses that translate into English and Spanish.

The task we will look at in this example is **Sentiment Analyses** performed on data from movie reviews. We will be using the IMDB dataset for training, and then we will be testing on a Norwegian dataset.


*NOTE:
The code in this notebook is not integrated in the DaNLP, but we are working on a Danish dataset for sentiment and with it a model for sentiment analyse, that is properly benchmarked*



### The steps 
**pre-steps**
1. Download the data and extract it to the right format
2. Install the libraries needed

**steps inside the Jupyter Notebook**
3. Prepare and clean the data, and embed it using LASER
5. Train a classifier
6. Try it on Danish text
6. Evaluate on target language data - in this case a Norwegian corpus


## Get the data

**The IMDB dataset**
Download the data at http://ai.stanford.edu/~amaas/data/sentiment/
and cite the paper:
Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher, [Learning Word Vectors for Sentiment Analysis](http://www.aclweb.org/anthology/P11-1015), ACL 2011

The data consists of 50K reviews from IMDB, and the labels originate from the ratings turned into a binary classification task. The data is split equally into a test set and in a trainset. The data is balanced between the two classes.

Once the data is downloaded, we will combine the training and testing part into one txt file each. It can be done in the following manner (from this [blogpost](https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184)), open a terminal and navigate to the download file, named: aclImdb_v1.tar.gz

Unzip:
`gunzip -c aclImdb_v1.tar.gz | tar xopf -`

Navigate and make new folder: `cd aclImdb && mkdir movie_data`

Concetenate into a txt file: `for split in train test; do for sentiment in pos neg; do for file in $split/$sentiment/*; do cat $file >> movie_data/full_${split}.txt; echo >> movie_data/full_${split}.txt; done; done; done;` 

Now there will be a file named full_train.txt and a file named full_test.txt in the folder aclImdb/movie_data.


**NoReC: The Norwegian Review Corpus**
The data contains reviews from different domains including movies originating from Norwegian news sources. The information of each reviews origins is stored as metadata along with ratings. The ratings are made comparable across domains and are in the range between 1 and 6. The dataset is split into train, validation and test. Read more about the data in the paper [NoReC: The Norwegian Review Corpus](http://www.lrec-conf.org/proceedings/lrec2018/pdf/851.pdf) Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, Fredrik Jørgensen, 2018

Clone the github to get the data,

`git clone https://github.com/ltgoslo/norec`

`cd norec`

`./download.sh`




## Setup the installation
You need the following python packages to run the code in this notebook. It is recommended to install it through a virtual environment, for example use pip [read more here](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).

The NoRec github include a package for extrating the text. Go to the folder 'src' and follow the instruction to install the package norec, [see here](https://github.com/ltgoslo/norec/tree/master/src).

The following packages can be installed thorugh pip:

- [laserembeddings](https://pypi.org/project/laserembeddings/) - This package is a wrapper of the Laser embeddings integrated to work directly in a python script, but feel free to use the original souce code from Laser.

- [scikit-learn](https://pypi.org/project/scikit-learn/) - This is used to fit a classification model
- NumPy
- Pandas


In [6]:
# import libaries
import re
import numpy as np
import pandas as pd
import pickle
import norec

from laserembeddings import Laser
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import train_test_split
from itertools import product
from sklearn.metrics import confusion_matrix

## Step 1:  Prepare the data 

First prepare the training data from the IMDB dataset such it is ready for the laser embeddings. We will only work with the training part. 

In [2]:
# load the review text into a list
path = 'aclImdb/movie_data/full_train.txt' # here set the path to the full_train.txt file 

reviews_train = []
for line in open(path, 'r'):
    reviews_train.append(line.strip())

# lets have a look at one abitra review
reviews_train[13]

"I enjoyed The Night Listener very much. It's one of the better movies of the summer.<br /><br />Robin Williams gives one of his best performances. In fact, the entire cast was very good. All played just the right notes for their characters - not too much and not too little. Sandra Oh adds a wonderful comic touch. Toni Collette is great as the Mom, and never goes over the top. Everyone is very believable.<br /><br />It's a short movie, just under an hour and a half. I noticed the general release version is nine minutes shorter than the Sundance version. I wonder if some of the more disturbing images were cut from the movie.<br /><br />The director told a story and did it in straightforward fashion, which is a refreshing change from many directors these days who seem to think their job is to impress the audience rather than tell a story and tell it well.<br /><br />Do not be sucker punched by the previews and ads. It is not a Hitchcockian thriller. See The Night Listener because you wan

In [44]:
# The reviews need to be clean for different xml tags
REMOVE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REMOVE .sub(" ", line) for line in reviews]
    reviews = [line.replace('\'','') for line in reviews]
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)

# and lets have a look again
reviews_train_clean[13] 

'I enjoyed The Night Listener very much. Its one of the better movies of the summer. Robin Williams gives one of his best performances. In fact, the entire cast was very good. All played just the right notes for their characters   not too much and not too little. Sandra Oh adds a wonderful comic touch. Toni Collette is great as the Mom, and never goes over the top. Everyone is very believable. Its a short movie, just under an hour and a half. I noticed the general release version is nine minutes shorter than the Sundance version. I wonder if some of the more disturbing images were cut from the movie. The director told a story and did it in straightforward fashion, which is a refreshing change from many directors these days who seem to think their job is to impress the audience rather than tell a story and tell it well. Do not be sucker punched by the previews and ads. It is not a Hitchcockian thriller. See The Night Listener because you want to see a good story told well. If you go exp

#### Embed with Laser
Now the reviews are ready to be embedded using Laser. In this first example we will take each review and embed it into one vector.  The embeddings will have a dimension of 1024 for each input, and the Laser embedding needs the input text, and the languages used for tokenization. Let us start with an example of using Laser.

In [45]:
# run an example to see it is working
laser = Laser()
examples = ['Det kunne være fedt med en sentiment klassifier på dansk!', 'Lad os prøve med en zero shot tilgang.' ]
embeddings = laser.embed_sentences(examples, lang='da')
embeddings.shape

(2, 1024)

In [25]:
# Now embed all the reviews in the training data 
# note this might take a really long time 
embedings = laser.embed_sentences(reviews_train_clean, lang='en')

# check you got the expected output
embedings.shape # this shoud be (25.000, 1024)

In [None]:
# Save the embeddings, then if you return to the notebook later there will be no need to run STEP 1 again
path_imdb_embeddings = 'aclImdb/movie_data/imdb_clean_train_laser' # choose the path to store the embedded imdb reviews
np.save(embedings, path_imdb_embeddings)

## Step 2: Train a classifier

In [2]:
# load the embeddings of the reviews, which is the feature vector to train the classifier 
path_imdb_embeddings = 'aclImdb/movie_data/imdb_clean_train_laser' # choose the path where the embedded imdb reviews are stored

# load the training features
features = np.load(path_imdb_embeddings + '.npy')

# the tager vector - from the way the reviews were concatenated into one file, we have that the
target = [1 if i < 12500 else 0 for i in range(25000)]


In [3]:
# Randomize the order, and split into train and validation set
X_train, X_val, y_train, y_val = train_test_split(features, target, test_size = 0.10, random_state=42)

In [4]:
# train a model using logistic regression

# logistic regression - Try different values
solvers = ['lbfgs'] #['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
regulation = ['l2']
c_values = [0.1,1,10,100]
for c,l,s in product(c_values, regulation,solvers): # with product we can iterate through all possible combinations 
    lr = LogisticRegression(C=c, solver=s, penalty=l, random_state=42, max_iter=10000000)
    lr.fit(X_train, y_train)
    print ("Accuracy on val: %s with c value %s, penalty %s and solver %s " 
           % ( accuracy_score(y_val, lr.predict(X_val)),c,l,s))

Accuracy on val: 0.796 with c value 0.1, penalty l2 and solver lbfgs 
Accuracy on val: 0.8288 with c value 1, penalty l2 and solver lbfgs 
Accuracy on val: 0.8348 with c value 10, penalty l2 and solver lbfgs 
Accuracy on val: 0.8364 with c value 100, penalty l2 and solver lbfgs 


In [8]:
# lets examine if the classifier is better to predict one class or not
y_pred = lr.predict(X_val)
confusion_matrix(y_val, y_pred, labels=[1,0]) # 1 is positive and 0 is negative


array([[1074,  224],
       [ 185, 1017]])

In [7]:
# save the choosen classifier model to disk
path_model = 'aclImdb/movie_data/Lr_model.sav'
pickle.dump(lr, open(path_model, 'wb'))



## Step 3:  Apply on Danish Exampels

In [8]:
# if nesesary load the model from disk which was build above
path_model = 'aclImdb/movie_data/Lr_model.sav'
lr = pickle.load(open(path_model, 'rb'))

In [9]:
# creat a function to judge the sentiment on danish sentences
def get_sentiment(sentence, classifier, token_lang='da'):
    
    # embed the sentence
    laser = Laser()
    input_features = laser.embed_sentences([sentence], lang=token_lang)
    
    # apply the classifier
    pred=classifier.predict(input_features)
    
    class_names = {'0': 'negative', '1': 'positve'}
    
    return class_names[str(int(pred))]
    

In [10]:
# let's try with some examples
get_sentiment('Det var ikke godt', lr)


'negative'

In [11]:
# let's try with some examples
get_sentiment('Det var godt, ikke?', lr)

'positve'

In [17]:
# let's try with some movie examples
get_sentiment('Filmen svarer til at kigge ind i en hvid væg i to timer', lr)

'negative'

In [13]:
# let's try with some movie examples with mix of languages
get_sentiment('Musikalfilmen Les Mesirable er min ynglings fordi den er just fabulous', lr)

'positve'

In [14]:
# let's try to give it a harder example, and see it fail
get_sentiment('Jeg så filmen sammen med mine dejlige veninder, men det var også det eneste gode at sige om den film', lr)

'positve'

### Comments
Note that the transfer performance is not only depended on the language we transfer from and too, but also on the domain used to train the classifier. A shift in domain - in this cause to other than movie reviews from IMDB - would also affect the classifier's performance. Likewise, using the classifier on data which is more clear in polarity would give higher result in using it on data that a more refined in its sentiment. 

The next step is to try to evaluate den model on a Norweign corpus.

## Step 4:  Evaluate on Norwegian Data

Presteps: follow the instruction in the top of the notebook to clone the Norec repository, download the data, and install the norec package. 

Now we will first prepare the data. We will be working with the subset defined as 'train' to test our model, since this is the larges subset. To ensemble the IMDB dataset we will make the task binary by dropping reviews with ratings in the middle (3 and 4), and combine the reviews with ratings 5 and 6 to positive, and the ones with 1 and 2 to negative. Further more, we will sample only the 'movie' reviews to remain in a similar domain as the training data from IMDB. Lastly we will balance the reviews to include equal number of positive and negative. 

Then we embed these reviews with LASER.

And then we test our train classifier. 

In [11]:
# function to prepare the data using the NOREC package


def prepare_data(subset, path):
    # subset is either: 'train', 'dev' or 'test'
    # path to the html file "norec/data/html.tar.gz"
    # load the data
    subset_data = norec.load(path, subset=subset)
    
    # Get the reviews, ratings and the sub katagori
    subset_list = [(norec.html_to_text(html), metadata['rating'], metadata['source-category'])
                 for html, metadata in subset_data]

    df  = pd.DataFrame(subset_list, columns=['reviews', 'score', 'genre'])
    
    # Keep only the movie revies
    df=df[df['genre']=='film']

    # drop reviews with score 4 or 3
    df = df[df.score != 3]
    df = df[df.score != 4]

    df.loc[df['score'] < 3, 'score'] = 0
    df.loc[df['score'] > 3, 'score'] = 1

    # clan reviews
    df['reviews']=df['reviews'].apply(lambda x: x.replace("\n", ' '))

    # make a balance set
    count_labels = pd.Series(df['score']).value_counts()

    if count_labels[1] > count_labels[0]:
        large_class = 'score == 1' 
        drop_fraction = 1-count_labels[0]/count_labels[1]
    else: 
        drop_fraction = 1-count_labels[1]/count_labels[0]
        large_class = 'score == 0' 
    df=df.drop(df.query(large_class).sample(frac=drop_fraction).index)
    
    return df['reviews'], df['score']

In [12]:
# prepare the data
path_norec_data = "norec/data/html.tar.gz" # rember to set the right path
reviews_norwe, y_norwe = prepare_data('train',path_norec_data)

# print an example of a review
print(reviews_norwe.iloc[1]) # note have the text is more descriping the plot in the movie instead of giving opinions of it

# print the numbers of revies
print(reviews_norwe.shape)

Thumbsucker  Han er flink men har ikke venner. Han er med i et team som heter delta team. Moren (Tilda Swinton) hans er sykepleier hun får jobben på et nytt sykehus der hun blir forelsket i en pasient som er  Kjendis. Faren(Vincent d’Onofrio) jobbet i en sportsbutikk han hadde  alltid tapt i løp mot tannlegen til (Keanu Reeves) Justin prøver å slut-te å suge på tommeltotten hans. Han klarer det til slutt.  anmeldelse av patrick rambjør
(2224,)


In [None]:
# embed with Laser
# this can take along time
laser = Laser()
X_norwe= laser.embed_sentences(reviews_norwe, lang='no')

In [15]:
# Now use the model 'lr' to predict and calculate the accuracy
y_pred = lr.predict(X_norwe)
print ("Accuracy on Norweigian data: %s " 
       % ( accuracy_score(y_norwe, y_pred)))

# Let's seen the confussion matric 
confusion_matrix(y_norwe, y_pred, labels=[1,0])   # 1 is positive and 0 is negative
# Note that it looks like the classifier has a harder time predicting the negative class

Accuracy on Norweigian data: 0.7630395683453237 


array([[926, 186],
       [341, 771]])