In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import os
import sys

sys.path.append(os.path.abspath(os.pardir))

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# Models
from tdparse.models.target import TargetInd
from tdparse.models.target import TargetDepC
from tdparse.models.target import TargetDep
from tdparse.models.target import TargetDepSent
# Word Vector methods
from tdparse.word_vectors import GensimVectors
from tdparse.word_vectors import PreTrained
from tdparse.helper import read_config, full_path
# Sentiment lexicons
from tdparse import lexicons
# Get the data
from tdparse.parsers import dong, semeval_14

from tdparse.tokenisers import stanford

# Target dependent models
This notebook shows how to use the target dependent models and comparing the results of our implementation to the one in the original [paper](https://www.ijcai.org/Proceedings/15/Papers/194.pdf). We also show the problems we encountered when attempting to reproduce the methods from the description in the paper and the affects of not stating certain processes.

The paper had four different models:
1. **Target-Ind** -- Uses only the full Tweet as context.
2. **Target-Dep-** -- Uses the left and right context of the target word as well as the target word as context.
3. **Target-Dep** -- Uses all of the above contexts.
4. **Target-Dep+** -- Uses all of the above as well as including two additional left and right contexts which ignores all words in the contexts unless they are part of the given sentiment lexicon (or any lexicon).

The above models correspond to the following classes in our implementation:
1. [TargetInd](../tdparse/models/target.py), 2. [TargetDepC](../tdparse/models/target.py), 3. [TargetDep](../tdparse/models/target.py), 4. [TargetDepSent](../tdparse/models/target.py)

All of the results shown below are 5 fold cross validation over the training data of [Dong et al.](https://aclanthology.coli.uni-saarland.de/papers/P14-2009/p14-2009) or where appropriate on the test data as reported in the paper.

In [2]:
# Load the training data
dong_train = dong(full_path(read_config('dong_twit_train_data')))
train_data = dong_train.data()
train_y = dong_train.sentiment_data()

# Get word vectors
w2v_path = full_path(read_config('word2vec_files')['vo_zhang'])
w2v = GensimVectors(w2v_path, None, model='word2vec', name='w2v')
sswe_path = full_path(read_config('sswe_files')['vo_zhang'])
sswe = PreTrained(sswe_path, name='sswe')

# Comparing the three base models

In the paper the base models (target-ind, target-dep- and target-dep) using the the word2vec word vectors were compared after they found the best C-values therefore we are going to use the C-Values stated in the paper to compare our results to theres. **random_state** is used here to ensure that the results are reproducible, it stops the data from randomly shuffling.

In [3]:
# Instances of the models
target_ind = TargetInd()
target_depc = TargetDepC()
target_dep = TargetDep()
# Getting the grid parameters for each model
grid_params_ind = target_ind.get_cv_params(word_vectors=[[w2v]], random_state=42)
grid_params_depc = target_depc.get_cv_params(word_vectors=[[w2v]], random_state=42)
grid_params_dep = target_dep.get_cv_params(word_vectors=[[w2v]], random_state=42)
# Running the grid search over 5 folds.
results_ind = target_ind.grid_search(train_data, train_y, params=grid_params_ind, cv=5, n_jobs=5)
results_depc = target_depc.grid_search(train_data, train_y, params=grid_params_depc, cv=5, n_jobs=5)
results_dep = target_dep.grid_search(train_data, train_y, params=grid_params_dep, cv=5, n_jobs=5)

In [4]:
results = [results_ind['mean_test_score'], results_depc['mean_test_score'], results_dep['mean_test_score']]
all_results = {'Our results' : [result.round(4)[0] * 100 for result in results]}
all_results['Paper results'] = [59.22, 65.38, 65.72]
index = ['Target-Ind', 'Target-Dep-', 'Target-Dep']
base_model_df = pd.DataFrame(all_results, index=index)
base_model_df

Unnamed: 0,Our results,Paper results
Target-Ind,60.98,59.22
Target-Dep-,65.69,65.38
Target-Dep,66.79,65.72


As you can see from the results above that we get similar results and the order of the models stays the same.

# Target-Dep+ and sentiment lexicons
The **Target-Dep+** model uses sentiment lexicons to remove words therefore in this section we compare:
1. The statistics on the sentiment lexicons
2. The results of the model using different lexicons

All the experiments below again use the Word2Vec word embeddings.
## Sentiment lexicon statistics

Below we present the size of the sentiment lexicon once it has been processed and the size of that lexicon stated in the paper.

In [5]:
# Load the sentiment lexicons and remove all words that are not associated
# to the Positive or Negative class.
subset_cats = {'positive', 'negative'}
mpqa = lexicons.Mpqa(subset_cats=subset_cats)
nrc = lexicons.NRC(subset_cats=subset_cats)
hu_liu = lexicons.HuLiu(subset_cats=subset_cats)
# Combine sentiment lexicons - Removes words that contradict each other.
mpqa_huliu = lexicons.Lexicon.combine_lexicons(mpqa, hu_liu)
all_three = lexicons.Lexicon.combine_lexicons(mpqa_huliu, nrc)

# Load the sentiment lexicons but lower case all the words
mpqa_low = lexicons.Mpqa(subset_cats=subset_cats, lower=True)
nrc_low = lexicons.NRC(subset_cats=subset_cats, lower=True)
hu_liu_low = lexicons.HuLiu(subset_cats=subset_cats, lower=True)
mpqa_huliu_low = lexicons.Lexicon.combine_lexicons(mpqa_low, hu_liu_low)
all_three_low = lexicons.Lexicon.combine_lexicons(mpqa_huliu_low, nrc_low)

In [6]:
def filter_cat(lexicon, filter_cat):
    return [word for word, cat in lexicon.lexicon if cat == filter_cat]

all_lexicons = [mpqa, hu_liu, nrc, mpqa_huliu, all_three]
num_positive = [len(filter_cat(lexicon, 'positive')) for lexicon in all_lexicons]
num_negative = [len(filter_cat(lexicon, 'negative')) for lexicon in all_lexicons]

all_lexicons_low = [mpqa_low, hu_liu_low, nrc_low, mpqa_huliu_low, all_three_low]
num_positive_low = [len(filter_cat(lexicon, 'positive')) for lexicon in all_lexicons_low]
num_negative_low = [len(filter_cat(lexicon, 'negative')) for lexicon in all_lexicons_low]

columns = ['Paper No. Positive', 'Ours No. Positive', 'Ours low No. Positive', 
           'Paper No. Negative', 'Ours No. Negative', 'Ours low No. Negative']
index = ['MPQA', 'Hu Liu', 'NRC', 'MPQA & Hu Liu', 'All Three']
data = [[2289, 2003, 2231, 2706, 3940], num_positive, num_positive_low, 
        [4114, 4780, 3243, 5069, 6490], num_negative, num_negative_low]
senti_info = dict(list(zip(columns, data)))
pd.DataFrame(senti_info, columns=columns, index=index)

Unnamed: 0,Paper No. Positive,Ours No. Positive,Ours low No. Positive,Paper No. Negative,Ours No. Negative,Ours low No. Negative
MPQA,2289,2304,2304,4114,4154,4154
Hu Liu,2003,2006,2006,4780,4783,4783
NRC,2231,2312,2312,3243,3324,3324
MPQA & Hu Liu,2706,2726,2726,5069,5079,5075
All Three,3940,4036,4036,6490,6551,6547


In [7]:
# Words that are shared between the MPQA and Hu Liu sentiment lexicons
[word for word, cat in list(set(mpqa_huliu.lexicon).difference(set(mpqa_huliu_low.lexicon))) if cat == 'negative']

['anti-American', 'anti-Semites', 'anti-Israeli', 'anti-US']

As you can see we never agree on the number of words within the lexicons. We get the lexicons from the sources described in the paper. Intrestingly if we do not lower case the words in the lexicons we won't see the same similarities between the MPQA and Hu Liu sentiment lexicon as they both share the words above just the Hu Liu lexicon has the words lower cased already where as MPQA has not.

## Showing the affect of using different sentiment lexicons in the Target-Dep+ model

In [8]:
# Instances of the model
target_dep_plus = TargetDepSent()
# Getting the grid parameters for each model
grid_params_sent = target_dep_plus.get_cv_params(word_vectors=[[w2v]], senti_lexicons=all_lexicons_low,
                                                 random_state=42)
# Running the grid search over 5 folds.
results_sent = target_dep_plus.grid_search(train_data, train_y, params=grid_params_sent, cv=5, n_jobs=5)

In [9]:
all_sent_results = {'Paper results' : [65.72, 66.05, 67.24, 65.56, 67.40, 67.30],
                    'Our results' : np.zeros(6)}
index = ['Target-Dep', 'Target-Dep+: NRC', 'Target-Dep+: Hu Liu', 'Target-Dep+: MPQA',
         'Target-Dep+: MPQA + Hu Liu', 'Target-Dep+: All Three']
sent_results_df = pd.DataFrame(all_sent_results, index=index)
sent_results_df['Our results']['Target-Dep'] = base_model_df['Our results']['Target-Dep']

In [10]:
name_map = {'Mpqa' : 'Target-Dep+: MPQA', 'HuLiu' : 'Target-Dep+: Hu Liu', 'NRC' : 'Target-Dep+: NRC',
            'Mpqa HuLiu' : 'Target-Dep+: MPQA + Hu Liu', 'Mpqa HuLiu NRC' : 'Target-Dep+: All Three'}
results_sent['lexicon'] = results_sent['param_union__left_s__filter__lexicon'].apply(lambda lex: lex.name)
for lex_name, model_name in name_map.items():
    score = results_sent.loc[results_sent['lexicon'] == lex_name]['mean_test_score']
    score = score.round(4) * 100
    sent_results_df['Our results'][model_name] = score
sent_results_df['Our results']['Target-Dep'] = base_model_df['Our results']['Target-Dep']
sent_results_df

Unnamed: 0,Our results,Paper results
Target-Dep,66.79,65.72
Target-Dep+: NRC,67.21,66.05
Target-Dep+: Hu Liu,68.63,67.24
Target-Dep+: MPQA,66.92,65.56
Target-Dep+: MPQA + Hu Liu,68.34,67.4
Target-Dep+: All Three,68.25,67.3


From the results shown above we get different results but the results also have a different rank between the lexicons as in the best lexicon was **Hu and Liu** where as the papers original results show the combination of **MPQA and Hu & Liu** was the best. However in general we can see that it is better to use a sentiment lexicon than not. Also that both our implmentation and the original paper show that the best single sentiment lexicon is **Hu & Liu** and that using **all three** sentiment lexicons is worse than using **MPQA and Hu & Liu**.

# Showing the affect of the different word vectors
As presented in the paper they show the affect of using different word vectors accross the four models using the best sentiment lexicon for the sentiment dependent model. As we had different result for the sentiment lexicons compared to the original paper we will show the results of using **Hu & Liu** lexicon and using the combination of **Hu & Liu and MPQA**. The word vectors used are the following:
1. Word2Vec - Which has been used throughout the previous experiments (100 dimensions)
2. SSWE - Sentiment Specific Word Embeddings (50 dimensions)
3. Concatenation of Word2vec and SSWE (150 dimensions)

In [11]:
# Process the results
grid_params_ind = target_ind.get_cv_params(word_vectors=[[w2v], [sswe], [w2v, sswe]], random_state=42)
grid_params_depc = target_depc.get_cv_params(word_vectors=[[w2v], [sswe], [w2v, sswe]], random_state=42)
grid_params_dep = target_dep.get_cv_params(word_vectors=[[w2v], [sswe], [w2v, sswe]], random_state=42)
grid_params_dep_sent = target_dep_plus.get_cv_params(word_vectors=[[w2v], [sswe], [w2v, sswe]], 
                                                     senti_lexicons=[hu_liu_low, mpqa_huliu_low], random_state=42)

results_ind = target_ind.grid_search(train_data, train_y, params=grid_params_ind, cv=5, n_jobs=5)
results_depc = target_depc.grid_search(train_data, train_y, params=grid_params_depc, cv=5, n_jobs=5)
results_dep = target_dep.grid_search(train_data, train_y, params=grid_params_dep, cv=5, n_jobs=5)
results_dep_sent = target_dep_plus.grid_search(train_data, train_y, params=grid_params_dep_sent, cv=5, n_jobs=5)

In [12]:
# Wrangling the results
results_dep_sent['lexicon'] = results_dep_sent['param_union__left_s__filter__lexicon'].apply(lambda lex: lex.name)
results_dep_sent_hu = results_dep_sent[results_dep_sent['lexicon'] == 'HuLiu']
results_dep_sent_hu_mpqa = results_dep_sent[results_dep_sent['lexicon'] == 'Mpqa HuLiu']
grid_results = {'Target-Ind' : results_ind, 'Target-Dep-' : results_depc, 'Target-Dep' : results_dep, 
                'Target-Dep+: Hu Liu' : results_dep_sent_hu, 
                'Target-Dep+: MPQA + Hu Liu' : results_dep_sent_hu_mpqa}
index = ['word2vec', 'sswe', 'word2vec + sswe']
columns = list(grid_results.keys())
name_map = {'w2v' : 'word2vec', 'sswe' : 'sswe', 'w2vsswe' : 'word2vec + sswe'}
vector_results_df = pd.DataFrame(np.zeros((len(index), len(columns))), columns=columns, index=index)
for model_name, result in grid_results.items():
    vec_col = result.columns[result.columns.map(lambda x: 'vector' in x)==True][0]
    get_vec_name = lambda vec_list: ''.join(map(lambda vec: vec.name, vec_list))
    result['vector'] = result[vec_col].apply(get_vec_name)
    for vec_name, index_name in name_map.items():
        score = result.loc[result['vector'] == vec_name]['mean_test_score']
        score = score.round(4) * 100
        vector_results_df[model_name][index_name] = score
vector_results_df


Unnamed: 0,Target-Ind,Target-Dep-,Target-Dep,Target-Dep+: Hu Liu,Target-Dep+: MPQA + Hu Liu
word2vec,60.98,65.69,66.79,68.63,68.34
sswe,60.18,66.71,66.36,67.96,67.67
word2vec + sswe,63.2,67.46,68.07,69.46,69.11


As we can see from the results above using the combination of the two word vectors is best accross all models which is the finding in the original paper. Also that **Target-Dep+** > **Target-Dep** > **Target-Dep-** > **Target-Ind** which is also what the original paper found. However un-like the original paper we found that using the *SSWE* word vectors to be generally worse than using the *Word2Vec* vectors showing that using just semantic information is more important than using a vector model that was created by reducing the semantic and sentiment loss. Also we found that using **Hu & Liu** lexicon to be better than any other and any other combination of lexicons compared to the original paper which found using the combination of **MPQA and Hu & Liu** to be the best. Finally we can see that we got similar results to the original.

# Results of the final models on the test data
Here we show the affect of the **Target-Ind**, **Target-Dep**, and **Target-Dep+** models on the test data as reported in the paper where each model uses the best parameters found in the previous tests.

For the **Target-Dep+** we show using **Hu & Liu** lexicon and using the combination of **MPQA and Hu & Liu** for direct comparison with the original paper as they found using **MPQA and Hu & Liu** to be better than **Hu & Liu** however we did not.

In [13]:
target_dep_plus_mpqa = TargetDepSent()

dong_test = dong(full_path(read_config('dong_twit_test_data')))
test_data = dong_test.data()
test_y = dong_test.sentiment_data()

best_params_ind = target_ind.get_params(word_vector=[w2v, sswe], random_state=42)
best_params_dep = target_dep.get_params(word_vector=[w2v, sswe], random_state=42)
best_params_dep_sent_hu = target_dep_plus.get_params(word_vector=[w2v, sswe], senti_lexicon=hu_liu_low,
                                                     random_state=42)
best_params_dep_sent_mpqa = target_dep_plus_mpqa.get_params(word_vector=[w2v, sswe], senti_lexicon=mpqa_huliu_low,
                                                       random_state=42)

target_ind.fit(train_data, train_y, params=best_params_ind)
target_dep.fit(train_data, train_y, params=best_params_dep)
target_dep_plus.fit(train_data, train_y, params=best_params_dep_sent_hu)
target_dep_plus_mpqa.fit(train_data, train_y, params=best_params_dep_sent_mpqa)

In [14]:
target_ind_res = target_ind.predict(test_data)
target_dep_res = target_dep.predict(test_data)
target_dep_plus_res_hu = target_dep_plus.predict(test_data)
target_dep_plus_res_mpqa = target_dep_plus_mpqa.predict(test_data)

results = [target_ind_res, target_dep_res, target_dep_plus_res_mpqa, target_dep_plus_res_hu]
scorers = {'acc' : accuracy_score, 'F1' : f1_score}
final_results_dict = {'Our results (Acc)' : [], 'Our results (Macro F1)' : []}
for result in results:
    for scorer_name, scorer in scorers.items():
        if scorer_name == 'F1':
            score = round(scorer(test_y, result, average='macro') * 100, 1)
            final_results_dict['Our results (Macro F1)'].append(score)
        else:
            score = round(scorer(test_y, result) * 100, 1)
            final_results_dict['Our results (Acc)'].append(score)

In [15]:
index = ['Target-Ind', 'Target-Dep', 'Target-Dep+ (MPQA & Hu Liu)', 'Target-Dep+ (Hu Liu)']
final_results_dict['Paper results (Acc)'] = [67.3, 69.7, 71.1, 0.]
final_results_dict['Paper results (Macro F1)'] = [66.4, 68.0, 69.9, 0.]
final_results_df = pd.DataFrame(final_results_dict, index=index)
final_results_df[['Our results (Acc)', 'Paper results (Acc)', 
                  'Our results (Macro F1)', 'Paper results (Macro F1)']]

Unnamed: 0,Our results (Acc),Paper results (Acc),Our results (Macro F1),Paper results (Macro F1)
Target-Ind,66.0,67.3,61.9,66.4
Target-Dep,69.7,69.7,66.7,68.0
Target-Dep+ (MPQA & Hu Liu),69.9,71.1,67.6,69.9
Target-Dep+ (Hu Liu),70.8,0.0,68.7,0.0


As you can see from above our results are very close to those reported in the paper and are identical for **Target-Dep** model. Also as you can see that our results using the **Hu Liu** lexicon are much better and are closer to the results of **Target-Dep+** in the original paper 

# Fine Tuning
We want to see if instead of using the C-Value reported in the paper, we fine tune our-seleves the C-Value where we use the combination of **Word2Vec** and **SSWE** embeddings which was never done/shown in the paper to see if we get values closer to those reported in the paper. We only do this for the best model (**Target-Dep+ (Hu Liu)**)


In [16]:
c_grid_params = {'word_vectors' : [[w2v, sswe]], 'random_state' : 42, 'senti_lexicons' : [hu_liu_low]}
best_c, _ = target_dep_plus.find_best_c(train_data, train_y, grid_params=c_grid_params, cv=5, n_jobs=5)
best_params = target_dep_plus.get_params(word_vector=[w2v, sswe], senti_lexicon=hu_liu_low,
                                         random_state=42, C=best_c)
target_dep_plus.fit(train_data, train_y, params=best_params)

In [17]:
target_dep_plus_res = target_dep_plus.predict(test_data)
norm_score = lambda score: round(score * 100, 1)
score_acc = norm_score(accuracy_score(test_y, target_dep_plus_res))
score_f1 = norm_score(f1_score(test_y, target_dep_plus_res, average='macro') )

index = ['Our results (Acc)', 'Paper results (Acc)', 'Our results (Macro F1)', 'Paper results (Macro F1)']
fine_tune_results_df = pd.DataFrame({'Target-Dep+' : [score_acc, 71.1, score_f1, 69.9]}, index=index)
fine_tune_results_df

Unnamed: 0,Target-Dep+
Our results (Acc),70.7
Paper results (Acc),71.1
Our results (Macro F1),68.2
Paper results (Macro F1),69.9


As you can see from the results above we have no improvement we actually got slightly worse results this could be due to the train data not perfectly representing the test data.

# Problems encountered when reproducing results
When reproducing these methods the main errors we came across where the following:
1. Not explicitly stating if the data has to be **scaled/normalised**
2. Not stating that all text should be **lower cased**

Both of these were not stated in the paper. We show the affects of not doing these to the results below using what we found to be the best performing model (**Target-Dep+ (Hu Liu)**)

In [18]:
best_params = target_dep_plus.get_params(word_vector=[w2v, sswe], senti_lexicon=hu_liu_low,
                                         random_state=42, scale=False)
target_dep_plus.fit(train_data, train_y, params=best_params)

In [19]:
target_dep_plus_res = target_dep_plus.predict(test_data)
norm_score = lambda score: round(score * 100, 1)
score_not_scale_acc = norm_score(accuracy_score(test_y, target_dep_plus_res))
score_not_scale_f1 = norm_score(f1_score(test_y, target_dep_plus_res, average='macro') )
score_scaled_acc = round(final_results_df['Our results (Acc)']['Target-Dep+ (Hu Liu)'], 1)
score_scaled_f1 = round(final_results_df['Our results (Macro F1)']['Target-Dep+ (Hu Liu)'], 1)

index = ['Not scaled (Acc)', 'Scaled (Acc)', 'Paper results (Acc)', 
         'Not scaled (Macro F1)', 'Scaled (Macro F1)', 'Paper results (Macro F1)']
scaled_results_df = pd.DataFrame({'Target-Dep+' : [score_not_scale_acc, score_scaled_acc, 71.1, 
                                                   score_not_scale_f1, score_scaled_f1, 69.9]}, index=index)
scaled_results_df

Unnamed: 0,Target-Dep+
Not scaled (Acc),44.4
Scaled (Acc),70.8
Paper results (Acc),71.1
Not scaled (Macro F1),40.6
Scaled (Macro F1),68.7
Paper results (Macro F1),69.9


As you can see not scaling the data (in this can we used [MinMax](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) scaling) affects the results by around ~{{str(score_scaled_acc - score_not_scale_acc)}}%. This was not stated in the paper. They did use a different Support Vector Machine Library [LibLinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) however we actually use this library just using the [Scikit-learn interface](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). Even though in the [Practical guide to LibLinear](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf) it states that you should scale and shows the importance like we have done above but this is not stated or reiterated in the paper.

In [20]:
best_params = target_dep_plus.get_params(word_vector=[w2v, sswe], senti_lexicon=hu_liu,
                                         random_state=42, lower=False)
target_dep_plus.fit(train_data, train_y, params=best_params)

In [21]:
target_dep_plus_res = target_dep_plus.predict(test_data)
norm_score = lambda score: round(score * 100, 1)
score_not_lower_acc = norm_score(accuracy_score(test_y, target_dep_plus_res))
score_not_lower_f1 = norm_score(f1_score(test_y, target_dep_plus_res, average='macro'))
score_lower_acc = score_scaled_acc
score_lower_f1 = score_scaled_f1

index = ['Not lower cased (Acc)', 'Lower cased (Acc)', 'Paper results (Acc)', 
         'Not lower cased (Macro F1)', 'Lower cased (Macro F1)', 'Paper results (Macro F1)']
pd.DataFrame({'Target-Dep+' : [score_not_lower_acc, score_lower_acc, 71.1, 
                               score_not_lower_f1, score_lower_f1, 69.9]}, index=index)

Unnamed: 0,Target-Dep+
Not lower cased (Acc),68.6
Lower cased (Acc),70.8
Paper results (Acc),71.1
Not lower cased (Macro F1),65.2
Lower cased (Macro F1),68.7
Paper results (Macro F1),69.9


As you can see above the results change slightly (~{{str(score_lower_acc - score_not_lower_acc)}}%). This is because the word embeddings used have all been pre-processed and lowered thus causing this affect. Lower casing words in Sentiment Analysis loses some information, as you would expect `GREAT` to be more positive than `great` when lower casing all the words you lose this information. This process is normally done to remove sparsity.