# Local Feature Based Stance Detection

In this lab session we will implement a very simple linear classifier for Stance Detection using https://scikit-learn.org

Stance detection consists of classifying a given document as expressing an AGAINST, FAVOR or NEUTRAL attitude/stance with respect to a given topic. In this particular lab, we use the Task A data from the Semeval 2016 Twitter dataset for Stance detection: https://alt.qcri.org/semeval2016/task6/ 

Scikit-learn allows you to quickly experiment with a large number of machine learning algorithms in low resource environments (in comparison to neural network approaches). Scikit-learn also provides a large number of functionalities to process data and evaluate and visualize the obtained results.

Unlike other toolkits we will see during the course, scikit-learn is a library with an easy to use API ideal for quick experimentation with a large variety of models and algorithms. Usually, it is a good starting point for classification tasks.

REMEMBER to check the tutorial:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html# 

## Load data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import nltk
import numpy as np
from nltk.corpus import stopwords
import string
from sklearn import preprocessing

# download English stopwords
nltk.download('stopwords')
# download nltk pos tagger for English
nltk.download('averaged_perceptron_tagger')

# load data
def load_data(fnames):
    data = []
    for fname in fnames:
        data.append(pd.read_csv(fname, sep='\t', encoding='utf-8'))
    data = pd.concat(data)
    targets = set(data['Target'])
    return data, list(targets)

def tokenized_tweets(df):
    tknzr = nltk.TweetTokenizer()
    df['Tokenized_tweet'] = df['Tweet'].apply(tknzr.tokenize)
    return df

def read_glove(path):
    '''
    read the glove vectors from path with dimension dim
    '''
    df = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
    glove = {key: val.values for key, val in df.T.items()}
    return glove

def preprocess(data, tokenize=True, remove_stopwords=True, remove_none=True):
    if tokenize:
        data = tokenized_tweets(data)
        data['Clean_tweet'] = data['Tokenized_tweet']
    if remove_stopwords:
        stop = stopwords.words('english')
        data['Clean_tweet'] = data['Clean_tweet'].apply(lambda sentence: [word for word in sentence if word not in stop])
        data['Clean_tweet'] = data['Clean_tweet'].apply(lambda sentence: [word for word in sentence if not all([c in string.punctuation for c in word])])
    if remove_none:
        data = data[data['Stance'] != 'NONE']
    return data[['Target','Clean_tweet', 'Stance']]   
    
def gloveVectorize(glove, text):
    '''
    Find the pretrained glove vectors of the words in the input text.
    The final vector is the average of the vectors
    '''
    dim = len(glove["the"])
    X = np.zeros( (len(text), dim) )
    for text_id, t in enumerate(text):
        tmp = np.zeros((1, dim))        
        # remove oov words
        words = [w for w in t if w in glove.keys()]
        for word in words:
            tmp[:] += glove[word]

        if len(words) == 0:
            X[text_id, :] = np.zeros((1, dim)) 
        else:
            X[text_id, :] = tmp/len(words)
    return X

def encode_labels(labels):
    enc = preprocessing.LabelEncoder()
    encoded = enc.fit_transform(labels)
    decoded = enc.inverse_transform(encoded)
    return encoded, decoded

def data_as_numpy(data):
    return np.asarray(data['Tweet']), np.asarray(data['Stance'])

def get_table(results):
    # print best models results
    targts, models, accs = [],[],[]
    precs,recs, f1s = [],[], []
    par_names, par_values = [], []

    for target in results.keys():
        for model in results[target]:
            targts.append(target)
            models.append(model)
            accs.append(np.mean(results[target][model]['scores']['test_accuracy']))
            precs.append(np.mean(results[target][model]['scores']['test_precision']))
            recs.append(np.mean(results[target][model]['scores']['test_recall']))
            f1s.append(np.mean(results[target][model]['scores']['test_f1']))
            if model == 'rf':
                par_names.append('D')
                par_values.append(results[target][model]['D'])
            else:
                par_names.append('C')
                par_values.append(results[target][model]['C'])
    res_table = pd.DataFrame({'target':targts, 'model':models, 
                              'accuracy':accs, 'precision':precs, 
                              'recall':recs, 'fscore':f1s,
                              'par_name':par_names, 'par_value':par_values}, columns=['target', 'model', 
                                                                     'accuracy', 'precision',
                                                                     'recall', 'fscore', 'par_name', 'par_value'])
    return res_table

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
# data path. trial data used as training too.
trial_file = "/content/drive/MyDrive/2022-ILTAPP/datasets/stance-semeval2016/semeval2016-task6-trialdata.utf-8.txt"
train_file = "/content/drive/MyDrive/2022-ILTAPP/datasets/stance-semeval2016/semeval2016-task6-trainingdata.utf-8.txt"
test_file = "/content/drive/MyDrive/2022-ILTAPP/datasets/stance-semeval2016/SemEval2016-Task6-subtaskA-testdata-gold.txt"

training_data, targets = load_data([trial_file, train_file])
training_data = tokenized_tweets(training_data)

training_data.head()

Unnamed: 0,ID,Target,Tweet,Stance,Tokenized_tweet
0,1,Hillary Clinton,"@tedcruz And, #HandOverTheServer she wiped cle...",AGAINST,"[@tedcruz, And, ,, #HandOverTheServer, she, wi..."
1,2,Hillary Clinton,Hillary is our best choice if we truly want to...,FAVOR,"[Hillary, is, our, best, choice, if, we, truly..."
2,3,Hillary Clinton,@TheView I think our country is ready for a fe...,AGAINST,"[@TheView, I, think, our, country, is, ready, ..."
3,4,Hillary Clinton,I just gave an unhealthy amount of my hard-ear...,AGAINST,"[I, just, gave, an, unhealthy, amount, of, my,..."
4,5,Hillary Clinton,@PortiaABoulger Thank you for adding me to you...,NONE,"[@PortiaABoulger, Thank, you, for, adding, me,..."


## Feature Extraction

From [Mohammad et al. 2016](http://saifmohammad.com/WebDocs/1605.01655v1.pdf):

The features used in our text classification system are shown below:

- __n-grams__: presence or absence of contiguous sequences of 1, 2 and 3 tokens (word n-grams); presence or absence of contiguous sequences of 2, 3, 4, and 5 characters (character n-grams);
- __sentiment (sent.)__: The sentiment lexicon features are derived from three manually created lexicons: NRC Emotion Lexicon [Mohammad and Turney 2010], Hu and Liu Lexicon [Hu and Liu 2004], and MPQA Subjectivity Lexicon [Wilson et al. 2005], and two automatically created, tweet-specific, lexicons: NRC Hashtag Sentiment and
NRC Emoticon (a.k.a. Sentiment140) [Kiritchenko et al. 2014a]; 
  + NRC Emotion Lexicon: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
  + Hu and Liu Lexicon: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
  + MPQA Subjectivity Lexicon: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
  + NRC Hashtag Sentiment and NRC Emoticon (a.k.a. Sentiment140): http://saifmohammad.com/WebPages/lexicons.html
 

- __target__: presence/absence of the target of interest in the tweet;
- __POS__: the number of occurrences of each part-of-speech tag (POS);
- __encodings (enc.)__: presence/absence of positive and negative emoticons, hashtags, characters in upper case, elongated words (e.g., sweeettt), and punctuations such as exclamation and question marks.

----

# ASSIGNMENT 1

We define some helper functions for the feature extraction.

+ TODO: define the function to perform POS tagging using NLTK
+ TODO: add code to read the NRC-Hashtag-Sentiment-Lexicon-v1.0

In [5]:
# utils
import re
import nltk
tknzr = nltk.TweetTokenizer() # keep global for simplicity

def tokenize(sentence):
    return tknzr.tokenize(sentence)

def pos_tag(sentence):
  # TODO complete this function to perform pos tagging using NLTK
  return nltk.pos_tag(sentence)


# helper functions for feature extraction
def ngrams(tokens, n):
    return list(zip(*[tokens[i:] for i in range(n)]))

def load_nrc_emotions(fname):
    emotions = {}
    f = open(fname, 'r')
    _ = f.readline()
    for line in f:
        word, emotion, affect = line.rstrip().split('\t')
        if affect == '1':
            if word not in emotions:
                emotions[word] = []
            emotions[word].append(emotion)
    f.close()
    return emotions

def load_nrc_hashtags(fname_list):
    sentiment = {}
    # TODO add code to read the NRC hashtags file
    # HINT: you may inspect the file in the drive
    return sentiment

def load_nrc_emoticons(fname_list):
    emoticons = {}
    for fname in fname_list:
        f = open(fname, 'r')
        for line in f:
            emoti, score, _, _ = line.rstrip().split('\t')
            emoticons[emoti] = score
        f.close()
    return emoticons

def load_hu_liu(neg_fname, pos_fname):
    sentiments = {}
    f = open(neg_fname, 'r')
    for line in f:
        if re.search('^;', line.rstrip()):
            continue
        if re.search('^$', line.rstrip()):
            continue
        sentiments[line.rstrip()] = 'negative'
    f.close()
    f = open(pos_fname, 'r')
    for line in f:
        if re.search('^;', line.rstrip()):
            continue
        if re.search('^$', line.rstrip()):
            continue
        sentiments[line.rstrip()] = 'positive'
    f.close()
    return sentiments

def load_mpqa_polarities(fname):
    polarities = {}
    f = open(fname, 'r')
    for line in f:
        sp  = line.rstrip().split(' ')
        if sp[5] == 'm':
            word = sp[2].split('=')[1]
            polarity = sp[6].split('=')[1]
        else:
            word = sp[2].split('=')[1]
            polarity = sp[5].split('=')[1]
        polarities[word] = polarity
    f. close()
    return polarities


def generate_target_dict(targets):
    target_dict = {}
    stop = stopwords.words('english')
    for target in targets:
        # original
        target_dict[target] = 1

        # lower case
        target_dict[target.lower()] = 1

        # join Hillary Clinton => HillaryClinton
        target_dict[target.replace(' ', '')] = 1
        target_dict['#'+target.replace(' ', '')] = 1
        target_dict['@'+target.replace(' ', '')] = 1
        target_dict[target.lower().replace(' ', '')] = 1
        target_dict['#'+target.lower().replace(' ', '')] = 1
        target_dict['@'+target.lower().replace(' ', '')] = 1

        # process parts of target name:
        for part in target.split(' '):
            if part not in stop:
                target_dict[part] = 1
                target_dict['#'+part] = 1
                target_dict['@'+part] = 1
                target_dict['#'+part.lower()] = 1
                target_dict['@'+part.lower()] = 1
    return target_dict


class Match(object):

    def __init__(self):
        self.trie = {}

    def match(self, words):
        i = 0
        while (i < len(words)):
            j = self.match2(words, i, self.trie)
            if j >= 0:
                return True
            i += 1
        return False

    def match2(self, words, i, trie):
        if words[i] not in trie:
            return -1
        for length in sorted(trie[words[i]].keys(), reverse=True):
            context = ' '.join(words[i+1:i+length+1])
            for entry in trie[words[i]][length]:
                if context == entry:
                    return length
        return -1

    def matchinit(self, dictionary):
        for entry in dictionary.keys():
            firstword = re.split(' +', entry)[0]
            rwords = re.split(' +', entry)[1:]

            length = len(rwords)
            if firstword not in self.trie:
                self.trie[firstword] = {}
            if length not in self.trie[firstword]:
                self.trie[firstword][length] = []
            self.trie[firstword][length].append(' '.join(rwords))

----

We load emotion and sentiment lexicons used in the feature extraction 

In [10]:
# Load lexicon and use as global variables.

# NRC emotion lexicon
nrc_emotion_file = '/content/drive/MyDrive/2022-ILTAPP/datasets/NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'
nrc_emotions = load_nrc_emotions(nrc_emotion_file)

# NRC hashtag sentiment
nrc_hashtag_path = '/content/drive/MyDrive/2022-ILTAPP/datasets/NRC-Hashtag-Sentiment-Lexicon-v1.0'
nrc_ht_sentiments = load_nrc_hashtags([nrc_hashtag_path+'/HS-unigrams.txt', nrc_hashtag_path+'/HS-bigrams.txt' ])

# NRC emoticons sentiment
nrc_emoticons_path = '/content/drive/MyDrive/2022-ILTAPP/datasets/NRC-Emoticon-Lexicon-v1.0'
nrc_em_sentiments = load_nrc_emoticons([nrc_emoticons_path+'/Emoticon-unigrams.txt', nrc_emoticons_path+'/Emoticon-bigrams.txt'])

# Hu and Liu sentiment lexicon
hu_liu_path = '/content/drive/MyDrive/2022-ILTAPP/datasets/Hu-Liu_sentiment_lexicon'
hu_liu_sentiments = load_hu_liu(hu_liu_path+'/negative-words.utf8.txt', hu_liu_path+'/positive-words.utf8.txt')

# MPQA polarity lexicon
mpqa_file = '/content/drive/MyDrive/2022-ILTAPP/datasets/mpqa_polarities/subjclueslen1-HLTEMNLP05.tff'
mpqa_polarities = load_mpqa_polarities(mpqa_file)

----

Define feature functions. Function will apply for each instance in the dataset. Output of the functions is a custom python dictionary with the activated/extracted features. Note that after the extraction we'll need to vectorize whole dataset of feature.

In [11]:
# features
def word_ngrams(tokens, n):
    features = {}
    name = str(n)+'wgram:'
    for ngram in ngrams(tokens, n):
        features[name+'_'.join(ngram)] = 1
    return features

def char_ngrams(sentence, n):
    features = {}
    name = str(n)+'cgram:'
    for ngram in ngrams(sentence, n):
        features[name+'_'.join(ngram)] = 1
    return features

def pos_nb(pos_tags):
    features = {}
    name='pos_nb:'
    for tag in pos_tags:
        feat = name+tag[1]
        if feat not in features:
            features[feat] = 1
        else:
            features[feat] += 1
    return features

def target_occurs(sentence, matcher):
    features = {}
    name = 'target:'
    if  matcher.match(sentence):
        features[name+'true'] = 1
    else:
        features[name+'false'] = 1
    return features

def nrc_emotions_features(tokens, lexicon):
    features = {}
    name='nrc_emo:'
    for token in tokens:
        if token in lexicon:
            for emotion in lexicon[token]:
                features[name+token+':'+emotion] = 1
    return features

def nrc_hashtag_features(tokens, lexicon):
    features = {}
    name = 'nrc_ht:'
    for token in tokens:
        if token in lexicon:
            features[name+token] = lexicon[token]
    return features

def nrc_emoticons_features(tokens, lexicon):
    features = {}
    name = 'nrc_emc:'
    for token in tokens:
        if token in lexicon:
            features[name+token] = lexicon[token]
    return features

def hu_liu_sentiment_features(tokens, lexicon):
    features = {}
    name = 'hu_liu:'
    for token in tokens:
        if token in lexicon:
            features[name+token+':'+lexicon[token]] = 1
    return features

def mpqa_polarity_features(tokens, lexicon):
    features = {}
    name = 'mpqa:'
    for token in tokens:
        if token in lexicon:
            features[name+token+':'+lexicon[token]] = 1
    return features


def encoding(tokens):
    pass


# ASSIGNMENT 2

+ TODO: extract bigram and trigram character ngram features using the "char_ngram" function.
+ TODO: extract postag features by using the pos_nb function.

In [13]:
from sklearn.feature_extraction import DictVectorizer

def extract_instance_features(instance, target_matcher=None,
                              feature_types=None):
    if feature_types is None:
        feature_types = {'ngrams' : True,
                        'cgrams' : True,
                        'sentiment' : True,
                        'nb_pos' : True,
                        'target' : True}
    # tokenize
    tokenized = tokenize(instance)
    
    # part-of-speech
    pos = nltk.pos_tag(tokenized)
    
    # extract features
    features = {}
    
    # word n-grams
    if feature_types['ngrams']:
        features.update(word_ngrams(tokenized, 1))
        features.update(word_ngrams(tokenized, 2))
        features.update(word_ngrams(tokenized, 3))
    
    # TODO implement bigram and trigram character n-grams features
    if feature_types['cgrams']:
       features.update(char_ngrams(instance, 2))
       features.update(char_ngrams(instance, 3))
       features.update(char_ngrams(instance, 4))
       features.update(char_ngrams(instance, 5))
    
    # sentiment
    if feature_types['sentiment']:
        features.update(nrc_emotions_features(tokenized, nrc_emotions))
        features.update(nrc_hashtag_features(tokenized, nrc_ht_sentiments))
        features.update(nrc_emoticons_features(tokenized, nrc_em_sentiments))
        features.update(hu_liu_sentiment_features(tokenized, hu_liu_sentiments))
        features.update(mpqa_polarity_features(tokenized, mpqa_polarities))
    
    # TODO extract postag features 
    if feature_types['nb_pos']:
        features.update(pos_nb(pos))
    
    # target
    if feature_types['target'] and target_matcher is not None:
        features.update(target_occurs(tokenized, target_matcher))
    
    return features

def extract_features(instances, target_names, 
                     feature_types=None):
    target_dict = generate_target_dict(target_names)
    matcher = Match()
    matcher.matchinit(target_dict)
    features = [extract_instance_features(inst, matcher, feature_types) for inst in instances]
    return features

---

Example of feature extraction of a single instance.

In [14]:
target = 'Hillary Clinton'
target_dict = generate_target_dict([target])
matcher = Match()
matcher.matchinit(target_dict)

print(target_dict)
print(training_data['Tweet'].iloc[0])
feats = extract_instance_features(training_data['Tweet'].iloc[0], matcher)
feats

{'Hillary Clinton': 1, 'hillary clinton': 1, 'HillaryClinton': 1, '#HillaryClinton': 1, '@HillaryClinton': 1, 'hillaryclinton': 1, '#hillaryclinton': 1, '@hillaryclinton': 1, 'Hillary': 1, '#Hillary': 1, '@Hillary': 1, '#hillary': 1, '@hillary': 1, 'Clinton': 1, '#Clinton': 1, '@Clinton': 1, '#clinton': 1, '@clinton': 1}
@tedcruz And, #HandOverTheServer she wiped clean + 30k deleted emails, explains dereliction of duty/lies re #Benghazi,etc #tcot #SemST


{'1wgram:#Benghazi': 1,
 '1wgram:#HandOverTheServer': 1,
 '1wgram:#SemST': 1,
 '1wgram:#tcot': 1,
 '1wgram:+': 1,
 '1wgram:,': 1,
 '1wgram:/': 1,
 '1wgram:30k': 1,
 '1wgram:@tedcruz': 1,
 '1wgram:And': 1,
 '1wgram:clean': 1,
 '1wgram:deleted': 1,
 '1wgram:dereliction': 1,
 '1wgram:duty': 1,
 '1wgram:emails': 1,
 '1wgram:etc': 1,
 '1wgram:explains': 1,
 '1wgram:lies': 1,
 '1wgram:of': 1,
 '1wgram:re': 1,
 '1wgram:she': 1,
 '1wgram:wiped': 1,
 '2cgram: _#': 1,
 '2cgram: _+': 1,
 '2cgram: _3': 1,
 '2cgram: _A': 1,
 '2cgram: _c': 1,
 '2cgram: _d': 1,
 '2cgram: _e': 1,
 '2cgram: _o': 1,
 '2cgram: _r': 1,
 '2cgram: _s': 1,
 '2cgram: _w': 1,
 '2cgram:#_B': 1,
 '2cgram:#_H': 1,
 '2cgram:#_S': 1,
 '2cgram:#_t': 1,
 '2cgram:+_ ': 1,
 '2cgram:,_ ': 1,
 '2cgram:,_e': 1,
 '2cgram:/_l': 1,
 '2cgram:0_k': 1,
 '2cgram:3_0': 1,
 '2cgram:@_t': 1,
 '2cgram:A_n': 1,
 '2cgram:B_e': 1,
 '2cgram:H_a': 1,
 '2cgram:O_v': 1,
 '2cgram:S_T': 1,
 '2cgram:S_e': 1,
 '2cgram:T_h': 1,
 '2cgram:a_i': 1,
 '2cgram:a_n': 

## Experiments

- **Classifier**: Class weigthed Linear SVM. No previous feature scaling is done (previous experiments showed that takes longer learning and performance is lower).
- **Experiments**: We'll run experiments in a in-target scenario running independent experiments in each of 5 targets.
    - We'll run ablation test on feature types.




<a id='intarget'></a>
### In-target scenario

This section is organized in two parts. First, we'll run a linear SVM with all the features and explore the effect of C when using discrete binary features. Second, we'll run an ablation test to measure the impact of each feature type over the whole system.

Finally, we'll compare all the results and draw some conclusions.

**Results on ablation test**

- overall, when we remove nb_pos feature the system improve about 1 point.
- character grams are the ones that contribute more in the system. This might because they are the predominant type of features by far.
- The rest contributes little or nothing regarding the whole system.

**Experiments with whole set of features**

In [15]:
%%capture --no-stdout
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction import DictVectorizer


# Run experiments: preprocess data and glove
results = {}
for target in targets:
    results[target] = {'svm': {'str':'', 'f1':0.0}}

scoring = {'accuracy' : 'accuracy',
           'f1': 'f1_macro',
           'precision':'precision_macro',
           'recall' : 'recall_macro'}

xval = 5
# feature type: ALL
feature_types = {'ngrams' : True,
                 'cgrams' : True,
                 'sentiment' : True,
                 'nb_pos' : True,
                 'target' : True}

for target in targets:
    print('Running experiments for "{}"'.format(target))
    
    training_texts, training_labels = data_as_numpy(training_data[training_data['Target'] == target])
    training_features = extract_features(training_texts, [target], feature_types)
 
    #vectorize
    vec = DictVectorizer()
    train_x = vec.fit_transform(training_features).toarray()
    train_y, labels = encode_labels(training_labels)
    
    print('Training shape: {}'.format(train_x.shape))
    
    C = [0.1, 0.5, 1, 10]
    for c in C:
        linear_svm = Pipeline([
            #("scaler", StandardScaler()),
            ("lsvc", LinearSVC(C=c, class_weight='balanced'))
        ])
        scores = cross_validate(linear_svm, train_x, train_y, cv=xval, scoring=scoring, return_train_score=False)
        f1 = np.mean(scores['test_f1'])
        print("[LinearSVM] C=%f | f1=%f" %(c,f1))
        if f1 > results[target]['svm']['f1']:
            results[target]['svm']['f1'] = f1
            results[target]['svm']['str'] = "[LinearSVM] C=%f | f1=%f" %(c,f1)
            results[target]['svm']['C'] = c
            results[target]['svm']['scores'] = scores
    print()

Running experiments for "Legalization of Abortion"
Training shape: (653, 88813)
[LinearSVM] C=0.100000 | f1=0.620551
[LinearSVM] C=0.500000 | f1=0.618854
[LinearSVM] C=1.000000 | f1=0.618854
[LinearSVM] C=10.000000 | f1=0.618854

Running experiments for "Hillary Clinton"
Training shape: (689, 95197)
[LinearSVM] C=0.100000 | f1=0.609049
[LinearSVM] C=0.500000 | f1=0.608345
[LinearSVM] C=1.000000 | f1=0.612214
[LinearSVM] C=10.000000 | f1=0.611179

Running experiments for "Climate Change is a Real Concern"
Training shape: (395, 66529)
[LinearSVM] C=0.100000 | f1=0.562176
[LinearSVM] C=0.500000 | f1=0.562176
[LinearSVM] C=1.000000 | f1=0.562176
[LinearSVM] C=10.000000 | f1=0.562176

Running experiments for "Atheism"
Training shape: (513, 76919)
[LinearSVM] C=0.100000 | f1=0.611200
[LinearSVM] C=0.500000 | f1=0.614918
[LinearSVM] C=1.000000 | f1=0.614918
[LinearSVM] C=10.000000 | f1=0.614918

Running experiments for "Feminist Movement"
Training shape: (664, 93884)
[LinearSVM] C=0.100000 | 

**Ablation tests**

In [16]:
%%capture --no-stdout
# run ablation test over feature types
feature_type_names = feature_types.keys()
for target in targets:
    for feature_type in feature_type_names:
        results[target]['svm-'+feature_type] = {'str':'', 'f1':0.0}

def init_feature_type():
    return {'ngrams' : True,'cgrams' : True,'sentiment' : True,'nb_pos' : True, 'target' : True}

scoring = {'accuracy' : 'accuracy',
           'f1': 'f1_macro',
           'precision':'precision_macro',
           'recall' : 'recall_macro'}

xval = 5
for target in targets:
    print('Running experiments for "{}"'.format(target))
    training_texts, training_labels = data_as_numpy(training_data[training_data['Target'] == target])
    
    # run ablation in target
    for feature_type in feature_type_names:
        feature_types = init_feature_type()
        # deactivate feature type
        model_name = 'svm-'+feature_type
        feature_types[feature_type] = False
        training_features = extract_features(training_texts, [target], feature_types)
 
        #vectorize
        vec = DictVectorizer()
        train_x = vec.fit_transform(training_features).toarray()
        train_y, labels = encode_labels(training_labels)
        
        print('Removed feature type: {}'.format(feature_type))
        print('Training shape: {}'.format(train_x.shape))

        C = [0.1, 0.5, 1, 10]
        for c in C:
            linear_svm = Pipeline([
                #("scaler", StandardScaler()),
                ("lsvc", LinearSVC(C=c, class_weight='balanced'))
            ])
            scores = cross_validate(linear_svm, train_x, train_y, cv=xval, scoring=scoring, return_train_score=False)
            f1 = np.mean(scores['test_f1'])
            print("[LinearSVM] C=%f | f1=%f" %(c,f1))
            if f1 > results[target][model_name]['f1']:
                results[target][model_name]['f1'] = f1
                results[target][model_name]['str'] = "[LinearSVM] C=%f | f1=%f" %(c,f1)
                results[target][model_name]['C'] = c
                results[target][model_name]['scores'] = scores
        print()

Running experiments for "Legalization of Abortion"
Removed feature type: ngrams
Training shape: (653, 64872)
[LinearSVM] C=0.100000 | f1=0.625912
[LinearSVM] C=0.500000 | f1=0.622269
[LinearSVM] C=1.000000 | f1=0.622269
[LinearSVM] C=10.000000 | f1=0.622269

Removed feature type: cgrams
Training shape: (653, 27480)
[LinearSVM] C=0.100000 | f1=0.532339
[LinearSVM] C=0.500000 | f1=0.525584
[LinearSVM] C=1.000000 | f1=0.523851
[LinearSVM] C=10.000000 | f1=0.523851

Removed feature type: sentiment
Training shape: (653, 85319)
[LinearSVM] C=0.100000 | f1=0.620422
[LinearSVM] C=0.500000 | f1=0.622008
[LinearSVM] C=1.000000 | f1=0.622008
[LinearSVM] C=10.000000 | f1=0.622008

Removed feature type: nb_pos
Training shape: (653, 88770)
[LinearSVM] C=0.100000 | f1=0.630182
[LinearSVM] C=0.500000 | f1=0.625796
[LinearSVM] C=1.000000 | f1=0.625547
[LinearSVM] C=10.000000 | f1=0.625547

Removed feature type: target
Training shape: (653, 88811)
[LinearSVM] C=0.100000 | f1=0.618854
[LinearSVM] C=0.500

# ASSIGNMENT 3

Create a table to show the ablation results in a table such as the one shown below.

+ TODO Create table
+ TODO Sort table to show the ablation results ordered by target, fscore and model

In [17]:
#TODO sort table to show the ablation results ordered by target, fscore and model.


df = pd.DataFrame(get_table(results))
df.sort_values(by=["model"])
df.sort_values(by=["fscore"])
df.sort_values(by=["target"])

Unnamed: 0,target,model,accuracy,precision,recall,fscore,par_name,par_value
23,Atheism,svm-target,0.713459,0.717908,0.593951,0.614918,C,0.5
22,Atheism,svm-nb_pos,0.713402,0.732654,0.592387,0.619521,C,0.1
21,Atheism,svm-sentiment,0.711498,0.723717,0.591839,0.61206,C,0.1
20,Atheism,svm-cgrams,0.676547,0.648164,0.537987,0.555853,C,0.1
19,Atheism,svm-ngrams,0.709537,0.70765,0.593432,0.612611,C,0.1
18,Atheism,svm,0.713459,0.717908,0.593951,0.614918,C,0.5
14,Climate Change is a Real Concern,svm-cgrams,0.653165,0.565568,0.5108,0.523349,C,1.0
17,Climate Change is a Real Concern,svm-target,0.708861,0.604721,0.549911,0.562176,C,0.1
12,Climate Change is a Real Concern,svm,0.708861,0.604721,0.549911,0.562176,C,0.1
13,Climate Change is a Real Concern,svm-ngrams,0.716456,0.609441,0.55542,0.567548,C,0.5


# (BONUS) ASSIGNMENT 4: Evaluation on Test Dataset

Evaluate the models in Task A: **in-target scenario**. Train model using whole training set and best C value of each target. Evaluate on the test set.
  + **HINT**: Look at the CV code to iterate over the targets to train and evaluate.

**Use sklearn API for evaluation**.
  + Hint: Look at the evaluation in the Text Classifier training with Spacy.

### Task A: in-target scenario

- Question: Do we obtain comparable results with respect to Mohammad et al. 2016?
- Write a table to summarize the results obtained.

