## Preparing and Pre-processing the dataset

- converting to lower case
- tokenize(punctuation preserved as tokens)

In [2]:
# declaring constants here which are used throughout
EMB_SIZE=300

In [3]:
categories = {'anger':0, 'sadness':1, 'joy':2, 'fear':3, 'disgust':4, 'guilt':5, 'shame':6}

In [4]:
'''
store embeddings in dictionary : {word:[300 sized embedding]}
'''
word2emb = {}
emb_file_path = './Datasets/Embeddings/ewe_uni_embeddings.txt' # 300 dimensional

with open(emb_file_path, 'r') as emb:
    lines = emb.readlines()
    for line in lines:
        dims = line.strip().split(" ")
        word = dims[0]
        emb = dims[1:] # embedding list
        emb = np.array([float(i) for i in emb])
        word2emb[word] = emb

In [30]:
'''
compute sentence representation : take average of word embeddings of all the words in the sentence
what about the words not in the embedding matrix?? - randomly initialise?

params : list of words of sentence
'''
def avg_embedding(word_list):
    emb_sum = np.array([0]*EMB_SIZE)
    mu, sigma = 0, 1
    for word in word_list:
        try:
            emb_sum = np.add(word2emb[word], emb_sum)
        except KeyError:
            emb_sum = np.add(np.random.normal(mu, sigma, 300), emb_sum) # number of dimensions
    return np.true_divide(emb_sum, len(word_list))

In [9]:
def pre_process(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
            (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
          | \w+(?:['-]\w+)*     # words with optional internal hyphens, and apostrophe
          | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
          | \.\.\.              # ellipsis
          | [.,;"?():_`-]       # these are separate tokens
        '''
    return RegexpTokenizer(pattern).tokenize(sentence)

In [31]:
with open('./dataset2.csv', 'r') as dataset:
    X,y = [],[]
    for line in dataset:
        sentence, cat = line.strip().rsplit(',',1) # reverse split at first comma
        words = pre_process(sentence)
        cat = categories[cat]
        X.append(avg_embedding(words))
        y.append(cat)

In [32]:
X = np.array(X)
y = np.array(y)
print(X.shape)
print(y.shape)

(7503, 300)
(7503,)


## Model for learning embeddings

Uni-labeled data : ISEAR 

architecture : input -> embedding layer -> hidden layer -> softmax

Trg objective : multinomial cross-entropy loss

Initialisation of word embeddings:
- Glove
- random initialisation from N(0,sigma^2) for words not in Glove

Model Hyperparameters, learners:
- Adam
- lr=0.001
- mini batch-size : 1024

## Classification model for emotion prediction

Compared 2 models : one with directly using features from emotion lexicon, other with learnt embeddings as features

For word embedding model,
    I/P to model -> sentence representation -> average of word vectors of all words in the sentence(size=300)
    
For model with features from lexicon : 
    I/P to model -> #words in sentence belonging to each category in the lexicon
    
2 classification models :
- L2 regularised multi class Logistic Regression
- SVM
10 FCV
macro F1-score -> avg F1 score over all emotion labels
F1 -> Harmonic mean of Precision and recall

### Using learnt emotion-enriched word embeddings

In [41]:
len(word2emb)

183712

Logistic Regression

In [33]:
lr_classifier = LogisticRegressionCV(cv=10, random_state=0, multi_class='multinomial').fit(X,y)



In [34]:
lr_classifier.score(X, y)

0.5175263228042116

Support Vector Machine

In [20]:
from sklearn import svm
from sklearn.model_selection import cross_val_score

In [35]:
svm_clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(svm_clf, X, y, cv=10, scoring='f1_macro')
scores.mean()

0.41640952996205804