# Lab3 Assignment: Training and Testing BoW and Averaged Embedding classifiers on a data set with tweets

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this assignment, you will build and apply two classifiers using a data set with emotion labels from the 2017 *Wassa* workshop:

http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

### Reference
   Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.

The texts are tweets and therefore a different genre than the spoken utterances from the conversations in the MELD data set. The data set is included in the distribution of this lab, where we aggregated all the training and test data in a single file. The notebook already includes the code for loading the CSV files in a Pandas dataframe.

We also included the functions to get AveragedWordEmbeddings in a separate Python file: **lab3_util.py**. These can be used to create embedding based representations for the texts.

In [12]:
import pandas as pd
from collections import Counter
import numpy as np
import nltk
from nltk.corpus import stopwords
import gensim
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import pickle
import sklearn
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import seaborn as sns

## 1. Loading the tweet data set

In [41]:
filepath = './data/wassa/training/all.train.tsv'
dftweets_train = pd.read_csv(filepath, sep='\t')
dftweets_train.head()

Unnamed: 0,ID,Tweet,Label,Score
0,10000,How the fu*k! Who the heck! moved my fridge!.....,anger,0.938
1,10001,So my Indian Uber driver just called someone t...,anger,0.896
2,10002,@DPD_UK I asked for my parcel to be delivered ...,anger,0.896
3,10003,so ef whichever butt wipe pulled the fire alar...,anger,0.896
4,10004,Don't join @BTCare they put the phone down on ...,anger,0.896


In [6]:
dftweets_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3613 entries, 0 to 3612
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      3613 non-null   int64  
 1   Tweet   3613 non-null   object 
 2   Label   3613 non-null   object 
 3   Score   3613 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 113.0+ KB


In [10]:
##### Test data
filepath = './data/wassa/testing/all.test.tsv'
dftweets_test = pd.read_csv(filepath, sep='\t')

#### 1.1 Extracting the texts and labels for the training and testing [1 point]

In [8]:
# HERE COMES THE CODE TO EXTRACT THE TRAINING TEXTS AND LABELS

In [9]:
# HERE COMES THE CODE TO EXTRACT THE TEST TEXTS AND LABELS

#### 1.2 Analysing the data [1 point]

In [24]:
# HERE COMES THE CODE TO GENERATE A BAR CHART FOR THE TRAIN DATA

In [25]:
# HERE COMES THE CODE TO GENERATE A BAR CHART FOR THE TEST DATA 

HERE COME YOUR COMMENTS ON THE DISTRIBUTION OF THE TRAIN AND TEST DATA AND A COMPARISON WITH THE DISTRIBUTION IN MELD.

## 2. Training and testing a classifier with averaged word embeddings

#### 2.1 Representing the tweet training and test data using the same word embedding model [1 point]

We can use the same functions that we used in the notebook *Lab3.6.ml.emotion-detection.embeddings.ipynb* to get averaged embeddings for the tweets. 
In order to use exactly the same function and to be able to re-use them again, we copied these functions to a separate python file "lab3_util.py". You can open the file in Jupyter to inspect its content.

We can now import this file as we do with other packages and apply it in this notebook but also in other code. This keeps our notebook readable and compact and makes sure we always use the same functions and do not accidently change them across notebooks.

For future coding, it is wise to also apply this to your own code. Put reusable code as functions in a separate Python file with an appropriate name and import this in different notebooks or other Python files. In this way, you develop your own tools over time and reuse them when needed.

We first import the Python file *lab3_util.py* in this notebook so that the functions are loaded in working memory. The file should be located in the same folder as this notebook.

In [28]:
import lab3_util as util

From this point onwards, you can call the functions from this file as follows:

* util.getAvgFeatureVecs(.....)
* util.getMostFrequentWords(.....)

Check the parameters of the functions required to call them and check the return values to catch what they return.

In [29]:
# HERE COMES THE CODE TO LOAD THE WORD EMBEDDINGS

In [30]:
# HERE COMES THE CODE TO DERIVE AVERAGED WORD EMBEDDING REPRESENTATIONS FOR THE TRAINING & TEST DATA

As before, we check which words are not in the embedding model's vocabulary.

#### 2.2 Analyse the unknown words in the train and test data [1 point]

In [31]:
# HERE COMES THE CODE TO GET THE LIST OF WORDS UNKNOWN TO THE EMBEDDING MODEL  

HERE COMES YOUR ANALYSIS OF THE UNKNOWN WORDS AND YOUR EXPECTATION ON THE PERFORMANCE

#### 2.3 Training and testing the classifier [1 point]

In [32]:
# HERE COMES THE CODE TO TRAIN AND TEST AN SVM WITH THE EMBEDDING REPRESENTATION

In [33]:
# HERE COMES THE CODE TO GENERATE A CLASSIFICATION REPORT AND A CONFUSION MATRIX

#### 2.3 Analysis of the test results [1 point]
HERE COMES YOUR ANALYSIS OF THE RESULT

### 3. Creating and testing a bag-of-words SVM classifier for the Tweets data

#### 3.1 Representing the tweets as BoW vectors [1 point]

In [34]:
# HERE COMES THE CODE TO CREATE A BOW REPRESENTATION WITH TF-IDF WEIGHTS FOR THE TRAINING

In [35]:
# HERE COMES THE CODE TO REPRESENT THE TEST DATA ACCORDING TO MELD BOW VECTORIZER

#### 3.2 Training and testing the BOW SVM Classifier [1 point]

In [36]:
# HERE COMES THE CODE TO TRAIN AN SVM CLASSIFIER WITH THE TRAINING DATA

In [37]:
# HERE COMES THE CODE TO APPLY THE CLASSIFIER TO THE TWEET TEST SET

In [38]:
# HERE COMES THE CODE TO GENERATE A CLASSIFICATION REPORT AND CONFUSION MATRIX

#### 3.3 Analyse the test results [1 point]

HERE COMES YOUR ANALYSIS OF THE RESULT.

## 4 Comparison of the results [2 points]

COMPARE THE RESULTS ACROSS THE TWO MODELS IN THIS NOTEBOOK AND ALSO WITH THE TWO MODELS BUILT AND TESTED ON THE MELD DATA

# End of The assignment