# ANLP 2020 - Assignment 1


*Milena Voskanyan, 812148*

<div class="alert alert-block alert-danger">Due: Monday, November 29</div>

<div class="alert alert-block alert-info">

**NOTE**

Please first fill in your name and id number at the top of the assignment, and **rename** the assignment file to **yourlastname-anlp-1.ipynb**<br><br>
Problems and questions are given in blue boxes like this one. All grey and white boxes must be filled by you (they either require code or a (brief!) discussion). <br><br>
Please hand in your assignment by the deadline via Moodle upload. In case of questions, you can reach us via the piazza forum, or by email.<br><br>
<b>For this assignment, do NOT use any external packages (NLTK or any others) EXCEPT where specified.</b>
</div>

<div class="alert alert-block alert-info">
In this assignment, you will implement and work with a Naive Bayes classifier. (Note that for this exercise, you don't need to represent the input as a vector necessarily. You can directly look at the presence of words, and look up the class conditional likelihood.)
<br>
<br>
We will use a Twitter dataset classified into "hate speech" and "non hate speech" (in our data, we have called these classes "offensive" and "nonoffensive" to avoid the charged and inaccurate term "hate speech"). First, load the data (we have provided the function for this):
</div>

In [1]:
import csv
import json
from nltk.tokenize import TweetTokenizer

def read_hate_tweets (annofile, jsonfile):
    """Reads in hate speech data."""
    all_data = {}
    annos = {}
    with open(annofile) as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            if row[0] in annos:
                # if duplicate with different rating, remove!
                if row[1] != annos[row[0]]:
                    del(annos[row[0]])
            else:
                annos[row[0]] = row[1]

    tknzr = TweetTokenizer()
                
    with open(jsonfile) as jsonfile:
        for line in jsonfile:
            twtjson = json.loads(line)
            twt_id = twtjson['id_str']
            if twt_id in annos:
                all_data[twt_id] = {}
                all_data[twt_id]['offensive'] = "nonoffensive" if annos[twt_id] == 'none' else "offensive"
                all_data[twt_id]['text_tok'] = tknzr.tokenize(twtjson['text'])

    # split training and test data:
    all_data_sorted = sorted(all_data.items())
    items = [(i[1]['text_tok'],i[1]['offensive']) for i in all_data_sorted]
    splititem = len(all_data)-3250
    train_dt = items[:splititem]
    test_dt = items[splititem:]
    print('Training data:',len(train_dt))
    print('Test data:',len(test_dt))

    return(train_dt,test_dt)

In [2]:
TWEETS_ANNO = 'NAACL_SRW_2016.csv'
TWEETS_TEXT = 'NAACL_SRW_2016_tweets.json'

(train_data,test_data) = read_hate_tweets(TWEETS_ANNO,TWEETS_TEXT)

Training data: 12896
Test data: 3250


<div class="alert alert-block alert-info">
Each item in our data consists of a tuple of the tweet text and its label (represented as a string). The tweet text has been tokenized and is represented as a list of words. We can look at an example item:
</div>

In [3]:
print(test_data)



## Problem 1: Evaluation [15 pts]

<div class="alert alert-block alert-info">

The first thing you're being asked to do is to provide evaluation functions for a classifier and a given labelled test set. Assume that the classifier has a `predict()` function that takes an item in the form of a list as above and predicts a class for that item. Write evaluation functions to compute the `accuracy` and `f_1` score for such a classifier. (To test your functions without having access to a real `predict()` function, you could simulate one that makes random predictions.)

</div>

In [4]:
def accuracy(classifier, data):
    """Computes the accuracy of a classifier on reference data.

    Args:
        classifier: A classifier.
        data: Reference data.

    Returns:
        The accuracy of the classifier on the test data, a float.
    """
    
    for i in data:
        # splitting data into two variables, one containing only tweets and the other - labels
        sents = i[0]
        true_label = i[1]
        
        # getting the predicted class
        prediction = classifier.predict(sents)
        
        # considering 'offensive' as positive values bc it is the target to be found
        # if predicted label is the same as true label
        if prediction == true_label:
            # and the predicted one is 'offensive'
            if prediction == 'offensive':
                # then true positive has to gain + 1
                classifier.true_pos += 1
            else:
                # if not, then true negative gets + 1
                classifier.true_neg += 1
        # else if predicted label is not the same as true label
        else:
            # and if the predicted label is 'offensive'
            if prediction == 'offensive':
                # then false positve gets + 1
                classifier.false_pos += 1
            else:
                # if it's not 'offensive' - false negative gains + 1
                classifier.false_neg += 1
                
    # calculating accuracy
    numerator = classifier.true_pos+classifier.true_neg   
    denominator = classifier.true_pos+classifier.true_neg+classifier.false_pos+classifier.false_neg
    accur = numerator / denominator
    
    return accur

def f_1(classifier, data):
    """Computes the F_1-score of a classifier on reference data.

    Args:
        classifier: A classifier.
        data: Reference data.

    Returns:
        The F_1-score of the classifier on the test data, a float.
    """
    # calculate precision, recall for f1 score with the tp, fp, tn, fn values gained via accuracy function, stored in NB class
    precision = classifier.true_pos / (classifier.true_pos + classifier.false_pos)
    recall = classifier.true_pos / (classifier.true_pos + classifier.false_neg)
    
    # calculating f1 score
    f1 = (2 * precision * recall) / (precision + recall)
    
    return f1

## Problem 2: Naive Bayes Classifier [35 pts]

<div class="alert alert-block alert-info">
Next, implement the Naive Bayes classifier from scratch using the code skeleton below and the definitions from class.<br><br>

Some requirements and notes for implementation:

<ul>
<li> You should allow for an arbitrary number of classes (in particular, you should not hard code the two classes needed for the given dataset). 
<li> The vocabulary of your classifier should be created dynamically from the training data. (The vocabulary is the set of all words that occur in the training data.).
<li> Use additive smoothing with a provided parameter k. 
<li> You may encounter unknown words at test time. Since we're not allowed to "peek" into the test set, we will implement the following simple treatment: We will assume that we don't know anything about unknown words and that in particular, their presence does not tell us anything about which class a document should be assigned to. Therefore, we will not include them in the calculation of the (log) probabilities during prediction, under the assumption that their probability does not differ hugely between the different classes (probably not a correct assumption, but the best we can do at this point). Since we don't need correct probabilities but only most likely classes, just ignore unknown words during prediction.
<li> Use log probabilities in order to avoid underflow.
</ul>

</div>

In [5]:
import math
from collections import Counter
class NaiveBayes(object):
    
    def __init__(self):
        """Initialises a new classifier."""
        # for accuracy and f1 score calculation
        self.true_pos = 0 
        self.true_neg = 0
        self.false_pos = 0
        self.false_neg = 0

    def predict(self, x):
        """Predicts the class for a document.

        Args:
            x: A document, represented as a list of words.

        Returns:
            The predicted class, represented as a string.
        """
        final = {}
        
        # getting classes
        for c in self.classes:
            likelihood = self.logprior[c]
            
            # getting words
            for w in x:
                
                # checking if the word is in the vocabulary
                if w in self.vocabulary:
                    likelihood = likelihood + self.loglikelihood[c][w]
                    
            # appending class and it's likelihood to a dict
            final[c] = likelihood
            
            # setting to 0, so that it gets the prior value of the next class
            likelihood = 0
            
        # getting the class with the highest number   
        prediction = max(final, key = final.get) 
        
        return prediction
    
    @classmethod
    def train(cls, data, k=1):
        """Train a new classifier on training data using maximum
        likelihood estimation and additive smoothing.

        Args:
            cls: The Python class representing the classifier.
            data: Training data.
            k: The smoothing constant.

        Returns:
            A trained classifier, an instance of `cls`.
        """  
        dictionary = {}
        results = {}
        likelihood = {}
        n_classes = {}
        loglikelihood = {}
        logprior = {}
        vocabulary = []
        classes = []
        
        # SEPARATING BY CLASS
        # creating a dictionary containing classes as keys and words as values
        for v, ke in data:
            # counting each class occurences and making a dict
            if ke in n_classes.keys():
                n_classes[ke] +=1
            else:
                n_classes[ke] = 1
            for w in v:
                if ke in dictionary.keys():
                    dictionary[ke].append(w)
                else:
                    dictionary[ke] = [w]
                    
        # iterating through the dict values and getting the N of times each word appears in a class
        # creating a nested dictionary 
        for c, voc in dictionary.items():
            values = Counter(voc)
            results[c] = values
            
        # for smoothing
        for key, value in results.items():
            vocabulary += list(value.keys())
            
        #LOGLIKELIHOOD
        for clss, wordlist in results.items():
            classes.append(clss)
            loglikelihood[clss] = {}
            for word in vocabulary:
                loglikelihood[clss][word] = math.log((wordlist[word] + k)/ (len(wordlist.keys()) + len(word)))    
  
        #LOGPRIOR
        for c, v in n_classes.items():
            logprior[c] = math.log(v / len(data))
        
        classifier = cls()
        classifier.classes = classes 
        classifier.vocabulary = vocabulary
        classifier.loglikelihood = loglikelihood
        classifier.logprior = logprior
        
        return classifier

<div class="alert alert-block alert-info">
Evaluate your classifier by training and testing it on the given data. Vary the smoothing parameter k. What happens when you decrease k? Plot a graph of the accuracy and/or f-score given different values of k. Discuss your findings.</div>

In [6]:
nb = NaiveBayes.train(train_data)
print("Accuracy: ", accuracy(nb, test_data))
print("F_1: ", f_1(nb, test_data))


Accuracy:  0.8566153846153847
F_1:  0.5458089668615985


In [None]:
accur = []
f1 = [] 
# creating a list of several k values
k = [0.2, 0.9, 1, 4]
# looping through the list to train NB classifier on different k values
for n in k:
    nb = NaiveBayes.train(train_data, n)
    # appending the results
    accur.append(accuracy(nb, test_data))
    f1.append(f_1(nb, test_data))
print('ac: ', accur)
print('f1: ', f1)

In [None]:
import matplotlib.pyplot as plt
# plotting the results of f1
plt.plot(k, f1)
plt.ylabel('F_1 score')
plt.xlabel('Smoothing with different k values')
plt.show()
# plotting the results of accuracy
plt.plot(k, accur)
plt.ylabel('Accuracy')
plt.xlabel('Smoothing with different k values')
plt.show()
''' In case of accuracy, k = .9 gives the highest result, compared to the rest k values.
    However, when looking at the f1 scores, k = 1 has the best result '''

## Problem 3: Feature Engineering [20 pts]

<div class="alert alert-block alert-info">

We mentioned that the Naive Bayes classifier can be used with many different feature types. Try to improve on the basic bag of words model by changing the feature list of your model. Implement at least two variants. For each, explain your motivation for this feature set, and test the classifier with the given data. Briefly discuss your results!<br><br> 
Ideas for feature sets that were mentioned in class include:

<ul>
<li> removing stop words or frequent words
<li> stemming or lemmatizing (you can use NLTK or spacy.io for basic NLP operations on the texts)
<li> introducing part of speech tags as features (how?)
<li> bigrams
</ul>

</div>

In [None]:
'''Assuming that combination of some words would appear more often then others in offensive speech,
    I chose to apply bigrams as features'''
import nltk
def bigram(data):
    # looping through data to get the sentences
    result = []
    for parts in data:
        # creating bigrams
        bi_gram = nltk.bigrams(parts[0])
        # appending them and the labels to a new list
        result.append(([*map(' '.join, bi_gram)], parts[1]))
    return result
#bigram(train_data)

In [None]:
new_train = bigram(train_data)
new_test = bigram(test_data)
nb = NaiveBayes.train(new_train)
print('Accuracy: ', accuracy(nb, new_test))
'''In case of bigrams, the performace of the classifier is obviously worse.
    I guess, for this type of a problem, unigram might work more efficiently.'''

In [None]:
'''Since stopwords are not provoding any important information for the case of 
    detecting offensive speech but are taking a lot of space in the dataset,
    that is why I decidded to choose to remove them'''
from nltk.corpus import stopwords
# storing stopwords in a variable to make the code run quicker
stopword = set(stopwords.words())
def stopwords_remove(data):
    stopwords_removed = []
    for parts in data:
        filtered = [word for word in parts[0] if word not in stopword]
        stopwords_removed.append((filtered, parts[1]))
    return stopwords_removed
#stopwords_remove(train_data)

In [None]:
train_ = stopwords_remove(train_data)
test_ = stopwords_remove(test_data)
nb = NaiveBayes.train(train_)
print('Accuracy: ', accuracy(nb, test_))
'''However, to my surprise, the accuracy is slightly worse than in case of the original dataset'''