# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Extra stuff that we need for the assignment

### Importing extra libraries

In [None]:
import spacy
import pathlib
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


### Objects for the extra libraries

In [None]:
nlp = spacy.load('en_core_web_sm')
vader_model = SentimentIntensityAnalyzer()

### Function to run VADER with SpaCy

In [None]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

**Explanation:**

*Sentence 1:* The sentence is classified as quite strongly positive as can be observed from the compound score. There are no negative words, some neutral words, as well as a positive word. In thise case, the positive word 'love' raises the positivity score of the sentence to 0.808. The other 0.192 neutral part is 'I, apples'. 

*Sentence 2:* In contrast to the previous sentence, this one leans more towards the negative side (compound -0.5216) which is due to the negation 'don't' infront of the word 'love'. This makes that there is a 0 positivity score, 0.627 negativity score (don't love), and a 1-0.627=0.373 neutral score. 

*Sentence 3:* This is the same sentence as sentence 1, but the addition of an emoticon (which is considered positive according to vader's lexicon) causes a bit of an increase in the positivity score, meaning the sentence becomes less neutral in exchange.

*Sentence 4:* No positive words are detected, however the word 'ruins' is considered as negative by the vader lexicon and so it weighs down the compound score to -0.4404 by adding 0.492 to the negativity score.

*Sentence 5:* The same is done here as was done with sentences 1 and 2, the opposite of sentence 4 is formed here by the addition of the negation 'certainly not' that refers to the negative phrase 'ruins'. Additionally, the negation increases the positivity score even more because of 'certainly' infront of 'not'.

*Sentence 6:* This sentence has an unreasonably negative compound score of -0.4215. The problem here is that the word 'lies' is being considered part of the act of lying by the vader lexicon whilst it has a different meaning in this context (the act of being on the chair). A reasonable score would be that in which the sentence is neutral (compound score 0.0, neutral 1.0).

*Sentence 7:* The same happens here as in sentence 6, but this time with a higher positivity score than one would expect. Here, the word 'like' is being considered in the sense of liking someone/something, whilst 'like' is actually used as term of comparison. Again, a more reasonable outcome would be that of a fully neutral sentence.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [None]:
import json

In [None]:
my_tweets = json.load(open('my_tweets.json', encoding="utf8"))

In [None]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'neutral', 'text_of_tweet': "Alright I think we've reached the finish line for fast charging phones. Can we do laptops next?", 'tweet_url': 'https://twitter.com/mkbhd/status/1630590355610411010?s=46&t=G9r_bOBlQ1uUeP9ZHluc5g'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [None]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [None]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    # vader_output = vader_model.polarity_scores(the_tweet)
    vader_output = run_vader(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))

              precision    recall  f1-score   support

    negative       0.89      0.47      0.62        17
     neutral       0.60      0.56      0.58        16
    positive       0.62      0.94      0.74        17

    accuracy                           0.66        50
   macro avg       0.70      0.66      0.65        50
weighted avg       0.70      0.66      0.65        50



Looking at the output of the code cell above, we see the averaged values of the precision, recall, and accuracy scores over the three classes negative, neutral, and positive. Furthermore, we have we have the macro and weighted average for the three different types of scores. The macro average is the average in the case all classes contribute equally to the dateset, while the weighted average is based on the weight (number of items) of the class in the dataset. Like shown above, they are the expect same and this is because we created a balanced dataset in which all three classes have more or less the same number of tweets. As a result, these two lines are not that relevant here. The precision score shows for each score how many tweets were classified correctly with respect to the tot number of tweets that were **classified** as part of the same class, whilst recall shows how many tweets were classified correctly with respect to the number of tweets that are **actually** part of that class. Our main focus is the accuracy score denoted as the f1-score. This score combines precision and recall and shows for each class how well the items were classified with respect to the whole dateset.

The next cell contains a piece of code that prints the sentiment labels for each tweet in the my_tweets file next to all the labels that were classified using vader. 

*Misclassified negatives:*
* tweet 5: "Olatubosun Kuku, a top INEC consultant, said Bola Tinubu was offering up to $170 million in bribes to electoral officers." 

-> Misclassified as positive. Though the sentence is about a negative topic, the words by themselves and the different parts of the sentence are for the most part considered positive with higher scores according to the vader lexicon. E.g. 'bribes' has a score of -0.8 while 'top' has one of 0.8. A positive part if the fact that someone (Bola Tinubu) was offering money which again make the sentence positive.

* tweet 6: "BREAKING: Binance used customer deposits for its own undisclosed purposes, according to a report from Forbes." 

-> This sentence is probably classified as neutral since non of the words are part of the vader lexicon.

* tweet 8: "Nigerian stock market has crashed again for the third time in a row 2015, 2019 and 2023 !!!" 

-> Misclassified as neutral. The sentence is negative because of how it says 'crashed again' which is emphasized with the exclamation marks. This sentence must have been misclassified as neutral, because '!' and the word 'crashed' do not exist in the vader_lexicon.txt, for vader to process it as positive, neutral or negative. Although the word 'crash' is put down as -1.7, thus it is important to stem the words for these cases. Moreover, many words are not contained in this file, which adds on to the reason it was misclassified. 

* tweet 26: "Democrats scrounging up votes from mystical places again...." 

-> Misclassified as neutral. Same issue as for tweet 8. Important words: scrounging, mystical

* tweet 28: "Having a bit of AI existential angst today" 

-> Misclassified as neutral. Same issue as tweet 8. Important words: existential, angst

* tweet 34: "Despite this scam being widely reported on social media for days - it still continues brazenly. Clearly the govt & HDFC care a rats ass about cybersecurity of its citizens/customers. No wonder then that hackers can come in & pulverise AIIMS for days and the Govt can’t do a thing." 

-> Misclassified as positive. The word 'scam' and 'no' have a negative score, however the rest of the words do not really have a negative rating, which is why we can assume that the ratio of negative scores was not high enough in the whole text for vader to classify it as a negative nor neutral tweet.

* tweet 41: "Me after hearing hurtful words and trying my best not to cry" 

-> Misclassified as positive. Same reason as tweet 8, but the other way. This tweet only contains the word best, whereas the words hurtful and cry are in the 'dictionary' for vader it was unable to classify it as negative. This could be due to the fact that the word best has a very high score in comparison to the words cry and hurtful.

* tweet 42: "just wanna sleep and feel nothing" 

-> Misclassified as neutral. This text must have been missclassified as neutral, because the words are neutral and emotionless, whereas if a human were to read it they can feel that the person who wrote this tweet is clearly tired and pretty done with life. 

* tweet 43: ""i have nothing poetically sad to say i simply want to kiII myself"" 

-> Misclassified as positive. The word 'sad' has a negative score. The word 'kiII' is misspelled with double capital I's instead of a double l which is why it was not able to classify it as sad despite the fact that the kill is in the vader file and has a extremly low score. However, it is not clear to why this tweet was classified as positive because the rest of the words are not in the 'dictionary' except for want, which has a positive neutral value of 0.3.

*Misclassified neutrals:*
* tweet 1: "Alright I think we've reached the finish line for fast charging phones. Can we do laptops next?" 

-> Misclassified as positive. 'Alright' and the rest of the words are either neutral or have a high positive value, which is why it was misclassified as neutral. However, when looking at the overall meaning behind this tweet, because its questioning whether they can achieve the next new step, despite the fact that they have achieved an achievement, which makes it neutral.

* tweet 4: "Should Man United have achieved more in recent years with the money they've spent? 🤔" 

-> Misclassified as positive. achieve has a positive score that affects the score of the sentence.

* tweet 30: "The sausage-fingered prop hands sold \\$55,000, while a pair of knitted gloves to fit them made \\$4,000" 

-> Misclassified as positive. The word fit has a positive score that affects the general score of the tweet.

* tweet 32: "London calling. Anya Taylor-Joy, Cynthia Erivo, Florence Pugh and Yusra Mardini step out on the red carpet at the 76th British Academy Film Awards wearing Tiffany designs. #BAFTAS #TiffanyAndCo" 

-> Misclassified as positive. The sentence is mentioning awards that can affect the score to be positive.

* tweet 36: "we’ll begin replacing that “official” label with a gold checkmark for businesses, and later in the week a grey checkmark for government and multilateral accounts" 

-> Misclassified as positive. The use of gold can make this sentence interpreted as positive while being a neutral update.

* tweet 40: "🚨 Manchester City are willing to let Bernardo Silva leave in the summer if a bid of €80M comes in. " 

-> Misclassified as negative. The sentence is referring to a player leaving, this can be interpreted as negative and not as neutral news.

* tweet 49: "Will Harry Kane win a trophy before he retires? 🏆" 

-> Misclassified as positive. The sentence includes words such as win and trophy that has a positive score according to the vader lexicon.

Misclassified positives:
* tweet 9: "Wahoo! The #SuperMarioMovie is moving from April 7 to April 5 in the US and in more than 60 markets around the world. The movie hits theaters in additional markets in April and May, with Japan opening April 28." 

-> Misclassified as neutral. Although the sentence opens with a positive word, the rest of the tweet is of a neutral nature according to the vader lexicon.

In [None]:
for i in range(len(gold)):
    if gold[i] != all_vader_output[i]:
        print('!!!', gold[i], all_vader_output[i])
    else:
        print(gold[i], all_vader_output[i])

!!! neutral positive
neutral neutral
neutral neutral
!!! neutral positive
!!! negative positive
!!! negative neutral
negative negative
!!! negative neutral
!!! positive neutral
positive positive
positive positive
positive positive
negative negative
negative negative
negative negative
positive positive
neutral neutral
neutral neutral
neutral neutral
positive positive
positive positive
negative negative
positive positive
neutral neutral
positive positive
!!! negative neutral
positive positive
!!! negative neutral
neutral neutral
!!! neutral positive
positive positive
!!! neutral positive
positive positive
!!! negative positive
negative negative
!!! neutral positive
positive positive
negative negative
neutral neutral
!!! neutral negative
!!! negative positive
!!! negative neutral
!!! negative positive
positive positive
positive positive
positive positive
negative negative
positive positive
!!! neutral positive
neutral neutral


### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

If we look at the classification reports that are obtained from running vader as is and after lemmitisation, there are barely any differences. The change in scores of around 0.01 are not drastic enough to conclude that lemmitisation has any significant effect. 

Adjectives seem more important due to the higher precision/recall, as well as accuracy scores. Though they are not drastically different from the scores for nouns and verbs, they are still considerably different. In terms of accuracy, the score reaches above 0.5 only a few times (which is still different from nouns and verbs where is hardly goes above 0.5 more than once). Nonetheless, the scores that are attained using adjectives are still less optimal compared to those of running vader as is or with just lemmitisation. There is a higher recall, but in terms of accuracy we also get higher scores. 

#### Loading the files

In [None]:
airline_tweets_folder = str(pathlib.Path.cwd().joinpath('airlinetweets\\airlinetweets'))
print(airline_tweets_folder)
airline_tweets = load_files(airline_tweets_folder)

airline_tweets_str = []
airline_tweets_labels = []

# airline_tweets_str = [i[2:-1] for i in list(map(str, airline_tweets.data))]

for i in range(len(airline_tweets.data)):
    airline_tweets_str.append(str(airline_tweets.data[i])[2:-1])
    airline_tweets_labels.append(airline_tweets.target_names[airline_tweets.target[i]])

C:\Users\loafo\Documenten\University\Courses, Lectures\Year 3\Text Mining\airlinetweets\airlinetweets


#### Running VADER as it is

In [None]:
vader_normal = [vader_output_to_label(run_vader(i, verbose=0)) for i in airline_tweets_str]

In [None]:
print(classification_report(airline_tweets_labels, vader_normal))

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



#### Running VADER after lemmatisation

In [None]:
vader_lemmatise = [vader_output_to_label(run_vader(i, verbose=0, lemmatize=True)) for i in airline_tweets_str]
    

In [None]:
print(classification_report(airline_tweets_labels, vader_lemmatise))

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.61      4755
weighted avg       0.65      0.62      0.62      4755



#### Running VADER on only adjectives

In [None]:
vader_adjectives = [vader_output_to_label(run_vader(i, verbose=0, parts_of_speech_to_consider={'ADJ'})) for i in airline_tweets_str]

In [None]:
print(classification_report(airline_tweets_labels, vader_adjectives))

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.67      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



#### Running VADER on adjectives + lemmatisation

In [None]:
vader_adjectives_lemmatise = [vader_output_to_label(run_vader(i, verbose=0, lemmatize=True, parts_of_speech_to_consider={'ADJ'})) for i in airline_tweets_str]    

In [None]:
print(classification_report(airline_tweets_labels, vader_adjectives_lemmatise))

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



#### Running VADER on only noun

In [None]:
vader_nouns = [vader_output_to_label(run_vader(i, verbose=0, parts_of_speech_to_consider={'NOUN'})) for i in airline_tweets_str]

In [None]:
print(classification_report(airline_tweets_labels, vader_nouns))

              precision    recall  f1-score   support

    negative       0.73      0.14      0.24      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.53      0.34      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.38      4755



#### Running VADER on noun + lemmatisation

In [None]:
vader_nouns_lemmatise = [vader_output_to_label(run_vader(i, verbose=0, lemmatize=True, parts_of_speech_to_consider={'NOUN'})) for i in airline_tweets_str]    

In [None]:
print(classification_report(airline_tweets_labels, vader_nouns_lemmatise))

              precision    recall  f1-score   support

    negative       0.71      0.16      0.26      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.52      0.33      0.40      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.39      4755
weighted avg       0.54      0.42      0.38      4755



#### Running VADER on only verbs

In [None]:
vader_nouns = [vader_output_to_label(run_vader(i, verbose=0, parts_of_speech_to_consider={'VERB'})) for i in airline_tweets_str]

In [None]:
print(classification_report(airline_tweets_labels, vader_nouns))

              precision    recall  f1-score   support

    negative       0.78      0.29      0.42      1750
     neutral       0.38      0.81      0.52      1515
    positive       0.57      0.34      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.45      4755



#### Running VADER on verb + lemmatisation

In [None]:
vader_nouns_lemmatise = [vader_output_to_label(run_vader(i, verbose=0, lemmatize=True, parts_of_speech_to_consider={'VERB'})) for i in airline_tweets_str]    

In [None]:
print(classification_report(airline_tweets_labels, vader_nouns_lemmatise))

              precision    recall  f1-score   support

    negative       0.74      0.30      0.42      1750
     neutral       0.38      0.78      0.51      1515
    positive       0.57      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.56      0.48      0.46      4755
weighted avg       0.57      0.47      0.45      4755



## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

When comparing TF-IDF to Bag of words, the difference in scores is very minimal, but Bag of words seems to be the better performing method in this case. This can be seen as the accuracy scores are higher in comparison to those of TF-IDF

When considering Bag of words, the frequency threshold affects the scores negatively as they slightly descrease when the threshold increases in most cases. However, in a sense, this could also be seen as almost non-significant because the effects are only ever so slight (0.01-0.05 change approx.). Moreover, there are no signs of a certain effect in TF-IDF as the scores either increase or descrease a little between the different frequencies, but without any clear indications of an increasing or decreasing pattern. 

#### 1. Training with TF-IDF and min_df = 2

In [None]:
airline_vector_1 = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_1 = airline_vector_1.fit_transform(airline_tweets.data)
tfidf_transformer_1 = TfidfTransformer()
airline_tfidf_1 = tfidf_transformer_1.fit_transform(airline_counts_1)

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(airline_tfidf_1, airline_tweets.target, test_size = 0.20) 

clf_1 = MultinomialNB().fit(X_train_1, y_train_1)
y_pred_1 = clf_1.predict(X_test_1)



In [None]:
print(classification_report(y_test_1, y_pred_1))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86       343
           1       0.82      0.68      0.74       290
           2       0.81      0.84      0.83       318

    accuracy                           0.82       951
   macro avg       0.82      0.81      0.81       951
weighted avg       0.82      0.82      0.81       951



#### 2. Training with TF-IDF and min_df = 5

In [None]:
airline_vector_2 = CountVectorizer(min_df=5, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_2 = airline_vector_2.fit_transform(airline_tweets.data)
tfidf_transformer_2 = TfidfTransformer()
airline_tfidf_2 = tfidf_transformer_2.fit_transform(airline_counts_2)

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(airline_tfidf_2, airline_tweets.target, test_size = 0.20) 

clf_2 = MultinomialNB().fit(X_train_2, y_train_2)
y_pred_2 = clf_2.predict(X_test_2)



In [None]:
print(classification_report(y_test_2, y_pred_2))

              precision    recall  f1-score   support

           0       0.81      0.92      0.86       330
           1       0.85      0.74      0.79       309
           2       0.86      0.85      0.85       312

    accuracy                           0.84       951
   macro avg       0.84      0.84      0.84       951
weighted avg       0.84      0.84      0.84       951



#### 3. Training with TF-IDF and min_df = 10

In [None]:
airline_vector_3 = CountVectorizer(min_df=10, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_3 = airline_vector_3.fit_transform(airline_tweets.data)
tfidf_transformer_3 = TfidfTransformer()
airline_tfidf_3 = tfidf_transformer_3.fit_transform(airline_counts_3)

X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(airline_tfidf_3, airline_tweets.target, test_size = 0.20)

clf_3 = MultinomialNB().fit(X_train_3, y_train_3)
y_pred_3 = clf_3.predict(X_test_3)



In [None]:
print(classification_report(y_test_3, y_pred_3))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       337
           1       0.77      0.72      0.74       301
           2       0.83      0.82      0.83       313

    accuracy                           0.80       951
   macro avg       0.80      0.80      0.80       951
weighted avg       0.80      0.80      0.80       951



#### 4. Training with 'Bag of Words' (airline_count) and min_df = 2

In [None]:
airline_vector_4 = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_4 = airline_vector_4.fit_transform(airline_tweets.data)

X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(airline_counts_4, airline_tweets.target, test_size = 0.20) 

clf_4 = MultinomialNB().fit(X_train_4, y_train_4)
y_pred_4 = clf_4.predict(X_test_4)



In [None]:
print(classification_report(y_test_4, y_pred_4))

              precision    recall  f1-score   support

           0       0.85      0.93      0.89       361
           1       0.88      0.72      0.79       293
           2       0.83      0.89      0.86       297

    accuracy                           0.85       951
   macro avg       0.85      0.84      0.85       951
weighted avg       0.85      0.85      0.85       951



#### 5. Training with 'Bag of Words' (airline_count) and min_df = 5

In [None]:
airline_vector_5 = CountVectorizer(min_df=5, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_5 = airline_vector_5.fit_transform(airline_tweets.data)

X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(airline_counts_5, airline_tweets.target, test_size = 0.20) 

clf_5 = MultinomialNB().fit(X_train_5, y_train_5)
y_pred_5 = clf_5.predict(X_test_5)



In [None]:
print(classification_report(y_test_5, y_pred_5))

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       344
           1       0.79      0.76      0.78       285
           2       0.86      0.83      0.84       322

    accuracy                           0.84       951
   macro avg       0.84      0.83      0.83       951
weighted avg       0.84      0.84      0.84       951



#### 6. Training with 'Bag of Words' (airline_count) and min_df = 10

In [None]:
airline_vector_6 = CountVectorizer(min_df=10, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts_6 = airline_vector_6.fit_transform(airline_tweets.data)

X_train_6, X_test_6, y_train_6, y_test_6 = train_test_split(airline_counts_6, airline_tweets.target, test_size = 0.20) 

clf_6 = MultinomialNB().fit(X_train_6, y_train_6)
y_pred_6 = clf_6.predict(X_test_6)



In [None]:
print(classification_report(y_test_6, y_pred_6))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86       346
           1       0.78      0.76      0.77       292
           2       0.86      0.79      0.82       313

    accuracy                           0.82       951
   macro avg       0.82      0.81      0.82       951
weighted avg       0.82      0.82      0.82       951



### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

Names, basic punctuation that are not associated with any emotions in general such as periods, semi-colons, question marks, and also nouns or verbs without any emotional meaning (day, to see, to go), etc. as part of the neutral class because of the fact they are commonly used in sentences without having a necessarily negative or positive feel to them, nor do they convey any specific meaning in regard of opinions or statements (mostly used when giving general information). Positive words such as 'good', 'thank you' and punctuation like exclamation marks would be expected to be important in positive sentence as they, in general, give off the emotion of happiness, gratefulness and excitement. These type of words are often used in airline promotions and information messages. Finally, words such as 'delay', 'cancellation', 'bad (weather)', and 'problem' would be part of important negative words. They relate to common problems that are perceived as inconvenient (thus negative) and can be frequently found in negative announcements regarding flights. 

Looking at the results most results are not that surprising. If somethings has to be said, in the negative class nouns like 'customer', 'crew', 'airport', 'virginamerica' and 'flight' are words that were expected to lean more towards the neutral side as they refer to people/airlines. Furthermore, both the negative and positive classes contain the special symbols '@', '#', and punctuation e.g. '.', '', etc. as important features which was not expected. However, for all the examples mentioned it also goes that context plays a role. This is why though we would not necessarily put the examples given in the classes they were placed in, it is understandable why they were sorted like that. In general these words or symbols are not strongly related to negative/positive documents, so even though the context that they are placed in is either negative or positive, considering them by themselves they would not give a good (or at least significant) indication of whether a document actually belongs to that class. 

It would be an idea to start by removing the words that are common across either two or all three classes, whilst keeping the rest. By doing so, we get as a result three lists with features that have been classified as important for solely one class (thus a unique feature for the label negative/neutral/positive). This could improve the model as the important features become more concrete for each class without overlap.

In [None]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vector_4, clf_4)

Important words in negative documents
0 1495.0 @
0 1364.0 united
0 1219.0 .
0 421.0 ``
0 387.0 ?
0 381.0 flight
0 371.0 !
0 311.0 #
0 211.0 n't
0 167.0 ''
0 122.0 's
0 107.0 virginamerica
0 107.0 service
0 105.0 :
0 93.0 delayed
0 93.0 cancelled
0 91.0 get
0 88.0 time
0 88.0 customer
0 84.0 bag
0 79.0 ...
0 78.0 plane
0 74.0 -
0 72.0 ;
0 66.0 'm
0 65.0 gate
0 65.0 &
0 64.0 hour
0 62.0 still
0 62.0 late
0 62.0 http
0 62.0 2
0 61.0 hours
0 59.0 would
0 59.0 airline
0 57.0 flights
0 56.0 amp
0 55.0 help
0 53.0 delay
0 53.0 ca
0 50.0 like
0 48.0 worst
0 48.0 one
0 46.0 waiting
0 46.0 (
0 45.0 never
0 44.0 )
0 44.0 $
0 43.0 back
0 43.0 've
0 41.0 us
0 41.0 ever
0 41.0 3
0 40.0 lost
0 39.0 wait
0 39.0 flightled
0 38.0 due
0 36.0 fly
0 35.0 seat
0 35.0 people
0 35.0 check
0 34.0 really
0 34.0 day
0 33.0 thanks
0 33.0 luggage
0 33.0 bags
0 32.0 days
0 32.0 crew
0 32.0 another
0 31.0 need
0 31.0 hold
0 31.0 baggage
0 31.0 airport
0 30.0 u
0 30.0 problems
0 30.0 going
0 30.0 4
0 29.0 ticket
0 29



### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook