# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

### Question1 answers:
```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369} 
```
sentence 1 is 0% negative, 19.2% neutral and 80.8% positive, the whole sentence is labeled as positive. This is very reasonable because love has a sentiment rating of 3.2.


```
INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}
```
62.7% negative, 37.3% neutral and 0% positive, overall negative.
"don't" changes the overall rating from positive to negative.


```
INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}
```
13.3%neutral and 86.7% positive. The sentence is more positive than than sentence 1, this is because the emoticon also has an positive sentiment rating.


```
INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}
```
50.8% neutral and 49.2% negative. This sentence has the highest score for neutral, however it is almost 50/50 with negative. Which is understandable as the sentence is a statement about a 'negative' thing. 
'Ruins' is categorized as a negative word with a sentiment rating of -1.9, skewing vader's output towards negative. Hence the compound score of -0.4404. 


```
INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}
```
51% neutral and 49% positive. With the addition of 'certainly not' the sentiment has shifted from 50/50 neutral and negative to 50/50 neutral and positive. As 'certainly not' negates the negative sentiment of the word 'ruins'. Changing the overall sentiment of the sentence form negative to positive .   


```
INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}
```
28.6% negative and 71.4% neutral. However the compound score is very negative hence the sentence will be labelled a being negative whilst it actually is neutral. This is because the word 'lies' can be interpreted as 'lying' implying something negative, it has a sentiment score of -1.8 skewing vaders decision towards a negative sentiment.


```
INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```
66.7% neutral and 33.3% positive. This neutral sentence is labelled as positive since the only word carrying sentiment is 'like', which has a positive sentiment. However the context in which 'like' is used makes it neutral, but vader of course does not take that into account.  

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. 

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [46]:
import json
import sklearn
import pathlib
import spacy
from sklearn import metrics
from sklearn.datasets import load_files
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report
nlp = spacy.load('en') # en_core_web_sm
vader_model = SentimentIntensityAnalyzer()

In [47]:
my_tweets = json.load(open('my_tweets.json'))

In [48]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'positive', 'text_of_tweet': 'Happy Birthday Hoseok!  I Hope you spend your birthday with everyone you love ', 'tweet_url': ''}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.

![Screenshot%20from%202020-02-29%2019-45-21.png](attachment:Screenshot%20from%202020-02-29%2019-45-21.png)

In the above table you can see the output of the quantitave elavualion. here is given the Precision, recall ,and the f1 score. The precision is the ratio of the true positives divided by the true positives and the false positives. so for example 53.6 precent of the time when vader labels a tweet as positive the tweets was correctly labeled.
The Recall is the ratio of true positives divided by the true positives and false negative. So again for positive tweets 78.9 precent of the positives tweets were labeled as positived by Vader. Both scores are important to do calculate the accuraccy of Vader alone the scores don't say much you could for instance have a model with perfect recall for positive tweets simply by labeling everything as positive. 

The F score is defined as the weighted harmonic mean of the test's precision and recall. This is for a combitation of recall and precision and is the most important metric for are examination. Based on this Vader best able to predict positive tweets followed by negative and lastly neutral tweets.

* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

### We selected the following not correctly classified tweets as positive.

 
 He is happpy:). Was classified as neutral. Don't quite understand why this was classified neutral okay happy is directed at he but there is as smile emoji that ends the sentence does vader only assign as possitive it is directed at the self 
 
 Incredible! Was classified as neutral. which i now see was missclassified by me.
 
 Hillaty Clinton is my fovorite person in the world. Was classified as neutral. Favorite was miss spelled which could be the reason for this.  
 
 its the only way i can enjoy fifa, love ur vids.  was classified as neutral. bad spelling maybe 
 
 Everything is possible If your thought is positive.  Was classified as neutral it probally is closer to neutral then possitive
 
### We selected the following not correctly classified tweets as neutral.

Vatican confirms Pope Francis and two aides test positive for Coronavirus - MCM. this was classified as positive which is a shame test positive and the word virus could have pushed the shale to negative 
 
Gene Sarazen was the first golfer of the modern era to complete the career Grand Slam Trophy Read more about the 1932 Open Champion here. Was classified as positive

but i'll probably be awake until 3 am for no reason. Was classified as negative which in heignshight is the correct classifacation 

To understand a paper better, I will write down some notes and will share them here. was clasified as positive

### We selected the following not correctly classified tweets as negative.

i’m so sad hope he gets well soon omg. Has been classified as positive. This was also in correctly classivief the sentiment is positive the emotion is negative

If Trump wasn’t so busy firing his scientists and calling the coronavirus a hoax, we might be better prepared for it. Has been classified as positive. Was classified as positieve this clearly an attack on thrump

Some real talk from @drdrew about how the media is reckless in its coronavirus coverage- Has been classified as positive.

I’m sad to learn the founder of #TraderJoes has died, but I’m very excited to meet the cauliflower version of him coming soon. Has been classified as positive.





 
 
 


def run_vader(textual_unit, 
              lemmatize=False,
              parts_of_speech_to_consider=set(),
              verbose=1):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -empty set -> all parts of speech are provided
    -non-empty set: only these parts of speech are considered
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [54]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0.0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [55]:

tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet, lemmatize = to_lemmatize)# run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report
report = classification_report(gold,all_vader_output,digits = 3)
print(report)



INPUT SENTENCE I Hope you spend your birthday with everyone you love
INPUT TO VADER ['happy', 'birthday', 'Hoseok', '!', ' ', 'I', 'hope', 'you', 'spend', 'your', 'birthday', 'with', 'everyone', 'you', 'love']
VADER OUTPUT {'neg': 0.0, 'neu': 0.448, 'pos': 0.552, 'compound': 0.902}

INPUT SENTENCE I am so in love with her I can’t even explain it
INPUT TO VADER ['I', 'be', 'so', 'in', 'love', 'with', 'her', 'I', 'can', 'not', 'even', 'explain', 'it']
VADER OUTPUT {'neg': 0.0, 'neu': 0.691, 'pos': 0.309, 'compound': 0.6682}

INPUT SENTENCE He is happpy:)
INPUT TO VADER ['He', 'be', 'happpy', ':', ')']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

INPUT SENTENCE He has more Yaya requests coming up so expect more panda goodness
INPUT TO VADER ['I', 'think', 'my', 'friend', 'rele', 'like', 'Yaya', ',', 'another', 'commission', 'for', 'the', 'same', 'friend', '<3', 'He', 'have', 'more', 'Yaya', 'request', 'come', 'up', 'so', 'expect', 'more', 'panda', 'goodness']
VADER


INPUT SENTENCE i'm sad now
INPUT TO VADER ['when', 'fluke', 'thank', "'", "'", 'pharm', "'", "'", 'I', 'feel', 'like', 'cry', ',', 'it', 'really', 'hit', 'me', 'hard', 'that', 'uwma', 'be', 'really', 'end', 'tomorrow', 'i', 'be', 'sad', 'now']
VADER OUTPUT {'neg': 0.272, 'neu': 0.554, 'pos': 0.173, 'compound': -0.4336}

INPUT SENTENCE It’s kind of sad.
INPUT TO VADER ['He', '’', 'so', 'proud', 'of', 'be', 'a', 'dick', '.', 'It', '’', 'kind', 'of', 'sad', '.']
VADER OUTPUT {'neg': 0.392, 'neu': 0.41, 'pos': 0.199, 'compound': -0.5106}

INPUT SENTENCE If Trump wasn’t so busy firing his scientists and calling the coronavirus a hoax, we might be better prepared for it.
INPUT TO VADER ['if', 'Trump', 'be', 'not', 'so', 'busy', 'fire', 'his', 'scientist', 'and', 'call', 'the', 'coronavirus', 'a', 'hoax', ',', 'we', 'may', 'be', 'better', 'prepared', 'for', 'it', '.']
VADER OUTPUT {'neg': 0.08, 'neu': 0.65, 'pos': 0.269, 'compound': 0.6049}

INPUT SENTENCE Some real talk from @drdrew about h

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* I - Run VADER (as it is) on the set of airline tweets 
* II -  Run VADER on the set of airline tweets after having lemmatized the text
* III - Run VADER on the set of airline tweets with only adjectives
* IV - Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* V - Run VADER on the set of airline tweets with only nouns
* VI - Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* VII - Run VADER on the set of airline tweets with only verbs
* VIII - Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

### Question 4 answers:

In [8]:
#loading the airline tweet files:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets= load_files(str(airline_tweets_folder))

# And making a function to (easily) run VADER on tweets using different settings:
def run_vader_on_tweets(tweets,
                        lemmatize_value=False,
                        pos_value=set()):
    vader_label =[]
    gold = []

    for tweet, label_int in zip(tweets.data,tweets.target):
        tweet_ = tweet.decode("utf-8")
        vader_output = run_vader(tweet_, lemmatize=lemmatize_value,parts_of_speech_to_consider= pos_value)
        vader_output_label = vader_output_to_label(vader_output)
        vader_label.append(vader_output_label)
        gold_label = airline_tweets.target_names[label_int]
        gold.append(gold_label)
    report = classification_report(gold, vader_label, digits =3)
    print(report)

#### **4a.** Generate for all separate experiments the classification report, i.e., Precision, Recall, and F1 scores per category as well as micro and macro averages. Use a different code cell (or multiple code cells) for each experiment.

In [9]:
#I Run VADER (as it is) on the set of airline tweets 
run_vader_on_tweets(airline_tweets)

              precision    recall  f1-score   support

    negative      0.797     0.515     0.625      1750
     neutral      0.605     0.506     0.551      1515
    positive      0.559     0.884     0.685      1490

    accuracy                          0.628      4755
   macro avg      0.654     0.635     0.621      4755
weighted avg      0.661     0.628     0.620      4755



In [10]:
#II - Run VADER on the set of airline tweets after having lemmatized the text
run_vader_on_tweets(airline_tweets, lemmatize_value = True)

              precision    recall  f1-score   support

    negative      0.787     0.523     0.628      1750
     neutral      0.599     0.490     0.539      1515
    positive      0.556     0.879     0.682      1490

    accuracy                          0.624      4755
   macro avg      0.648     0.631     0.616      4755
weighted avg      0.655     0.624     0.617      4755



In [11]:
#III - Run VADER on the set of airline tweets with only adjectives
run_vader_on_tweets(airline_tweets, pos_value = {'ADJ'})

              precision    recall  f1-score   support

    negative      0.863     0.213     0.342      1750
     neutral      0.407     0.891     0.559      1515
    positive      0.675     0.455     0.544      1490

    accuracy                          0.505      4755
   macro avg      0.648     0.520     0.481      4755
weighted avg      0.659     0.505     0.474      4755



In [12]:
#IV - Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
run_vader_on_tweets(airline_tweets, lemmatize_value = True,pos_value = {'ADJ'})

              precision    recall  f1-score   support

    negative      0.861     0.213     0.342      1750
     neutral      0.407     0.891     0.559      1515
    positive      0.675     0.455     0.543      1490

    accuracy                          0.505      4755
   macro avg      0.648     0.520     0.481      4755
weighted avg      0.658     0.505     0.474      4755



In [13]:
#V - Run VADER on the set of airline tweets with only nouns
run_vader_on_tweets(airline_tweets, pos_value = {'NOUN'})

              precision    recall  f1-score   support

    negative      0.721     0.137     0.230      1750
     neutral      0.359     0.825     0.500      1515
    positive      0.546     0.344     0.422      1490

    accuracy                          0.421      4755
   macro avg      0.542     0.435     0.384      4755
weighted avg      0.551     0.421     0.376      4755



In [14]:
#VI - Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
run_vader_on_tweets(airline_tweets, lemmatize_value = True,pos_value = {'NOUN'})

              precision    recall  f1-score   support

    negative      0.709     0.150     0.248      1750
     neutral      0.359     0.816     0.498      1515
    positive      0.534     0.336     0.413      1490

    accuracy                          0.421      4755
   macro avg      0.534     0.434     0.386      4755
weighted avg      0.543     0.421     0.379      4755



In [15]:
#VII - Run VADER on the set of airline tweets with only verbs
run_vader_on_tweets(airline_tweets, pos_value = {'VERB'})

              precision    recall  f1-score   support

    negative      0.775     0.284     0.416      1750
     neutral      0.383     0.809     0.520      1515
    positive      0.568     0.349     0.432      1490

    accuracy                          0.472      4755
   macro avg      0.576     0.481     0.456      4755
weighted avg      0.585     0.472     0.454      4755



In [16]:
#VIII - Run VADER on the set of airline tweets with only verbs and after having lemmatized the text
run_vader_on_tweets(airline_tweets,lemmatize_value = True, pos_value = {'VERB'})

              precision    recall  f1-score   support

    negative      0.742     0.292     0.419      1750
     neutral      0.379     0.781     0.510      1515
    positive      0.566     0.358     0.439      1490

    accuracy                          0.469      4755
   macro avg      0.562     0.477     0.456      4755
weighted avg      0.571     0.469     0.454      4755



#### **4b.** Compare the scores and explain what they tell you.
+ Does lemmatisation help?  

The classification reports show that lemmatisation, overall, does not make a huge difference. There is a slight decrease in precision, recall and with them the f1-score, micro and macro average when the text is lemmatized. However, the positive precision and recall when looking at just the verbs and the negative precision and recall when only looking at nouns have a slight increase when they are lemmatized. With lemmatisation the verb goes back to its lemma, with this some meaning can get changed/lost, this could explain the classification report results.

+ Are all parts of speech equally important for sentiment analysis?

The precision, recall and with it the f1-scores are not convincingly better or worse when comparing the specific parts of speech with the complete sentence analyses. However, the micro, macro and weighted averages are consistently lower when looking at the specific parts of speech. This can be explained by the loss of nuance when only looking at a specific part of speech. Words can have different meaning when used in a certain context, this context disappears with part of speech tagging and therefor the meaning can be changed.

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 

In [1]:
import pathlib
import sklearn
import numpy
import nltk

from collections import Counter
from nltk.corpus import stopwords

from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
#import the airport tweets
cwd = pathlib.Path.cwd()
airline_tweets_path = cwd.joinpath('airlinetweets')
airline_tweets = load_files(str(airline_tweets_path))

In [3]:
def vectorize_and_train_tweets(df, vectorizer):
    #vectorize all data 
    airline_vec = CountVectorizer(min_df= df, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english')) 
    airline_counts = airline_vec.fit_transform(airline_tweets.data)

    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)
    
    #check which vectorizer is used, splits the data, trains and tests the classifier on the data, 
    #the corresponding report is passed on as output of the function, if vectorizer is not recognized-> handled as well
    if vectorizer == 'tfidf': 
        tweets_train, tweets_test, y_train, y_test = train_test_split(airline_tfidf, airline_tweets.target, test_size = 0.20)
        tfidf_clf = MultinomialNB().fit(tweets_train, y_train)
        tfidf_pred = tfidf_clf.predict(tweets_test)
        report = sklearn.metrics.classification_report(y_true=y_test, y_pred = tfidf_pred, digits=3)
        
    elif vectorizer == 'count':
        tweets_train, tweets_test, y_train, y_test = train_test_split(airline_counts, airline_tweets.target, test_size = 0.20)
        count_clf = MultinomialNB().fit(tweets_train, y_train)
        count_pred = count_clf.predict(tweets_test)
        report = sklearn.metrics.classification_report(y_true=y_test, y_pred = count_pred, digits=2)
    else: 
        report = ('%s vectorizer not defined' %vectorizer)
        
    return(report)   

* [1 point] a. Generate a classification_report for all experiments

Naive bayes airline - default - TF-IDF representation, min_df=2

In [1]:
print(vectorize_and_train_tweets(2,'tfidf'))

NameError: name 'vectorize_and_train_tweets' is not defined

Naive bayes airline - default - BoW representation, min_df=2

In [5]:
print(vectorize_and_train_tweets(2,'count'))

  'stop_words.' % sorted(inconsistent))


              precision    recall  f1-score   support

           0       0.83      0.91      0.87       355
           1       0.86      0.70      0.78       315
           2       0.81      0.88      0.84       281

    accuracy                           0.83       951
   macro avg       0.83      0.83      0.83       951
weighted avg       0.84      0.83      0.83       951



Naive bayes airline - default - TF-IDF representation, min_df=5

In [6]:
print(vectorize_and_train_tweets(5,'tfidf'))

  'stop_words.' % sorted(inconsistent))


              precision    recall  f1-score   support

           0      0.806     0.901     0.851       333
           1      0.827     0.736     0.779       311
           2      0.844     0.831     0.837       307

    accuracy                          0.824       951
   macro avg      0.826     0.823     0.822       951
weighted avg      0.825     0.824     0.823       951



print(vectorize_and_train_tweets(10,'tfidf'))

In [7]:
print(vectorize_and_train_tweets(10,'tfidf'))

  'stop_words.' % sorted(inconsistent))


              precision    recall  f1-score   support

           0      0.786     0.909     0.843       340
           1      0.830     0.727     0.775       308
           2      0.872     0.828     0.849       303

    accuracy                          0.824       951
   macro avg      0.829     0.821     0.823       951
weighted avg      0.827     0.824     0.823       951



* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?

For all settings tested the highest overall scoring is category 0, the negative category.

In [11]:
airline_tweets.target_names[0]

'negative'

For; count-, tfidf vectorizer, min_df = 2, 5 and 10, all values for precision, recall and the f1-score tended to follow a certain pattern. This pattern would have category 0 as the highest scoring, on every metric except precision. Here it is slightly lower than the precision of the other categories. Furthermore it can be seen that whilst category 0 performs best overall, category 2, positive, is in a sure second position for every setting. Category 1, neutral, is the lowest scoring category everywhere, this could be because it is harder to recognize since the sentiment is based on presence of certain positive or negative words. 

The general shape the classification report follows for all settings
```
        precision   recall   f1-score  
   0      0.821     0.869     0.844    <- highest overall scores  
   1      0.779     0.719     0.748    <-lowest scores present   
   2      0.834     0.840     0.837    <- second highest everywhere, except for precision where it scores highest  
```

 + does the frequency threshold affect the scores? Why or why not according to you?
 
The accuracy, macro- and weighted averages go up slightly when the frequency threshold is increased. The overall precision, recall and f1 per category vary a little but not in a noticeble pattern. 
One possible explanation for the increased accuracy and averages when the frequency treshold is heightened, the remaining terms analyzed have a higher chance to actually contain sentiment, as they are commonly used across the documents. 
This trend of increasing the accuracy through heigtening the frequency threshold will probably not hold for a much higher frequency threshold, as too many important words could be filtered out.

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [None]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook