# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```


### Exercise 1 Answer

Sentence 1: The presence of the word "love" correctly influences a positive sententiment ('pos: 0.808', compound: '0.6369'). The word love has a high positive valence in the lexicon.

Sentence 2: VADER implements the negatiion rule in this sentence, flipping the sentiment valence of 'love' to negative  (neg: 0.627, compound: -0.5216) . 

Sentence 3: The positive sentiment is even stronger (pos: 0.867, compound: 0.7579) than in sentence 1, correctly influenced by the emoticon ":-)" which adds to the positive sentiment. Emoticons are included in VADER's lexicon. 

Sentence 4: The negative sentiment (neg: 0.492, compound: -0.4404) comes from the word 'ruins' which can have negative connotations. This assignment might not always fit the context for example historical ruins.

Sentence 5: The negation "not" correctly influences the sentiment to be more positive (pos: 0.49, compound: 0.5867), suggesting VADER's effective handling of negation in conjunction with an intensifier "certainly" leading to a high positive score. However with different context it could be the negative sentiment influence of 'ruins' could again make this sentence sentiment be inaccurate.

Sentence 6: The negative sentiment (neg: 0.286, compound: -0.4215) is incorrectly applied. This is likely because "lies" is interpreted as dishonesty, whereas in this context, it simply means reclining.

Sentence 7: The slightly positive sentiment (pos: 0.333, compound: 0.3612) might be unexpected as the sentence appears neutral. This could be due to the lack of contextually negative or positive cues, leading to a default slight positive interpretation. Without clear sentiment-laden words, the analysis leans slightly positive, which might reflect an inherent bias or the treatment of neutral statements in VADER's algorithm.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json

In [2]:
my_tweets = json.load(open('my_tweets.json'))

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'neutral', 'text_of_tweet': '@BritIndianVoice @RishiSunak #RishiSunak is #BritishHindu of paternal #Pakistani Punjabi origin. He’s legally POC (#Pakistan\xa0Origin Card) holder issued by @NadraPak (probably also concurrently holding India’s OCI through his Indian NRI wife/mother). Also there may be issue with his wife’s Indian citizenship.', 'tweet_url': 'https://twitter.com/IsmailYSyed/status/1584696134693711873'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [4]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [6]:
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy 

nlp = spacy.load('en_core_web_sm') 

vader_model = SentimentIntensityAnalyzer()

#copied from lab3.2

def run_vader(textual_unit, 
              lemmatize=True, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [11]:
from sklearn.metrics import classification_report
import pandas as pd
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet)
    vader_label = vader_output_to_label(vader_output)
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])

    
# use scikit-learn's classification report

report = classification_report(gold,all_vader_output,digits = 3)
print(report)

#error analysis

df = pd.DataFrame({
    'Gold': gold,
    'VADER': all_vader_output,
    'Tweet': tweets
})

df['Error'] = df['Gold'] != df['VADER']

errors = df[df['Error']]

for sentiment in ['positive', 'neutral', 'negative']:
    sentiment_errors = errors[errors['Gold'] == sentiment].head(10)
    print(f"\nErrors for {sentiment}:")
    for _, row in sentiment_errors.iterrows():
        print(f"Tweet: {row['Tweet']}, Gold: {row['Gold']}, VADER: {row['VADER']}")

              precision    recall  f1-score   support

    negative      0.500     0.350     0.412        20
     neutral      0.700     0.389     0.500        18
    positive      0.308     0.667     0.421        12

    accuracy                          0.440        50
   macro avg      0.503     0.469     0.444        50
weighted avg      0.526     0.440     0.446        50


Errors for positive:
Tweet: Time Wheel - Now It's an Indian to Look after British. #RishiSunakPM #RishiSunak #election #India #UKPrimeMinister #PrimeMinister, Gold: positive, VADER: neutral
Tweet: Labour lawmaker Nadia Whittome:“He’s a multi-millionaire who, as chancellor, cut taxes on bank profits while overseeing the biggest drop in living standards since 1956. Black, white or Asian: if you work for a living, he is not on your side.”.#Britain..#Tories..#RishiSunak.., Gold: positive, VADER: negative
Tweet: @JhaSanjay India is inspiring millions and several nations, except disgustingly fake narrative pushers. #

### Question 3 - part a (answer)
Positive tweets have the lowest precision (0.308) but the highest recall (0.667), indicating the model often misclassifies tweets as positive but is good at catching most of the genuinely positive tweets. Neutral tweets exhibit higher precision (0.700) but lower recall (0.389), meaning while the model is relatively accurate when it predicts a tweet is neutral, it still misses many neutral tweets. For negative tweets, the precision is relatively low (0.500), indicating that only half of the tweets predicted as negative were correctly classified. The recall is even lower (0.350), suggesting the model misses a significant portion of truly negative tweets.

Macro averages in this case are particularly relevant. They show VADERS performance across all sentiments, highlighting its strengths and weaknesses without bias toward more frequent categories. This balanced evaluation is essential for developing an unbiased understanding of public sentiment from social media data.

Overall, macro precision and recall are about 50% showing it both misclassifies about hafl of the tweets and only catches about half of the true positive tweets.

### Question 3 - part b (answer)

Errors for positive:
For these tweets even though the gold label is positive, VADER calculated them all as negative sentiment. This is likely because with perspective matters in these tweets. While they do convey some negative sentiment the message is positive towards Rishi Sunak. VADER only looks for general sentiment and is not aware that the outcome of these tweets is one that is positive towards Sunak and just registers the negative sentiment towards something else. 

Errors for neutral: 
This is admittedly a difficult label to know manually as these neutral tweets generally give both positive and negatives to outcome neutral. Also once again VADER is unaware that while a tweet can contain one sentiment it is overall a different sentiment towards the subject (Rishi Sunak). For example: Tweet: I see a lot of white ass is getting burnt... #RishiSunak (Gold: neutral, VADER: negative). While this tweet without context is of negative sentiment since it also contains #RishiSunak its sentiment is more complex and is neutral since it is the meaning of the sentence does not refer to Sunak but a statement of the settings around him. 

Errors for negative:
VADER incorrectly labels most of these tweets as postitive sentiment when in fact they are negative. This happens multiple times with the mention of wealth. While VADER may measure 'wealth' and 'full of money' with positive sentiment in Sunak's case as a politician it is meant with negative sentiment due to the idea that this means he is out of touch with the British Population. Additionally, in other tweets, the user addresses the audience ('Good morning') which has strong positive sentiment but then goes goes on to express negative sentiment about Sunak which doesnt have as strong effect on VADER's overall sentiment calculation. 

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [35]:
#import packages and load airline tweets
import pathlib
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets_data = load_files(str(airline_tweets_folder))

def index_to_label(list_index):
    labels=[]
    for i in list_index:
        label=airline_tweets_data.target_names[i]
        labels.append(label)
    return labels


VADER_as_is=[]
VADER_lemmatized=[]
VADER_only_adjective=[]
VADER_only_adjective_lemmatized=[]
VADER_only_noun=[]
VADER_only_noun_lemmatized=[]
VADER_only_verb=[]
VADER_only_verb_lemmatized=[]

for tweet in airline_tweets_data.data:
    tweet=str(tweet)[2:-1]
    vader1=run_vader(tweet, lemmatize=False)
    VADER_as_is.append(vader_output_to_label(vader1))

    vader2=run_vader(tweet, lemmatize=True)
    VADER_lemmatized.append(vader_output_to_label(vader2))

    vader3=run_vader(tweet, lemmatize=False, parts_of_speech_to_consider='ADJ')
    VADER_only_adjective.append(vader_output_to_label(vader3))

    vader4=run_vader(tweet, lemmatize=True, parts_of_speech_to_consider='ADJ')
    VADER_only_adjective_lemmatized.append(vader_output_to_label(vader4))

    vader5=run_vader(tweet, lemmatize=False, parts_of_speech_to_consider='NOUN')
    VADER_only_noun.append(vader_output_to_label(vader5))

    vader6=run_vader(tweet, lemmatize=True, parts_of_speech_to_consider='NOUN')
    VADER_only_noun_lemmatized.append(vader_output_to_label(vader6))

    vader7=run_vader(tweet, lemmatize=False, parts_of_speech_to_consider='VERB')
    VADER_only_verb.append(vader_output_to_label(vader7))

    vader8=run_vader(tweet, lemmatize=True, parts_of_speech_to_consider='VERB')
    VADER_only_verb_lemmatized.append(vader_output_to_label(vader8))
   

print('---VADER AS IS---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_as_is,digits = 3)
print(report)

print('---VADER lemmatized---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_lemmatized,digits = 3)
print(report)

print('---VADER ADJ---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_adjective,digits = 3)
print(report)

print('---VADER ADJ lemmatized---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_adjective_lemmatized,digits = 3)
print(report)

print('---VADER NOUN---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_noun,digits = 3)
print(report)

print('---VADER NOUN lemmatized---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_noun_lemmatized,digits = 3)
print(report)

print('---VADER VERB---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_verb,digits = 3)
print(report)

print('---VADER VERB lemmatized---\n')
report = classification_report(index_to_label(list(airline_tweets_data.target)),VADER_only_verb_lemmatized,digits = 3)
print(report)


---VADER AS IS---

              precision    recall  f1-score   support

    negative      0.796     0.515     0.625      1750
     neutral      0.604     0.507     0.551      1515
    positive      0.558     0.881     0.684      1490

    accuracy                          0.627      4755
   macro avg      0.653     0.634     0.620      4755
weighted avg      0.660     0.627     0.620      4755

---VADER lemmatized---

              precision    recall  f1-score   support

    negative      0.786     0.521     0.627      1750
     neutral      0.598     0.489     0.538      1515
    positive      0.556     0.879     0.681      1490

    accuracy                          0.623      4755
   macro avg      0.647     0.630     0.615      4755
weighted avg      0.654     0.623     0.615      4755

---VADER ADJ---

              precision    recall  f1-score   support

    negative      0.870     0.210     0.339      1750
     neutral      0.403     0.893     0.555      1515
    positive   

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [37]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

dfs=[2,4,6,8,10]

for i in dfs:

    airline_vec = CountVectorizer(min_df=i, # If a token appears fewer times than this, across all documents, it will be ignored
                                tokenizer=nltk.word_tokenize) 
    # stopwords are removedstop_words=stopwords.words('english')

    airline_counts = airline_vec.fit_transform(airline_tweets_data.data)

    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

    docs_tfid_train, docs_tfid_test, y_tfid_train, y_tfid_test = train_test_split(
        airline_tfidf, # the tf-idf model
        airline_tweets_data.target, # the category values for each tweet 
        test_size = 0.20 # we use 80% for training and 20% for development
        ) 

    docs_counts_train, docs_counts_test, y_counts_train, y_counts_test = train_test_split(
        airline_tfidf, # the tf-idf model
        airline_tweets_data.target, # the category values for each tweet 
        test_size = 0.20 # we use 80% for training and 20% for development
        )

    tfid_clf = MultinomialNB().fit(docs_tfid_train, y_tfid_train)
    counts_clf = MultinomialNB().fit(docs_counts_train, y_counts_train)

    tfid_y_pred = tfid_clf.predict(docs_tfid_test)
    counts_y_pred = tfid_clf.predict(docs_counts_test)

    print(f'---tfid min_df={i}---\n')
    report = classification_report(y_tfid_test,tfid_y_pred,digits = 3)
    print(report)

    print(f'---counts min_df={i}---\n')
    report = classification_report(y_counts_test,counts_y_pred,digits = 3)
    print(report)




---tfid min_df=2---

              precision    recall  f1-score   support

           0      0.784     0.947     0.858       337
           1      0.880     0.720     0.792       307
           2      0.891     0.850     0.870       307

    accuracy                          0.842       951
   macro avg      0.852     0.839     0.840       951
weighted avg      0.850     0.842     0.840       951

---counts min_df=2---

              precision    recall  f1-score   support

           0      0.838     0.974     0.901       351
           1      0.953     0.782     0.859       312
           2      0.909     0.906     0.908       288

    accuracy                          0.891       951
   macro avg      0.900     0.888     0.889       951
weighted avg      0.897     0.891     0.889       951





---tfid min_df=4---

              precision    recall  f1-score   support

           0      0.856     0.905     0.879       367
           1      0.824     0.728     0.773       302
           2      0.818     0.858     0.837       282

    accuracy                          0.835       951
   macro avg      0.832     0.830     0.830       951
weighted avg      0.834     0.835     0.833       951

---counts min_df=4---

              precision    recall  f1-score   support

           0      0.880     0.939     0.908       359
           1      0.868     0.800     0.833       280
           2      0.890     0.885     0.887       312

    accuracy                          0.880       951
   macro avg      0.879     0.874     0.876       951
weighted avg      0.880     0.880     0.879       951





---tfid min_df=6---

              precision    recall  f1-score   support

           0      0.849     0.911     0.879       359
           1      0.833     0.738     0.782       290
           2      0.835     0.854     0.845       302

    accuracy                          0.840       951
   macro avg      0.839     0.834     0.835       951
weighted avg      0.840     0.840     0.839       951

---counts min_df=6---

              precision    recall  f1-score   support

           0      0.829     0.949     0.885       352
           1      0.926     0.778     0.845       306
           2      0.873     0.867     0.870       293

    accuracy                          0.869       951
   macro avg      0.876     0.865     0.867       951
weighted avg      0.874     0.869     0.868       951





---tfid min_df=8---

              precision    recall  f1-score   support

           0      0.824     0.902     0.862       338
           1      0.847     0.728     0.783       320
           2      0.820     0.857     0.838       293

    accuracy                          0.830       951
   macro avg      0.831     0.829     0.828       951
weighted avg      0.831     0.830     0.828       951

---counts min_df=8---

              precision    recall  f1-score   support

           0      0.834     0.931     0.880       362
           1      0.888     0.762     0.820       302
           2      0.854     0.857     0.856       287

    accuracy                          0.855       951
   macro avg      0.859     0.850     0.852       951
weighted avg      0.857     0.855     0.854       951





---tfid min_df=10---

              precision    recall  f1-score   support

           0      0.826     0.905     0.864       357
           1      0.846     0.747     0.793       293
           2      0.837     0.837     0.837       301

    accuracy                          0.835       951
   macro avg      0.836     0.830     0.831       951
weighted avg      0.836     0.835     0.834       951

---counts min_df=10---

              precision    recall  f1-score   support

           0      0.847     0.951     0.896       349
           1      0.869     0.784     0.824       296
           2      0.894     0.853     0.873       306

    accuracy                          0.868       951
   macro avg      0.870     0.863     0.864       951
weighted avg      0.869     0.868     0.866       951



### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [39]:
airline_vec = CountVectorizer(min_df=i, # If a token appears fewer times than this, across all documents, it will be ignored
                            tokenizer=nltk.word_tokenize) 
# stopwords are removedstop_words=stopwords.words('english')

airline_counts = airline_vec.fit_transform(airline_tweets_data.data)

docs_counts_train, docs_counts_test, y_counts_train, y_counts_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_data.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    )

counts_clf = MultinomialNB().fit(docs_counts_train, y_counts_train)


def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vec, counts_clf)



Important words in negative documents
0 162.07421750009578 united
0 115.60386701619673 .
0 99.99382694053722 @
0 99.52395184333372 ``
0 79.03888229479556 to
0 68.93972736442352 i
0 66.4869048130357 the
0 59.762813193889905 a
0 55.02131012131392 flight
0 51.695025162695686 ?
0 51.25141648392533 and
0 49.5782399747483 is
0 49.218693806843206 you
0 48.0772867031744 my
0 47.71408134110614 #
0 45.62010143410105 on
0 45.08385282660199 in
0 43.22929023395862 !
0 41.147325453645 for
0 39.294184470831105 n't
0 36.87981608737397 your
0 36.14752801663445 no
0 35.86999262639082 of
0 35.5698861216441 it
0 33.41387279718946 not
0 33.05361901417877 that
0 31.992341457899226 was
0 30.282987482661365 have
0 29.173781472250337 at
0 28.418963180654718 with
0 27.161488300587006 me
0 26.128819653872206 this
0 24.983083916676623 virginamerica
0 24.96214101272257 ''
0 24.13833871497802 service
0 23.73377671639408 's
0 22.3556782624379 delayed
0 22.33461157883637 do
0 22.15577418213356 be
0 20.93067580317864 

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook