# Lab3 - Assignment Sentiment

In [1]:
import json
import pathlib
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

import spacy

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question 1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

###  Answer 1:

The first three scores of the VADER output are ratios for the proportion of text that falls into the categories of 'negative', 'neutral' or 'positive'. Together they should add up to one. To rate the sentiment of these individual parts of text, VADER utilizes a valence based approach which means that not only the positive or negative sentiment of a word is assessed but also the intensity of the sentiment is taken into account. VADER retrieves these sentiment scores from a lexicon containing a vast amount of words including slang terms and punctuation. Additionally, VADER captures contextual nuances such as capitalization, where capital letters intensify sentiment, and everything that comes after the word 'but' influences the sentiment of the text that came before it. Finally a 'compound' score is calculated by adding up all the lexicon ratings and standardizing them to a rating between -1 and 1 which stand for extremely negative and extremely positive respectively. With that in mind we will analyze the scores of the sentences given above:

    -SENTENCE 1: The positive sentiment for this sentence seems to be identified due to the word 'love,' which has a relatively high positive mean sentiment rating in the lexicon (3.2). No word in this sentence necessitates the use of negation or intensification rules.
    
    -SENTENCE 2: The negative sentiment for this sentence is due to the use of "don't" before "love" which triggers negation handling, switching the sentiment to negative compared to the first sentence, resulting in a negative compound score (-0.5216). No word in this sentence necessitates the intensification rule.

    -SENTENCE 3: The addition of ':-)', which has a positive mean sentiment rating in the lexicon of 1.3, increases the positive sentiment compared to the first sentence. This is reflected in a higher positive score (0.867) and compound score (0.7579). The increase in positive scores is attributable to VADER's capacity to recognize and assign sentiment values to emoticons, in addition to words. No word in this sentence necessitates the use of the negation or intensification rules.

    -SENTENCE 4: The word "ruins" in this sentence is interpreted negatively in VADER's lexicon with a negative mean sentiment rating of -1.9, leading to a negative sentiment. While "ruins" can have a negative connotation, in this context, it might not necessarily be negative (e.g the houses reffered to in this sentence could be historic buildings). This highlights a limitation in VADER's ability to understand context. No word in this sentence necessitates the use of the negation or intensification rules.

    -SENTENCE 5: The negation "not" applied to the concept of 'ruins' and the intensifier "certainly" are handled here, making the sentiment positive. Upon inspection, "considered" does not appear to add any sentiment value, likely because it is not associated with any specific sentiment score in VADER's lexicon. The result seems somewhat misleading when considering the neutral nature of the statement. VADER correctly inverts the sentiment due to negation but might be overestimating the positive score (0.49) leading to a more positive compound score(0.5867). This again seems to highlight a limitation in VADER's ability to understand context, similar to sentence 4.

    -SENTENCE 6: This sentence seems to be incorrectly identified as negative with a compound score of (-0.4215), likely because of the word "lies", which has a negative mean sentiment rating of -1.8. This seems to underscore a limitation in VADER's ability to understand context, in particularly with polysemous words. 'Lies' is interpreted as dishonesty, whereas, in this context, it simply means reclining. This highlights a flaw in VADER's classification ability for sentences containing words with multiple meanings. No word in this sentence necessitates the use of the negation or intensification rules.
    
    -SENTENCE 7: This sentence is identified with a positive sentiment, which may not entirely reflect the neutral nature of the comparison being made. The positive score (0.333) and compound score (0.3612) are unexpected, as the sentence appears neutral. This outcome may be attributed to the fact that 'like' is the only word from this sentence included in VADER's lexicon, carrying a positive mean sentiment rating of 1.5. This situation seems to highlight a limitation in VADER's ability to understand context, due to its limited lexicon. No word in this sentence necessitates the use of negation or intensification rules.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [2]:
with open('my_tweets.json', encoding='utf-8') as f:
    my_tweets = json.load(f)

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'neutral', 'text_of_tweet': 'Poop Money android app is now available at tntdevelopment store. https://goo.gl/ddAn1E', 'tweet_url': 'https://twitter.com/RandomTweetsApp/status/856165981072306176'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [4]:
nlp = spacy.load('en_core_web_sm')
vader_model = SentimentIntensityAnalyzer()

In [5]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [6]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [7]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
# use scikit-learn's classification report

In [8]:
for i in range(len(tweets)):
    print(f"The tweet: {tweets[i]}")
    print(f"The gold label for this tweet: {gold[i]}")
    print(f"The VADER output for this tweet: {all_vader_output[i]}\n")

The tweet: Poop Money android app is now available at tntdevelopment store. https://goo.gl/ddAn1E
The gold label for this tweet: neutral
The VADER output for this tweet: neutral

The tweet: Haaland just surpassed Jesus season tally in one game
The gold label for this tweet: positive
The VADER output for this tweet: neutral

The tweet: i want to corroborate the stories about the poor working conditions with wilbur and lovejoy. i have volunteered and worked for him and the band, and though i was spared any truly negative experiences, there was an immense lack of professionalism and care from bosses. that’s all.
The gold label for this tweet: negative
The VADER output for this tweet: negative

The tweet: ERLING HAALAND BRACE WITH ANOTHER DE BRUYNE ASSIST THE BEST DUO IN THE WORLD
The gold label for this tweet: positive
The VADER output for this tweet: positive

The tweet: Despite all negative news, this year will turn out , just as you prayed for 🤩🤩✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️
The gold label 

### Answer 3a Perform a quantitative evaluation

In [9]:
# Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
print(classification_report(gold, all_vader_output, target_names=['negative', 'neutral', 'positive']))

              precision    recall  f1-score   support

    negative       0.50      0.82      0.62        17
     neutral       0.44      0.27      0.33        15
    positive       0.85      0.61      0.71        18

    accuracy                           0.58        50
   macro avg       0.60      0.57      0.56        50
weighted avg       0.61      0.58      0.57        50



#### precision is the proportion of correctly classified tweets for each sentiment category (negative, neutral, positive) compared to all tweets classified in that category:
- precision(negative): 0.50 indicates that when the VADER classifies a tweet as negative, it is correct 50% of the time 
- precision(neutral): 0.44 suggests that when the VADER classifies a tweet as neutral, it is correct 44% of the time 
- precision(positive): 0.85 shows a high precision for positive classification. VADER is correct 85% of the time when it classifies a tweet is positive.

#### recall is the proportion of actual tweets for each sentiment category (negative, neutral, positive) that were correctly identiified as such:
- recall(negative): 0.82 shows a high recall for negative classification. This suggests that VADER is able to correctly identify 82% of all actual negative tweets.
- recall(neutral): 0.27 indicates that VADER is able to correctly identify 27% of all actual neutral tweets.
- recall(positive): 0.61 shows that the VADER correctly identifies 61% of all actual positive tweets

#### f1-Score is the harmonic mean between precision and recall for each sentiment (where an improvement in precision can come at the cost of a decrease in recall, or vice versa):
- f1-Score(negative): 0.62 suggest a relatively  balanced precision and recall for negative tweets.
- f1-Score(neutral): 0.33 indicates a poor balance between precision and recall.
- f1-Score(positive): 0.71 indicates a relatively better balance between precision and recall for positive tweets compared to neutral ones, but a relatively worse balance between precision and recall for the negative ones.

#### support is the count of labelled tweets for each sentiment  (negative, neutral, positive), and can be used as an indication to assess class imbalance: 
- support(negative): there are 17 gold labeled negative tweets.
- support(neutral): there are 15 gold labeled neutral tweets.
- support(positive): there are 18 gold labeled positive tweets.

#### accuracy is the proportion of correctly classified tweets  accross all classifications made, across all sentiments:
- 0.58 indicates that VADER correctly identifies the sentiment of a tweet 58% of the time across all categories.

#### macro avg: Averages the precision, recall, and f1-score per sentiment: negative, neutral, and positive, without considering the number of instances of each sentiment i.e. each sentiment is given equal weight:
- The macro average precision is 0.60, recall is 0.57, and f1-score is 0.56.

#### weighted avg: Averages the precision, recall, and F1-score per sentiment: negative, neutral, and positive, taking into account the number of true instances of each sentiment. This approach gives more weight to sentiments with more instances, making it a useful metric in imbalanced datasets:
- The weighted average precision is 0.61, recall is 0.58, and f1-score is 0.57

___
- VADER seems to perform best at classifying positive tweets, as indicated by the high precision and reasonably good recall and F1-score for positive sentiment.
- VADER seems to struggles with neutral tweets the most, showing the lowest recall and F1-score, indicating difficulty in correctly identifying neutral sentiment accurately
- Negative sentiment has a relatively high recall but lower precision, suggesting the model frequently predicts tweets as negative, even when they are not, but is quite good at capturing most negative instances.
- An accuracy of 0.58 suggests moderate performance, with room for improvement, especially in correctly identifying neutral tweets and improving the precision for negative tweets.

#### Most relevant scores:
- The most relevant scores depend on the use case of a model. Here since neither the cost of false positive is high (precision) nor is it of utmost importance to capture as many true instances as possible (recall), the f1-score emerges as a particularly valuable metric. This is due to its balanced consideration of both precision and recall, offering a inclusive view of VADER's performance. Looking at the f1-scores across different sentiments can help identify which specific areas (positive, negative, neutral) might require improvements, such as adjusting the parameters of the 'run_vader' function to better capture neutral sentiments, in this case. The support metric indicates that hte sentiments seem to be fairly balanced, and since each sentiment category can be considered equally important, the macro avg for the f1-score is likely the single best metric to track in order to asses the performance of VADER in this experiment.

### Answer 3b Perform an error analysis

In [10]:
#error analysis
df = pd.DataFrame({
'Gold': gold,
'VADER': all_vader_output,
'Tweet': tweets
})
df['Error'] = df['Gold'] != df['VADER']
errors = df[df['Error']]
for sentiment in ['positive', 'neutral', 'negative']:
    sentiment_errors = errors[errors['Gold'] == sentiment].head(10)
    print(f"\nErrors for {sentiment}:")
    for _, row in sentiment_errors.iterrows():
        print(f"Tweet {_}: {row['Tweet']}, Gold: {row['Gold']}, VADER: {row['VADER']}")


Errors for positive:
Tweet 1: Haaland just surpassed Jesus season tally in one game, Gold: positive, VADER: neutral
Tweet 5: Improve your mindset, no one likes to be around negative people.  #redo96, Gold: positive, VADER: negative
Tweet 6: Pochettino: “No one says anything negative [about Liverpool or City]. It’s like if you win, you win, if you lose, you lose, it’s OK. Nothing happens. But in Chelsea it is completely different because of the pressure of that [£1billion spent on the squad]. For me, it is unfair, but in saying that, I accept the opinions.”, Gold: positive, VADER: negative
Tweet 20: This could make a grown ass man cry, Gold: positive, VADER: negative
Tweet 27: We're on the road once again, as we head down to Gloucestershire to take on Forest Green Rovers! 🔴⚪️ #WxmAFC, Gold: positive, VADER: neutral
Tweet 31: When stick is life., Gold: positive, VADER: neutral
Tweet 37: texting my boys kicking my feet giggling n shit & a lil fart slipped out, Gold: positive, VADER: nega

### Error analysis for positive:
- Tweet 1: This tweet is likely classified as neutral because VADER doesn't recognize any of the words as positive sentiment-bearing words
- Tweet 5: VADER fails to recognize the positive intention of the tweet. The word "negative" with a high mean sentiment rating of -2.7 and the phrase "no one likes to be around" likely overshadows the first part of the tweet and leads VADER to classify this tweet as negative. 
- Tweet 6: The relatively complex structure of this tweet contains both positive and negative aspects. The repeated use of words like "negative" and "lose" and discussion about unfair treatment likely leads VADER identifying the tweet as negative, as it makes up most of the tweet. VADER does not recognize the state of accaptance Pochettino declares ("I accept the opinions") as sufficient to identify the tweet as positive, due to their being less overall  positive sentiment-bearing words in the tweet.
- Tweet 20: Is likely identified as negative due to the words"ass"(-2.5) and "cry"(-2.1). VADER does not seem capture the context in which crying is a positive response to a moving or heartwarming scenario. 
- Tweet 27: The excitement of the team's supporters are not captured by VADER due to a lack of positive sentiment-bearing words in the tweet
- Tweet 31: The tweet is likely too vague, lacking sentiment-bearing words that VADER can recognize.
- Tweet 37: The use of "shit" (-2.6) accompanied by a lack of positive sentiment-bearing words recognized by VADER, likely leads VADER to classify this tweet as negative. The overall humorous nature of the tweet is missed by VADER.

### Error analysis for neutral:
- Tweet 7: Contains phrases that likely led to it being classified as negative by VADER, such as "not implemented" and "I don’t mean to be Mr. Negative", Despite the overall message being more of a critique than a purely negative sentiment
- Tweet 9: The tweet is likely interpreted as negative due to the presence of "curse" which has a high negative mean sentiment rating in the lexicon of -2.5, despite the actual meaning being closer to neutral or quite possibly even positive (based on the context)
- Tweet 10: Similar to the previous tweet, the word "Curse" in the title seemingly leads to a negative sentiment, despite the context being neutral (simply the title of an episode).
- Tweet 16: The sarcastic nature of the tweet is not recognized by VADER. The use of "fuck" by the author directed towards the government and themselves is easily picked up as negative.
- Tweet 18: Similar to the previous tweet, The sarcastic nature of this tweet is not recognized by VADER. This tweet's positive classification likely comes from words like "allowed", "apologize" that are included in the lexicon with a positive mean sentiment rating
- Tweet 21:The negative sentiment detected by VADER is easily attributed to the explicit mention of violent and disturbing actions through words with high negative mean sentiment ratings e.g "killing", "jailed". Despite the news context, which may be considered neutral
- Tweet 32: The use of words like "no" and "stereotypical" likely leads to the negative classification of the tweet, VADER again fails to recognize and handle the sarcastic nature of the tweet 
- Tweet 35: The negative sentiment detected by VADER can easily be attributed to the explicit mention of violent actions through words with negative mean sentiment ratings e.g "cut", "attacking". Despite the context(a description of a scene from an anime) which may be considered neutral
- Tweet 38: VADER likely classifies this tweet as negative because of words with negative mean sentiment ratings like "problem" and dampening intensifiers like "almost". 
- Tweet 42: The presence of the negation "wouldn't" does not seem to be directly applied to "stupid"(-2.4), leading to VADER identifying the tweet as negative, without the ability to contextualize the humor or sarcasm.

### Error analysis for negative:
- All three tweets seem to lack the use of common negation words that VADER could use to identify negative sentiment.
- Tweet 24: The use of "NOOOOOOOOO" strongly indicates negative sentiment, which is not accounted for in VADER's lexicon or its rules
- Tweet 36: Highlights VADER's inability to account for sarcasm, as indicated by the use of "of course" in the sentence. Additionally, VADER fails to capture the nuance of the 🤦🏾‍♂️ emoticon, which is commonly used to indicate embarrassment
- Tweet 49: Indicates an undesired situation the author of the tweet finds themselves in. "Without having to" should serve as a negation to the idea of performing on social media, but VADER might not fully capture the negative sentiment because the complex phrasing. The use of "really" and "less and less excited" indicate intensification but in a negative context. VADER might misinterpret "really" as positive reinforcement to "great work" leading to a positive sentiment that overshadows negative indicators

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [11]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

path: C:\Users\remia\Desktop\period 4\ba-text-mining\lab_sessions\lab3\airlinetweets
this will print True if the folder exists: True


In [12]:
str(airline_tweets_folder)

'C:\\Users\\remia\\Desktop\\period 4\\ba-text-mining\\lab_sessions\\lab3\\airlinetweets'

In [13]:
airline_tweets_all = load_files(str(airline_tweets_folder))
#print(airline_tweets_all)

#### Gold labels for the airline tweets

In [14]:
airline_gold = []

for i in range(len(airline_tweets_all.data)):
    sentiment_index = airline_tweets_all.target[i]
    airline_gold.append(airline_tweets_all.target_names[sentiment_index])

#### Run VADER (as it is) on the set of airline tweets

In [15]:
airline_tweets = []
airline_vader_output = []

for i in range(len(airline_tweets_all.data)):
    # decoding bytes to string before run_vader (to deal with "ExtraData: unpack(b) received extra data" error)
    if isinstance(airline_tweets_all.data[i], bytes):
        textual_unit = airline_tweets_all.data[i].decode('utf-8')
    else:
        textual_unit = airline_tweets_all.data[i]
    vader_output = run_vader(textual_unit) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_tweets.append(textual_unit)
    airline_vader_output.append(vader_label)

#### Run VADER on the set of airline tweets after having lemmatized the text

In [16]:
airline_vader_output_lem = []

for i in range(len(airline_tweets)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, lemmatize=True) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_lem.append(vader_label)

#### Run VADER on the set of airline tweets with only adjectives

In [17]:
airline_vader_output_adj = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, parts_of_speech_to_consider={'ADJ'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_adj.append(vader_label)

#### Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text

In [18]:
airline_vader_output_lem_adj = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, lemmatize=True, parts_of_speech_to_consider={'ADJ'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_lem_adj.append(vader_label)

#### Run VADER on the set of airline tweets with only nouns

In [19]:
airline_vader_output_noun = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, parts_of_speech_to_consider={'NOUN'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_noun.append(vader_label)

#### Run VADER on the set of airline tweets with only nouns and after having lemmatized the text

In [20]:
airline_vader_output_lem_noun = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, lemmatize=True, parts_of_speech_to_consider={'NOUN'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_lem_noun.append(vader_label)

#### Run VADER on the set of airline tweets with only verbs

In [21]:
airline_vader_output_verb = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, parts_of_speech_to_consider={'VERB'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_verb.append(vader_label)

#### Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

In [22]:
airline_vader_output_lem_verb = []

for i in range(len(airline_tweets_all.data)):
    textual_unit = airline_tweets[i]
    vader_output = run_vader(textual_unit, lemmatize=True, parts_of_speech_to_consider={'VERB'}) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    airline_vader_output_lem_verb.append(vader_label)

### 4a Answer. Generate for all separate experiments the classification report

In [23]:
print("Run VADER (as it is) on the set of airline tweets\n")
print(classification_report(airline_gold, airline_vader_output, target_names=['negative', 'neutral', 'positive']))

Run VADER (as it is) on the set of airline tweets

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.64      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [24]:
print("Run VADER on the set of airline tweets after having lemmatized the text\n")
print(classification_report(airline_gold, airline_vader_output_lem, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets after having lemmatized the text

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [25]:
print("Run VADER on the set of airline tweets with only adjectives\n")
print(classification_report(airline_gold, airline_vader_output_adj, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only adjectives

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



In [26]:
print("Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text\n")
print(classification_report(airline_gold, airline_vader_output_lem_adj, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text

              precision    recall  f1-score   support

    negative       0.87      0.21      0.34      1750
     neutral       0.40      0.89      0.56      1515
    positive       0.66      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.51      0.47      4755
weighted avg       0.66      0.50      0.47      4755



In [27]:
print("Run VADER on the set of airline tweets with only nouns\n")
print(classification_report(airline_gold, airline_vader_output_noun, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only nouns

              precision    recall  f1-score   support

    negative       0.73      0.14      0.24      1750
     neutral       0.36      0.82      0.50      1515
    positive       0.53      0.34      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.38      4755



In [28]:
print("Run VADER on the set of airline tweets with only nouns and after having lemmatized the text\n")
print(classification_report(airline_gold, airline_vader_output_lem_noun, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only nouns and after having lemmatized the text

              precision    recall  f1-score   support

    negative       0.72      0.16      0.26      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.52      0.33      0.40      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.39      4755
weighted avg       0.54      0.42      0.38      4755



In [29]:
print("Run VADER on the set of airline tweets with only verbs\n")
print(classification_report(airline_gold, airline_vader_output_verb, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only verbs

              precision    recall  f1-score   support

    negative       0.77      0.29      0.42      1750
     neutral       0.38      0.81      0.52      1515
    positive       0.57      0.34      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.45      4755



In [30]:
print("Run VADER on the set of airline tweets with only verbs and after having lemmatized the text\n")
print(classification_report(airline_gold, airline_vader_output_lem_verb, target_names=['negative', 'neutral', 'positive']))

Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

              precision    recall  f1-score   support

    negative       0.74      0.30      0.42      1750
     neutral       0.38      0.78      0.51      1515
    positive       0.57      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.56      0.48      0.46      4755
weighted avg       0.57      0.47      0.45      4755



### 4b Answer. Compare the scores and explain what they tell you

$$$$$$$$$   Orlando $$

Does lemmatisation help?


## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [31]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

path: C:\Users\remia\Desktop\period 4\ba-text-mining\lab_sessions\lab3\airlinetweets
this will print True if the folder exists: True


In [32]:
str(airline_tweets_folder)

'C:\\Users\\remia\\Desktop\\period 4\\ba-text-mining\\lab_sessions\\lab3\\airlinetweets'

In [33]:
airline_tweets_data = load_files(str(airline_tweets_folder))

In [34]:
dfs=[2,5,10]

for i in dfs:

    airline_vec = CountVectorizer(min_df=i, # If a token appears fewer times than this, across all documents, it will be ignored
                                tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))

    airline_counts = airline_vec.fit_transform(airline_tweets_data.data)

    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

    docs_tfid_train, docs_tfid_test, y_tfid_train, y_tfid_test = train_test_split(
        airline_tfidf, # the tf-idf model
        airline_tweets_data.target, # the category values for each tweet 
        test_size = 0.20, # we use 80% for training and 20% for development
        random_state=1
        ) 

    docs_counts_train, docs_counts_test, y_counts_train, y_counts_test = train_test_split(
        airline_counts, # the bag of words model
        airline_tweets_data.target, # the category values for each tweet 
        test_size = 0.20,
        random_state=1 # we use 80% for training and 20% for development
        )

    tfid_clf = MultinomialNB().fit(docs_tfid_train, y_tfid_train)
    counts_clf = MultinomialNB().fit(docs_counts_train, y_counts_train)

    tfid_y_pred = tfid_clf.predict(docs_tfid_test)
    counts_y_pred = tfid_clf.predict(docs_counts_test)

    print(f'---tfid min_df={i}---\n')
    report = classification_report(y_tfid_test,tfid_y_pred,digits = 3)
    print(report)

    print(f'---counts min_df={i}---\n')
    report = classification_report(y_counts_test,counts_y_pred,digits = 3)
    print(report)




---tfid min_df=2---

              precision    recall  f1-score   support

           0      0.814     0.904     0.856       343
           1      0.868     0.689     0.768       296
           2      0.836     0.897     0.866       312

    accuracy                          0.835       951
   macro avg      0.839     0.830     0.830       951
weighted avg      0.838     0.835     0.832       951

---counts min_df=2---

              precision    recall  f1-score   support

           0      0.833     0.933     0.880       343
           1      0.872     0.693     0.772       296
           2      0.849     0.904     0.876       312

    accuracy                          0.849       951
   macro avg      0.852     0.843     0.843       951
weighted avg      0.851     0.849     0.845       951





---tfid min_df=5---

              precision    recall  f1-score   support

           0      0.827     0.892     0.858       343
           1      0.828     0.730     0.776       296
           2      0.847     0.869     0.858       312

    accuracy                          0.834       951
   macro avg      0.834     0.830     0.831       951
weighted avg      0.834     0.834     0.832       951

---counts min_df=5---

              precision    recall  f1-score   support

           0      0.847     0.901     0.873       343
           1      0.844     0.747     0.792       296
           2      0.861     0.894     0.877       312

    accuracy                          0.851       951
   macro avg      0.850     0.847     0.847       951
weighted avg      0.850     0.851     0.849       951





---tfid min_df=10---

              precision    recall  f1-score   support

           0      0.818     0.854     0.836       343
           1      0.770     0.736     0.753       296
           2      0.835     0.830     0.833       312

    accuracy                          0.810       951
   macro avg      0.808     0.807     0.807       951
weighted avg      0.809     0.810     0.809       951

---counts min_df=10---

              precision    recall  f1-score   support

           0      0.840     0.872     0.856       343
           1      0.794     0.753     0.773       296
           2      0.860     0.865     0.863       312

    accuracy                          0.833       951
   macro avg      0.831     0.830     0.830       951
weighted avg      0.832     0.833     0.832       951



### 5b. Answer $$$ Orlando
####### Look at the results of the experiments with the different settings and try to explain why they differ:

###### which category performs best, is this the case for any setting? 
- best for TF-IDF representation: min_df=2
- best for Bag of words representation: min_df=5 
###### does the frequency threshold affect the scores? Why or why not according to you
- yes, but not by much it would seem. The removal of stopwords with 'stop_words=stopwords.words('english')', and very common tokens (min_df), which are often not useful for training a classification model, means that the vocabulary is already focused on more meaningful words. Adjusting min_df primarily excludes less frequent terms, which may not add much predictive power beyond what the more frequent, informative tokens already provide. This suggests that key determinants of class membership are captured by the more frequent, informative tokens that remain after the threshold is applied.

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [35]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

path: C:\Users\remia\Desktop\period 4\ba-text-mining\lab_sessions\lab3\airlinetweets
this will print True if the folder exists: True


In [36]:
str(airline_tweets_folder)

'C:\\Users\\remia\\Desktop\\period 4\\ba-text-mining\\lab_sessions\\lab3\\airlinetweets'

In [37]:
airline_tweets_data = load_files(str(airline_tweets_folder))

In [38]:
airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                                tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))

airline_counts = airline_vec.fit_transform(airline_tweets_data.data)




In [39]:
#Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test; Bag of words representation ('airline_count'), min_df=2)
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the Bag of words model with min_df=2. airline_counts_two is defined at question 5
    airline_tweets_data.target, # the category values for each tweet 
    test_size = 0.20, # we use 80% for training and 20% for development
    random_state=2) 

In [40]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [41]:
# Predicting the Test set results, find macro recall
y_pred = clf.predict(docs_test)

In [42]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.91      0.88       362
           1       0.86      0.73      0.79       317
           2       0.79      0.85      0.82       272

    accuracy                           0.83       951
   macro avg       0.83      0.83      0.83       951
weighted avg       0.84      0.83      0.83       951



In [43]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 



#### 6a.Answer

In [44]:
# example of how to call from notebook:
important_features_per_class(airline_vec, clf) # airline_vec_two defined at question 5

Important words in negative documents
0 1499.0 @
0 1367.0 united
0 1212.0 .
0 412.0 ``
0 406.0 ?
0 404.0 flight
0 386.0 !
0 309.0 #
0 219.0 n't
0 154.0 ''
0 129.0 's
0 113.0 service
0 107.0 :
0 104.0 virginamerica
0 99.0 get
0 91.0 cancelled
0 90.0 time
0 89.0 customer
0 87.0 delayed
0 86.0 plane
0 85.0 bag
0 73.0 hours
0 72.0 ...
0 71.0 'm
0 66.0 http
0 66.0 ;
0 66.0 -
0 64.0 gate
0 63.0 airline
0 62.0 late
0 62.0 help
0 60.0 &
0 59.0 still
0 59.0 hour
0 58.0 would
0 56.0 ca
0 54.0 amp
0 53.0 2
0 51.0 've
0 50.0 worst
0 50.0 waiting
0 50.0 one
0 50.0 never
0 50.0 $
0 48.0 flights
0 48.0 (
0 47.0 like
0 47.0 delay
0 45.0 )
0 40.0 wait
0 40.0 flightled
0 40.0 due
0 40.0 back
0 39.0 luggage
0 39.0 lost
0 39.0 fly
0 39.0 check
0 38.0 seat
0 38.0 really
0 37.0 people
0 36.0 us
0 36.0 day
0 35.0 ever
0 35.0 another
0 35.0 3
0 34.0 crew
0 33.0 trying
0 33.0 hold
0 33.0 bags
0 33.0 baggage
0 32.0 ticket
0 32.0 got
0 32.0 airport
0 31.0 u
0 31.0 thanks
0 30.0 terrible
0 30.0 problems
0 29.0 se

#### 6b1.Answer

### $$$ michael

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook