# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

&nbsp;&nbsp;&nbsp;&nbsp;*The first two sentences are marked very intuitively. VADER considers 'love' a positive word and its negation then as negative. The third sentence is considered more positive than the first because VADER takes the smiley face into account.\
&nbsp;&nbsp;&nbsp;&nbsp;Fourth sentece is similar to the first, 'ruins' is a negative word and thusly the sentece is negative. The fifth sentence is again a negation of the fourth one and so is positive.\
&nbsp;&nbsp;&nbsp;&nbsp;The last two senteces contain the words 'lies' and 'like' which have a negative and positive sentiment respectively in certain contexts. Therefore, even though they are neutral in these sentences, VADER labels them with their respective sentiments.*

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json

In [2]:
my_tweets = json.load(open('my_tweets.json'))

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

0 {'sentiment_label': 'negative', 'text_of_tweet': "@IamNotThatAlex The infamous doxxing website named after a bird. If that's not enough, count yourself lucky and stay away ", 'tweet_url': 'https://twitter.com/im_just_laur/status/1553870749290713089'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

### Task 3a

In [4]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader_model = SentimentIntensityAnalyzer()

In [5]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [6]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = vader_model.polarity_scores(the_tweet)
    vader_label = vader_output_to_label(vader_output)
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])

In [7]:
from sklearn.metrics import classification_report

print(classification_report(gold, all_vader_output))

              precision    recall  f1-score   support

    negative       0.64      0.30      0.41        23
     neutral       0.42      0.56      0.48         9
    positive       0.48      0.72      0.58        18

    accuracy                           0.50        50
   macro avg       0.51      0.53      0.49        50
weighted avg       0.54      0.50      0.48        50



*From the classification report we can observer that the tweets with the negative label have a high precision and lower recall, and the tweets with the neutral and positive label have a lower precision, and higher recall. This indicates that VADER outputs negatively labeled tweets less compared to the (23) gold negative tweet labels. Similarly, VADER labeled more tweets as neutral and positive than there actually were in the gold set (9 and 18 respectivley). The f1-scores indicate that VADER was most succesfull for labeling positive tweets.*

### Task 3b

In [8]:
#print 10 tweets that were misclassified
for label in set(gold):
    print('LABEL:', label.upper())
    c = 0
    for i in range(len(tweets)):
        if gold[i] == label and all_vader_output[i] != label:
            print(f"Tweet {i}: {tweets[i]}")
            print('gold:', gold[i], '|| vader:', all_vader_output[i])
            print()
            c += 1
        if c == 10:
            break
    print()

LABEL: POSITIVE
Tweet 2: @A2Lintra @PEMatson @NewfieldsToday it closed real early, so we didn't even quite finish the museum much less explore any of the grounds. So, there is a lot more to see on a future visit.
gold: positive || vader: neutral

Tweet 6: @KMNetter @cathleendecker @pkcapitol She was on MSNBC while it was going on. She was hardly cowering in fear and afraid to let anyone in. The lady sounded both outraged and ready to kick ass to me. Rightfully so.
gold: positive || vader: negative

Tweet 17: Just as you feel when you look on the river and sky, so I felt; Just as any of you is one of a living crowd, I was one of a crowd  --Walt Whitman, whose 200th birthday is Friday #Whitman200 https://t.co/yFdTwklH9h
gold: positive || vader: neutral

Tweet 35: When Covid struck Michigan hard @GovWhitmer listened to science &amp; saved thousands of lives with her swift action.  But make no mistake, the pathogens of hate and division spread, incited by the president and his complicit su

#### NEUTRAL LABELS:
*The neutral tweets that were labeled incorrectly by VADER, generally seem to do so as a result of having positive or negative words in the sentence, that do not necessarily contribute to the sentiment. Examples of the positive words can be seen in tweets 4, 10 and 26, with words such as 'favorate' and 'justice'. The only tweet labeled as negative (tweet 26) contains negative words such as 'forced' and 'cut'.*

#### POSITIVE LABELS:
*All misclasifications for the tweets that should have been marked as positive, are because VADER marked them as neutral instead. Most of these tweets state something positive, despite something negative (2, 6, 35, 37). These generally contain a mix of positive and negative words, which makes it understandable that it was classified as neutral. More complex reasoning is required to understand the true meaning/sentiment of such statements. Additionally, one of the tweets contained a metaphorical statement (17), which is also inherently difficult to classify correctly.*

#### NEGATIVE LABELS:
*Many of the negative tweets, VADER has labeled as positive instead. This is partially due to people using positive (or strengthening) words to describe negative things, to emphasise how bad it is (0, 3, 20, 21, 22, 30, 31). In some cases, it is even done sarcastically (9). Some tweets that VADER marked as neutral contain a more cryptic negative sentiment, and does not contain (many strong) negative or positive words (5, 22).*

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

### Task 4a

In [9]:
def run_vader(textual_unit, lemmatize=False, parts_of_speech_to_consider=None, verbose=0):
    doc = nlp(textual_unit)
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:
            to_add = token.text
            if lemmatize:
                to_add = token.lemma_
                if to_add == '-PRON-': to_add = token.text
            if parts_of_speech_to_consider and token.pos_ in parts_of_speech_to_consider:
                input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [10]:
import spacy
from sklearn.datasets import load_files

nlp = spacy.load('en_core_web_sm')
airline_tweets_train = load_files('airlinetweets')

In [11]:
#run vader on tweets
all_vader_output = list()
for sent in airline_tweets_train.data:
    scores = vader_model.polarity_scores(str(sent))
    #print()
    #print('VADER OUTPUT', scores)
    vader_label = vader_output_to_label(scores)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.80      0.49      0.60      1750
     neutral       0.57      0.56      0.56      1515
    positive       0.56      0.83      0.67      1490

    accuracy                           0.62      4755
   macro avg       0.64      0.63      0.61      4755
weighted avg       0.65      0.62      0.61      4755



In [12]:
#lemmatized text
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=True)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [13]:
#only adjectives
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=False, parts_of_speech_to_consider={'ADJ'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [14]:
#only adjectives and after having lemmatized the text
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=True, parts_of_speech_to_consider={'ADJ'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [15]:
#only nouns
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=False, parts_of_speech_to_consider={'NOUN'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [16]:
#only nouns and after having lemmatized the text
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=True, parts_of_speech_to_consider={'NOUN'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [17]:
#only verbs
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=False, parts_of_speech_to_consider={'VERB'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [18]:
#only verbs and after having lemmatized the text
all_vader_output = list()
for i in airline_tweets_train.data:
    lemtext = run_vader(str(i), lemmatize=True, parts_of_speech_to_consider={'VERB'}, verbose=0)
    vader_label = vader_output_to_label(lemtext)
    all_vader_output.append(vader_label)

print(classification_report([airline_tweets_train.target_names[i] for i in airline_tweets_train.target], all_vader_output))

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



### Task 4b

*Comparing the classification reports of lemmatized and unlemmatized tweets we see that lemmatization does not really result in a significant difference. For every report the scores differ between 0.01 and 0.05. Lemmatization converts a word to its base form which does not really help VADER with semtiment classification in this case.*

*The best performance that we get using only one POS catagory is from the only adjectives report. The f1 score is: neg 0.34, neutr 0.56 and pos 0.53 with an accuracy of 0.50. Specifically the recall is really high (0.89) with a low precision (0.41) for the neutral tweets meaning it returns many results, but most of its predicted labels are incorrect when compared to the training labels. The opposite is true for the negative and positive tweets. There the recall is lower than the precision. VADER predicts the sentiment correctly of a low number of sentences.*

*The next best POS category is only using verbs and then nouns as tags. VADER however performs best when all the POS labels are used. The respective performance of f1-scores when all labels are used is: neg 0.60, neutr 0.56 and pos 0.67 with an accuracy of 0.62 which is the highest out of all.*

*All POS are helpfull since VADER can use all words with sentiment to figure out the overall sentiment of a sentance. If you only want to use one tag in this model then you are best off by using the adjective tag. However it will always be more accurate when all the tags are used.*

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

### Task 5a

In [19]:
import nltk
import warnings
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

warnings.filterwarnings('ignore')

In [20]:
for min_df in [2, 5, 10]:
    # Create feature extractors
    airline_vec = CountVectorizer(min_df=min_df, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
    tfidf_transformer = TfidfTransformer()

    # Extract features
    airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)
    
    # Split train and test data
    tfidf_train, tfidf_test, tfidf_y_train, tfidf_y_test = train_test_split(airline_counts, airline_tweets_train.target, test_size=0.2)
    count_train, count_test, count_y_train, count_y_test = train_test_split(airline_counts, airline_tweets_train.target, test_size=0.2)
    
    # Train and predict
    print(f"-------------------------MIN_DF={min_df}-------------------------")
    print("Trained with tfidf:")
    print(classification_report(tfidf_y_test, MultinomialNB().fit(tfidf_train, tfidf_y_train).predict(tfidf_test)), '\n')
    print("Trained with bow:")
    print(classification_report(count_y_test, MultinomialNB().fit(count_train, count_y_train).predict(count_test)))
    print()

-------------------------MIN_DF=2-------------------------
Trained with tfidf:
              precision    recall  f1-score   support

           0       0.84      0.88      0.86       354
           1       0.81      0.69      0.75       298
           2       0.81      0.87      0.84       299

    accuracy                           0.82       951
   macro avg       0.82      0.81      0.81       951
weighted avg       0.82      0.82      0.82       951
 

Trained with bow:
              precision    recall  f1-score   support

           0       0.84      0.93      0.88       339
           1       0.84      0.72      0.78       313
           2       0.82      0.85      0.84       299

    accuracy                           0.84       951
   macro avg       0.84      0.83      0.83       951
weighted avg       0.84      0.84      0.83       951


-------------------------MIN_DF=5-------------------------
Trained with tfidf:
              precision    recall  f1-score   support

    

### Task 5b

*Comparing the accuracy of both settings, they seem to be very similar. For every frequency threshold the results are almost identical.*

*For min_df = 2 the accuracy is 0.85 for TF-IDF compared to 0.84 for Bag of words.*

*For min_df = 5 the accuracy is 0.84 for TF-IDF compared to 0.85 for Bag of words.*

*For min_df = 10 the accuracy is 0.84 for both settings.*

*Looking at these results we can also see that the frequency threshold does not really affect the scores. By increasing the min_df, more terms should be ignored that appear too infrequent. But by removing these terms, the accuracy does not increase. But a higher frequency threshold may be beneficial still by reducing the dimensionality of the imput vector without negatively impacting performance.*

*The reason for why a higher frequency threshold does not reduce accuracy might be because the words are still considered relatively infrequent and so the model is not able to learn an association between them and the sentiment of the tweet.*

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

### Task 6a

In [21]:
airline_vec = CountVectorizer(min_df=min_df, tokenizer=nltk.word_tokenize, stop_words=stopwords.words('english'))
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
data, _, labels, _ = train_test_split(airline_counts, airline_tweets_train.target, test_size=0.2)
clf = MultinomialNB().fit(data, labels)

In [22]:
def important_features_per_class(vectorizer, classifier, n=80):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

important_features_per_class(airline_vec, clf)

Important words in negative documents
0 1529.0 @
0 1389.0 united
0 1249.0 .
0 416.0 ``
0 409.0 flight
0 401.0 ?
0 371.0 !
0 325.0 #
0 222.0 n't
0 160.0 ''
0 139.0 's
0 114.0 :
0 111.0 service
0 108.0 virginamerica
0 99.0 get
0 99.0 cancelled
0 92.0 delayed
0 91.0 bag
0 89.0 customer
0 86.0 time
0 84.0 plane
0 74.0 ...
0 74.0 'm
0 73.0 http
0 73.0 -
0 69.0 hours
0 68.0 gate
0 65.0 ;
0 64.0 hour
0 61.0 still
0 59.0 airline
0 58.0 late
0 58.0 help
0 57.0 would
0 57.0 &
0 56.0 ca
0 56.0 2
0 55.0 flights
0 52.0 amp
0 51.0 worst
0 51.0 like
0 50.0 one
0 50.0 $
0 48.0 flightled
0 47.0 delay
0 46.0 've
0 45.0 waiting
0 45.0 never
0 43.0 us
0 43.0 3
0 43.0 (
0 40.0 really
0 40.0 lost
0 40.0 ever
0 40.0 )
0 39.0 back
0 38.0 wait
0 37.0 u
0 37.0 last
0 37.0 check
0 36.0 seat
0 36.0 due
0 36.0 another
0 35.0 fly
0 35.0 day
0 33.0 seats
0 33.0 people
0 33.0 luggage
0 33.0 bags
0 32.0 ticket
0 32.0 thanks
0 32.0 could
0 32.0 airport
0 31.0 hold
0 31.0 guys
0 31.0 even
0 30.0 problems
0 30.0 need
0 2

### Task 6b

*Expected features for the negative class are cancelled, delayed, worst, last etc since they are clearly negative. There are a lot of neutral words listed which are a bit unexpected such as united, flight, service.*

*Expected features for the neutral class are words like names such as jetblue, southwestair, americair and tomorrow, know. We did not expect 'help' and 'please' to be that highly ranked in the neutral list.*

*For the positive list we see positive words like thanks, great, love etc in the upper part of the list. It is unexpected however that names such as the airlines are ranked really high (southwestair and jetblue higher than thanks).*

*The tweets are about airlines so a high occurence of names of the airline companies are expected. Maybe deleting these would improve the ranking of the model. On the other hand, certain airlines may generally have worse or better client satisfaction and may be correlated with the sentiment.*

*To further improve the model we would probably delete the standard make-up of the tweets such as: @, # and http. We would also remove stop-words. We would keep negations since negation handling plays an important role in classification. An example would be n't which is vital in sentences.*