# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

**answers:**
* sentence 1: No negative words, so neg is 0.0. Other words are neutral or positive so it results in a postive compound score based on how postive the words are.
* sentence 2: Large portion of the sentence is seen as negative because of 'don't', it is strange that because don't was added now 63% of the sentence is negative. The overall score could have been higher since don't love is not very negative.
* sentence 3: Score is higher than sentence 1, because ':-)' with a lexicon of 1.3 adds a lot of positivity to the sentence.
* sentence 4: 'ruins' doesn't necessarily mean that is is negative. It can be monumental as well. It should be more neutral than negative. Should be based on context and not just the word itself. ('ruins' has a score of -1.9)
* sentence 5: This is a clear example of the negation used by VADER. 'not ruins' is seen as positive.
* sentence 6: 'lies' is seen as negative, probably because people annotaded the word as negative because of the association with someone lying and not someone lying down. VADER doesn't really take into account what the word means in different contexts.
* sentence 7: Probably more on the positive side because of the word like, but would have been more logical if it was 100% neutral, since it is a statement and doesn't say anything about the house.



### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. 

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json

In [2]:
my_tweets = json.load(open('my_tweets.json', encoding = 'utf-8'))

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'negative', 'text_of_tweet': 'They spied on my campaign!', 'tweet_url': 'https://twitter.com/realDonaldTrump/status/1232908958421131264'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [4]:
import pathlib
import nltk
import spacy
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.datasets import load_files
from sklearn.metrics import classification_report

vader_model = SentimentIntensityAnalyzer()
nlp = spacy.load('en_core_web_sm') # en_core_web_sm

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets_train = load_files(str(airline_tweets_folder))

data = airline_tweets_train.data
# 0:negative, 1:neutral, 2:positive
labels = airline_tweets_train.target

In [5]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=set(),
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -empty set -> all parts of speech are provided
    -non-empty set: only these parts of speech are considered
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [6]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
def vader_output_to_label_num(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 0
    elif compound == 0.0:
        return 1
    elif compound > 0.0:
        return 2
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [7]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet)
    vader_label = vader_output_to_label(vader_output)
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))

              precision    recall  f1-score   support

    negative       0.94      0.76      0.84        21
     neutral       0.60      0.55      0.57        11
    positive       0.65      0.83      0.73        18

    accuracy                           0.74        50
   macro avg       0.73      0.71      0.72        50
weighted avg       0.76      0.74      0.74        50



In [8]:
for i in range(len(tweets)):
    if all_vader_output[i]!= gold[i]:
        print('True label:', gold[i], ', Predicted label:', all_vader_output[i])
        print(tweets[i])
        print('--------------------------------------------------------')

True label: negative , Predicted label: neutral
They spied on my campaign!
--------------------------------------------------------
True label: negative , Predicted label: positive
There has rarely been a juror so tainted as the forewoman in the Roger Stone case. Look at her background. She never revealed her hatred of “Trump” and Stone. She was totally biased, as is the judge. Roger wasn’t even working on my campaign. Miscarriage of justice. Sad to watch!
--------------------------------------------------------
True label: negative , Predicted label: neutral
Car drives into crowd during carnival procession in German town of Volkmarsen, injuring several people, police say
--------------------------------------------------------
True label: negative , Predicted label: positive
Supporter of Islamic State group admits plotting to bomb London's St Paul's Cathedral and a hotel
--------------------------------------------------------
True label: neutral , Predicted label: positive
SparkLabs 

**answers:**
- Overall the classification reports shows that the class negative has the best f1-score and precision, while the positive class has the highest recall. This means that negative tweets are more correclty classified by VADER, and that the positive tweets are detected more frequently. Nonetheless, VADER performs poorly when considering the neutral class, compared to the other two classes.
- Most of the errors encountered above are of VADER switching neutral tweets as positive ones. This may explained by the fact that VADER scores words independently from their context, which makes so that if a neutral tweet has words of positive conotation, the compound score may be inflated by those positive words. The confusion between positive and negative tweets may be due to the fact that those texts employ words out of the VADER lexicon, such as :"spied, misscariage", which have a negative sentiment, but were not considered by the algorithm.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [9]:
# Run VADER (as it is) on the set of airline tweets
eval_v = [vader_output_to_label_num(run_vader(str(tweet))) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.80      0.51      0.63      1750
           1       0.60      0.51      0.55      1515
           2       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.66      0.63      0.62      4755



In [10]:
# Run VADER on the set of airline tweets after having lemmatized the text
eval_v = [vader_output_to_label_num(run_vader(str(tweet), lemmatize=True)) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.79      0.52      0.63      1750
           1       0.60      0.50      0.54      1515
           2       0.56      0.87      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755



In [11]:
#Run VADER on the set of airline tweets with only adjectives
eval_v = [vader_output_to_label_num(run_vader(str(tweet), parts_of_speech_to_consider={'ADJ'})) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.86      0.21      0.34      1750
           1       0.41      0.89      0.56      1515
           2       0.67      0.45      0.54      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.52      0.48      4755
weighted avg       0.66      0.50      0.47      4755



In [12]:
#Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
eval_v = [vader_output_to_label_num(run_vader(str(tweet), 
                                              parts_of_speech_to_consider={'ADJ'},
                                              lemmatize=True)) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.86      0.21      0.34      1750
           1       0.41      0.89      0.56      1515
           2       0.67      0.45      0.54      1490

    accuracy                           0.50      4755
   macro avg       0.65      0.52      0.48      4755
weighted avg       0.66      0.50      0.47      4755



In [13]:
#Run VADER on the set of airline tweets with only nouns
eval_v = [vader_output_to_label_num(run_vader(str(tweet), parts_of_speech_to_consider={'NOUN'})) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.72      0.13      0.23      1750
           1       0.36      0.82      0.50      1515
           2       0.54      0.33      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.43      0.38      4755
weighted avg       0.55      0.42      0.37      4755



In [14]:
#Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
eval_v = [vader_output_to_label_num(run_vader(str(tweet), 
                                              parts_of_speech_to_consider={'NOUN'},
                                              lemmatize=True)) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.71      0.15      0.24      1750
           1       0.36      0.81      0.50      1515
           2       0.53      0.33      0.40      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.43      0.38      4755
weighted avg       0.54      0.42      0.37      4755



In [15]:
#Run VADER on the set of airline tweets with only verbs
eval_v = [vader_output_to_label_num(run_vader(str(tweet), parts_of_speech_to_consider={'VERB'})) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.77      0.29      0.42      1750
           1       0.38      0.81      0.52      1515
           2       0.57      0.34      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.45      4755



In [16]:
#Run VADER on the set of airline tweets with only verbs and after having lemmatized the text
eval_v = [vader_output_to_label_num(run_vader(str(tweet), 
                                              parts_of_speech_to_consider={'VERB'},
                                              lemmatize=True)) for tweet in data]
print(classification_report(labels, eval_v))

              precision    recall  f1-score   support

           0       0.74      0.29      0.42      1750
           1       0.38      0.78      0.51      1515
           2       0.57      0.35      0.44      1490

    accuracy                           0.47      4755
   macro avg       0.56      0.48      0.45      4755
weighted avg       0.57      0.47      0.45      4755



***answer:***
- It seems that lemmatization has a small effect on model performance, but it is negligible. This result may be explained by the fact that airplane tweets don't have much word variations that can be reduced to their lemmas. 
- The overall result of adding the POS tags makes that the weighted average for the classification is lower for all cases. Nonetheless, the addition of the POS tags makes so that the precision of the classification of positive and negative tweets increases when compared to the baseline. The precision of neutral cases is lower for the POS models however. This result makes sense because the addition of POS information, specially ADJ, VERB, NOUN can potentially convey sentiment, as adjectives, verbs and nouns often express a lot of semantical emotion. 

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

def create_count_vec(tweets, min_df=2):
    vec = CountVectorizer(min_df=min_df,
                                 tokenizer=nltk.word_tokenize, 
                                 stop_words=stopwords.words('english'))
    count_vec = vec.fit_transform(tweets.data)
    return count_vec

def create_tfidf_vec(tweets, min_df=2):
    count_vec = create_count_vec(tweets, min_df)

    tfidf_transformer = TfidfTransformer()
    tfidf_vec = tfidf_transformer.fit_transform(count_vec)
    return tfidf_vec

#create vectors
vec_2 = create_count_vec(airline_tweets_train)
vec_5 = create_count_vec(airline_tweets_train, min_df=5)
vec_10 = create_count_vec(airline_tweets_train, min_df=10)

tfidf_vec_2 = create_tfidf_vec(airline_tweets_train)
tfidf_vec_5 = create_tfidf_vec(airline_tweets_train, min_df=5)
tfidf_vec_10 = create_tfidf_vec(airline_tweets_train, min_df=10)

def train_and_test_nb(vec, tweets):
    #split data
    docs_train, docs_test, y_train, y_test = train_test_split(
        vec, 
        tweets.target, 
        test_size = 0.2
        ) 
    #train
    clf = MultinomialNB().fit(docs_train, y_train)
    y_pred = clf.predict(docs_test)
    
    return classification_report(y_test, y_pred)

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


In [18]:
print('Bag of Words, min_df=2:\n', train_and_test_nb(vec_2, airline_tweets_train), '\n')
print('Bag of Words, min_df=5:\n', train_and_test_nb(vec_5, airline_tweets_train), '\n')
print('Bag of Words, min_df=10:\n', train_and_test_nb(vec_10, airline_tweets_train), '\n')
print('---------------------------------------------------------------------\n')
print('TF-IDF, min_df=2:\n', train_and_test_nb(tfidf_vec_2, airline_tweets_train), '\n')
print('TF-IDF, min_df=5:\n', train_and_test_nb(tfidf_vec_5, airline_tweets_train), '\n')
print('TF-IDF, min_df=10:\n', train_and_test_nb(tfidf_vec_10, airline_tweets_train), '\n')

Bag of Words, min_df=2:
               precision    recall  f1-score   support

           0       0.82      0.92      0.87       344
           1       0.87      0.73      0.80       305
           2       0.82      0.84      0.83       302

    accuracy                           0.83       951
   macro avg       0.84      0.83      0.83       951
weighted avg       0.84      0.83      0.83       951
 

Bag of Words, min_df=5:
               precision    recall  f1-score   support

           0       0.86      0.91      0.88       349
           1       0.82      0.77      0.80       304
           2       0.83      0.83      0.83       298

    accuracy                           0.84       951
   macro avg       0.84      0.83      0.84       951
weighted avg       0.84      0.84      0.84       951
 

Bag of Words, min_df=10:
               precision    recall  f1-score   support

           0       0.86      0.93      0.89       335
           1       0.83      0.81      0.82      

***answer***:
- Overall the classification of negative tweets has the highest f1-score, independently of the setting.
- The document frequency of words in general does not have any effect on the results. This may be explained by the fact that the dataset is so big that most of the words can pass this threshold

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [19]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:

#vec
airline_vec = CountVectorizer(min_df=2,
                             tokenizer=nltk.word_tokenize, 
                             stop_words=stopwords.words('english'))
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

#split data
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, 
    airline_tweets_train.target, 
    test_size = 0.2
    ) 

#train
clf = MultinomialNB().fit(docs_train, y_train)
important_features_per_class(airline_vec, clf)

  'stop_words.' % sorted(inconsistent))


Important words in negative documents
0 1504.0 @
0 1376.0 united
0 1249.0 .
0 431.0 ``
0 392.0 flight
0 370.0 ?
0 357.0 !
0 317.0 #
0 215.0 n't
0 160.0 ''
0 128.0 's
0 114.0 service
0 101.0 virginamerica
0 101.0 :
0 97.0 cancelled
0 96.0 get
0 91.0 customer
0 90.0 ...
0 88.0 bag
0 82.0 plane
0 78.0 time
0 77.0 delayed
0 75.0 -
0 73.0 ;
0 73.0 'm
0 67.0 hours
0 65.0 late
0 65.0 &
0 64.0 gate
0 63.0 http
0 61.0 still
0 60.0 amp
0 59.0 2
0 58.0 hour
0 58.0 airline
0 57.0 would
0 56.0 help
0 55.0 ca
0 54.0 worst
0 50.0 one
0 49.0 flights
0 49.0 $
0 47.0 like
0 46.0 waiting
0 44.0 delay
0 43.0 've
0 42.0 never
0 42.0 flightled
0 42.0 3
0 42.0 (
0 40.0 people
0 40.0 back
0 39.0 luggage
0 39.0 ever
0 39.0 )
0 38.0 us
0 38.0 check
0 37.0 lost
0 37.0 due
0 36.0 trying
0 35.0 u
0 35.0 really
0 35.0 fly
0 35.0 day
0 35.0 bags
0 33.0 wait
0 33.0 seat
0 32.0 guys
0 31.0 thanks
0 31.0 seats
0 31.0 days
0 31.0 crew
0 30.0 ticket
0 30.0 need
0 30.0 going
0 30.0 4
0 29.0 got
0 29.0 even
0 29.0 airport


***answer:***
- We would expect the following words for each class:
-- Negative: cancelled, bad, late, lost,.. --Neutral: on time, arrived, departure, platform,.. --Positive: good, recommend, like,... 
This is because those words are often seen in the context of bad airplane experiences, regular notifications and recomendations, which relates to negative, neutral and positive experiences in airplanes respectively
- 'delayed' and 'cancelled' are in the neutral category, which we would expect to be in the negative category, since a flight being cancelled or delayed has a negative conotation.
- We would remove the airline names, as the model could learn to correlate the airline companies to the sentiment. We would also remove some of the special characters not used to express intensity (such as #, &, :), as they are ubiquitous in every class. 

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook