In the last DataLab you implemented a Naive Bayes classifier. You created a frequency table. In this DataLab you will use this table to create features for a logistic regression algorithm, and use scikit-learn to build the model.

**Chapter 5 of the book "Speech and Language Processing" is referenced in this notebook.**

First, repeat the steps you did in the last DataLab until and including Task 3. In other words create this table again:

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import random
import re                                  
import string  

import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizer

In [None]:
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

In [None]:
nltk.download('twitter_samples')

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

In [None]:
# print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

## 1) Tweet preprocessing

Again, use the function `tweet_processor()` you created previously.

In [None]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #

    return processed_tweet

And sanity check if it works.
    
Example tweet:
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [None]:
example_tweet = ('My beautiful sunflowers on a sunny Friday morning off :)'
                 ' #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i')
print(example_tweet)

In [None]:
tweet_processor(example_tweet)

Before going any further, let's split the dataset into training and test sets.

In [None]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

**Task 1 (From the last DataLab)**

The function `tweet_processor()` expects a single tweet to process. But you have lists of tweets to process. Write a function called `tweet_processor_list()` that accept a list of strings (tweets) and returns a list of processed tweets. A processed tweet is a list of tokens. Therefore  `tweet_processor_list()` should return a list of lists.

The first two items in the `positive_tweets_tr` are:

```
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!']
 ```
 
 the expected output of `tweet_processor_list()` is:
 
 ```
 [['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'],
 ['hey',
  'jame',
  'odd',
  ':/',
  'pleas',
  'call',
  'contact',
  'centr',
  '02392441234',
  'abl',
  'assist',
  ':)',
  'mani',
  'thank']]
 
 ```

In [None]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    return processed_tweet_list

In [None]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

**Task 2  (From the last DataLab)**

Now it is time to creative the _vocabulary_ as defined in Chapter 4, Section 4.2:

> vocabulary V consists of the union of all the word types in all classes

Combine all the tokens in `positive_tweets_tr` and `negative_tweets_tr` into one big list and get the unique tokens from this list.

Expected length of the vocabulary is `9085` unique tokens. Notice that if you use a different train/test split or different preprocessing this number will be different.

First 50 tokens in the vocabulary:

```
['(-:',
 '(:',
 '):',
 '--->',
 '-->',
 '->',
 '.\n.',
 '.\n.\n.',
 '. .',
 '. . .',
 '. ..',
 '. ...',
 '..',
 '...',
 '0',
 '0-100',
 '0-2',
 '0.001',
 '0.7',
 '00',
 '00128835',
 '009',
 '00962778381',
 '01282',
 '01482',
 '01:15',
 '01:16',
 '02079',
 '02392441234',
 '0272 3306',
 '0330 333 7234',
 '0345',
 '05.15',
 '07:02',
 '07:17',
 '07:24',
 '07:25',
 '07:32',
 '07:34',
 '08',
 '0878 0388',
 '08962464174',
 '0ne',
 '1',
 '1,300',
 '1,500',
 '1-0',
 '1.300',
 '1.8',
 '1/2']
```

In [None]:
# YOUR CODE HERE #

**Task 3  (From the last DataLab)**

In order to calculate the equation 4.12

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

We first need to calculate $count(w_i, c)$ which is the number of times each token in the vocabulary occurs in class c. This is also called the word frequency table.

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|



In [None]:
from collections import Counter
word_count_pos = Counter(vocab_pos)
word_count_neg = Counter(vocab_neg)

In [None]:
# YOUR CODE HERE #
freqs = 

In [None]:
df = pd.DataFrame.from_dict(freqs, orient='index', columns=['count(w_i, +)', 'count(w_i, -)'])
df.head(10)

## Logistic regression

Now it is time to create features for a logistic regression model. How can we create features from the following table?

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|

Let's create two features, one for the positive counts and one for the negative counts.

- For each token in the tweet get count($w_i$, +) from the table
- Calculate the sum
- This will be your first feature ($x_1$)

Similarly

- For each token in the tweet get count($w_i$, -) from the table
- Calculate the sum
- This will be your second feature ($x_2$)

Finally, repeat this for every tweet in the training and test sets. Let's get back to our example tweet to better understand what you need to do.

Example tweet (raw):
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Example tweet (processed):

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

|tokens|count(w_i, +)|count(w_i, -)|
|--|--|--|
|beauti|45|10|
|sunflow|2|0|
|sunni|5|1|
|friday|91|9|
|morn|68|23|
|:)|2847|2|
|sunflow|2|0|
|favourit|9|8|
|happi|161|18|
|friday|91|9|
|…|31|14|
|Total|$x_1$ = 3352|$x_2$ = 94|

Sanity check your array shapes. Expected outputs:

|Array|Shape|
|--|--|
|X_train|(8000, 2)|
|y_train|(8000,)|
|X_test|(2000, 2)|
|y_test|(2000,)|


X_train output:

```
array([[2847,    2],
       [ 504,   94],
       [   2,    1],
       ...,
       [   0,  378],
       [   1, 3663],
       [   1, 3663]])
```

**Task 1 (First task of this DataLab)**

Create $x_1$ and $x_2$ as described above, for each tweet in the training and test sets.

In [None]:
# YOUR CODE HERE #

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

**Task 2**

Now you are ready to build a logistic regression model using scikit-learn.

Try with and without normalization (as described in section 5.2 page 83 of the book Speech and Language Processing).

In [None]:
# YOUR CODE HERE #

Compare the two models you have developed (Naive Bayes and Logistic Regression) considering _5.2.4 Choosing a classifier_ on page 85 of the book Speech and Language Processing

**Task 3**

Designing new features (discussed in page 83 of the book Speech and Language Processing) is an important part of building models. You created two features in Task 1. Now, design your own features and try to improve the model performance. Check the table on page 82 for inspiration.

In [None]:
# YOUR CODE HERE #

**Task 4**

Use the `SentimentIntensityAnalyzer` from `nltk` to predict the sentiment of the tweets.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

It tells you if a tweet (or a text in general) is positive or negative.

In [None]:
sia.polarity_scores("The acting was good.")

Notice that a tweet can contain both sentiments at the same time.

In [None]:
sia.polarity_scores("The acting was good, but the story was bad.")

In [None]:
sia.polarity_scores("The acting was bad, but the story was good.")

Use the `compound` score to decide whether a tweet is positive or negative. If the compound is a positive number, the prediction is positive. If it is a negative number the prediction is negative.

If the `compound` is zero it is neither positive, nor negative.

In [None]:
sia.polarity_scores("I feel neutral")

Now calculate the accuracy of `SentimentIntensityAnalyzer` on the **raw** test tweets. Decide how you would like to handle neutral tweets. 

In [None]:
# 80% training 20% testing
positive_tweets_tr_raw = all_positive_tweets[:4000]
positive_tweets_te_raw = all_positive_tweets[4000:]

negative_tweets_tr_raw = all_negative_tweets[:4000]
negative_tweets_te_raw = all_negative_tweets[4000:]

In [None]:
# YOUR CODE HERE #

**Task 5**

Use the results of `SentimentIntensityAnalyzer` as new features to your Logistic Regression model. Try with and without normalization.

In [None]:
# YOUR CODE HERE #