# Challenge 2: Sentiment Analysis

In this challenge we will learn sentiment analysis and practice performing sentiment analysis on Twitter tweets. 

## Introduction

Sentiment analysis is to *systematically identify, extract, quantify, and study affective states and subjective information* based on texts ([reference](https://en.wikipedia.org/wiki/Sentiment_analysis)). In simple words, it's to understand whether a person is happy or unhappy in producing the piece of text. Why we (or rather, companies) care about sentiment in texts? It's because by understanding the sentiments in texts, we will be able to know if our customers are happy or unhappy about our products and services. If they are unhappy, the subsequent action is to figure out what have caused the unhappiness and make improvements.

Basic sentiment analysis only understands the *positive* or *negative* (sometimes *neutral* too) polarities of the sentiment. More advanced sentiment analysis will also consider dimensions such as agreement, subjectivity, confidence, irony, and so on. In this challenge we will conduct the basic positive vs negative sentiment analysis based on real Twitter tweets.

NLTK comes with a [sentiment analysis package](https://www.nltk.org/api/nltk.sentiment.html). This package is great for dummies to perform sentiment analysis because it requires only the textual data to make predictions. For example:

```python
>>> from nltk.sentiment.vader import SentimentIntensityAnalyzer
>>> txt = "Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do."
>>> analyzer = SentimentIntensityAnalyzer()
>>> analyzer.polarity_scores(txt)
{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.8442}
```

In this challenge, however, you will not use NLTK's sentiment analysis package because in your Machine Learning training in the past 2 weeks you have learned how to make predictions more accurate than that. The [tweets data](https://www.kaggle.com/kazanova/sentiment140) we will be using today are already coded for the positive/negative sentiment. You will be able to use the Naïve Bayes classifier you learned in the lesson to predict the sentiment of tweets based on the labels.

Conducting Sentiment Analysis¶
Loading and Exploring Data
The dataset we'll be using today is located on Kaggle (https://www.kaggle.com/kazanova/sentiment140). Once you have downloaded and imported the dataset, it you will need to define the columns names: df.columns = ['target','id','date','flag','user','text']

Notes:

The dataset is huuuuge (1.6m tweets). When you develop your data analysis codes, you can sample a subset of the data (e.g. 20k records) so that you will save a lot of time when you test your codes.

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re

In [2]:
original_data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin1')

In [3]:
#RENAME COLUMNS. 
original_data.columns = ['target','id','date','flag','user','text']
original_data.columns
# I called like this because I did not create a sample first. So I rahter changed this name to avoid replace many times the other name. 

Index(['target', 'id', 'date', 'flag', 'user', 'text'], dtype='object')

In [4]:
original_data.head(10)

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
5,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
6,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
7,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
8,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?
9,0,1467812416,Mon Apr 06 22:20:16 PDT 2009,NO_QUERY,erinx3leannexo,spring break in plain city... it's snowing


In [5]:
data = original_data.sample(n=20000)
data

Unnamed: 0,target,id,date,flag,user,text
781935,0,2323528022,Thu Jun 25 00:51:24 PDT 2009,NO_QUERY,KaeliTheKool,Just spent 20 minutes watching a commercial on...
1344802,4,2039933465,Thu Jun 04 23:10:57 PDT 2009,NO_QUERY,LaurHogan,Robert Pattinson Topless Pics!! Oh Lordi ! New...
1031235,4,1933033820,Tue May 26 22:49:32 PDT 2009,NO_QUERY,kristinfinley,"@shutupman Awesome Cemeteries, some fancy ice..."
392428,0,2055146991,Sat Jun 06 08:47:10 PDT 2009,NO_QUERY,miravalonia,"blue jeans, over-played for tonight.."
1502684,4,2071752051,Sun Jun 07 19:52:25 PDT 2009,NO_QUERY,snickel727,Waiting for Night at the Museum 2 to start wit...
...,...,...,...,...,...,...
1544210,4,2181675504,Mon Jun 15 12:08:50 PDT 2009,NO_QUERY,revjesse,@tim_shelbourne hahahaa
432191,0,2064695890,Sun Jun 07 06:56:54 PDT 2009,NO_QUERY,tzejing,needs someone to fetch me home after band tmr
1303071,4,2009019458,Tue Jun 02 15:01:22 PDT 2009,NO_QUERY,tomatoeMD66,"http://twitpic.com/6hpqt - humm, a message in ..."
759619,0,2296303738,Tue Jun 23 09:01:07 PDT 2009,NO_QUERY,tatarina,Its so pretty outside and its ruined cause I d...


In [6]:
data.shape

(20000, 6)

In [7]:
duplicates = data[data.duplicated()]
duplicate_count = len(duplicates)
print(f"Number of duplicates: {duplicate_count}")
print("Duplicate Rows:")
print(duplicates)

Number of duplicates: 0
Duplicate Rows:
Empty DataFrame
Columns: [target, id, date, flag, user, text]
Index: []


In [8]:
data.isna().sum().sum()

0

In [9]:
data

Unnamed: 0,target,id,date,flag,user,text
781935,0,2323528022,Thu Jun 25 00:51:24 PDT 2009,NO_QUERY,KaeliTheKool,Just spent 20 minutes watching a commercial on...
1344802,4,2039933465,Thu Jun 04 23:10:57 PDT 2009,NO_QUERY,LaurHogan,Robert Pattinson Topless Pics!! Oh Lordi ! New...
1031235,4,1933033820,Tue May 26 22:49:32 PDT 2009,NO_QUERY,kristinfinley,"@shutupman Awesome Cemeteries, some fancy ice..."
392428,0,2055146991,Sat Jun 06 08:47:10 PDT 2009,NO_QUERY,miravalonia,"blue jeans, over-played for tonight.."
1502684,4,2071752051,Sun Jun 07 19:52:25 PDT 2009,NO_QUERY,snickel727,Waiting for Night at the Museum 2 to start wit...
...,...,...,...,...,...,...
1544210,4,2181675504,Mon Jun 15 12:08:50 PDT 2009,NO_QUERY,revjesse,@tim_shelbourne hahahaa
432191,0,2064695890,Sun Jun 07 06:56:54 PDT 2009,NO_QUERY,tzejing,needs someone to fetch me home after band tmr
1303071,4,2009019458,Tue Jun 02 15:01:22 PDT 2009,NO_QUERY,tomatoeMD66,"http://twitpic.com/6hpqt - humm, a message in ..."
759619,0,2296303738,Tue Jun 23 09:01:07 PDT 2009,NO_QUERY,tatarina,Its so pretty outside and its ruined cause I d...


In [10]:
#Cheking if there are empty values.
blank_values = data[data == ' '].count()
blank_values

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

### Prepare Textual Data for Sentiment Analysis

Now, apply the functions you have written in Challenge 1 to your whole data set. These functions include:

* `clean_up()`

* `tokenize()`

* `stem_and_lemmatize()`

* `remove_stopwords()`

Create a new column called `text_processed` in the dataframe to contain the processed data. At the end, your `text_processed` column should contain lists of word tokens that are cleaned up. Your data should look like below:

![Processed Data](data-cleaning-results.png)

In [11]:
%%time
def clean_up(s):
    s = re.sub(r"http\S+|www\S+|https\S+", "", s)
    s = re.sub(r"[^a-zA-Z]", " ", s)
    s = s.lower()
    return s

def my_tokenize(s):
    return word_tokenize(s)

def lemmatize(l):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in l]

def remove_stopwords(l):
    stop_words = set(stopwords.words('english'))
    return [word for word in l if word not in stop_words]

CPU times: total: 0 ns
Wall time: 0 ns


In [12]:
%%time
data['text_processed'] = data['text'].apply(clean_up)
data['text_processed'] = data['text_processed'].apply(my_tokenize)
data['text_processed'] = data['text_processed'].apply(lemmatize)
data['text_processed'] = data['text_processed'].apply(remove_stopwords)

CPU times: total: 8.8 s
Wall time: 18.1 s


In [13]:
data

Unnamed: 0,target,id,date,flag,user,text,text_processed
781935,0,2323528022,Thu Jun 25 00:51:24 PDT 2009,NO_QUERY,KaeliTheKool,Just spent 20 minutes watching a commercial on...,"[spent, minute, watching, commercial, slim, n,..."
1344802,4,2039933465,Thu Jun 04 23:10:57 PDT 2009,NO_QUERY,LaurHogan,Robert Pattinson Topless Pics!! Oh Lordi ! New...,"[robert, pattinson, topless, pic, oh, lordi, n..."
1031235,4,1933033820,Tue May 26 22:49:32 PDT 2009,NO_QUERY,kristinfinley,"@shutupman Awesome Cemeteries, some fancy ice...","[shutupman, awesome, cemetery, fancy, ice, cre..."
392428,0,2055146991,Sat Jun 06 08:47:10 PDT 2009,NO_QUERY,miravalonia,"blue jeans, over-played for tonight..","[blue, jean, played, tonight]"
1502684,4,2071752051,Sun Jun 07 19:52:25 PDT 2009,NO_QUERY,snickel727,Waiting for Night at the Museum 2 to start wit...,"[waiting, night, museum, start, uncbear, yay]"
...,...,...,...,...,...,...,...
1544210,4,2181675504,Mon Jun 15 12:08:50 PDT 2009,NO_QUERY,revjesse,@tim_shelbourne hahahaa,"[tim, shelbourne, hahahaa]"
432191,0,2064695890,Sun Jun 07 06:56:54 PDT 2009,NO_QUERY,tzejing,needs someone to fetch me home after band tmr,"[need, someone, fetch, home, band, tmr]"
1303071,4,2009019458,Tue Jun 02 15:01:22 PDT 2009,NO_QUERY,tomatoeMD66,"http://twitpic.com/6hpqt - humm, a message in ...","[humm, message, bottle, fan, mail, flounder, b..."
759619,0,2296303738,Tue Jun 23 09:01:07 PDT 2009,NO_QUERY,tatarina,Its so pretty outside and its ruined cause I d...,"[pretty, outside, ruined, cause, feel, well]"


### Creating Bag of Words

The purpose of this step is to create a [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. *corpus*) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In the cell below, combine all the words in `text_processed` and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's `FreqDist` class ([documentation](https://www.nltk.org/api/nltk.html#module-nltk.probability)). Then select the top 5,000 words from the frequency distribution.

In [14]:
from nltk import FreqDist
words_bag = [word for sublist in data['text_processed'] for word in sublist]

freq_dist = FreqDist(words_bag)

common_words = freq_dist.most_common(5000)
common_words

[('day', 1434),
 ('wa', 1269),
 ('good', 1189),
 ('go', 1071),
 ('get', 1063),
 ('like', 984),
 ('quot', 982),
 ('got', 935),
 ('today', 883),
 ('u', 850),
 ('love', 847),
 ('going', 837),
 ('time', 837),
 ('work', 832),
 ('back', 743),
 ('know', 696),
 ('lol', 695),
 ('one', 675),
 ('really', 642),
 ('im', 634),
 ('night', 630),
 ('want', 623),
 ('amp', 618),
 ('still', 572),
 ('see', 566),
 ('na', 545),
 ('home', 540),
 ('new', 529),
 ('well', 513),
 ('think', 510),
 ('miss', 507),
 ('thanks', 503),
 ('need', 499),
 ('last', 493),
 ('ha', 484),
 ('oh', 451),
 ('morning', 450),
 ('hope', 445),
 ('feel', 434),
 ('much', 430),
 ('make', 424),
 ('tomorrow', 424),
 ('twitter', 410),
 ('great', 405),
 ('haha', 392),
 ('wish', 381),
 ('friend', 379),
 ('come', 360),
 ('happy', 357),
 ('fun', 356),
 ('sad', 355),
 ('thing', 344),
 ('right', 344),
 ('would', 338),
 ('week', 327),
 ('bad', 326),
 ('sorry', 320),
 ('sleep', 318),
 ('getting', 318),
 ('tonight', 316),
 ('gon', 313),
 ('though', 

In [15]:
len(common_words)

5000

In [16]:
data['target'].value_counts()

4    10103
0     9897
Name: target, dtype: int64

In [17]:
data.columns

Index(['target', 'id', 'date', 'flag', 'user', 'text', 'text_processed'], dtype='object')

### Building Features

Now let's build the features. Using the top 5,000 words, create a 2-dimensional matrix to record whether each of those words is contained in each document (tweet). Then you also have an output column to indicate whether the sentiment in each tweet is positive. For example, assuming your bag of words has 5 items (`['one', 'two', 'three', 'four', 'five']`) out of 4 documents (`['A', 'B', 'C', 'D']`), your feature set is essentially:

| Doc | one | two | three | four | five | is_positive |
|---|---|---|---|---|---|---|
| A | True | False | False | True | False | True |
| B | False | False | False | True | True | False |
| C | False | True | False | False | False | True |
| D | True | False | False | False | True | False|

However, because the `nltk.NaiveBayesClassifier.train` class we will use in the next step does not work with Pandas dataframe, the structure of your feature set should be converted to the Python list looking like below:

```python
[
	({
		'one': True,
		'two': False,
		'three': False,
		'four': True,
		'five': False
	}, True),
	({
		'one': False,
		'two': False,
		'three': False,
		'four': True,
		'five': True
	}, False),
	({
		'one': False,
		'two': True,
		'three': False,
		'four': False,
		'five': False
	}, True),
	({
		'one': True,
		'two': False,
		'three': False,
		'four': False,
		'five': True
	}, False)
]
```

To help you in this step, watch the [following video](https://www.youtube.com/watch?v=-vVskDsHcVc) to learn how to build the feature set with Python and NLTK. The source code in this video can be found [here](https://pythonprogramming.net/words-as-features-nltk-tutorial/).

In [18]:
#Following the video. 
#documents = [(list(movie_reviews.words(fileid)), category)
       #for category in movie_reviews.categories()
       #for fileid in movie_reviews.fileids(category)]
        
# random.shuffle(documents)

In [19]:
def find_features(document):
    words = set(document)
    features = {}
    for w in common_words:
        features[w] = (w in words)
    
    return features

 The code finds the common words in a document.
 It then creates an empty dictionary called features and stores each word in it as a key-value pair.
 The code finds all of the words that are not already in the dictionary, which is set to be common_words.
 Then, for each word found, it checks if that word is also present in the list of keys stored inside features.
 If so, then its value will be 1 (true) and if not, its value will be 0 (false).
 The code will return a dictionary of features that are in the document.

In [20]:
feature_set = [(find_features(data), target) for (data, target) in list(zip(data['text_processed'], data['target']))]
print(len(feature_set))

20000


 The code starts by creating a list of all the features in the data and their corresponding target.
 The code then iterates over this list, using zip to create a new list with each element being either an item from the original list or None if it was not found.
 This is done for every feature/target combination in the data set.
 The code compiles a list of all the features and their corresponding target.

In [21]:
feature_set[:8]

[({('day', 1434): False,
   ('wa', 1269): False,
   ('good', 1189): False,
   ('go', 1071): False,
   ('get', 1063): False,
   ('like', 984): False,
   ('quot', 982): False,
   ('got', 935): False,
   ('today', 883): False,
   ('u', 850): False,
   ('love', 847): False,
   ('going', 837): False,
   ('time', 837): False,
   ('work', 832): False,
   ('back', 743): False,
   ('know', 696): False,
   ('lol', 695): False,
   ('one', 675): False,
   ('really', 642): False,
   ('im', 634): False,
   ('night', 630): False,
   ('want', 623): False,
   ('amp', 618): False,
   ('still', 572): False,
   ('see', 566): False,
   ('na', 545): False,
   ('home', 540): False,
   ('new', 529): False,
   ('well', 513): False,
   ('think', 510): False,
   ('miss', 507): False,
   ('thanks', 503): False,
   ('need', 499): False,
   ('last', 493): False,
   ('ha', 484): False,
   ('oh', 451): False,
   ('morning', 450): False,
   ('hope', 445): False,
   ('feel', 434): False,
   ('much', 430): False,
   ('m

### Building and Traininng Naive Bayes Model

In this step you will split your feature set into a training and a test set. Then you will create a Bayes classifier instance using `nltk.NaiveBayesClassifier.train` ([example](https://www.nltk.org/book/ch06.html)) to train with the training dataset.

After training the model, call `classifier.show_most_informative_features()` to inspect the most important features. The output will look like:

```
Most Informative Features
	    snow = True            False : True   =     34.3 : 1.0
	  easter = True            False : True   =     26.2 : 1.0
	 headach = True            False : True   =     20.9 : 1.0
	    argh = True            False : True   =     17.6 : 1.0
	unfortun = True            False : True   =     16.9 : 1.0
	    jona = True             True : False  =     16.2 : 1.0
	     ach = True            False : True   =     14.9 : 1.0
	     sad = True            False : True   =     13.0 : 1.0
	  parent = True            False : True   =     12.9 : 1.0
	  spring = True            False : True   =     12.7 : 1.0
```

The [following video](https://www.youtube.com/watch?v=rISOsUaTrO4) will help you complete this step. The source code in this video can be found [here](https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/).

In [22]:
train_set, test_set = feature_set[:10000], feature_set[10000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Testing Naive Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling `nltk.classify.accuracy(classifier, test)`.

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!

In [23]:
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)

0.5026
Most Informative Features
               ('aa', 3) = False               0 : 4      =      1.0 : 1.0
           ('aaaaah', 4) = False               0 : 4      =      1.0 : 1.0
             ('aaah', 6) = False               0 : 4      =      1.0 : 1.0
             ('aah', 10) = False               0 : 4      =      1.0 : 1.0
            ('aaron', 7) = False               0 : 4      =      1.0 : 1.0
