# Challenge 2: Sentiment Analysis

In this challenge we will learn sentiment analysis and practice performing sentiment analysis on Twitter tweets. 

## Introduction

Sentiment analysis is to *systematically identify, extract, quantify, and study affective states and subjective information* based on texts ([reference](https://en.wikipedia.org/wiki/Sentiment_analysis)). In simple words, it's to understand whether a person is happy or unhappy in producing the piece of text. Why we (or rather, companies) care about sentiment in texts? It's because by understanding the sentiments in texts, we will be able to know if our customers are happy or unhappy about our products and services. If they are unhappy, the subsequent action is to figure out what have caused the unhappiness and make improvements.

Basic sentiment analysis only understands the *positive* or *negative* (sometimes *neutral* too) polarities of the sentiment. More advanced sentiment analysis will also consider dimensions such as agreement, subjectivity, confidence, irony, and so on. In this challenge we will conduct the basic positive vs negative sentiment analysis based on real Twitter tweets.

NLTK comes with a [sentiment analysis package](https://www.nltk.org/api/nltk.sentiment.html). This package is great for dummies to perform sentiment analysis because it requires only the textual data to make predictions. For example:

```python
>>> from nltk.sentiment.vader import SentimentIntensityAnalyzer
>>> txt = "Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do."
>>> analyzer = SentimentIntensityAnalyzer()
>>> analyzer.polarity_scores(txt)
{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.8442}
```

In this challenge, however, you will not use NLTK's sentiment analysis package because in your Machine Learning training in the past 2 weeks you have learned how to make predictions more accurate than that. The [tweets data](https://www.kaggle.com/kazanova/sentiment140) we will be using today are already coded for the positive/negative sentiment. You will be able to use the Naïve Bayes classifier you learned in the lesson to predict the sentiment of tweets based on the labels.

## Conducting Sentiment Analysis

### Loading and Exploring Data

The dataset we'll be using today is located on Kaggle (https://www.kaggle.com/kazanova/sentiment140). Once you have downloaded and imported the dataset, it you will need to define the columns names: df.columns = ['target','id','date','flag','user','text']

*Notes:* 

* The dataset is huuuuge (1.6m tweets). When you develop your data analysis codes, you can sample a subset of the data (e.g. 20k records) so that you will save a lot of time when you test your codes.

In [1]:
# Loading and Exploring Data:

import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv', header=None, encoding='latin-1')

In [2]:
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [3]:
df.shape

(1600000, 6)

In [4]:
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

In [5]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [7]:
df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [8]:
# Selecting a subset of 20000 rows:

df_subset = df.sample(n=20000, random_state=42)

In [9]:
df_subset.head()

Unnamed: 0,target,id,date,flag,user,text
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps fo..."
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,@TiannaChaos i know just family drama. its la...
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography ...
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem


In [10]:
df_subset.shape

(20000, 6)

### Prepare Textual Data for Sentiment Analysis

Now, apply the functions you have written in Challenge 1 to your whole data set. These functions include:

* `clean_up()`

* `tokenize()`

* `stem_and_lemmatize()`

* `remove_stopwords()`

Create a new column called `text_processed` in the dataframe to contain the processed data. At the end, your `text_processed` column should contain lists of word tokens that are cleaned up. Your data should look like below:

![Processed Data](data-cleaning-results.png)

In [11]:
# Clean up function:

def clean_up(s):
    
    import re
    
    s = re.sub('http\S+|[\W\d]+', ' ',  s)
    s = re.sub('\s+', ' ', s)
    
    return s.lower()

In [12]:
# Tokenize function:

def tokenize(s):
    
    from nltk.tokenize import word_tokenize
    
    tokens = word_tokenize(s)
    
    return tokens

In [13]:
# Stem_and_lemmatize function:

def stem_and_lemmatize(l):
    
    from nltk.stem import PorterStemmer, WordNetLemmatizer
    from nltk.corpus import wordnet

    
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer() 
    
    stem_and_lemmatized = []
    
    for word in l:
        stemmed = stemmer.stem(word)
        lemmatized = lemmatizer.lemmatize(word)
        stem_and_lemmatized.append((stemmed,lemmatized))
        
    return stem_and_lemmatized

In [14]:
# Stopwords function:

def remove_stopwords(l):
    
    from nltk.corpus import stopwords
    
    without_sw = [word for word in l if not word in stopwords.words('english')]
    
    return without_sw

In [15]:
# Applying it to the text all together:

df_subset['text_processed'] = df_subset['text'].apply(clean_up)
df_subset['text_processed'] = df_subset['text_processed'].apply(tokenize)
df_subset['text_processed'] = df_subset['text_processed'].apply(stem_and_lemmatize)
df_subset['text_processed'] = df_subset['text_processed'].apply(remove_stopwords)

In [16]:
df_subset.head(15)

Unnamed: 0,target,id,date,flag,user,text,text_processed
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!,"[(chrishasboob, chrishasboobs), (ahhh, ahhh), ..."
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps fo...","[(misstoriblack, misstoriblack), (cool, cool),..."
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,@TiannaChaos i know just family drama. its la...,"[(tiannachao, tiannachaos), (i, i), (know, kno..."
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography ...,"[(school, school), (email, email), (won, won),..."
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem,"[(upper, upper), (airway, airway), (problem, p..."
379611,0,2052380495,Sat Jun 06 00:32:16 PDT 2009,NO_QUERY,Yengching,Going to miss Pastor's sermon on Faith...,"[(go, going), (to, to), (miss, miss), (pastor,..."
1189018,4,1983449090,Sun May 31 13:10:36 PDT 2009,NO_QUERY,jessig06,on lunch....dj should come eat with me,"[(on, on), (lunch, lunch), (dj, dj), (should, ..."
667030,0,2245479748,Fri Jun 19 16:11:29 PDT 2009,NO_QUERY,felicityfuller,@piginthepoke oh why are you feeling like that?,"[(piginthepok, piginthepoke), (oh, oh), (whi, ..."
93541,0,1770705699,Mon May 11 22:01:32 PDT 2009,NO_QUERY,stephiiheyy,gahh noo!peyton needs to live!this is horrible,"[(gahh, gahh), (noo, noo), (peyton, peyton), (..."
1097326,4,1970386589,Sat May 30 03:39:34 PDT 2009,NO_QUERY,wyndwitch,@mrstessyman thank you glad you like it! There...,"[(mrstessyman, mrstessyman), (thank, thank), (..."


### Creating Bag of Words

The purpose of this step is to create a [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. *corpus*) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In the cell below, combine all the words in `text_processed` and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's `FreqDist` class ([documentation](https://www.nltk.org/api/nltk.html#module-nltk.probability)). Then select the top 5,000 words from the frequency distribution.

In [17]:
# Selecting the 5000 words with the most frequency:

from nltk.probability import FreqDist

all_words = [word for sublist in df_subset['text_processed'] for word in sublist]

freq_dist = FreqDist(all_words)

top_words = freq_dist.most_common(5000)

top_words = [word[0] for word in top_words]

In [18]:
print(top_words)



In [19]:
len(top_words)

5000

### Building Features

Now let's build the features. Using the top 5,000 words, create a 2-dimensional matrix to record whether each of those words is contained in each document (tweet). Then you also have an output column to indicate whether the sentiment in each tweet is positive. For example, assuming your bag of words has 5 items (`['one', 'two', 'three', 'four', 'five']`) out of 4 documents (`['A', 'B', 'C', 'D']`), your feature set is essentially:

| Doc | one | two | three | four | five | is_positive |
|---|---|---|---|---|---|---|
| A | True | False | False | True | False | True |
| B | False | False | False | True | True | False |
| C | False | True | False | False | False | True |
| D | True | False | False | False | True | False|

However, because the `nltk.NaiveBayesClassifier.train` class we will use in the next step does not work with Pandas dataframe, the structure of your feature set should be converted to the Python list looking like below:

```python
[
	({
		'one': True,
		'two': False,
		'three': False,
		'four': True,
		'five': False
	}, True),
	({
		'one': False,
		'two': False,
		'three': False,
		'four': True,
		'five': True
	}, False),
	({
		'one': False,
		'two': True,
		'three': False,
		'four': False,
		'five': False
	}, True),
	({
		'one': True,
		'two': False,
		'three': False,
		'four': False,
		'five': True
	}, False)
]
```

To help you in this step, watch the [following video](https://www.youtube.com/watch?v=-vVskDsHcVc) to learn how to build the feature set with Python and NLTK. The source code in this video can be found [here](https://pythonprogramming.net/words-as-features-nltk-tutorial/).

[![Building Features](building-features.jpg)](https://www.youtube.com/watch?v=-vVskDsHcVc)

In [20]:
# Creating the 2D matrix:

features = []


for index, row in df_subset.iterrows():
   
    tweet_text = row['text']
    
   
    word_presence = {word: (word in tweet_text.split()) for word, _ in top_words}
    
    
    features.append((word_presence, row['target'] == 4))


In [21]:
features[:1]

[({'i': False,
   'to': False,
   'the': False,
   'a': False,
   'it': False,
   'my': False,
   'you': False,
   'and': False,
   'is': False,
   'for': False,
   'in': False,
   's': False,
   't': False,
   'of': False,
   'on': False,
   'me': False,
   'that': False,
   'so': False,
   'have': False,
   'm': False,
   'but': False,
   'just': False,
   'with': False,
   'not': False,
   'be': False,
   'at': False,
   'wa': False,
   'day': False,
   'can': False,
   'thi': False,
   'good': False,
   'now': False,
   'up': False,
   'get': False,
   'out': False,
   'all': False,
   'are': False,
   'like': False,
   'quot': False,
   'no': False,
   'go': False,
   'today': False,
   'got': False,
   'work': False,
   'do': False,
   'love': False,
   'your': False,
   'time': False,
   'too': False,
   'we': False,
   'lol': False,
   'what': False,
   'one': False,
   'from': False,
   'know': False,
   'back': False,
   'am': False,
   'will': False,
   'don': False,
   'abo

### Building and Traininng Naive Bayes Model

In this step you will split your feature set into a training and a test set. Then you will create a Bayes classifier instance using `nltk.NaiveBayesClassifier.train` ([example](https://www.nltk.org/book/ch06.html)) to train with the training dataset.

After training the model, call `classifier.show_most_informative_features()` to inspect the most important features. The output will look like:

```
Most Informative Features
	    snow = True            False : True   =     34.3 : 1.0
	  easter = True            False : True   =     26.2 : 1.0
	 headach = True            False : True   =     20.9 : 1.0
	    argh = True            False : True   =     17.6 : 1.0
	unfortun = True            False : True   =     16.9 : 1.0
	    jona = True             True : False  =     16.2 : 1.0
	     ach = True            False : True   =     14.9 : 1.0
	     sad = True            False : True   =     13.0 : 1.0
	  parent = True            False : True   =     12.9 : 1.0
	  spring = True            False : True   =     12.7 : 1.0
```

The [following video](https://www.youtube.com/watch?v=rISOsUaTrO4) will help you complete this step. The source code in this video can be found [here](https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/).

[![Building and Training NB](nb-model-building.jpg)](https://www.youtube.com/watch?v=rISOsUaTrO4)

In [22]:
# Selecting 90% for the train set and 10% for the test set:

train_size = int(0.9 * len(features))
train_set, test_set = features[:train_size], features[train_size:]

In [23]:
# Training the model:

import nltk

classifier = nltk.NaiveBayesClassifier.train(train_set)

In [24]:
classifier.show_most_informative_features(15)

Most Informative Features
                     sad = True            False : True   =     17.0 : 1.0
                    poor = True            False : True   =     16.9 : 1.0
                    pain = True            False : True   =     14.3 : 1.0
                     ugh = True            False : True   =     13.1 : 1.0
                   shame = True            False : True   =     11.8 : 1.0
                    sick = True            False : True   =     11.6 : 1.0
                  anyway = True             True : False  =     11.6 : 1.0
                    sore = True            False : True   =     10.7 : 1.0
                   burnt = True            False : True   =      9.7 : 1.0
                 stomach = True            False : True   =      9.5 : 1.0
                    sing = True             True : False  =      8.3 : 1.0
                   blood = True            False : True   =      7.7 : 1.0
                    hate = True            False : True   =      7.5 : 1.0

### Testing Naive Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling `nltk.classify.accuracy(classifier, test)`.

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!

In [25]:
# Testing NBModel:

print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, test_set))*100)

Classifier accuracy percent: 66.55
