# Sentiment Analysis

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. (wikipedia)

## 1. Loading the Dataset

In here we are using <a href="http://ai.stanford.edu/~amaas/data/sentiment/">Large Movie Review Dataset</a> from Stanford. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. This dataset provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 

In [27]:
# Importing the libraries
import os
import glob
from sklearn.utils import shuffle

In [28]:
# Function for loading the dataset
def load_dataset(data_dir = "./data/imdb-reviews/"):

    # Initializing a dictionary for X_data and y_data
    X_data = {}
    y_data = {}
    
    # Iterating through the "train" and "test"
    for train_or_test in ['train', 'test']:
        
        # Initialize an empty dictionary for the train and test
        X_data[train_or_test] = {}
        y_data[train_or_test] = {}
        
        # Iterate through "positive", "negative"
        for positive_or_negative in ['positive', 'negative']:
            
            # Initialize an empty list for each sentiment
            X_data[train_or_test][positive_or_negative] = []
            y_data[train_or_test][positive_or_negative] = []
            
            # Get the name of all texts in our folder
            file_names = glob.glob(os.path.join(data_dir, train_or_test, positive_or_negative, "*.txt")) 
                
            # Iterate through file names
            for i_file in file_names:
                
                # Open the (text) file
                with open(i_file) as i_file:
                
                    # Assign values to our dictionary from that file
                    X_data[train_or_test][positive_or_negative].append(i_file.read())
                    y_data[train_or_test][positive_or_negative].append(positive_or_negative)
                
    return X_data, y_data

In [29]:
# Loading the dataset
X_data, y_data = load_dataset()

In [30]:
# Get the shape dataset
print("Training set:\n {} Positive / {} Negative\n".format(len(X_data["train"]["positive"]), 
                                                           len(X_data["train"]["negative"])))

print("Test set:\n {} Positive / {} Negative".format(len(X_data["test"]["positive"]), 
                                                   len(X_data["test"]["negative"])))

Training set:
 12500 Positive / 12500 Negative

Test set:
 12500 Positive / 12500 Negative


In [31]:
# Splitting the dataset into training and test set
X_train = X_data["train"]["positive"] + X_data["train"]["negative"]
y_train = y_data["train"]["positive"] + y_data["train"]["negative"]

X_test = X_data["test"]["positive"] + X_data["test"]["negative"]
y_test = y_data["test"]["positive"] + y_data["test"]["negative"]

In [32]:
# Suffling the trianing set and test ste
X_train, y_train = shuffle(X_train, y_train)
X_test, y_test = shuffle(X_test, y_test)

In [33]:
print("Training set = {} \nTest set = {}".format(len(X_train), len(X_test)))

Training set = 25000 
Test set = 25000


In [34]:
# Get a small subset of dataset (for speed purposes)
X_train, y_train = X_train[:4000], y_train[:4000]
X_test, y_test = X_test[:1000], y_test[:1000]

## 2. Preprocessing

At the second step, We will prerpocess our dataset which is an essential part of any type of model. More specifically we will apply the following steps for preprocessing:
1. Lowercasing the text
2. Removing the punctuation
3. Converting to tokens
4. Removing the stopwords
5. Apply stemmer
6. Apply lemmizer

In [35]:
# Importing the libraries
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from keras.preprocessing import sequence
import bs4
import numpy as np

Using TensorFlow backend.


In [36]:
# Preprocessing the text
def preprocess_text(text):

    # Removing all HTML tags
    text = bs4.BeautifulSoup(text, "html5lib").get_text().strip()
    
    # Lowercasing the text
    text = text.lower()

    # Removing the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)

    # Converting to tokens
    tokens = word_tokenize(text)

    # Removing the stopwords
    tokens = [i_token for i_token in tokens if i_token not in stopwords.words("english")]

    # Apply stemmer
    stemmed = [PorterStemmer().stem(i_token) for i_token in tokens]

    # Apply lemmizer
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="n") for i_token in stemmed]
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="v") for i_token in lemmtized]

    return lemmtized

In [37]:
# Preproces the training set and test set
X_train = [preprocess_text(i) for i in X_train]
X_test = [preprocess_text(i) for i in X_test]

In [38]:
# Get the total dataset
total_dataset = X_train + X_test

In [39]:
# Create word2id and id2word
all_unique_words = np.unique([item for sub_list in total_dataset for item in sub_list])
word2id = {i_token: index for index, i_token in enumerate(all_unique_words)}
id2word = {index: i_token for index, i_token in enumerate(all_unique_words)}

In [40]:
# Mapping words in training to its corresponding id
for index, sub_list in enumerate(X_train):
    X_train[index] = list(map(lambda x: word2id[x], sub_list))
    
# Mapping words in test set to its corresponding id
for index, sub_list in enumerate(X_test):
    X_test[index] = list(map(lambda x: word2id[x], sub_list))

In [41]:
# Pad sequence
max_words = 500

X_train = sequence.pad_sequences(X_train, maxlen = max_words)
X_test = sequence.pad_sequences(X_test, maxlen = max_words)

In [42]:
# Covert labels into 0 and 1
def str_to_int(label):
    if label == "positive":
        return 1
    else:
        return 0
    
y_train = list(map(str_to_int, y_train))
y_test = list(map(str_to_int, y_test))

## 3. Model

Now we are ready to feed the data into our model for training. As you will see, Even with a simple architecture we can reach to a high accuracy rate.

In [43]:
# Import the libraries
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Reshape
from keras.callbacks import ModelCheckpoint

In [44]:
# Some hyperparameters
embedding_size = 32
lstm_units = 100
batch_size = 128
num_epochs = 10

vocabulary_size = len(all_unique_words)

In [45]:
# Create the model
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length = max_words))
model.add(LSTM(units = lstm_units))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary of model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           831360    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 884,661
Trainable params: 884,661
Non-trainable params: 0
_________________________________________________________________
None


In [46]:
# Checkpoint for saving the model
checkpointer = ModelCheckpoint(filepath='./saved model/weights.best.sentiment_analysis.hdf5', 
                               verbose = 1, 
                               save_best_only = True)

# Train the model
model.fit(X_train, 
          y_train,
          validation_data = (X_test, y_test),
          batch_size = batch_size,
          epochs = num_epochs,
          callbacks = [checkpointer], 
          verbose = 1)

Train on 4000 samples, validate on 1000 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.67256, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 2/10

Epoch 00002: val_loss improved from 0.67256 to 0.58436, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 3/10

Epoch 00003: val_loss improved from 0.58436 to 0.39578, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 4/10

Epoch 00004: val_loss improved from 0.39578 to 0.36961, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 5/10

Epoch 00005: val_loss did not improve from 0.36961
Epoch 6/10

Epoch 00006: val_loss did not improve from 0.36961
Epoch 7/10

Epoch 00007: val_loss did not improve from 0.36961
Epoch 8/10

Epoch 00008: val_loss did not improve from 0.36961
Epoch 9/10

Epoch 00009: val_loss did not improve from 0.36961
Epoch 10/10

Epoch 00010: val_loss did not improve from 0.36961


<keras.callbacks.History at 0x1a5ca71dd8>

## 4. Evaluation

Once you have trained your model, it's time to see how well it performs on unseen test data.

In [47]:
# Evaluate your model on the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # returns loss and other metrics specified in model.compile()
print("Test Set Accuracy: {}%".format(scores[1]*100))  # scores[1] should correspond to accuracy if you passed in metrics=['accuracy']

Test Set Accuracy: 83.3%


## 5. Prediction

Now you are ready for prediction. You can check the sentiment of any sentence you input.

In [48]:
# Function for prediction
def text_to_predict(unseen_text):
    
    # Preprocess the text
    unseen_text = preprocess_text(unseen_text)
    
    # Convert the words to ids
    unseen_text = list(map(lambda x: word2id[x], unseen_text))
    
    # Pad sequences
    unseen_text = sequence.pad_sequences([unseen_text], max_words)
    
    # Get the prediction
    prediction = model.predict(unseen_text)[0][0]*100

    # Print the result
    if prediction < 0.5:
        print("The given sentence is negative.")
    elif prediction > 0.5:
        print("The given sentence is positive.")

In [50]:
# Predict a unseen text
unseen_text = "The movie is absolutely terrible. It's not something i would suggest"
text_to_predict(unseen_text)

The given sentence is positive.


In [51]:
# Predict a unseen text
unseen_text = "The movie is absolutely great. Can't wait to watch it again."
text_to_predict(unseen_text)

The given sentence is positive.


**RESOURCES:**
1. <a href="https://monkeylearn.com/sentiment-analysis/">Sentiment Analysis - nearly everything you need to know</a>
2. <a href="https://medium.freecodecamp.org/how-to-make-your-own-sentiment-analyzer-using-python-and-googles-natural-language-api-9e91e1c493e">How to make your own sentiment analyzer using Python and Google’s Natural Language API</a>

<hr>

#  Topic Modeling

In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model. (Wikipedia)

## 1. Loading the Dataset

At the very start, We will load the dataset and take a look at it. The dataset that we are using, Contains data of news headlines published over a period of 15 years. It sourced from the reputable Australian news source ABC (Australian Broadcasting Corp.).

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd

np.random.seed(10101)

In [2]:
# Loading the dataset
data = pd.read_csv(filepath_or_buffer = "./dataset/abcnews-date-text.csv")
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [3]:
# Getting the second column
documents = data[["headline_text"]]

# Get the first 90K Rows only
documents = documents[:90000]

# Adding the index column
documents['index'] = documents.index

documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [4]:
# Total number of documents
print("Total number of documents: ", len(documents))

Total number of documents:  90000


## 2. Preprocessing the Dataset

At the second step, We will prerpocess our dataset which is an essential part of any type of model. More specifically we will apply the following steps for preprocessing:
1. Lowercasing the text
2. Removing the punctuation
3. Converting to tokens
4. Removing the stopwords
5. Apply stemmer
6. Apply lemmizer

In [5]:
# Importing the libraries
import nltk
import re
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [6]:
# Downloading wordnet 
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/soheilmohammadpour/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# Preprocessing the text
def preprocess_text(text):

    # Lowercasing the text
    text = text.lower()

    # Removing the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)

    # Converting to tokens
    tokens = word_tokenize(text)

    # Removing the stopwords
    tokens = [i_token for i_token in tokens if i_token not in stopwords.words("english")]

    # Apply stemmer
    stemmed = [PorterStemmer().stem(i_token) for i_token in tokens]

    # Apply lemmizer
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="n") for i_token in stemmed]
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="v") for i_token in lemmtized]

    return lemmtized

In [8]:
# Preprocessing the whole document
processed_documents = documents['headline_text'].map(preprocess_text)

In [9]:
processed_documents.head(10)

0              [aba, decid, commun, broadcast, licenc]
1                  [act, fire, wit, must, awar, defam]
2            [g, call, infrastructur, protect, summit]
3            [air, nz, staff, aust, strike, pay, rise]
4        [air, nz, strike, affect, australian, travel]
5                   [ambiti, olsson, win, tripl, jump]
6               [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, four, memphi, m...
8            [aust, address, un, secur, council, iraq]
9                   [australia, lock, war, timet, opp]
Name: headline_text, dtype: object

## 3. Feature Extraction

### 3.1. Bag-of-Words

The bag of words (BoW) model is a simplifying representation used in natural language processing and information retrieval. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

In [10]:
# Importing the libraries
import gensim

In [11]:
# Making a dictionary cotaining words and their integer ids
dictionary = gensim.corpora.Dictionary(processed_documents)

In [12]:
# Printing first 10 items in dictionary
count = 0
for key, value in dictionary.iteritems():
    print(key, value)
    count += 1
    if count == 10:
        break

0 aba
1 broadcast
2 commun
3 decid
4 licenc
5 act
6 awar
7 defam
8 fire
9 must


In [13]:
# Deletting very rare and very common words
dictionary.filter_extremes(no_below = 15, # Removing words with less than 15 (absolute number)
                           no_above = 0.1, # Words appearing more than 10% of documents (fraction of size, not absolute number)
                           keep_n = 100000) # keeping the most frequent tokens

In [14]:
# Applying Bag-of-Words for each document (a list of words)
bow_corpus = [dictionary.doc2bow(document = doc) for doc in processed_documents]

print("Bag-of-Words of our sample document: ", bow_corpus[10])

Bag-of-Words of our sample document:  [(40, 1), (43, 1), (47, 1), (48, 1), (49, 1), (50, 1)]


In [15]:
# Printing out the BoW for our sample document
bow_10 = bow_corpus[10]

for i in range(len(bow_10)):
    print('The word "{}" with word id of {} repeated {} times'.format(dictionary[bow_10[i][0]],
                                                                      bow_10[i][0],
                                                                      bow_10[i][1]))

The word "iraq" with word id of 40 repeated 1 times
The word "australia" with word id of 43 repeated 1 times
The word "10" with word id of 47 repeated 1 times
The word "aid" with word id of 48 repeated 1 times
The word "contribut" with word id of 49 repeated 1 times
The word "million" with word id of 50 repeated 1 times


### 3.2. TF-IDF

One limitation of BoW is that it treats every word as being equally important. Even though some words occur frequently within a corpus. The solution is to first count the number of documents in which each word occur (this is called document frequency). Then we have to divide the term frequency by the document frequency of that term. Now we have a metric that is proportional to the frequency of occurrence of a term in a document. And inversely promotional to the number of documents it appears in.

In [16]:
# Importing the libraries
from gensim import corpora, models
from pprint import pprint

In [17]:
# Creating a TF-IDF model
tdidf = models.TfidfModel(corpus = bow_corpus)

In [18]:
# Applying TF-IDF to the entire corpus
tdidf_corpus = tdidf[bow_corpus]

print("1st TD-IDF: ", tdidf_corpus[0])

1st TD-IDF:  [(0, 0.5215718389702217), (1, 0.4870042588063656), (2, 0.3430208344700119), (3, 0.41760004211431156), (4, 0.44579881184600006)]


In [19]:
# Preview TF-IDF for first document
for doc in tdidf_corpus:
    pprint(doc)
    break

[(0, 0.5215718389702217),
 (1, 0.4870042588063656),
 (2, 0.3430208344700119),
 (3, 0.41760004211431156),
 (4, 0.44579881184600006)]


## 4. LDA Model

In the previous topic, We applied LDA using bag-of-words feature extraction. In this part, We will do the same steps but using TF-IDF feature extraction.

In [20]:
# Defining the LDA model
lda_model_tfidf = gensim.models.LdaMulticore(corpus = bow_corpus, 
                                             num_topics = 10, 
                                             id2word = dictionary, 
                                             passes = 2,
                                             workers = 2)

In [21]:
# Exploring words in each topic and it weight
for index, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} \nWords: {}".format(index, topic), "\n")

Topic: 0 
Words: 0.023*"hospit" + 0.015*"worker" + 0.013*"work" + 0.013*"union" + 0.012*"find" + 0.011*"strike" + 0.010*"protest" + 0.009*"olymp" + 0.008*"bodi" + 0.008*"rail" 

Topic: 1 
Words: 0.021*"lead" + 0.019*"win" + 0.016*"elect" + 0.013*"take" + 0.012*"high" + 0.011*"coast" + 0.011*"aussi" + 0.010*"india" + 0.010*"south" + 0.010*"shoot" 

Topic: 2 
Words: 0.015*"reject" + 0.015*"new" + 0.015*"top" + 0.014*"law" + 0.013*"open" + 0.012*"claim" + 0.010*"leav" + 0.009*"feder" + 0.009*"final" + 0.009*"port" 

Topic: 3 
Words: 0.021*"plan" + 0.020*"cup" + 0.018*"world" + 0.013*"support" + 0.010*"battl" + 0.010*"prepar" + 0.010*"park" + 0.009*"go" + 0.008*"final" + 0.008*"ahead" 

Topic: 4 
Words: 0.027*"govt" + 0.014*"public" + 0.013*"help" + 0.012*"say" + 0.011*"time" + 0.010*"urg" + 0.009*"opposit" + 0.009*"develop" + 0.009*"leader" + 0.009*"report" 

Topic: 5 
Words: 0.056*"polic" + 0.042*"man" + 0.033*"charg" + 0.029*"court" + 0.028*"face" + 0.016*"death" + 0.015*"murder" + 0.01

## 6. Prediction

Now that our model has been trained, Let's predict one sentence in our dataset. 

In [22]:
unseen_doc = "big plan to boost paroo water supplies"

In [23]:
# Preprocessing the new document
processed_text = preprocess_text(text = unseen_doc)
print("Processed text: ", processed_text)

Processed text:  ['aviat', 'world', 'face', 'moment', 'reckon', '737', 'max', 'crash']


In [24]:
# Creating a Bag-of-Word
bow_vector = dictionary.doc2bow(document = processed_text)
print("Bag-of-Words for given text: ", bow_vector)

Bag-of-Words for given text:  [(206, 1), (427, 1), (429, 1), (1805, 1)]


In [25]:
# Checking which topic does our document belongs to
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {:.2f}% | Topic: {}".format(score*100, lda_model_tfidf.print_topic(topicno = index, topn = 3)))


Score: 41.97% | Topic: 0.056*"polic" + 0.042*"man" + 0.033*"charg"

Score: 22.39% | Topic: 0.021*"plan" + 0.020*"cup" + 0.018*"world"

Score: 21.63% | Topic: 0.019*"group" + 0.018*"water" + 0.017*"wa"

Score: 2.00% | Topic: 0.033*"call" + 0.014*"school" + 0.014*"road"

Score: 2.00% | Topic: 0.023*"hospit" + 0.015*"worker" + 0.013*"work"

Score: 2.00% | Topic: 0.021*"lead" + 0.019*"win" + 0.016*"elect"

Score: 2.00% | Topic: 0.028*"boost" + 0.015*"new" + 0.015*"sydney"

Score: 2.00% | Topic: 0.056*"u" + 0.029*"iraq" + 0.029*"kill"

Score: 2.00% | Topic: 0.027*"govt" + 0.014*"public" + 0.013*"help"

Score: 2.00% | Topic: 0.015*"reject" + 0.015*"new" + 0.015*"top"


**RESOURCES:**
1. <a href="https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21">Topic Modelling in Python with NLTK and Gensim</a>
2. <a href="https://towardsdatascience.com/the-complete-guide-for-topics-extraction-in-python-a6aaa6cedbbc">An overview of topics extraction in Python with LDA</a>
3. <a href="https://nlpforhackers.io/topic-modeling/">Complete Guide to Topic Modeling</a>