Sentiment analysis, sometimes called opinion mining or polarity detection, refers to the set of algorithms and techniques that are used to extract the polarity of a given document; that is, it determines whether the sentiment of a document is positive, negative or neutral. Sentiment analysis is gaining popularity in the industry as it allows organizations to mine opinions of a large group of users or potential customers in a cost-efficient way. Sentiment analysis is now used extensively in advertizing campaigns, political campaigns , stock analysis and more.

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [7]:
data = pd.read_csv("amazon_cells_labelled.txt", sep='\t', header=None)
data.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


The above result shows the first 5 lines of the dataframe.

We do need to separate the columns that contain text reviews and the column containing sentiment labels

In [8]:
X = data.iloc[:,0] # extract column with reviews
y = data.iloc[:,-1] # extract column with sentiments

We need to do this because the text data needs to be preprocessed for the ML model. We do need to import the CountVectorizer class which performs key preprocessing steps on the text data such as tokenization, stop word removal, one-hot encoding and so on

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)
X_vec.todense() # convert sparse matrix into dense matrix

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

The above matrix with each row representing a review and each column representing a unique word in the corpus. Each row vector represents the word count in that row for each unique word.

Next we do need to import TfidfTransformer class to transform word counts into their respective tf-idf values (https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/). It is a time to transform the word count matrix into a matrix with correspoding tf-idf values

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X_vec)
X_tfidf = X_tfidf.todense()

With this, we did complete the preprocessing part and are now ready to train the model using the processed data. However, before we do that we do need to split the data into training and testing sets so that we can evaluate the performance of our trained model. This is called cross-validation and it is EXTREMELY important part of ML model training. We can easily split the data manually but for the sake of consistency, we will use the train_test_split class of sklearn's model_selection module to do this. For this, we pass our processed reviews data (X_tfidf) and the sentiment data to the train_test_split object and pass another argument regarding the desired ratio of the split.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size = 0.25, random_state = 0)

The above code will split both independent variables (the tfidf matrix) and the dependent variable (sentiment) into training and testing data.

We now have everything to train our model. For this we will import MultinomialNaive Bayes class from sklearn's naive_bayes module and fit the training data to the model.

In [12]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)



MultinomialNB()

Fitting the training data means that our Naive Bayes classifier has now learned the training data and is now in a position to calculate relevant probabilities. Therefore, if an out-of-sample review is now passed to the classifier, it will try to calculate the probability of the sentiment being positive or negative given that the words this, disappointed and product exist in the review. 

In [13]:
y_pred = clf.predict(X_test)



In [14]:
print(y_pred)

[0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 1
 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1
 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1
 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 0
 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1 1
 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1
 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 0]


The above code shows how we obtained the predicted sentiment values from the classifier for the test reviews that are stored in the y_pred array.


To determine the performance of our model we will create a confusion matrix that calculates the number of correct predictions broken down for each classification

In [15]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[ 87,  33],
       [ 20, 110]])

The vertical axis of sklearn's confusion matrix should be interpreted as the actual values, while the horizontal axis should be interpreted as the predicted value. Therefore, our model predicted 107 (87+20) values as having a sentiment score of 0, out which 87 were correctly predicted and 20 were incorrectly predicted. Likewise, the model predicted 143 (110+33) values as having a sentiment score of 1, out of which 110 were correctly predicted and 33 were incorrectly predicted.

Therefore, the total number of correct predictions is obtained by summing the left diagonal (87+110). The accuracy is the ratio of the total correct predictions divided by the total count of the test set (obtained by summing all the numbers in the confusion matrix). Therefore, the accuracy in this case is 197/250 = 78.8% this is a decent accuracy score given the simple model and limited training data. We only have 750 abridged reviews. Tuning model parameters and performing further preprocessing steps such as lemmatization, stemming and so on will improve the performance of the model.