The following are some important things to consider while creating and deploying the sentiment analyzer:
The training data should be consistent with the objective of the sentiment analyzer. Don't train the model using movie reviews if the objective is to predict the sentiment of financial news articles.


Accurately labeling the training data is critical for the model to perform well. We have used pre-labeled data in this chapter. However, if you are creating a real- world application, you will have to spend time labeling the training documents. Typically, labeling should be done by someone with a good understanding of industry jargon.


Sourcing training data is a difficult task. You can use tools such as web scraping or social media scraping, subject to permissions. Effort should be spent on sourcing data from multiple platforms and you shouldn't rely too much on a particular source.


Evaluate the performance of your model regularly and retrain the model if required.

In [2]:
# sentiment analysis: opinion mining/polarity detection
# extract the polarity of a given document (+,-,NEUTRAL)
# large group of users or potential customers in a cost-efficient way.

# advertisement campaigns, political campaigns, stock analysis, and more.

 $ [Url]:(http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) $
#### As we can see, the document contains a list of customer reviews and each review is assigned a sentiment score, 
#### with 0 representing negative sentiment and 1 representing positive sentiment.

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv('amazon_cells_labelled.txt',sep='\t',header=None)

In [7]:
data.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [9]:
X = data.iloc[:,0] # extract columns with reviews
y = data.iloc[:,-1]

In [17]:
X

0      So there is no way for me to plug it in here i...
1                            Good case, Excellent value.
2                                 Great for the jawbone.
3      Tied to charger for conversations lasting more...
4                                      The mic is great.
                             ...                        
995    The screen does get smudged easily because it ...
996    What a piece of junk.. I lose more calls on th...
997                         Item Does Not Match Picture.
998    The only thing that disappoint me is the infra...
999    You can not answer calls with the unit, never ...
Name: 0, Length: 1000, dtype: object

In [10]:
## CountVectorizer class, which performs key preprocessing steps on the text data such as tokenization, 
# stop word removal, one-hot encoding, and so on

#### Data Preprocessing

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
vectorizer = CountVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)
X_vec

<1000x1642 sparse matrix of type '<class 'numpy.int64'>'
	with 4702 stored elements in Compressed Sparse Row format>

In [18]:
print(vectorizer.get_feature_names())



In [19]:
len(vectorizer.get_feature_names())

1642

In [13]:
X_vec.todense() # convert sparse matrix to dense matrix

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [14]:
# Each row vector represents the word count in that row for each unique word:

In [20]:
# converting the word count matrix into a matrix with corresponding tf-idf values:

In [49]:
# # Transform data by applying term frequency inverse document frequency (TF-IDF)

In [21]:
from sklearn.feature_extraction.text import TfidfTransformer

In [22]:
tfidf = TfidfTransformer()

In [23]:
X_tfidf = tfidf.fit_transform(X_vec)
X_tfidf = X_tfidf.todense()

In [24]:
X_tfidf

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [25]:
# because each review in the corpus is quite brief, the majority of the values in each row of the matrix are set to 0:

## Naive Bayes 

### Train the Model

#### cross-validation

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
X_train,X_test,y_train,y_test = train_test_split(X_tfidf,y, test_size=0.25, random_state=0)

In [30]:
from sklearn.naive_bayes import MultinomialNB

In [31]:
clf = MultinomialNB()

In [32]:
clf.fit(X_train,y_train)

MultinomialNB()

In [33]:
# Fitting the training data essentially means that our Naive Bayes classifier has now learned the training data and is now in a position to calculate relevant probabilities.

In [34]:
y_pred = clf.predict(X_test)

#### model evaluation

In [35]:
from sklearn.metrics import confusion_matrix

In [36]:
confusion_matrix(y_test,y_pred)

array([[ 87,  33],
       [ 20, 110]])

In [37]:
#  The vertical axis of sklearn's confusion matrix should be interpreted as the actual values, 
# while the horizontal axis should be interpreted as the predicted values.

In [38]:
#  Therefore, the accuracy, in this case, is 197/250 = 78.8%.
# This is a decent accuracy score given the simple model and limited training data we had (only 750 abridged reviews).

In [40]:
# Tuning model parameters and performing further preprocessing steps such as lemmatization, stemming, and so on can 
# improve the accuracy further.


## SVM

In [41]:
# optimal hyperplane that best segregates the classes

In [42]:
#  SVM identifies the frontier data points (or points closest to the opposing class), also known as support vectors, 
# and then attempts to find the boundary (also known as the hyperplane in the N-dimensional space)
# that is the farthest from the support vector of each class.

In [45]:
# To identify a hyperplane that segregates vectors in a 1,642-dimensional space into
# positive sentiment classes and negative sentiment classes. 

In [48]:
X_tfidf.shape # 1642 is the number of features and we have 1000 examples

(1000, 1642)

In [50]:
from sklearn.svm import SVC

In [51]:
classifier = SVC(kernel='linear')
classifier.fit(X_train,y_train)

SVC(kernel='linear')

In [53]:
# classifier has identified the optimum hyperplane after identifying the 
# frontier points and calculating the relevant distances based on the training data

In [54]:
y_pred = classifier.predict(X_test)

In [55]:
# measure the performance

In [56]:
confusion_matrix(y_test,y_pred)

array([[102,  18],
       [ 33,  97]])

In [57]:
#  the accuracy, in this case, 199/250 = 79.6%, which is marginally better than the Naive Bayes model's accuracy. 

In [58]:
# he model's performance can be further improved by improving input data preprocessing (via lemmatization, stemming, and so on) 
# and optimizing various SVM hyperparameters.

In [59]:
classifier_rbf = SVC(kernel='rbf')
classifier_rbf.fit(X_train,y_train)

SVC()

In [64]:
y_pred = classifier_rbf.predict(X_test)

In [65]:
confusion_matrix(y_test,y_pred)

array([[106,  14],
       [ 45,  85]])

In [67]:
# accuracy = 76.4%, worse than linear classifier

### Productinizing a trained sentiment analyzer

In [69]:
# way to reuse this model to predict the sentiment of new product reviews. 
# picking means - efers to serializing and deserializing Python object structures. 

In [70]:
import pickle

In [71]:
pickle.dump(vectorizer,open('vectorizer','wb')) # save vectorizer for resue

In [72]:
pickle.dump(classifier,open('svm_clf_sa','wb')) # save the classifier for resue

In [96]:
pickle.dump(clf,open('nb_clf','wb')) # save the classifier for resue

In [73]:
# can now import these pickled objects as we wish.

In [74]:
# creating a function that does the job of classifying the model

In [106]:
def sentiment_analysis(classifier,training_matrix,doc):
    '''function to predict the sentiment of product review
    classifier: pre_trained_model
    training_matrix: matrix of features assosoated with trained model (vectorizer)
    doc: product review whose sentiment needs to be classified'''
    X_new = training_matrix.transform(pd.Series(doc))
    # dont use fit transform here as model is already fitted
    X_new = X_new.todense() # convert sparse matrix to dense
    
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    X_tfidf_new = tfidf.fit_transform(X_new)
    X_tfidf_new = X_tfidf_new.todense()
    y_new = classifier.predict(X_tfidf_new)
    if y_new==0:
        return 'negative sentiment'
    elif y_new==1:
        return 'positive sentiment'

In [107]:
# upickling the pickled model and vectorizer and pass to the function

In [123]:
svm_clf = pickle.load(open('nb_clf','rb'))

In [124]:
vectorizer = pickle.load(open('vectorizer','rb'))

In [131]:
new_doc = "The battery performance is not as expected ."

In [132]:
sentiment_analysis(svm_clf,vectorizer,new_doc)

'negative sentiment'

In [129]:
 new_doc = "I dont think i like this product"

In [130]:
sentiment_analysis(svm_clf,vectorizer,new_doc)

'negative sentiment'

In [133]:
# Evaluate the performance of your model regularly and retrain the model if required.