A sentiment analyser will predict whether an input sentence has positive or negative sentiment.

Sentiment analyser is a classification task in machine learning. The classifier need to classify a sentence into positive sentiment (eg: I love coffee) or negative sentiment (eg: I hate golddiggers)

Lets start coding...

PREREQUESTIES:PYTHON BASICS

In [1]:
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer

Imported a classifier from sklearn package in the first line and CountVectorizer in the second line. 

It all make sense in a minute.

In [2]:
train = [
     ('I love this sandwich.', 'pos'), ('this is an amazing place!', 'pos'),
     ('I feel very good about these beers.', 'pos'),
     ('this is my best work.', 'pos'),
     ("what an awesome view", 'pos'),
     ('I do not like this restaurant', 'neg'),
     ('I am tired of this stuff.', 'neg'),
     ("I can't deal with this", 'neg'),
     ('he is my sworn enemy!', 'neg'),
     ('my boss is horrible.', 'neg'),
     ("i hate you.",'neg'),
     ("i love mangoes.",'pos'),
     ("i am sad.", 'neg')
]

In [3]:
test = [
     ('the cake was good.', 'pos'),
     ('I do not enjoy my job', 'neg'),
     ("I am not feeling good today.", 'neg'),
     ("I feel amazing!", 'pos'),
     ('Ebey is an amazing person.', 'pos'),
     ("I can't believe I'm doing this.", 'neg')

]

We need dataset for building such a classifier.

Here we initialized a training dataset and testing dataset with input sentences and its corresponding sentiments ('pos' or 'neg') in tuples. 

We will use the train dataset for training the model and check whether the classifier is correct using the test dataset.

Test dataset contain sentences the classfier didnt saw while training.

So lets take each dataset and seperate into features and labels.

In [4]:
train_features = []
train_labels = []
for data in train:
    train_features.append(data[0])
    train_labels.append(data[1])

We seperately stored the features and labels for training into two list, train_features and train_labels

In [5]:
print("training features: ",train_features)


training features:  ['I love this sandwich.', 'this is an amazing place!', 'I feel very good about these beers.', 'this is my best work.', 'what an awesome view', 'I do not like this restaurant', 'I am tired of this stuff.', "I can't deal with this", 'he is my sworn enemy!', 'my boss is horrible.', 'i hate you.', 'i love mangoes.', 'i am sad.']


In [6]:
print("training labels: ",train_labels)

training labels:  ['pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg']


We do the same thing to test dataset !

In [7]:
test_features = []
test_labels = []
for data in test:
    test_features.append(data[0])
    test_labels.append(data[1])

In [8]:
print("testing features: ",test_features)

testing features:  ['the cake was good.', 'I do not enjoy my job', 'I am not feeling good today.', 'I feel amazing!', 'Ebey is an amazing person.', "I can't believe I'm doing this."]


In [9]:
print("testing labels: ",test_labels)

testing labels:  ['pos', 'neg', 'neg', 'pos', 'pos', 'neg']


The dataset we have is in textual form. Machine leaarning algorithm will take only numerical data. 

So we convert this textual data into vectors.

For that we will use CountVectorizer which will represent a sentence in vector based on frequency of words occured in the dataset.

Lets see a small example first.

Consider we have sentences like this

In [10]:
["balu love mougli", "balu love bageera" ]

['balu love mougli', 'balu love bageera']

Using the CountVectorizer() we can represent these sentences into vectors.

First initilaizing the vectorizer.

In [11]:
count_vect = CountVectorizer()

The vectorizer will initialize a vocabulary and create vectors based on the frequency of words occur in the dataset. For initializing the vocabulary and vectorize using that vocabulary we use the function fit_transform()

In [12]:
x = count_vect.fit_transform(["balu love mougli","balu love bageera"])

Lets see the output.

In [13]:
print(x.toarray())

[[0 1 1 1]
 [1 1 1 0]]


What the vectorizer did is it created a vocabulary with number of elements equal to the number of individual words in the above dataset in alphabetic order. 

That is 'bageera', 'balu', 'love' and 'mougli'.

So as you can see the vector for a sentence contain four elements, each represents one word in the vocabulary. A sentence can be represented with the number of times the word occured in the sentence.

For example,

"balu love mougli" --> [0 1 1 1]

(First word in the vocabulary, 'bageera' is not present in the sentence so first element is zero. All other words in vocabulary are present once, so the elements in their postions are made one. If some words occur twice in a sentence then corresponding element is made two and so on)

Now say we have a new sentence, we have to convert it to vector based on the vocabulary we created above.

In [14]:
["balu love pikachu"]

['balu love pikachu']

We use transform function in CountVectorizer for transforming the new input to vector based on the vocabulary we created.

In [15]:
y = count_vect.transform(["balu love pikachu"])

See we use the same object, count_vect to which we initialized the vocabulary. So calling transform() will vectorize the new input using that vocabulary.

Lets see the output !

In [16]:
print(y.toarray())

[[0 1 1 0]]


Verify the output yourself.

Now lets vectorize our training dataset.

In [17]:
x_train = count_vect.fit_transform(train_features)

In [18]:
print(x_train.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
  0 0]
 [0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
  0 0]
 [1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
  0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  1 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0
  0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]
 [0 

Lets convert x_train to array for feeding it to the machine learning model.

In [19]:
x_train = x_train.toarray()

So we have the features in numerical format, now we need the labels also. One conventional practice is to label using 0,1,2... N for N labels.

Here we have 2 labels, so lets represent 'pos' as 0 and 'neg' ad 1

In [20]:
y_train = []
for i in train_labels:
    if i == 'pos':
        y_train.append(0)
    elif i == 'neg':
        y_train.append(1)
print(y_train)

[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1]


Done.Lets build the classifier now !

In [21]:
classifier = GaussianNB()

In sklearn a classifier has two functions, fit and predict.

Function fit() will train the model using the dataset and after training we can use predict() for predicting the label giving a new input feature. 

In [22]:
classifier.fit(x_train,y_train)

GaussianNB(priors=None)

OK... Lets test the classifier using test dataset.

In [23]:
x_test = count_vect.transform(test_features)

The test features are transformed to vectors using the training dataset vocabulary.

In [24]:
x_test = x_test.toarray()

In [25]:
k = classifier.predict(x_test)

Now we predict the sentiments of test dataset ! If the predicted label is 0 that means corresponding sentence has positive sentiment, if 1 then its negative.

In [26]:
for i,data in enumerate(test_features):
    print("input: ",data)
    if k[i] == 0:
        print("sentiment: ","positive")
    elif k[i] == 1:
        print("sentiment: ","negative")

input:  the cake was good.
sentiment:  positive
input:  I do not enjoy my job
sentiment:  negative
input:  I am not feeling good today.
sentiment:  negative
input:  I feel amazing!
sentiment:  positive
input:  Ebey is an amazing person.
sentiment:  positive
input:  I can't believe I'm doing this.
sentiment:  negative
