## Sentiment Classifier And Analysis

We will leverage the spacy library to train a sentiment classifier. We will then use this trained model to predict a few custom reviews and check if the model is able to predict the sentiment of these reviews

In [1]:
# Importing libraries
import pandas as pd
from datetime import datetime
import spacy
import spacy_transformers

# Storing docs in binary format
from spacy.tokens import DocBin

In [2]:
df = pd.read_csv('../data/final_dataset.csv')
df.shape

(2896, 6)

Splitting our dataset into train and test sets in 80-20 split. 

In [3]:
train = df.sample(frac = 0.8, random_state = 25)
test = df.drop(train.index)
print(train.shape, test.shape)

(2317, 6) (579, 6)


Importing the required spacy package. English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

In [5]:
import spacy
nlp=spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

First step is to create tuples which are pairs of text along with sentiments. Creating tuples for both train dataset.

In [6]:
train['tuples'] = train.apply(lambda row: (row['Customer Review'],row['Customer Rating']), axis=1)
train = train['tuples'].tolist()

Creating tuples for test dataset.

In [7]:
test['tuples'] = test.apply(lambda row: (row['Customer Review'],row['Customer Rating']), axis=1)
test = test['tuples'].tolist()

Checking one of the tuples that were created above. As you can see it is a combination of review and its sentiment.

In [11]:
train[0]

('great property warm welcome staff food delicious great stay definitely recommend family friend staff go extra mile stay comfortable',
 'Excellent')

The second step is to create a spaCy document for each tuple in the train and test dataset with the help of a transformer model (en_core_web_sm)using a spacy pipeline called nlp. Each tuple is nothing but text and its sentiments. Note here that we map 4 and 5 star ratings (Excellent and Very Good as positive sentiments), 1 and 2 start rating (Poor and Terrible) as negative and everything else i.e 3 star as Neutral

In [12]:
def document(data):
  text = []
  for doc, label in nlp.pipe(data, as_tuples = True):
    if (label=='Excellent'):
      doc.cats['positive'] = 1
      doc.cats['negative'] = 0
      doc.cats['neutral']  = 0
    elif (label=='Very Good'):
      doc.cats['positive'] = 1
      doc.cats['negative'] = 0
      doc.cats['neutral']  = 0
    elif (label=='Poor'):
      doc.cats['positive'] = 0
      doc.cats['negative'] = 1
      doc.cats['neutral']  = 0
    elif (label=='Terrible'):
      doc.cats['positive'] = 0
      doc.cats['negative'] = 1
      doc.cats['neutral']  = 0
    else:
      doc.cats['positive'] = 0
      doc.cats['negative'] = 0
      doc.cats['neutral']  = 1
    text.append(doc)
  
  return(text)

In [25]:
# Calculate the time for converting into binary document for train dataset

start_time = datetime.now()

#passing the train dataset into function 'document'
train_docs = document(train)

#Creating binary document using DocBin function in spaCy
doc_bin = DocBin(docs = train_docs)

#Saving the binary document as train.spacy
doc_bin.to_disk("train.spacy")
end_time = datetime.now()

#Printing the time duration for train dataset
print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:02.901909


In [26]:
# Calculate the time for converting into binary document for test dataset

start_time = datetime.now()

#passing the test dataset into function 'document'
test_docs = document(test)
doc_bin = DocBin(docs = test_docs)
doc_bin.to_disk("test.spacy")
end_time = datetime.now()

#Printing the time duration for test dataset
print('Duration: {}'.format(end_time - start_time))


Duration: 0:00:00.758820


We use the base spacy config file to define our own configuration for model parameters. Default configurtions are more than enough for us. All we need to do is define the paths for train and dev files

In [27]:
#Converting base configuration into full config file
!python -m spacy init fill-config ./base_config.cfg ./config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Finally we run the script to train our spacy model to perform sentiment analysis. This model will then be saved to output updated folder from where we can leverage the trained model to predict review sentiments.

In [28]:
start_time = datetime.now()

!python -m spacy train ./config.cfg --verbose  --output ./output_updated

end_time = datetime.now()

print('Duration: {}'.format(end_time - start_time))

[38;5;4mℹ Saving to output directory: output_updated[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2022-12-15 23:12:14,530] [INFO] Set up nlp object from config
[2022-12-15 23:12:14,535] [DEBUG] Loading corpus from path: test.spacy
[2022-12-15 23:12:14,535] [DEBUG] Loading corpus from path: train.spacy
[2022-12-15 23:12:14,535] [INFO] Pipeline: ['textcat']
[2022-12-15 23:12:14,537] [INFO] Created vocabulary
[2022-12-15 23:12:14,537] [INFO] Finished initializing nlp object
[2022-12-15 23:12:15,183] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2022-12-15 23:12:15,189] [DEBUG] Loading corpus from path: test.spacy
[2022-12-15 23:12:15,189] [DEBUG] Loading corpus from path: train.spacy
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.22        0.00  

In [43]:
text = "The stay was bad. The serivce was poor."

In [44]:
#Test the data from the best model
nlp = spacy.load("output_updated/model-best")
demo = nlp(text)
print(demo.cats)

{'positive': 0.25720450282096863, 'negative': 0.4230425953865051, 'neutral': 0.31975290179252625}


In [45]:
text1 = "We had an amazing time. The rooms were very clean. The food tasted amazing and staff was very courteous"

In [46]:
#Test the data from the best model
nlp = spacy.load("output_updated/model-best")
demo = nlp(text1)
print(demo.cats)

{'positive': 0.9619991779327393, 'negative': 0.013230466283857822, 'neutral': 0.024770323187112808}


In [2]:
text2 = "Stay was as expected. The service was on par with our expecations"

In [3]:
#Test the data from the best model
nlp = spacy.load("output_updated/model-best")
demo = nlp(text2)
print(demo.cats)

{'positive': 0.7474048137664795, 'negative': 0.1129467636346817, 'neutral': 0.13964837789535522}
