# LAB 5. LSTM FOR TEXT CLASSIFICATION & SENTIMENT ANALYSIS

In [1]:
"""1. Load the pipeline and the en_core_web_md modules
2. Show the components considered in the pipeline
3. Load the SA dataset from Campus Virtual
4. Explore the dataset to describe it
5. Add the text categorizer component (using a multilabel model) to the pipeline
6. Add two labels: positive and negative sentiments
7. Create the comments’ samples
8. Initialize the pipeline
9. Enable the text categorizer component to be trained
10. Create an optimizer object (resume_training) to keep weights of existing statistical
models
11. Set 5 training epochs, and loss values
12. Test new data"""

'1. Load the pipeline and the en_core_web_md modules\n2. Show the components considered in the pipeline\n3. Load the SA dataset from Campus Virtual\n4. Explore the dataset to describe it\n5. Add the text categorizer component (using a multilabel model) to the pipeline\n6. Add two labels: positive and negative sentiments\n7. Create the comments’ samples\n8. Initialize the pipeline\n9. Enable the text categorizer component to be trained\n10. Create an optimizer object (resume_training) to keep weights of existing statistical\nmodels\n11. Set 5 training epochs, and loss values\n12. Test new data'

### 1. Load the pipeline and the en_core_web_md modules

In [2]:
#Load the pipeline and the en_core_web_md modules
import spacy

spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")



[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### 2. Show the components considered in the pipeline

In [3]:
#Show the components considered in the pipeline
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### 3. Load the SA dataset from Campus Virtual

In [4]:
#Load the SA dataset from Campus Virtual
import pandas as pd

sadataset = pd.read_csv("./contents/SA_dataset.csv")

In [5]:
sadataset.head()

Unnamed: 0,Review,Rating,Sentiment
0,**Possible Spoilers**,1,0
1,"Read the book, forget the movie!",2,0
2,**Possible Spoilers Ahead**,2,0
3,"What a script, what a story, what a mess!",2,0
4,I hope this group of film-makers never re-unites.,1,0


### 4. Explore the dataset to describe it

In [6]:
#Explore the dataset to describe it
print(sadataset.describe())


            Rating    Sentiment
count  5000.000000  5000.000000
mean      5.902200     0.550000
std       3.653944     0.497543
min       1.000000     0.000000
25%       2.000000     0.000000
50%       7.000000     1.000000
75%      10.000000     1.000000
max      10.000000     1.000000


In [7]:
#Get rating distribution
rating_distribution = sadataset['Rating'].value_counts()
print(rating_distribution)
#Now print it in percentages 
rating_distribution = sadataset['Rating'].value_counts(normalize=True)
print(rating_distribution)

Rating
10    1385
1     1061
8      520
9      472
3      401
4      401
2      387
7      373
Name: count, dtype: int64
Rating
10    0.2770
1     0.2122
8     0.1040
9     0.0944
3     0.0802
4     0.0802
2     0.0774
7     0.0746
Name: proportion, dtype: float64


### 5. Add the text categorizer component (using a multilabel model) to the pipeline

In [8]:
# Add the text categorizer component (using a multilabel model) to the pipeline
nlp.add_pipe("textcat")

print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'textcat']


### 6. Add two labels: positive and negative sentiments

In [9]:
#Add two labels: positive and negative sentiments
nlp.get_pipe("textcat").add_label("positive")
nlp.get_pipe("textcat").add_label("negative")


1

### 7. Create the comments’ samples

In [10]:
sadataset.tail()

Unnamed: 0,Review,Rating,Sentiment
4995,"I have only seen this once--in 1986, at an ""ar...",10,1
4996,"This being my first John Carpenter film, I mus...",9,1
4997,"This is kind of a weird movie, given that Sant...",1,0
4998,"Vic (Richard Dreyfuss) is a mob boss, leaving ...",4,0
4999,"Yup, that's right folks, this is undoubtedly t...",1,0


In [21]:
#Create the comments’ samples
train_texts = sadataset["Review"].values
train_sentiments = sadataset["Sentiment"].values
train_labels = [{"cats": {"negative": label == 0,
                          "positive": label == 1}} 
                for label in sadataset["Sentiment"]]
train_data = list(zip(train_texts, train_labels))

train_data[-10:]

[('This movie is funny in more ways than one. It\'s got action. It\'s got humour. It\'s got attitude. It\'s got Dolemite\'s all girl army of kung-fu hos! And that\'s just what the movie offers as a film. It\'s also badly acted by some, the mic makes more than one cameo appearance, and some "punches" miss by feet. But when you make a movie this cool, who\'s got time to pay attention to those "details"? This movie rocks. Rent it tonight, if you can find it... I had to buy it to see it, but I don\'t regret it!',
  {'cats': {'negative': False, 'positive': True}}),
 ('I am sick of series with young and clueless people, talking about their "problems" all the time, self centered, boring and absolutely annoying (Popular; Dawson\'s Creek; Beverly Hills; etc). "Hack" is a breath of fresh air, with a great actor (David Morse), a completely different plot, credible people with REAL problems (thank God !!) and very, very good histories. I just love it!! I hope "Hack" will go on for a long time, bec

### 8. Initialize the pipeline

In [34]:
nlp.initialize()

<thinc.optimizers.Optimizer at 0x21a95fa0c20>

In [35]:
#Initialize the pipeline
nlp.begin_training()

<thinc.optimizers.Optimizer at 0x21a95f9cb80>

### 9. Enable the text categorizer component to be trained

In [36]:
#enable the text categorizer to be trained
textcat = nlp.get_pipe("textcat")
textcat.begin_training()
#textcat = select_pipes(nlp, include=["textcat"])


AttributeError: 'TextCategorizer' object has no attribute 'begin_training'

### 10. Create an optimizer object (resume_training) to keep weights of existing statistical models

In [37]:

#Create an optimizer object (resume_training) to keep weights of existing statistical models
from spacy.util import minibatch
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

### 11. Set 5 training epochs, and loss values

In [38]:
#Spacy's Example class is used to create the training data
from spacy.training.example import Example

#Set 5 training epochs, and loss values
losses = {}

for epoch in range(5):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch) # Unzipping the batch
        example = []
        # Update the model with iterating each text and label in the batch
        for i in range(len(texts)):
            doc = nlp.make_doc(texts[i])
            example.append(Example.from_dict(doc, labels[i]))
        nlp.update(example, drop=0.3, losses=losses)
    print(losses)

{'textcat': 0.25}
{'textcat': 0.4930790662765503}
{'textcat': 0.7365425527095795}
{'textcat': 0.9751671850681305}
{'textcat': 1.2012136280536652}


### 12. Test new data

In [41]:
#Test new data
test_text = "This movie sucked"

doc = nlp(test_text)
doc.cats



{'POSITIVE': 0.5026506781578064, 'NEGATIVE': 0.4973493814468384}

In [42]:
test_text = "This movie was the best one I have ever seen"

doc = nlp(test_text)
doc.cats

{'POSITIVE': 0.502112090587616, 'NEGATIVE': 0.4978879392147064}