# Multiclass classification for movie reviews

* Multiclass/multilabel classification in the Semeval 2015 ABSA corpus
* Subjectivity classification for movie reviews 

## Multiclass/multilabel classification in the Semeval 2015 ABSA corpus

Let's load up the ABSA corpus with BeautifulSoup.

In [2]:
import numpy as np
from bs4 import BeautifulSoup
filename = "ABSA-15_Restaurants_Train_Final.xml"
soup = BeautifulSoup(open(filename,encoding="utf-8"),"xml")

In [3]:
soup

<?xml version="1.0" encoding="utf-8"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
<Opinion category="RESTAURANT#GENERAL" from="51" polarity="negative" target="place" to="56"/>
</Opinions>
</sentence>
<sentence id="1004293:1">
<text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
<Opinions>
<Opinion category="SERVICE#GENERAL" from="75" polarity="negative" target="staff" to="80"/>
</Opinions>
</sentence>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion category="SERVICE#GENERAL" from="0" polarity="negative" target="NULL" to="0"/>
</Opinions>
</sentence>
<sentence id="1004293:3">
<text>The food was lousy - too sweet or too salty and the 

Let's build a multiclass logistic regression classifier for aspects (using sentences with only one opinion)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

texts = []
aspects = []
polarities = []
for sentence in soup.find_all("sentence"):
    opinions = sentence.find_all("Opinion")
    if len(opinions) == 1:
        opinion = opinions[0]
        aspects.append(opinion["category"])
        polarities.append(opinion["polarity"])
        texts.append(sentence.find("text").string)
        
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(texts)

clf = LogisticRegression()
clf.fit(X,aspects)

LogisticRegression()

We are going to pull out the top feature from each classifier

In [5]:
clf.coef_.shape

(12, 1672)

In [6]:
clf.classes_

array(['AMBIENCE#GENERAL', 'DRINKS#PRICES', 'DRINKS#QUALITY',
       'DRINKS#STYLE_OPTIONS', 'FOOD#PRICES', 'FOOD#QUALITY',
       'FOOD#STYLE_OPTIONS', 'LOCATION#GENERAL', 'RESTAURANT#GENERAL',
       'RESTAURANT#MISCELLANEOUS', 'RESTAURANT#PRICES', 'SERVICE#GENERAL'],
      dtype='<U24')

In [6]:
import numpy as np
class_index = 0
print(clf.classes_[class_index])
max_ind = np.argmax(clf.coef_[class_index])
print(vectorizer.get_feature_names()[max_ind])

AMBIENCE#GENERAL
atmosphere


We can do the same for polarities

In [7]:
clf = LogisticRegression()
clf.fit(X,polarities)

LogisticRegression()

In [8]:
clf.coef_.shape

(3, 1672)

In [12]:
clf.classes_

array(['negative', 'neutral', 'positive'], dtype='<U8')

In [10]:
class_index = 0  # this is for the `negative` class
print(clf.classes_[class_index])
max_ind = np.argmax(clf.coef_[class_index])  # the index of the word that has the maximum coef
print(vectorizer.get_feature_names_out()[max_ind])

negative
don


Next, we evaluate using crossvalidation, use the macro-averaged f1_score metric

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score,make_scorer

cross_val_score(LogisticRegression(), X, aspects, scoring=make_scorer(f1_score,average="macro"))



array([0.17901824, 0.25822287, 0.28381585, 0.26912907, 0.24725083])

For 12 aspects, we would expect around ~8% random accuracy. So this model is generally better.

In [14]:
cross_val_score(LogisticRegression(), X, polarities, scoring=make_scorer(f1_score,average="macro"))

array([0.42735761, 0.43469354, 0.42680015, 0.41176994, 0.41268181])

For 3 polarities, we would expect around ~33% random accuracy. Again, our model is doing better.

--------
Note:
- macro average: straight-forward, just take the average of the precision and recall of the system on different sets.
- micro average: take into account the imbalance of classes. It can be a useful measure when your dataset varies in size.
--------

In [15]:
cross_val_score(LogisticRegression(), X, aspects)



array([0.56578947, 0.65789474, 0.64473684, 0.59868421, 0.60264901])

In [16]:
cross_val_score(LogisticRegression(), X, polarities)

array([0.75      , 0.73026316, 0.76973684, 0.72368421, 0.74172185])

## Subjectivity classification for lexicon-based polarity classification in movie reviews

Let's look at the subjectivity corpus, subjective vs objective sentences from movie reviews

In [19]:
import nltk
nltk.download('subjectivity')

[nltk_data] Downloading package subjectivity to
[nltk_data]     /Users/lxy/nltk_data...
[nltk_data]   Unzipping corpora/subjectivity.zip.


True

In [20]:
from nltk.corpus import subjectivity

subjectivity.categories()

['obj', 'subj']

In [24]:
for sents in subjectivity.sents(categories="obj"):
    print(sents)
    break

['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.']


In [25]:
for sents in subjectivity.sents(categories="subj"):
    print(sents)
    break

['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.']


Let's build a basic bigram classifier and see how it does on distinguishing these two kinds of texts.

In [26]:
from nltk import everygrams  # this is to extract every n-gram up to size n
from nltk import pos_tag

def get_feature_dict(sentence):
    tagged_sent = pos_tag(sentence)
    tags = [tag for pos,tag in tagged_sent]
    feature_dict = {}
    regular_grams = everygrams(tags, min_len=1,max_len=2)  # every unigram and bigram
    for gram in regular_grams:
        feature_dict[gram] = 1    
    return feature_dict

In [27]:
from sklearn.feature_extraction import DictVectorizer

feature_dicts = []
y = []

for sent in subjectivity.sents(categories="obj"):
    feature_dicts.append(get_feature_dict(sent))
    y.append("obj")
    
for sent in subjectivity.sents(categories="subj"):
    feature_dicts.append(get_feature_dict(sent))
    y.append("subj")

vectorizer = DictVectorizer()

X = vectorizer.fit_transform(feature_dicts)


In the feature_dicts, we have unigram and bigram taggings:

* Notice the number is binary, which means if this exists or not.

In [28]:
feature_dicts

[{('DT',): 1,
  ('DT', 'NN'): 1,
  ('NN',): 1,
  ('NN', 'VBZ'): 1,
  ('VBZ',): 1,
  ('VBZ', 'IN'): 1,
  ('IN',): 1,
  ('IN', 'DT'): 1,
  ('NN', 'WRB'): 1,
  ('WRB',): 1,
  ('WRB', 'DT'): 1,
  ('DT', 'JJ'): 1,
  ('JJ',): 1,
  ('JJ', 'NN'): 1,
  ('NN', 'VBN'): 1,
  ('VBN',): 1,
  ('VBN', 'JJ'): 1,
  ('JJ', 'NNS'): 1,
  ('NNS',): 1,
  ('NNS', 'TO'): 1,
  ('TO',): 1,
  ('TO', 'VB'): 1,
  ('VB',): 1,
  ('VB', 'NN'): 1,
  ('NN', 'IN'): 1,
  ('NN', '.'): 1,
  ('.',): 1},
 {('VBG',): 1,
  ('VBG', 'IN'): 1,
  ('IN',): 1,
  ('IN', 'DT'): 1,
  ('DT',): 1,
  ('DT', 'JJ'): 1,
  ('JJ',): 1,
  ('JJ', 'NN'): 1,
  ('NN',): 1,
  ('NN', 'CC'): 1,
  ('CC',): 1,
  ('CC', 'VBG'): 1,
  ('VBG', 'NNS'): 1,
  ('NNS',): 1,
  ('NNS', 'IN'): 1,
  ('IN', 'JJ'): 1,
  ('NN', ','): 1,
  (',',): 1,
  (',', 'NN'): 1,
  ('CC', 'JJ'): 1,
  ('NN', 'NN'): 1,
  ('NN', 'VBZ'): 1,
  ('VBZ',): 1,
  ('VBZ', 'VBN'): 1,
  ('VBN',): 1,
  ('VBN', 'PRP$'): 1,
  ('PRP$',): 1,
  ('PRP$', 'NN'): 1,
  ('NN', 'IN'): 1,
  ('DT', 'NN'): 1,


In [30]:
from sklearn.linear_model import LogisticRegression

cross_val_score(LogisticRegression(solver="liblinear"), X, y)


array([0.7685, 0.7705, 0.7695, 0.787 , 0.7755])

Now we will do lexicon-based sentiment analysis, but weighting the SO value of each sentence by the probability that it is subjective

In [31]:
clf = LogisticRegression(solver="liblinear")
clf.fit(X,y)

LogisticRegression(solver='liblinear')

In [32]:
from nltk.corpus import opinion_lexicon

positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

In [33]:
from nltk.corpus import movie_reviews

total = 0
correct = 0

for fileid in movie_reviews.fileids():
    total_score = 0
    for para in movie_reviews.paras(fileid):
        para_score = 0
        for sent in para:
            for word in sent:
                word = word.lower()
                if word in positive_words:
                    para_score += 1
                elif word in negative_words:
                    para_score -= 1
        total_score += para_score
    total += 1
    if total_score > 0 and "pos" in movie_reviews.categories(fileid):
        correct += 1
    elif total_score <= 0 and "neg" in movie_reviews.categories(fileid):
        correct += 1
        
print(correct/total)
        

0.7


In [41]:
total = 0
correct = 0
count = 0

for fileid in movie_reviews.fileids():
    total_score = 0
    for sent in movie_reviews.sents(fileid):
        count += 1
        sent_score = 0
        lowered = []
        for word in sent:
            word = word.lower()
            lowered.append(word)
            if word in positive_words:
                sent_score += 1
            elif word in negative_words:
                sent_score -= 1
        X_MR = vectorizer.transform([get_feature_dict(lowered)])
        prediction = clf.predict_proba(X_MR)[0][-1]
        if count < 100:
            print(sent)
            print(prediction)
        total_score += sent_score*prediction
    total += 1
    if total_score > 0 and "pos" in movie_reviews.categories(fileid):
        correct += 1
    elif total_score <= 0 and "neg" in movie_reviews.categories(fileid):
        correct += 1
        
print(correct/total)

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.']
0.07544926861105555
['they', 'get', 'into', 'an', 'accident', '.']
0.2245764008004889
['one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.']
0.09091341598570471
['what', "'", 's', 'the', 'deal', '?']
0.7527499642933279
['watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.']
0.42223013246202795
['.']
0.2166650659716331
['.']
0.2166650659716331
['critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.']
0.7528129948469408
['which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'bre

['even', 'the', 'magic', 'kingdom', 'at', 'its', 'most', 'mediocre', '--', 'that', "'", 'd', 'be', '"', 'pocahontas', '"', 'for', 'those', 'of', 'you', 'keeping', 'score', '--', 'isn', "'", 't', 'nearly', 'as', 'dull', 'as', 'this', '.']
0.9801957753746948
['the', 'story', 'revolves', 'around', 'the', 'adventures', 'of', 'free', '-', 'spirited', 'kayley', '(', 'voiced', 'by', 'jessalyn', 'gilsig', ')', ',', 'the', 'early', '-', 'teen', 'daughter', 'of', 'a', 'belated', 'knight', 'from', 'king', 'arthur', "'", 's', 'round', 'table', '.']
0.05089959291231018
['kayley', "'", 's', 'only', 'dream', 'is', 'to', 'follow', 'in', 'her', 'father', "'", 's', 'footsteps', ',', 'and', 'she', 'gets', 'her', 'chance', 'when', 'evil', 'warlord', 'ruber', '(', 'gary', 'oldman', ')', ',', 'an', 'ex', '-', 'round', 'table', 'member', '-', 'gone', '-', 'bad', ',', 'steals', 'arthur', "'", 's', 'magical', 'sword', 'excalibur', 'and', 'accidentally', 'loses', 'it', 'in', 'a', 'dangerous', ',', 'booby', '-',