# Applying ML to sentiment analysis

## Preparing the IMDB Movie reviews dataset for sentiment analysis

https://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
import pandas as pd
import numpy as np 
df = pd.read_csv('movie_data.csv')
df

Unnamed: 0,review,sentiment
0,This film is an attempt to present Jared Diamo...,0
1,Whereas the movie was beautifully shot and rea...,0
2,"The comparison is perhaps unfair, but inevitab...",0
3,It's hard and I didn't expect it... But it's r...,0
4,"Elvira Mistress of the Dark is just that, a ca...",1
...,...,...
49995,I own Ralph Bakshis forgotten masterpiece Fire...,1
49996,Jason Bourne sits in a dusty room in with bloo...,1
49997,"Two college buddies - one an uptight nerd, the...",1
49998,My school's drama club will be putting this sh...,1


## Bag of words model

In [3]:
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [4]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## Assessing word relevancy via term frequency-inverse document  frequency

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                          norm='l2',
                          smooth_idf=True)

np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## Cleaning text data

In [6]:
df.loc[10, 'review']

"Saw it at the Philadelphia Gay and Lesbian Film Fest.<br /><br />What can I say? Against my better judgment, I liked it, but it seemed to me that that acting was a little...weak (mostly I noticed this from the family of the teen boy). I mean, the script wasn't stellar to begin with, but the actors didn't make me believe the relationships.<br /><br />The plot is also predictable.<br /><br />Nonethelss, I liked it. The characters are likable, and the plot is not challenging or upsetting. It's sweet, the characters care about each other, and I don't count it as fifty minutes ill-spent. <br /><br />But I don't recommend it."

In [7]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    text = (re.sub('[\W]+', ' ', text.lower()) +' '.join(emoticons).replace('-', ''))
    return text

In [8]:
preprocessor(df.loc[10, 'review'])

'saw it at the philadelphia gay and lesbian film fest what can i say against my better judgment i liked it but it seemed to me that that acting was a little weak mostly i noticed this from the family of the teen boy i mean the script wasn t stellar to begin with but the actors didn t make me believe the relationships the plot is also predictable nonethelss i liked it the characters are likable and the plot is not challenging or upsetting it s sweet the characters care about each other and i don t count it as fifty minutes ill spent but i don t recommend it '

In [9]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [10]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens

In [11]:
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [12]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [13]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aksha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification

In [14]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)
small_param_grid = [
    {
         'vect__ngram_range': [(1, 1)],
         'vect__stop_words': [None],
         'vect__tokenizer': [tokenizer, tokenizer_porter],
         'clf__penalty': ['l2'],
         'clf__C': [1.0, 10.0]
     },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokenizer],
        'vect__use_idf':[False],
        'vect__norm':[None],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
 ]
lr_tfidf = Pipeline([
                        ('vect', tfidf),
                        ('clf', LogisticRegression(solver='liblinear'))
                    ])
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy', cv=5,
                           verbose=2, n_jobs= -3)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


10 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Aksha\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Aksha\anaconda3\Lib\site-packages\sklearn\base.py", line 1363, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Aksha\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 653, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\A

0,1,2
,estimator,Pipeline(step...liblinear'))])
,param_grid,"[{'clf__C': [1.0, 10.0], 'clf__penalty': ['l2'], 'vect__ngram_range': [(1, ...)], 'vect__stop_words': [None], ...}, {'clf__C': [1.0, 10.0], 'clf__penalty': ['l2'], 'vect__ngram_range': [(1, ...)], 'vect__norm': [None], ...}]"
,scoring,'accuracy'
,n_jobs,-3
,refit,True
,cv,5
,verbose,2
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,False
,preprocessor,
,tokenizer,<function tok...00267223D09A0>
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,10.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,100


In [16]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x00000267223D09A0>}


In [17]:
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')


CV Accuracy: 0.895
Test Accuracy: 0.900


## Working with bigger data – online algorithms and out of-core learning

In [18]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [19]:
next(stream_docs(path='movie_data.csv'))

('"This film is an attempt to present Jared Diamonds theory of ""Guns, Germs and Steel"", explaining how Europeans have dominated much of the globe.<br /><br />The version I saw of this documentary came on 2 discs covering 3 hours. I think the information could have been presented in 20 minutes. There are completely useless scenes of: Professor Jared Diamond watching birds through binoculars, Professor Jared Diamond failing to use a bow and arrow properly, Professor Jared Diamond firing a muzzle-loader badly. Was this documentary supposed to make a hero out of ""Professor Jared diamond?"". This part of the documentary was so bad, it could have been a spoof. The worst was when Diamond is shown breaking down and weeping when touring the malaria ward in an African hospital. None of this helps me understand his theory of ""Guns, Germs and Steel."" BTW, ""Guns, Germs and Steel"" is said about 100 times. ""Can the Europeans guns, germs and steel get them out of this dire situation? Stay tune

In [20]:
doc_stream = stream_docs(path='movie_data.csv')
for _ in range(5):
    text,label = next(doc_stream)
    print(text)
    print("-----------")
    


"This film is an attempt to present Jared Diamonds theory of ""Guns, Germs and Steel"", explaining how Europeans have dominated much of the globe.<br /><br />The version I saw of this documentary came on 2 discs covering 3 hours. I think the information could have been presented in 20 minutes. There are completely useless scenes of: Professor Jared Diamond watching birds through binoculars, Professor Jared Diamond failing to use a bow and arrow properly, Professor Jared Diamond firing a muzzle-loader badly. Was this documentary supposed to make a hero out of ""Professor Jared diamond?"". This part of the documentary was so bad, it could have been a spoof. The worst was when Diamond is shown breaking down and weeping when touring the malaria ward in an African hospital. None of this helps me understand his theory of ""Guns, Germs and Steel."" BTW, ""Guns, Germs and Steel"" is said about 100 times. ""Can the Europeans guns, germs and steel get them out of this dire situation? Stay tuned 

In [21]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [22]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='log_loss', random_state=1)
doc_stream = stream_docs(path='movie_data.csv')

In [23]:
from tqdm import tqdm
import numpy as np

classes = np.array([0, 1])

for _ in tqdm(range(45)):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)


100%|██████████| 45/45 [00:36<00:00,  1.24it/s]


In [24]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.877


In [25]:
clf = clf.partial_fit(X_test, y_test)

## Latent Dirichlet allocation

In [26]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# the following is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                         max_df=.1,
                         max_features=5000)
X = count.fit_transform(df['review'].values)

In [28]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [29]:
lda.components_.shape

(10, 5000)

In [30]:
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
   print(f'Topic {(topic_idx + 1)}:')
   print(' '.join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic 1:
worst minutes script awful stupid
Topic 2:
family mother father girl children
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house blood gore sex
Topic 7:
role performance comedy actor plays
Topic 8:
series episode episodes war tv
Topic 9:
book version original read effects
Topic 10:
action fight guy guys fun


In [31]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
Emilio Miraglia's first Giallo feature, The Night Evelyn Came Out of the Grave, was a great combination of Giallo and Gothic horror - and this second film is even better! We've got more of the Giallo side of the equation this time around, although Miraglia doesn't lose the Gothic horror stylings tha ...

Horror movie #2:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #3:
Before I talk about the ending of this film I will talk about the plot. Some dude named Gerald breaks his engagement to Kitty and runs off to Craven Castle in Scotland. After several months Kitty and her aunt venture off to Scottland. Arriving at Craven Castle Kitty finds that Gerald has aged and he ...
