<a href="https://colab.research.google.com/github/Maternowsky/Maternowsky/blob/main/Sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis - subfield of natural language processing(NLP). We will be using a dataset of 50,000 movie reviews from the Internet Movie Database(IMDb)**

In [1]:
!pip install pyprind

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [2]:
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

if os.path.exists(target):
    os.remove(target)

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB '
                     f'| {speed:.2f} MB/s | {duration:.2f} sec elapsed')
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(source, target, reporthook)

100% | 80.23 MB | 2.60 MB/s | 30.85 sec elapsed

In [3]:


if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()



In [4]:
import pyprind
import pandas as pd
import os
import sys
basepath = 'aclImdb'
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()
for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path,file),
                'r', encoding='utf-8') as infile: 
        txt = infile.read()

      x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
      df = pd.concat([df, x], ignore_index=False)

df.columns = ['review', 'sentiment']

## **Shuffle dataset using np.random and store as csv**

In [5]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

In [6]:
df = pd.read_csv('movie_data.csv', encoding = 'utf-8')
df = df.rename(columns={"0": "review", "1": "sentiment"})
df.head(3)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1


In [7]:
df.shape
#check that all 50,000 reviews are there

(50000, 2)

# **Bag of Words model- construct model with CountVectorizer class from scikit learn**

In [8]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet,'
                 'and one and one is two'])
bag = count.fit_transform(docs)

In [9]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [10]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## **[above]- index at 0 is count of 'and', index 1 is count of 'is'. these are the raw term frequencies**

## **n-gram is how many words grouped and counted. for example, in the sentance "hi how are you", a 1-gram would be ["hi", "how", "are", "you"] and  a 2 gram would be ["hi how", "how are", "are you"]**

### **term frequency-inverse document frequency- downweight words that are frequently used**

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True,
                         norm = 'l2',
                         smooth_idf = True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## **display last 50 characters from shuffled document**

In [12]:
df.loc[0, 'review'][-50:]

'and I suggest that you go see it before you judge.'

In [13]:
import re
def preprocessor(text):
  text = re.sub('<[^>]*>', '', text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
  text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
  return text

In [14]:

preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [15]:
preprocessor(df.loc[0, 'review'][-50:])

'and i suggest that you go see it before you judge '

### **Applying preprocessor function to all movie reviews in DataFrame**

In [16]:
df['review']= df['review'].apply(preprocessor)

## **Processing documents into tokens- splitting into individual words**

In [17]:
def tokenizer(text):
  return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

### **Word stemming- another useful technique in tokenization that transforms words into root form to map related words to same stem**

In [18]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
  return[porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

## **stop word removal**

In [19]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [20]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]


['runner', 'like', 'run', 'run', 'lot']

# **Training Logistic Regression model**

In [21]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

## **GridSearchCV to find optimal parameters using 5-fold stratified cross-validation**

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase = False,
                        preprocessor = None)
small_param_grid = [{'vect__ngram_range': [(1,1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1,1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf': [False],
                     'vect__norm': [None],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]}]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring = 'accuracy',
                           cv = 5,
                           verbose = 2,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




### **Print best parameter set**

In [23]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')

Best parameter set: {'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f59e5c7a7a0>}


In [24]:
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

CV Accuracy: 0.873
Test Accuracy: 0.881


# **out-of-core learning which allows us to work with large datasets by fitting the classifier incrementally on smaller batches of a dataset**

In [72]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [73]:
next(stream_docs(path = 'movie_data.csv'))

('"I went and saw this movie last night after being coaxed to by a few friends of mine. I\'ll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."',
 1)

In [74]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [75]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error= 'ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)


from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log',
                    random_state=1)
doc_stream = stream_docs(path='movie_data.csv')

In [76]:
# cant use countvectorizer or tfidvectorizer so we have to use
# hashing vectorizer

import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
  X_train, y_train = get_minibatch(doc_stream, size = 1000)
  if not X_train:
    break
  X_train = vect.transform(X_train)
  clf.partial_fit(X_train, y_train, classes=classes)
  pbar.update




## **Evaluate model using last 5000 documents**

In [77]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 1.000


### **Use last 5000 documents to update model**

In [78]:
clf = clf.partial_fit(X_test, y_test)

## **Topic modeling- broad task of assigning topics to unlabeled text documents**

## **Latent Dirichlet Allocation (LDA) popular topic modeling technique**

In [79]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# the following is necessary on some computers

df = df.rename(columns={"0": "review", "1": "sentiment"})

In [81]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features = 5000)
X = count.fit_transform(df['review'].values)

In [82]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [83]:
lda.components_.shape

(10, 5000)

In [85]:
#print 5 most important words
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
  print(f'Topic {(topic_idx +1)}:')
  print(' '.join([feature_names[i] for i in topic.argsort() [:-n_top_words -1:-1]]))


Topic 1:
worst minutes script awful stupid
Topic 2:
family mother father girl children
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art feel
Topic 5:
police guy car murder dead
Topic 6:
horror house gore blood sex
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read effects
Topic 10:
action fight guy guys fun


### **plot 3 movies from horror movie**

In [86]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
  print(f'\nHorror movie #{(iter_idx + 1)}:')
  print(df['review'][movie_idx][:300],'...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
"House of the Damned" (also known as "Spectre") is one of your low budget haunted house horror flicks, filled with mediocre performances and cheap effects. It is about a family that inherits an old Irish mansion, and after moving in begin to experience strange phenomenon and ghostly apparitions, inc ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...


### **printed 300 characters from first 3 horror classified movies and it checked out as working**