# NLP on IMDB Dataset

Focus:
- Data Loading
- Pre-processing using NLTK.
- BoW using SkLearn + Sparse Matrices
- TF-IDF using SkLearn
- Word2VEc using Gensim
- BERT using bert-serving library + APIs.


### Data loading

In [2]:
# The following commands can be used if we are working in Google Colab

# Source: http://ai.stanford.edu/~amaas/data/sentiment/
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    
# '!' means the command will run in linux terminal. the above command will download .gz file to the local cmputer where the colab is running

# uncompress and see the data
! ls
! pwd
! unzip aclImdb_v1.tar.gz

#Google: "Unzip tar gz file colab" ----> https://stackoverflow.com/questions/49685924/extract-google-drive-zip-from-google-colab-notebook
import shutil
shutil.unpack_archive("/content/aclImdb_v1.tar.gz", "/content/")
! ls /content/

! ls -l /content/aclImdb

! head -100 /content/aclImdb/imdb.vocab

! ls /content/aclImdb/train

! ls /content/aclImdb/train/pos

! cat /content/aclImdb/train/pos/6250_10.txt

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [20]:
# load data from k-reviews from imdb/train/pos reviews

k=100
raw_data = [] # empty list

# Question: Why list of strings?
# Ans : After text processing we need to process the data using the fucntions in NLTK,Scikit,Spacey like libraries.
# Eventhough these library uses different datatypes, widely used one is list of strings.
# Moreover list of strings are mutable, indexable, allow duplicates, ordered, easy to manipulate.

index_file = dict(); # store mapping from index to filename

import os
directory = r'./aclImdb/train/pos/'

i=0

for f in os.listdir(directory): # for each file in the subfolder
      
  if f.endswith(".txt"): # check for text file
    fname = directory + "/" + f
    
    tmp = open(fname, "r") # read file 

    raw_data.append(tmp.read())
    index_file[i] = fname
        
    i += 1

    if i==k: # read k files
      break

#print(i)
print(index_file)
print("*"*100)
print(raw_data[99])
print("*"*100)
print (len(raw_data))


{0: './aclImdb/train/pos//0_9.txt', 1: './aclImdb/train/pos//10000_8.txt', 2: './aclImdb/train/pos//10001_10.txt', 3: './aclImdb/train/pos//10002_7.txt', 4: './aclImdb/train/pos//10003_8.txt', 5: './aclImdb/train/pos//10004_8.txt', 6: './aclImdb/train/pos//10005_7.txt', 7: './aclImdb/train/pos//10006_7.txt', 8: './aclImdb/train/pos//10007_7.txt', 9: './aclImdb/train/pos//10008_7.txt', 10: './aclImdb/train/pos//10009_9.txt', 11: './aclImdb/train/pos//1000_8.txt', 12: './aclImdb/train/pos//10010_7.txt', 13: './aclImdb/train/pos//10011_9.txt', 14: './aclImdb/train/pos//10012_8.txt', 15: './aclImdb/train/pos//10013_7.txt', 16: './aclImdb/train/pos//10014_8.txt', 17: './aclImdb/train/pos//10015_8.txt', 18: './aclImdb/train/pos//10016_8.txt', 19: './aclImdb/train/pos//10017_9.txt', 20: './aclImdb/train/pos//10018_8.txt', 21: './aclImdb/train/pos//10019_8.txt', 22: './aclImdb/train/pos//1001_8.txt', 23: './aclImdb/train/pos//10020_8.txt', 24: './aclImdb/train/pos//10021_8.txt', 25: './aclImdb

## Data preprocessing

In [29]:
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [30]:
# Lemmatize

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saras\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [31]:
# Tokenize, Stop words removal and lemmatize

import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
#stop_words = set(stopwords.words('english')) # NLTK 

lemmatizer = WordNetLemmatizer()

# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

def stopWord_Lemmatize(sent, stop_words, lemmatizer ):
  word_tokens = word_tokenize(sent) # tokenize
  return_sent = "";
  
  for w in word_tokens:
      if w not in stop_words:
          return_sent += " " + lemmatizer.lemmatize(w) # lemmatize w beofre adding it to the return_sent
  return return_sent

print(raw_data[0])
print("*"*100+"\n")
print(stopWord_Lemmatize(raw_data[0], stop_words, lemmatizer) ) # call the function


Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
****************************************************************************************************

 Bromwell High cartoon comedy . It ran time program school life , `` Teachers '' . My 35 ye

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saras\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saras\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
from tqdm import tqdm
from bs4 import BeautifulSoup

preprocessed_data = []
# tqdm is for printing the status bar
for sentance in tqdm(raw_data):
    sentance = re.sub(r"http\S+", "", sentance) # remove URLs
    sentance = BeautifulSoup(sentance, 'lxml').get_text()  # remove all tags
    sentance = decontracted(sentance) # expanding english language contractions
    sentance = re.sub("\S*\d\S*", "", sentance).strip()  # remove words with numbers
    sentance = re.sub('[^A-Za-z]+', ' ', sentance) # Remove all special characters, punctuation and spaces from string
    sentance = re.sub('\s+',' ', sentance) # replace multiple spaces with single space
    sentance = sentance.lower()
    sentance = ' '.join(e.lower() for e in sentance.split())
    #preprocessed_reviews.append(sentance.strip())
    
    processed_data.append(stopWord_Lemmatize(sentance, stop_words, lemmatizer)) #pre-process the k sentences and store the result
    
print(raw_data[0])
print("*"*100+"\n")
print(processed_data[0])

100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 525.55it/s]

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
****************************************************************************************************

 Bromwell High cartoon comedy . It ran time program school life , `` Teachers '' . My 35 ye




### Bag-of-words (BoW)

In [37]:
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_BoW = count_vect.fit_transform(processed_data)

print(count_vect.get_feature_names()) 
print("the type of count vectorizer ",type(X_BoW))
print("the shape of out text BOW vectorizer ",X_BoW.get_shape())
print("the number of unique words ", X_BoW.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (400, 3853)
the number of unique words  3853


In [None]:
print(type(X_BoW))
#csr - compressed sparse row. BoW will store only non-zero data as a key value pair. Check sticky notes for imp interview que

<class 'scipy.sparse.csr.csr_matrix'>


In [None]:
# Here we are checking how much memory we have saved
# Sparse representations vs dense Matrices

#Refer: https://docs.scipy.org/doc/scipy/reference/sparse.html


print(X_BoW.data.nbytes) 
# Refer: https://stackoverflow.com/questions/43681279/why-is-scipy-sparse-matrix-memory-usage-indifferent-of-the-number-of-elements-in
# the above code checks how much space its taking for storing just data (not indeces)
# Convert X_BoW to dense. dense will store zero elements as well.
# Ref: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.todense.html#scipy.sparse.csr_matrix.todense
X_BoW_Dense = X_BoW.todense();
print(X_BoW_Dense.data.nbytes)

print(X_BoW.shape)
print(X_BoW_Dense.shape)
# Always use sparse-matrix in these situations. And remember here we have used 1-gram. If 2-grams memory will shoot up.

81912
3499200
(100, 4374)
(100, 4374)


In [None]:
print(X_BoW[0,:])

  (0, 4264)	2
  (0, 2271)	3
  (0, 2539)	4
  (0, 1341)	1
  (0, 4357)	1
  (0, 2165)	1
  (0, 3093)	1
  (0, 1698)	2
  (0, 1670)	3
  (0, 2717)	1
  (0, 4047)	1
  (0, 4368)	1
  (0, 3937)	1
  (0, 716)	3
  (0, 3942)	1
  (0, 3875)	1
  (0, 3916)	1
  (0, 1025)	1
  (0, 1607)	1
  (0, 2613)	1
  (0, 1923)	1
  (0, 3039)	1
  (0, 2978)	1
  (0, 4257)	1
  (0, 1630)	2
  :	:
  (0, 3910)	1
  (0, 2054)	2
  (0, 4296)	2
  (0, 3536)	1
  (0, 3152)	1
  (0, 3537)	1
  (0, 1497)	1
  (0, 1378)	1
  (0, 1936)	1
  (0, 2720)	1
  (0, 1929)	1
  (0, 517)	2
  (0, 1187)	1
  (0, 2205)	1
  (0, 426)	1
  (0, 135)	1
  (0, 1964)	1
  (0, 1090)	1
  (0, 1002)	1
  (0, 3254)	1
  (0, 3389)	1
  (0, 3170)	1
  (0, 818)	1
  (0, 68)	1
  (0, 1034)	1


In [None]:
print(X_BoW[0,818])
print(X_BoW[0,817])

1
0


In [None]:
print(X_BoW_Dense[0,:])
print(X_BoW_Dense[0,818])
print(X_BoW_Dense[0,817])

[[0 0 0 ... 0 0 0]]
1
0


### TD-IDF

In [None]:
# Google: "TF-IDF SkLearn" ---> https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2))
X_tfidf = vectorizer.fit_transform(raw_data)

print(vectorizer.get_feature_names())



In [None]:
print(type(X_tfidf))
print(X_tfidf.shape)
print(X_tfidf.data.nbytes) # Refer: https://stackoverflow.com/questions/43681279/why-is-scipy-sparse-matrix-memory-usage-indifferent-of-the-number-of-elements-in

X_tfidf_dense = X_tfidf.todense()
print(type(X_tfidf_dense))
print(X_tfidf_dense.shape)
print(X_tfidf_dense.data.nbytes)

<class 'scipy.sparse.csr.csr_matrix'>
(100, 14609)
154288
<class 'numpy.matrix'>
(100, 14609)
11687200


### Word2Vec

In [None]:
# Here the code is using Gensim library. Spacy is also great library for w2v which gives very good documentation.
# Refer: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

# Download pretrained  vectors
# https://www.quora.com/How-can-I-download-the-Google-news-word2vec-pretrained-model-from-a-Ubuntu-terminal
! wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
! gunzip GoogleNews-vectors-negative300.bin.gz

--2020-04-19 10:57:53--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.162.189
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.162.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-04-19 10:59:58 (12.6 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [None]:
! ls

GoogleNews-vectors-negative300.bin  sample_data


In [None]:
# takes time as the model is large to laod into RAM
# RAM consumption also shoots up. Kernel could restart and give you a larger RAM isnatnce like 25GB RAM.
# google colab itself provide good RAm. So it is advicable to run in colab.

from gensim.models import KeyedVectors
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.7118192911148071)]


In [None]:
v = model['queen'];
print(type(v))
print(v.shape)

<class 'numpy.ndarray'>
(300,)


### BERT

- BERT vs Word2Vec
- BERT: Contextual encodings
- BERT - Bert as a service works as follows. In our computer BERT runs a process called server and another process called client. Server might run on a different computer also. ie BERT as a service is built on a technology called called ZeroMQ which is similar to the concept of APIs. ZeroMQ is a popular type of communication framework to talk between computers. BAAS has 2 components, Bert as a server and as a client.Both can be on same computer or different. Client can sent a sentence to server, where it takes the sentence , run some code and returns the vector (API concept). The main difference between APIs and ZeroMQ is latency feature,ie ZeroMQ has very low latency. ZeroMQ is used by Financial companies for low latency requirement wereas APIs by web based companies. Both API and ZeroMQ is designed over HTTP protocol which is based on TCP. ZeroMQ has its own disadvantages like if it fails it wont tell unlike APIs.

In [None]:
# W2V consumes more RAM. In BERT we need to run a Deep learning Model on a GPU and returns a vector.
# BERT works in such a way that, say a sentence w1w2w3.. Bert can create individual vector for each word or a vector for whole sentence.
# There are many ways to obatin BERT encodings. We are using one of the most simple+popular approaches

# we can do it in Keras as well.

# Change to GPU feature in colab notebook.GPU based instance: Runtime---> Change Runtime --> GPU

# Change to TensorFlow Version 1.x. Current Tensorflow ver is 2. Biggest mess up done by Google is code works in TFv1 willn't run in V2.
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
#https://github.com/hanxiao/bert-as-service
# The steps mentioned in the above documentation works on typical compter doesnt work on colab.
# https://github.com/hanxiao/bert-as-service/issues/380
# tHe above mentioned issue has resolved in the above discussion.
# Install BERT-SERVING client and Server
!pip install bert-serving-client
!pip install -U bert-serving-server[http]

Requirement already up-to-date: bert-serving-server[http] in /usr/local/lib/python3.6/dist-packages (1.10.0)


In [None]:
# dowload pretrained models
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip

--2020-04-19 10:35:00--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.133.128, 2a00:1450:400c:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.133.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip.2’


2020-04-19 10:35:02 (187 MB/s) - ‘uncased_L-12_H-768_A-12.zip.2’ saved [407727028/407727028]

Archive:  uncased_L-12_H-768_A-12.zip
replace uncased_L-12_H-768_A-12/bert_model.ckpt.meta? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  


In [None]:
# Start BERT_SERVER on the current computer
!nohup bert-serving-start -model_dir=./uncased_L-12_H-768_A-12 > out.file 2>&1 &
#nohup-to execute a command such that it ignores the HUP (hangup) signal and therefore does not stop when the user logs out.
# above command -whatever the output it writes on out.file and runs in the background.

In [None]:
# Use bert-client from python
# Takes time to execute as it uses GPU
from bert_serving.client import BertClient
bc = BertClient()
print (bc.encode(['First do it', 'then do it right', 'then do it better'])) # list of setences

[[ 0.13186494  0.32404163 -0.82704437 ... -0.37119538 -0.3925019
  -0.317218  ]
 [ 0.24873495 -0.12334374 -0.38933873 ... -0.4475625  -0.55913556
  -0.11345225]
 [ 0.28627333 -0.18580079 -0.30906785 ... -0.29593712 -0.3931053
   0.07640254]]


In [None]:
v = bc.encode([raw_data[0]])

here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


In [None]:
print(type(v))
print(v.shape)

<class 'numpy.ndarray'>
(1, 768)
