<a href="https://colab.research.google.com/github/TheoLiapik/Sidiras_Liapikos2/blob/master/NLP_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://www.auth.gr/sites/default/files/banner-horizontal-282x100.png)
# Text Mining and Natural Language Processing - Assignment 2


** Text Classification using Word Embeddings**
<br>
**Potentially useful library documentation, references, and resources**:

* IPython notebooks: <https://ipython.org/ipython-doc/3/notebook/notebook.html#introduction>
* Numpy numerical array library: <https://docs.scipy.org/doc/>
* Gensim's word2vec: <https://radimrehurek.com/gensim/models/word2vec.html>
* Keras Deep-Learning library: <https://keras.io/layers/embeddings/>
* Bokeh interactive plots: <http://bokeh.pydata.org/en/latest/> (we provide plotting code here, but click the thumbnails for more examples to copy-paste)
* scikit-learn ML library (aka `sklearn`): <http://scikit-learn.org/stable/documentation.html>
* nltk NLP toolkit: <http://www.nltk.org/>
* tutorial for processing xml in python using `lxml`: <http://lxml.de/tutorial.html> (we did this for you below, but in case you need it in the future)



In [0]:
import bokeh
import gensim
import numpy as np
import re
import urllib.request
import zipfile

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#Part 1

## 1.1. Train a Word2Vec model on the WikiText dataset

### 1.1.1 Download Dataset

One could skip the next time consuming stages and upload the already pre-processed clean data directly from my personal Google Drive at section 1.1.3.3

In [0]:
# Import necessary Libraries
import urllib.request
import zipfile

# Download the dataset ~190MB
urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip", filename="wikitext-103-v1.zip")

('wikitext-103-v1.zip', <http.client.HTTPMessage at 0x7efe78788470>)

In [0]:
# Extract only the data of interest
# From the .zip file open and read only the training tokens
with zipfile.ZipFile('wikitext-103-v1.zip', 'r') as z:
  doc = z.open('wikitext-103/wiki.train.tokens', 'r').read()

# The first 500 bytes of data
print(doc[:500])

b' \n = Valkyria Chronicles III = \n \n Senj\xc5\x8d no Valkyria 3 : <unk> Chronicles ( Japanese : \xe6\x88\xa6\xe5\xa0\xb4\xe3\x81\xae\xe3\x83\xb4\xe3\x82\xa1\xe3\x83\xab\xe3\x82\xad\xe3\x83\xa5\xe3\x83\xaa\xe3\x82\xa23 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors'


### 1.1.2 Data Pre-processing

In [0]:
# Convert bytes to string and then split to paragraphs
doc_str = doc.decode("utf-8")
doc_para  = doc_str.split('\n')

# the first 5 paragraphs
print(doc_para[:5])

[' ', ' = Valkyria Chronicles III = ', ' ', ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . ', " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for s

#### 1.1.2.1 Basic pre-processing procedure
For each paragraph of the data:
- Remove multiple space characters
- Remove empty tokens
- Lower the characters
- Remove non-Alpharithmetic characters
- Remove arithmetic words
- Remove created multiple space characters
- Tokenize (split on space characters)
- Remove stop-words
- Remove words with less than 3 characters
- Remove empty paragraphs
- Save the remaining tokens to a list


In [0]:
doc_para_noEmpties = []
for para in doc_para:
    para = re.sub(r'\s+', ' ',para)
    if para != ' ':
        para = para.lower()
        para = re.sub(r'[^a-z0-9]+', ' ',para)
        para = re.sub(r' [0-9]+ ', ' ',para)
        para = re.sub(r'\s+', ' ',para)
        para = para.split(' ')
        para = [word for word in para if word not in stopwords.words('english')]
        para = [word for word in para if len(word)>2]
        if len(para) == 0:
          continue
        doc_para_noEmpties.append(para)


In [0]:
# The first 6 paragraphs of clean data
print(doc_para_noEmpties[:6])

[['valkyria', 'chronicles', 'iii'], ['senj', 'valkyria', 'unk', 'chronicles', 'japanese', 'lit', 'valkyria', 'battlefield', 'commonly', 'referred', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', 'tactical', 'role', 'playing', 'video', 'game', 'developed', 'sega', 'media', 'vision', 'playstation', 'portable', 'released', 'january', 'japan', 'third', 'game', 'valkyria', 'series', 'employing', 'fusion', 'tactical', 'real', 'time', 'gameplay', 'predecessors', 'story', 'runs', 'parallel', 'first', 'game', 'follows', 'nameless', 'penal', 'military', 'unit', 'serving', 'nation', 'gallia', 'second', 'europan', 'war', 'perform', 'secret', 'black', 'operations', 'pitted', 'imperial', 'unit', 'unk', 'raven'], ['game', 'began', 'development', 'carrying', 'large', 'portion', 'work', 'done', 'valkyria', 'chronicles', 'retained', 'standard', 'features', 'series', 'also', 'underwent', 'multiple', 'adjustments', 'making', 'game', 'forgiving', 'series', 'newcomers', 'character', 'designer', 'unk',

### 1.1.3 Store and Load Data Locally

The duration of data pre-processing procedure is extremly long (> 4 h).  After completion I saved (takes a couple of minutes) the clean data to a .csv file, to use it locally from Drive.

You can download the .csv file from this link:
[preproc_clean_data.csv](https://drive.google.com/open?id=1vgObuKTZ0iL69AUWt_uZ6Orx49Lu1KfS)

#### 1.1.3.1 Connect to personal Google Drive

In [0]:
# Connect to personal Google Drive
from google.colab import drive
drive.mount('/content/drive/',force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


#### 1.1.3.2 Store Clean Data to Drive

In [0]:
# Save data to local .csv file
import csv

with open("/content/drive/My Drive/NLP Assignment 2/preproc_clean_data.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(doc_para_noEmpties)

#### 1.1.3.3 Load Clean Data from Drive

In [0]:
# Recover data from local .csv file
import csv

doc_para_noEmpties = []
with open("/content/drive/My Drive/NLP Assignment 2/preproc_clean_data.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        doc_para_noEmpties.append(row)

In [0]:
# The first 6 paragraphs from restored data
print(doc_para_noEmpties[:6])

[['valkyria', 'chronicles', 'iii'], ['senj', 'valkyria', 'unk', 'chronicles', 'japanese', 'lit', 'valkyria', 'battlefield', 'commonly', 'referred', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', 'tactical', 'role', 'playing', 'video', 'game', 'developed', 'sega', 'media', 'vision', 'playstation', 'portable', 'released', 'january', 'japan', 'third', 'game', 'valkyria', 'series', 'employing', 'fusion', 'tactical', 'real', 'time', 'gameplay', 'predecessors', 'story', 'runs', 'parallel', 'first', 'game', 'follows', 'nameless', 'penal', 'military', 'unit', 'serving', 'nation', 'gallia', 'second', 'europan', 'war', 'perform', 'secret', 'black', 'operations', 'pitted', 'imperial', 'unit', 'unk', 'raven'], ['game', 'began', 'development', 'carrying', 'large', 'portion', 'work', 'done', 'valkyria', 'chronicles', 'retained', 'standard', 'features', 'series', 'also', 'underwent', 'multiple', 'adjustments', 'making', 'game', 'forgiving', 'series', 'newcomers', 'character', 'designer', 'unk',

### 1.1.4 Training the Word2Vec model on Dataset

#### 1.1.4.1 Setting model's parameters:
- **window** (int, optional) – Maximum distance between the current and predicted word within a sentence
- **size** (int, optional) – Dimensionality of the word vectors
- **sg** ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW
- **min_count** (int, optional) – Ignores all words with total frequency lower than this
- **workers** (int, optional) – Use these many worker threads to train the model

In [0]:
from gensim.models import Word2Vec

model = Word2Vec(window=4,size=100,sg=1,min_count=10,workers = -1)
model.build_vocab(doc_para_noEmpties)  # Building the model vocabulary
model.train(doc_para_noEmpties,total_examples=model.corpus_count,epochs=model.iter)


  """


(0, 0)

In [0]:
# Model's Vocabulary
vocab = model.wv.vocab
print('Vocabulary size:')
print(len(vocab))

Vocabulary size:
111839


In [0]:
# 10 most frequent words
mfw = model.wv.index2word[:10]
print(mfw)

['unk', 'first', 'one', 'also', 'two', 'new', 'time', 'would', 'game', 'later']


### 1.1.5 Find the 5 most similar word pairs from the 10 most frequent words

In [0]:
import operator

# Use dictionary structure to store word pairs and their similarity
similar_pairs = {}

# Compare each of the 10 most frequent words against the others and find similarity
for p1 in mfw:
  # restrict_vocab parameter restricts comparison to the 10 most frequent words
    pairs = model.wv.most_similar(p1, restrict_vocab=10)
    for p2,sim in pairs:
        similar_pairs[(p1, p2)] = sim

# Sort dictionary's data by values
sorted_similar_pairs = sorted(similar_pairs.items(), key=operator.itemgetter(1), reverse=True)

# Since each pair appears twice, I keep every other element of the ordered list
sorted_similar_pairs = sorted_similar_pairs[0:10:2]

# Print pairs
for pair in sorted_similar_pairs:
  print(pair)


(('unk', 'first'), 0.19137097895145416)
(('one', 'time'), 0.18669351935386658)
(('unk', 'later'), 0.1399707943201065)
(('also', 'would'), 0.11235056817531586)
(('unk', 'new'), 0.10211379081010818)


  if np.issubdtype(vec.dtype, np.int):


## 1.2. Implement a function that retrieves two word vectors and computes their cosine distance

In [0]:
# A function taking as input a trained word2vec model and two words (strings) and
# then manually computes and returns their cosine distance
def cosVecDist(w2vModel, str1, str2):
    # Use model to find vector representation of strings
    vstr1 = w2vModel.wv[str1]
    vstr2 = w2vModel.wv[str2]
    # Cosine distance of two vectors equals to the dot product of vectors
    # divided by the product of vectrors' legths
    # The dot product of two 1-D vectors can be computed using numpy.dot() function
    dotProd = np.dot(vstr1,vstr2)
    # The length of a 1-D vector can be computed using numpy.linalg.norm() function
    # The default value of ord parameter (ord=None), returns the 2-norm of vectors
    length1 = np.linalg.norm(vstr1, ord=None)
    length2 = np.linalg.norm(vstr2, ord=None)
    # Cosine distance of the vectors
    cosDist = dotProd/(length1*length2)
    return(cosDist)

### 1.2.1 Comparing the two approaches

Compute Cosine Distance of word pairs using 2 methods:

- Custom cosVecDist() function defined above
- The in-built wv.similarity() function of trained model

In [0]:
# I use the trained model from previous step
import random
import pandas as pd

# Create 5 random pairs of words from model's vocabulary
vocab = model.wv.index2word
list = []
for i in range(5):
  list.append(random.choices(vocab, k=2))

# DataFrame to store the results
headers = ['Word 1', 'Word 2', 'Custom Function', 'Model\'s Function']
cosDist = pd.DataFrame(columns = headers)

for pair in list:
  results = []
  results.append(pair[0])
  results.append(pair[1])
  results.append(cosVecDist(model, pair[0], pair[1]))
  results.append(model.wv.similarity(pair[0], pair[1]))
  
  new_row = pd.Series(results, index = headers)
  cosDist = cosDist.append(new_row, ignore_index=1)


  if np.issubdtype(vec.dtype, np.int):


In [0]:
# Print the results
print('Computing Cosine Distance with 2 different methods:')
cosDist

Computing Cosine Distance with 2 different methods:


Unnamed: 0,Word 1,Word 2,Custom Function,Model's Function
0,flameout,fertilise,-0.171696,-0.171696
1,belmonte,mocking,0.049636,0.049636
2,leg,platelets,0.099155,0.099155
3,supple,romani,-0.040979,-0.040979
4,pensford,bandshell,-0.030766,-0.030766


### 1.2.2 Conclusions

Both methods delivers exactly the same result, so it's obvious that in-built wv.similarity() function uses cosine distance to express the similarity of two words

##1.3.     Visit the NLPL word embedding repository and download the models with the following identifiers: 

- 40. It was trained on the English CoNLL17 corpus, using Continuous Skip-gram algorithm with vector size 100, and window size 10.
- 75. It was trained on the English Oil and Gas corpus, using Continuous Bag-of-Words algorithm with vector size 400, and window size 5.
- 82. It was trained on the English Common Crawl Corpus, using GloVe algorithm with vector size 300, and window size 10.

In [0]:
import urllib.request

### 1.3.1 CoNLL17 corpus

In [0]:
# Downloading CoNLL17 corpus ~1.5GB
urllib.request.urlretrieve("http://vectors.nlpl.eu/repository/11/40.zip", filename="40.zip")


('40.zip', <http.client.HTTPMessage at 0x7eff2e30e198>)

### 1.3.2 Oil and Gas corpus

In [0]:
# Downloading Oil and Gas corpus ~0.4GB
urllib.request.urlretrieve("http://vectors.nlpl.eu/repository/11/75.zip", filename="75.zip")


('75.zip', <http.client.HTTPMessage at 0x7efe7872f128>)

### 1.3.3 Common Crawl corpus

In [0]:
# Downloading Oil and Gas corpus ~2.3GB
urllib.request.urlretrieve("http://vectors.nlpl.eu/repository/11/82.zip", filename="82.zip")


('82.zip', <http.client.HTTPMessage at 0x7efe7872fdd8>)

##1.4. Create the lists of top 20 most frequent words in WikiText, CoNLL17, Oil and Gas, and Common Crawl corpora

In [0]:
import zipfile

In [0]:
# Dictionary structure to store 20 most frequent words of each corpus
twentyFreqWords = {}

### 1.4.1 Wiki Text

Just use the wv.index2word function on the model trained at stage 1.4

In [0]:
# 20 most frequent words
wiki_mfw = model.wv.index2word[:20]
twentyFreqWords['WikiText'] = wiki_mfw
print(twentyFreqWords['WikiText'])

['unk', 'first', 'one', 'also', 'two', 'new', 'time', 'would', 'game', 'later', 'three', 'film', 'may', 'year', 'made', 'second', 'season', 'years', 'world', 'war']


For the downloaded pre-trained embeddings I follow a differnt approach. Since each model is saved as a text file in the word2vec format, its lines are as a rule sorted by frequency. Each word and its vector corresponds to a paragraph of the .txt file. Furthermore, the first paragraph of the .txt file holds informations about the length of the vocabulary and the dimensionality of the vectors. So I just have to read the first few ten thousands of bytes, accordind to their vector's dimensionality and make sure to include at least the first 22 paragraphs.

For each model I repeat the following steps:
- Extract and read part of the model.txt file stored in the downloaded .zip file
- Convert the text bytes to string and split to paragraphs 
- Tokenize each paragraph to words
- Ignore the first paragraph and keep the first word of the following 20 paragraphs to list


### 1.4.2 CoNLL17 corpus

In [0]:
# Use the model.txt file inside the .zip file
# Read only the first 20000 bytes
with zipfile.ZipFile('40.zip', 'r') as z:
  doc40 = z.open('model.txt', 'r').read(20000)


In [0]:
# Convert bytes to string and then split to paragraphs
doc_str = doc40.decode("utf-8")
doc_para  = doc_str.split('\n')

In [0]:
# Check if sample includes at least the first 22 paragraphs
len(doc_para)

23

In [0]:
para_list = []
for i in range(20):
  # Ignore the first paragraph
  para = doc_para[i+1].split(' ')[0]
  para_list.append(para)

# Save list to dictionary
twentyFreqWords['40:CoNLL15'] = para_list

In [0]:
print(twentyFreqWords['40:CoNLL15'])

['</s>', ',', 'the', '.', 'of', 'and', 'to', 'a', 'in', '-', ')', '(', ':', 'for', 'is', '"', 'on', 'i', 'that', 'with']


### 1.4.3 Oil and Gas corpus

Same procedure as above

In [0]:
with zipfile.ZipFile('75.zip', 'r') as z:
  doc75 = z.open('model.txt', 'r').read(80000)

doc_str = doc75.decode("utf-8")
doc_para  = doc_str.split('\n')

len(doc_para)

23

In [0]:
para_list = []
for i in range(20):
  para = doc_para[i+1].split(' ')[0]
  para_list.append(para)

twentyFreqWords['75:OilAndGas'] = para_list

print(twentyFreqWords['75:OilAndGas'])

['lrb', 'rrb', 'sediment', 'fault', 'datum', 'basin', 'sample', 'area', 'study', 'model', 'result', 'zone', 'water', 'rock', 'time', 'formation', 'high', 'surface', 'increase', 'change']


### 1.4.4 Common Crawl corpus

Same procedure as above

In [0]:
with zipfile.ZipFile('82.zip', 'r') as z:
  doc82 = z.open('model.txt', 'r').read(60000)
  
doc_str = doc82.decode("utf-8")
doc_para  = doc_str.split('\n')

len(doc_para)

23

In [0]:
para_list = []
for i in range(20):
  para = doc_para[i+1].split(' ')[0]
  para_list.append(para)

twentyFreqWords['82:CommonCrawl'] = para_list

print(twentyFreqWords['82:CommonCrawl'])

['the', ',', '.', 'and', 'to', 'of', 'a', 'in', 'is', 'that', 'i', 'for', 'it', 'you', 'on', "'s", 'with', '-rrb-', '-lrb-', 'as']


## 1.5. Comparison of the 4 word lists

In [0]:
import pandas as pd
print(pd.DataFrame.from_dict(twentyFreqWords))

   40:CoNLL15 75:OilAndGas 82:CommonCrawl
0        </s>          lrb            the
1           ,          rrb              ,
2         the     sediment              .
3           .        fault            and
4          of        datum             to
5         and        basin             of
6          to       sample              a
7           a         area             in
8          in        study             is
9           -        model           that
10          )       result              i
11          (         zone            for
12          :        water             it
13        for         rock            you
14         is         time             on
15          "    formation             's
16         on         high           with
17          i      surface          -rrb-
18       that     increase          -lrb-
19       with       change             as


###1.5.1 Conclusions

Comparing the 4 lists I conclude the following:

-  CoNLL15 DataSet is composed exclusively by stop words.
-  OilAndGas DataSet contains clean pre-processed data. All stop words are removed.
-  CommonCrawl DataSet is composed exclusively by stop words.
-  WikiText DataSet contains clean pre-processed data. All stop words are removed.

## 1.6. Project top 1000 words from the WikiText corpus in 2d space using t-SNE plot

I use the model trained at stage 1.4

In [0]:
# Get the 1000 most frequent words from the model
mfw1000 = model.wv.index2word[:1000]
print(mfw1000[:20])

['unk', 'first', 'one', 'also', 'two', 'new', 'time', 'would', 'game', 'later', 'three', 'film', 'may', 'year', 'made', 'second', 'season', 'years', 'world', 'war']


In [0]:
# Get the vector representation of each word in the list
mfw1000_vecs = model[mfw1000]
print(mfw1000_vecs[0])

[-2.8655671e-03  4.5412979e-03 -3.7377772e-03  8.4503921e-04
 -4.6012774e-03  4.7684074e-03 -5.4040161e-04  2.5904330e-03
  1.4480082e-03 -4.4996226e-03 -2.7799828e-03 -2.7477788e-03
  1.6728049e-03 -4.9593090e-03  2.3156838e-04  4.6505779e-03
  2.6402522e-05 -2.0656511e-03 -1.7831454e-03 -2.6781848e-03
 -2.8692437e-03 -9.9430233e-04  1.1304878e-03 -1.2564651e-03
  4.1854358e-03 -3.8195304e-03 -2.4816170e-03  4.6105557e-03
 -7.4935844e-04 -1.8244198e-03 -1.1363826e-03  4.4023115e-03
 -3.2536786e-03 -2.3927144e-04  2.2385111e-03 -7.0144032e-04
 -5.1844853e-04 -1.7208519e-03 -8.5241324e-04  2.4389627e-03
  4.2723813e-03  1.4387226e-03  3.4389631e-03 -4.2178594e-03
 -1.7981980e-03  2.6706930e-03  5.0389330e-04  3.2826872e-03
  3.0838910e-03 -1.9076788e-04  3.3433323e-03 -1.8412486e-03
  2.4158449e-03  1.1767549e-03  6.1219494e-04 -1.4081600e-03
 -8.5972471e-04 -4.6416526e-03 -4.3288963e-03  2.1201104e-03
  3.8737925e-03 -5.9332506e-04  3.8541672e-03  3.1657084e-03
  1.3826588e-03 -3.21834

  """Entry point for launching an IPython kernel.


In [0]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from sklearn.manifold import TSNE
from bokeh.io import output_notebook
output_notebook()

tsne = TSNE(n_components=2, random_state=42)
mfw1000_vecs_tsne = tsne.fit_transform(mfw1000_vecs)

p = figure(tools="pan,wheel_zoom,box_zoom,reset,save",
           toolbar_location="above",
           title="Word2Vec T-SNE for the 1000 most common words",
           plot_width=800)

source = ColumnDataSource(data=dict(x1=mfw1000_vecs_tsne[:,0],
                                    x2=mfw1000_vecs_tsne[:,1],
                                    names=mfw1000))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="7pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

#Part 2

## 2.1. Train a word embedding model on the sentence classification corpus from the UCI Machine Learning repository

### 2.1.1 Download Dataset

In [0]:
import urllib.request
import zipfile
import re
from os import listdir
from gensim.models import Word2Vec

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip", filename="SentenceCorpus.zip")

('SentenceCorpus.zip', <http.client.HTTPMessage at 0x7eff301602b0>)

### 2.1.2 Data Pre-Processing

In [0]:
# After execution each single line of the training documents is an element of corpus list 
corpus = []
of_stopwords = []
extra_words_to_remove = ['citation','number','symbol','misc','aimx','ownx','cont','base','of','','abstract','introduction']

with zipfile.ZipFile('SentenceCorpus.zip', 'r') as z:
  z.extractall()
  # Handling the training files
  file_list = sorted(listdir('SentenceCorpus/labeled_articles/'))
  for file_name in file_list:
    if file_name.endswith('1.txt'):
      file = z.read('SentenceCorpus/labeled_articles/' + file_name).decode('utf-8')
      corpus.extend(file.split('\n'))
  # The official stop words
  file = z.read('SentenceCorpus/word_lists/stopwords.txt').decode('utf-8')
  of_stopwords.extend(file.split('\n')[1:-1])
  

In [0]:
file_list

['.DS_Store',
 'arxiv_annotate10_7_1.txt',
 'arxiv_annotate10_7_2.txt',
 'arxiv_annotate10_7_3.txt',
 'arxiv_annotate1_13_1.txt',
 'arxiv_annotate1_13_2.txt',
 'arxiv_annotate1_13_3.txt',
 'arxiv_annotate2_66_1.txt',
 'arxiv_annotate2_66_2.txt',
 'arxiv_annotate2_66_3.txt',
 'arxiv_annotate3_80_1.txt',
 'arxiv_annotate3_80_2.txt',
 'arxiv_annotate3_80_3.txt',
 'arxiv_annotate4_168_1.txt',
 'arxiv_annotate4_168_2.txt',
 'arxiv_annotate4_168_3.txt',
 'arxiv_annotate5_240_1.txt',
 'arxiv_annotate5_240_2.txt',
 'arxiv_annotate5_240_3.txt',
 'arxiv_annotate6_52_1.txt',
 'arxiv_annotate6_52_2.txt',
 'arxiv_annotate6_52_3.txt',
 'arxiv_annotate7_268_1.txt',
 'arxiv_annotate7_268_2.txt',
 'arxiv_annotate7_268_3.txt',
 'arxiv_annotate8_81_1.txt',
 'arxiv_annotate8_81_2.txt',
 'arxiv_annotate8_81_3.txt',
 'arxiv_annotate9_279_1.txt',
 'arxiv_annotate9_279_2.txt',
 'arxiv_annotate9_279_3.txt',
 'jdm_annotate10_210_1.txt',
 'jdm_annotate10_210_2.txt',
 'jdm_annotate10_210_3.txt',
 'jdm_annotate1_1

In [0]:
corpus[1]

'MISC\tThe Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied\r'

In [0]:
# After execution each element of corpus list (line of the training documents) is
# transformed to a list of clean tokens as element in preproc_corpus list
preproc_corpus = []
for doc in corpus:
  doc = doc.lower()
  doc = re.sub(r'[^a-z0-9]+', ' ',doc)
  doc = re.sub(r'\s+', ' ',doc)
  doc = doc.split(' ')
  doc = [word for word in doc if word not in of_stopwords+extra_words_to_remove+stopwords.words('english')]
  doc = [word for word in doc if len(word)>1]
  if len(doc) > 0:
    preproc_corpus.append(doc)

In [0]:
preproc_corpus[0]

['minimum',
 'description',
 'length',
 'principle',
 'online',
 'sequence',
 'estimation',
 'prediction',
 'proper',
 'learning',
 'setup',
 'studied']

In [0]:
len(corpus), len(preproc_corpus)

(1120, 1039)

### 2.1.3 Train a basic word embedding model

In [0]:
def customW2Vmodel(window=4,size=100,sg=1,min_count=1,workers=-1,negative=5):
  model = Word2Vec(window=window,size=size,sg=sg,min_count=min_count,workers=workers,negative=negative)
  model.build_vocab(preproc_corpus)  # Building the model vocabulary
  model.train(preproc_corpus,total_examples=model.corpus_count,epochs=model.iter)
  return(model)

In [0]:
%%time

model0 = customW2Vmodel(window=4,size=100,sg=1,min_count=1,workers=-1,negative=5)
print('Vocabulary size: ',len(model0.wv.vocab))

Vocabulary size:  3887
CPU times: user 99.3 ms, sys: 1.87 ms, total: 101 ms
Wall time: 107 ms


  after removing the cwd from sys.path.


##2.2 Train other models on the same data, but with one hyperparameter different (for example, window size or vector size)

### 2.2.1 Adjusting vector dimensionality (size)

In [0]:
%%time

model1 = customW2Vmodel(size=250)
print('Vocabulary size: ',len(model1.wv.vocab))

Vocabulary size:  3887
CPU times: user 97.8 ms, sys: 1.92 ms, total: 99.7 ms
Wall time: 106 ms


  after removing the cwd from sys.path.


In [0]:
%%time

model11 = customW2Vmodel(size=500)
print('Vocabulary size: ',len(model1.wv.vocab))

Vocabulary size:  3887
CPU times: user 114 ms, sys: 2.94 ms, total: 117 ms
Wall time: 123 ms


  after removing the cwd from sys.path.


### 2.2.2 Adjusting window size

In [0]:
%%time

model2 = customW2Vmodel(window=8)
print('Vocabulary size: ',len(model2.wv.vocab))

Vocabulary size:  3887
CPU times: user 91.5 ms, sys: 5.53 ms, total: 97 ms
Wall time: 100 ms


  after removing the cwd from sys.path.


### 2.2.3 Adjusting minimum token frequency (min_count)

In [0]:
%%time

model3 = customW2Vmodel(min_count=5)
print('Vocabulary size: ',len(model3.wv.vocab))

Vocabulary size:  870
CPU times: user 39.4 ms, sys: 3.04 ms, total: 42.4 ms
Wall time: 46.9 ms


  after removing the cwd from sys.path.


### 2.2.4 Adjusting negative sampling

In [0]:
%%time

model4 = customW2Vmodel(negative=10)
print('Vocabulary size: ',len(model4.wv.vocab))

Vocabulary size:  3887
CPU times: user 94.2 ms, sys: 3.78 ms, total: 98 ms
Wall time: 102 ms


  after removing the cwd from sys.path.


### 2.2.5 Adjusting training algorithm (sg=0 for CBOW; sg=1 for skip-gram )

In [0]:
%%time

model5 = customW2Vmodel(sg=0)
print('Vocabulary size: ',len(model5.wv.vocab))

Vocabulary size:  3887
CPU times: user 104 ms, sys: 1.77 ms, total: 106 ms
Wall time: 110 ms


  after removing the cwd from sys.path.


## 2.3. Compare models with respect to Performance

I choose a set of metrics to compare models' performance, including training time and results delivering from various in-built functions

In [0]:
# Model to compare
models = [('Basic Model',model0),('Vector Dimensionality 250',model1),
          ('Vector Dimensionality 500',model11),('Window Size',model2),
          ('Min Token Frequency',model3),('Negative Sampling',model4),
          ('Train Algorithm',model5)]

### 2.3.1 Training time

### 2.3.2 10 best pairs of words

In [0]:
import pandas as pd

In [0]:
# Function that uses the model to determine the N best pairs of tokens, based on
# similarity, among all possible pairs in the vocabolary
def bestSimPairs(model,N,vocab=None):
  import operator
  
  # Use dictionary structure to store word pairs and their similarities
  similar_pairs = {}
  if(vocab == None):
    vocab = model.wv.index2word
    
  # Compare each word of the vocabulary against all others and find similarities
  for p1 in vocab:
    pairs = model.wv.most_similar(p1,topn=1)
    p2,sim = pairs[0]
    similar_pairs[(p1, p2)] = sim

  # Sort dictionary's data by values
  sorted_similar_pairs = sorted(similar_pairs.items(),key=operator.itemgetter(1),reverse=True)

  # Since every pair appears twice, I keep every other element of the ordered list
  sorted_similar_pairs = sorted_similar_pairs[0:2*N:2]
  
  return(sorted_similar_pairs) 

In [0]:
vocab = None

best_pairs_by_method = {}

for (name,model) in models:
  best_pairs_by_method[name] = bestSimPairs(model, 10, vocab)

print(pd.DataFrame.from_dict(best_pairs_by_method))

  if np.issubdtype(vec.dtype, np.int):


                                      Basic Model  \
0          ((noisy, driving), 0.4894424080848694)   
1      ((works, assemblies), 0.48557597398757935)   
2         ((simpler, burst), 0.48217469453811646)   
3  ((inserts, contribution), 0.47997167706489563)   
4              ((deal, care), 0.4799291491508484)   
5   ((employ, concurrently), 0.46593475341796875)   
6              ((go, halali), 0.4637524485588074)   
7         ((fit, employing), 0.45622801780700684)   
8  ((specifically, maximize), 0.4561006724834442)   
9             ((isps, odour), 0.4546169638633728)   

                                 Min Token Frequency  \
0            ((variants, times), 0.4278267025947571)   
1  ((explicitly, psychological), 0.4277902841567993)   
2             ((new, involving), 0.4151037931442261)   
3              ((human, hetero), 0.4103476405143738)   
4           ((believed, times), 0.39652907848358154)   
5          ((processing, show), 0.39632487297058105)   
6    ((generally, constr

In [0]:
vocab = model3.wv.index2word

best_pairs_by_method = {}

for (name,model) in models:
  best_pairs_by_method[name] = bestSimPairs(model, 10, vocab)

print(pd.DataFrame.from_dict(best_pairs_by_method))

  if np.issubdtype(vec.dtype, np.int):


                                         Basic Model  \
0         ((works, assemblies), 0.48557597398757935)   
1     ((specifically, maximize), 0.4561006724834442)   
2    ((alternative, identifies), 0.4453243613243103)   
3     ((direction, predicting), 0.44045907258987427)   
4              ((probably, 300), 0.4356062412261963)   
5             ((measure, nervous), 0.43490070104599)   
6               ((three, lbfr), 0.43085259199142456)   
7            ((times, variants), 0.4278267025947571)   
8  ((psychological, explicitly), 0.4277902841567993)   
9        ((required, regimens), 0.42751410603523254)   

                                 Min Token Frequency  \
0            ((variants, times), 0.4278267025947571)   
1  ((explicitly, psychological), 0.4277902841567993)   
2             ((new, involving), 0.4151037931442261)   
3              ((human, hetero), 0.4103476405143738)   
4           ((believed, times), 0.39652907848358154)   
5          ((processing, show), 0.3963248729705

### 2.3.3 Compute words' cosine distance

Find the cosine distance of pairs of words

In [0]:
# Function that accepts a trained word2vec model and list of pairs of words and
# computes the cosine distance of each pair
def cosDist(model, pairs):
  cos_dist = []
  
  for pair in pairs:
    w1,w2 = pair
    cos_dist.append(model.wv.similarity(w1,w2))
    
  return(cos_dist)

In [0]:
import random

# Create 5 random pairs of words from the smallest vocabulary
vocab = model3.wv.index2word
list = []
for i in range(5):
  list.append(random.choices(vocab, k=2))

cosDist_by_method = {}
cosDist_by_method['List of pairs'] = list

# Use the methods to determine the cosine distance of words in each pair
for (name,model) in models:
  cosDist_by_method[name] = cosDist(model, list)

print(pd.DataFrame.from_dict(cosDist_by_method).set_index('List of pairs'))

                       Basic Model  Min Token Frequency  Negative Sampling  \
List of pairs                                                                
[exchange, easy]         -0.004057            -0.004057          -0.004057   
[stimuli, complexity]    -0.035405            -0.035405          -0.035405   
[responsible, easy]      -0.038228            -0.038228          -0.038228   
[implies, could]          0.135873             0.135873           0.135873   
[recent, correct]         0.156587             0.156587           0.156587   

                       Train Algorithm  Vector Dimensionality 250  \
List of pairs                                                       
[exchange, easy]             -0.004057                   0.079440   
[stimuli, complexity]        -0.035405                  -0.130871   
[responsible, easy]          -0.038228                   0.022298   
[implies, could]              0.135873                   0.104804   
[recent, correct]             0.156587 

  if np.issubdtype(vec.dtype, np.int):


### 2.3.4 Compute phrases' cosine distance

Find the cosine distance of lists of words or strings

In [0]:
# Function that accepts a trained model and list of pairs of lists of words or
# strings and computes the cosine distance of each pair. Strings in a pair should
# have same length
def cosDistPhrase(model, lists):
  cos_phrase_dist = []
  
  for pair in lists:
    p1,p2 = pair
    cos_phrase_dist.append(model.wv.n_similarity(p1,p2))
    
  return(cos_phrase_dist)

In [0]:
import random

# Create 5 pairs of random 4-element lists of words from the smallest vocabulary
vocab = model3.wv.index2word
lists = []
for i in range(5):
  lists.append([random.choices(vocab, k=4), random.choices(vocab, k=4)])

cosDistPhrase_by_method = {}
cosDistPhrase_by_method['Lists of pairs'] = lists

# Use the methods to determine the cosine distance of words in each list
for (name,model) in models:
  cosDistPhrase_by_method[name] = cosDistPhrase(model, lists)

print(pd.DataFrame.from_dict(cosDistPhrase_by_method).set_index('Lists of pairs'))

                                                    Basic Model  \
Lists of pairs                                                    
[[decisions, instances, cellular, testing], [un...     0.044305   
[[output, involving, questions, disutility], [p...    -0.087051   
[[hard, place, age, majority], [network, belief...     0.117287   
[[preferences, probability, value, six], [proce...    -0.095530   
[[would, complexity, context, solution], [viewe...     0.147153   

                                                    Min Token Frequency  \
Lists of pairs                                                            
[[decisions, instances, cellular, testing], [un...             0.044305   
[[output, involving, questions, disutility], [p...            -0.087051   
[[hard, place, age, majority], [network, belief...             0.117287   
[[preferences, probability, value, six], [proce...            -0.095530   
[[would, complexity, context, solution], [viewe...             0.147153   

    

  if np.issubdtype(vec.dtype, np.int):


### 2.3.5 Find which word doesn't match with the others

Which word from the given list doesn’t go with the others?

In [0]:
# Function that accepts lists of words and uses the model to determine which word
# doesn't match with the others
def doesntMatch(model, lists):
  dont_match = []
  
  for list in lists:
    dont_match.append(model.doesnt_match(list))
    
  return(dont_match)

In [0]:
import random

# Create 5 random 5-element lists from the smallest vocabulary
vocab = model3.wv.index2word
lists = []
for i in range(5):
  lists.append(random.choices(vocab, k=5))

doesnt_match_by_method = {}
doesnt_match_by_method['Lists of words'] = lists

# Use the methods to determine which word doesn't match in each list
for (name,model) in models:
  doesnt_match_by_method[name] = doesntMatch(model, lists)

print(pd.DataFrame.from_dict(doesnt_match_by_method).set_index('Lists of words'))

                                                Basic Model  \
Lists of words                                                
[polynomial, firing, length, three, fixed]            three   
[real, consequence, involving, splicing, task]  consequence   
[framing, dual, couple, bernoulli, option]        bernoulli   
[speed, analyzed, variants, predict, feedback]     analyzed   
[able, instances, couple, procedure, whether]        couple   

                                               Min Token Frequency  \
Lists of words                                                       
[polynomial, firing, length, three, fixed]                   three   
[real, consequence, involving, splicing, task]         consequence   
[framing, dual, couple, bernoulli, option]               bernoulli   
[speed, analyzed, variants, predict, feedback]            analyzed   
[able, instances, couple, procedure, whether]               couple   

                                               Negative Sampling  \

  """
  if np.issubdtype(vec.dtype, np.int):


### 2.3.6 Predict output word

Get the probability distribution of the center word given context words

In [0]:
# Function that accepts lists of words as context and uses the model to determine
# the most probable center word
def findCenter(model, lists):
  center = []
  
  for list in lists:
    center.append(model.predict_output_word(list, topn=1))
    
  return(center)

In [0]:
# List of 5 tokenized random phrases from corpus
lists = [['pay','their','taxes','despite','low','likelihood'],
        ['incorporate','potentially','given','loss','function'],
        ['challenges','accurate','realistic','modeling'],
        ['derive','macroscopic','statistics','different','types'],
        ['critical','rapid','encoding','novel','information']]

center_by_method = {}

# Use the methods to determine the center word for each list of context words
for (name,model) in models:
  center_by_method[name] = findCenter(model, lists)

print(pd.DataFrame.from_dict(center_by_method))

                Basic Model      Min Token Frequency  \
0  [(model, 0.00025726782)]  [(model, 0.0011494253)]   
1  [(model, 0.00025726782)]  [(model, 0.0011494253)]   
2  [(model, 0.00025726782)]  [(model, 0.0011494253)]   
3  [(model, 0.00025726782)]  [(model, 0.0011494253)]   
4  [(model, 0.00025726782)]  [(model, 0.0011494253)]   

          Negative Sampling           Train Algorithm  \
0  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
1  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
2  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
3  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
4  [(model, 0.00025726782)]  [(model, 0.00025726782)]   

  Vector Dimensionality 250 Vector Dimensionality 500  \
0  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
1  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
2  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
3  [(model, 0.00025726782)]  [(model, 0.00025726782)]   
4  [(model, 0.00025726782)]  [(mode

#Part 3

## 3.1. Train a word embedding model on the sentence classification corpus from the UCI Machine Learning repository

In [1]:
# Connect to personal Google Drive
from google.colab import drive
drive.mount('/content/drive/',force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [0]:
import urllib.request
import zipfile

Download and extract data to Drive. Once done, you can handle the data locally.

In [13]:
urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip", filename="SentenceCorpus.zip")

('SentenceCorpus.zip', <http.client.HTTPMessage at 0x7f565c93e3c8>)

In [0]:
# urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip", filename="SentenceCorpus.zip")
with zipfile.ZipFile('SentenceCorpus.zip', 'r') as z:
  z.extractall('/content/drive/My Drive/NLP Assignment 2/Part3/')


In [0]:
# Data paths
training_docs_directory = '/content/drive/My Drive/NLP Assignment 2/Part3/SentenceCorpus/labeled_articles/'
official_stopwords = '/content/drive/My Drive/NLP Assignment 2/Part3/SentenceCorpus/word_lists/stopwords.txt'
vocab_filename = '/content/drive/My Drive/NLP Assignment 2/Part3/vocab.txt'
embedding_word2vec_filename = '/content/drive/My Drive/NLP Assignment 2/Part3/embedding_word2vec.txt'


## 3.2 Train an Embedding Layer on the DataSet


###3.2.1 Extract the DataSet's vocabulary and save it in a file to Drive for local use

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
from string import punctuation
from os import listdir
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file
  file.close()
  return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
  doc = doc.lower()
  tokens = doc.split()
  # remove punctuation from each token
  table = str.maketrans('', '', punctuation)
  tokens = [w.translate(table) for w in tokens]
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]
  tokens = [w for w in tokens if not w in extra_words_to_remove]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
  # load doc
  doc = load_doc(filename)
  # clean doc
  tokens = clean_doc(doc)
  # update counts
  vocab.update(tokens)

# load all docs in a directory  
def process_docs(directory, vocab):
  # walk through all files in the folder
  for filename in listdir(directory):
    # keep docs only drom the first reviewer
    if filename.endswith('1.txt'):
      # create the full path of the file to open
      path = directory+filename
      # add doc to vocab
      add_doc_to_vocab(path,vocab)

# Save list to file
def save_list(lines, filename):
  # convert lines to a single blob of text
  data = '\n'.join(lines)
  # open file
  file = open(filename, 'w')
  # write text
  file.write(data)
  # close file
  file.close()

# Add all docs to vocab
# Define vocabulary
vocab = Counter()

extra_words_to_remove = ['citation','number','symbol','misc',
                         'aimx','ownx','cont','base','of','',
                         'abstract','introduction']

# Decompress and manage DataSet
# Get training documents from respective directories
process_docs(training_docs_directory, vocab)

# The size of the vocab
print("\nVocabulary size:", len(vocab))
# Top words in the vocab
print("\nMost common words: \n", vocab.most_common(50))

# Keep tokens with a min occurence
min_occurence = 1
tokens = [k for k,c in vocab.items() if c >= min_occurence]
print("\nUpdated Vocabulary size:", len(tokens))

# Save tokens to a vocabulary file
save_list(tokens, vocab_filename)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Vocabulary size: 3968

Most common words: 
 [('model', 96), ('models', 58), ('one', 56), ('may', 50), ('many', 48), ('also', 48), ('based', 47), ('however', 46), ('proteins', 46), ('two', 45), ('data', 45), ('results', 43), ('learning', 41), ('loss', 38), ('selfcontrol', 38), ('participants', 38), ('used', 37), ('using', 36), ('behavior', 36), ('new', 35), ('different', 35), ('well', 35), ('stochastic', 34), ('thus', 34), ('section', 33), ('information', 32), ('studies', 32), ('ion', 32), ('general', 31), ('et', 31), ('al', 31), ('conflict', 31), ('formula', 31), ('neurons', 31), ('splicing', 31), ('expected', 30), ('use', 30), ('first', 30), ('study', 30), ('choice', 30), ('face', 30), ('example', 29), ('task', 29), ('network', 29), ('large', 28), ('analysis', 28), ('problem', 28), ('paper', 27), ('function', 27), ('individual', 27)]

Updated Vocabulary size: 3968


### 3.2.2 Training the Embedding Layer

In [0]:
# Optional. Recover vocabulary from local file
with open(vocab_filename, "r") as f:
    vocab = f.read().split('\n')

In [20]:
len(vocab)

3968

#### 3.2.2.1 Building the data

**Libraries and Functions to use**

For the data pre-processing I will follow different approach than the Part 2 of the Assignment, with various predefined functions and hopefully the same results.

In [21]:
from string import punctuation
from os import listdir
import numpy as np
from numpy import array
import operator


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding


classes = ['aimx','base','cont','misc','ownx']
extra_words_to_remove = ['citation','number','symbol','of','','abstract','introduction']

# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file
  file.close()
  return text

# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):
  clean_lines = []
  lines = doc.splitlines()
  for line in lines:
    line = line.lower()
    # split into tokens by white space
    tokens = line.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [w for w in tokens if not w in extra_words_to_remove]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab+classes]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    clean_lines.append(tokens)
  return clean_lines

# load all docs in a directory
def process_docs(directory, vocab, reviewer):
  lines = []
  # walk through all files in the folder
  filelist = sorted(listdir(directory))
  for filename in filelist:
    # keep docs only from the first reviewer
    if filename.endswith(reviewer):
      # create the full path of the file to open
      path = directory + filename
      # load and clean the doc
      doc = load_doc(path)
      doc_lines = doc_to_clean_lines(doc, vocab)
      doc_lines = [l for l in doc_lines if len(l) > 0]
      doc_lines = [l for l in doc_lines if l[0] in classes]
      # add lines to list
      lines += doc_lines
  return lines


Using TensorFlow backend.


**Get the training data**

In [0]:
# Corpus doc (X) and labels (y) from DataSet
X = []
y = []

# Get training documents from respective directories
# Each element is a single line (list of clean tokens) of the training documents
# Getting corpus docs (X) and labels from first reviewer's files
training_docs = process_docs(training_docs_directory, vocab, '1.txt')
y1 = []   # Labels from first reviewer
for line in training_docs:
  y1.append(line[0])
  X.append(line[1:])

# Labels from second and third reviewers
y2 = []
training_docs2 = process_docs(training_docs_directory, vocab, '2.txt')
for line in training_docs2:
  y2.append(line[0])
y3 = []
training_docs3 = process_docs(training_docs_directory, vocab, '3.txt')
for line in training_docs3:
  y3.append(line[0])

# get the predominant label for each doc
for i in range(len(y1)):
    dic = {'aimx': 0, 'base': 0, 'cont': 0, 'misc': 0, 'ownx': 0, }
    dic[y1[i]] += 1
    dic[y2[i]] += 1
    dic[y3[i]] += 1
    sorted_dic = sorted(dic.items(), key=operator.itemgetter(1), reverse=True)
    y.append(sorted_dic[0][0])


In [23]:
y[:10]

['misc',
 'misc',
 'misc',
 'aimx',
 'ownx',
 'ownx',
 'ownx',
 'misc',
 'cont',
 'misc']

**Handling categorical y values**

In [24]:
# First transform categorical values to integers
le = LabelEncoder()
y = le.fit_transform(y)
print('Original y labels: ', classes)
print('Integer encoded y labels: ', le.transform(classes))

# Then use One Hot encoding
ohe = OneHotEncoder(sparse=False)
y = y.reshape(len(y),1)
y = ohe.fit_transform(y)
# One could use only the initial Integer encoder and not the following One Hot.
# Then during Network compilation should use loss='sparse_categorical_crossentropy'


Original y labels:  ['aimx', 'base', 'cont', 'misc', 'ownx']
Integer encoded y labels:  [0 1 2 3 4]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [0]:
# Split training-testing data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,
                                               stratify=y, random_state=0)

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X)

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# encode sequences
encoded_X_train = tokenizer.texts_to_sequences(X_train)
encoded_X_test = tokenizer.texts_to_sequences(X_test)

# pad sequences
max_length = max([len(s) for s in X])
X_train_pad = pad_sequences(encoded_X_train, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(encoded_X_test, maxlen=max_length, padding='post')


#### 3.2.2.2 Define and Train the model

I define the model inside a function in order to be able to call it on the next part of the Assignment

In [29]:

# Model's parameters
output_dim = 100
activation = 'softmax'
losses = 'categorical_crossentropy'
optimizer = 'adam'

parameters = [output_dim, activation, losses, optimizer]

def annDef(param):
  # Define Model
  model = Sequential()
  model.add(Embedding(vocab_size, output_dim=output_dim, input_length=max_length))
  model.add(Flatten())
  model.add(Dense(5, activation=activation))
  print(model.summary())

  # Compile Network
  model.compile(loss=losses, optimizer=optimizer, metrics=['accuracy'])

  # Fit Network to training data
  model.fit(X_train_pad, y_train, epochs=10, verbose=2)

  # Evaluate Network during Training
  loss, acc = model.evaluate(X_test_pad, y_test, verbose=2)
  print('Test Accuracy: %f' % (acc*100))
  
  return model
  

model = annDef(parameters)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 54, 100)           392100    
_________________________________________________________________
flatten_2 (Flatten)          (None, 5400)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 27005     
Total params: 419,105
Trainable params: 419,105
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/9
 - 0s - loss: 1.1738 - acc: 0.5683
Epoch 2/9
 - 0s - loss: 0.9924 - acc: 0.6024
Epoch 3/9
 - 0s - loss: 0.8682 - acc: 0.6171
Epoch 4/9
 - 0s - loss: 0.6933 - acc: 0.8085
Epoch 5/9
 - 0s - loss: 0.4873 - acc: 0.8902
Epoch 6/9
 - 0s - loss: 0.3218 - acc: 0.9293
Epoch 7/9
 - 0s - loss: 0.2072 - acc: 0.9646
Epoch 8/9
 - 0s - loss: 0.1354 - acc: 0.9915
Epoch 9/9
 - 0s - loss: 0.0885 - ac

### 3.2.3 Model evaluation

**Importing the metrics**

I will use the well known metrics from sklearn library

In [0]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

**Prediction on the test set**

Use trained model to extract the predictions for the test set of data

In [31]:
y_pred = model.predict(X_test_pad)
y_pred[:10]

array([[0.01635011, 0.00768984, 0.01620445, 0.9035947 , 0.05616083],
       [0.06417371, 0.01832901, 0.04484073, 0.3990794 , 0.4735771 ],
       [0.07827186, 0.03355028, 0.05857874, 0.49991858, 0.3296805 ],
       [0.06035602, 0.04286615, 0.13120122, 0.6543308 , 0.11124577],
       [0.01870227, 0.00959134, 0.03747429, 0.9101905 , 0.02404161],
       [0.05441982, 0.01605988, 0.02669431, 0.5027342 , 0.4000917 ],
       [0.10939447, 0.03267734, 0.07525317, 0.4094746 , 0.37320042],
       [0.05899882, 0.07121229, 0.28040886, 0.5207333 , 0.06864672],
       [0.01490743, 0.00607549, 0.01438066, 0.9210328 , 0.04360374],
       [0.02601239, 0.00933846, 0.02204727, 0.850629  , 0.09197292]],
      dtype=float32)

It's obvious that the prediction for each sample is in the SoftMax format. The output is composed by 5 float numbers denoting the probability the sample to belong to each one of the 5 categories.

**Turning probabilities into Labels**

My task is to restore the original labels for testing data. I will follow the opposite procedure than the one used to encode the categorical y values:

- Turn SoftMax format into Integer encoding
- Turn Integer encoding into original labels, using the original fitted LabelEncoder() instance (inverse transformation).

In [33]:
# Integer encoding
# argmax() function will replace the set of probabilities with the index of
# the higher probability 
y_p = np.argmax(y_pred, axis=1)
y_p_t = np.argmax(y_test, axis=1)
y_p[:10]

array([3, 4, 3, 3, 3, 3, 3, 3, 3, 3])

In [34]:
# Inverse transformation of Integer encoding to Labels
y_predictions = le.inverse_transform(y_p)
y_true = le.inverse_transform(y_p_t)
y_predictions[:10]

array(['misc', 'ownx', 'misc', 'misc', 'misc', 'misc', 'misc', 'misc',
       'misc', 'misc'], dtype='<U4')

**Calculate the metrics**

In [35]:
accuracy = accuracy_score(y_true,y_predictions)
recall = recall_score(y_true,y_predictions,average='macro')
precision = precision_score(y_true,y_predictions,average='macro')
f1 = f1_score(y_true,y_predictions,average='macro')

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [36]:
print('Accuracy:\t{:.3f}'.format(accuracy))
print('Recall:\t\t{:.3f}'.format(recall))
print('Precision:\t{:.3f}'.format(precision))
print('F1 score:\t{:.3f}'.format(f1))

Accuracy:	0.668
Recall:		0.276
Precision:	0.460
F1 score:	0.276


## 3.3 Experimenting with various parameters of the Classifier

**Model definition function from previous part**

In [0]:
def annDef(od, act, los, opt):
  # Define Model
  model = Sequential()
  model.add(Embedding(vocab_size, output_dim=od, input_length=max_length))
  model.add(Flatten())
  model.add(Dense(5, activation=act))
  print(model.summary())

  # Compile Network
  model.compile(loss=los, optimizer=opt, metrics=['accuracy'])

  # Fit Network to training data
  model.fit(X_train_pad, y_train, epochs=10, verbose=2)

  # Evaluate Network during Training
  loss, acc = model.evaluate(X_test_pad, y_test, verbose=2)
  print('Test Accuracy: %f' % (acc*100))
  
  return(model)
  

**Model evaluation function**

In [0]:
def evalModel(model):
  y_pred = model.predict(X_test_pad)
  y_p = np.argmax(y_pred, axis=1)
  y_p_t = np.argmax(y_test, axis=1)
  y_predictions = le.inverse_transform(y_p)
  y_true = le.inverse_transform(y_p_t)
  accuracy = accuracy_score(y_true,y_predictions)
  recall = recall_score(y_true,y_predictions,average='macro')
  precision = precision_score(y_true,y_predictions,average='macro')
  f1 = f1_score(y_true,y_predictions,average='macro')
  return((accuracy,recall,precision,f1))

**Define model's parameters and execute 3 times with the same set**

In [40]:
import pandas as pd
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

# Dataframe to store total results
headers=['Dimensions','Activation','Losses','Optimizer','Accurasy(+/-sd)',
         'Recall(+/-sd)','Precision(+/-sd)','F1_score(+/-sd)']
headers2 = ['Accurasy','Recall','Precision','F1_score']

total = pd.DataFrame(columns=headers)

# Model's parameters' range
output_dim = [100,250,500]
activation = ['softmax','sigmoid']
losses = ['categorical_crossentropy','binary_crossentropy']
optimizer = ['adam','sgd']

# Testing set
# output_dim = [100]
# activation = ['softmax',]
# losses = ['categorical_crossentropy',]
# optimizer = ['adam',]

for od in output_dim:
  for act in activation:
    for los in losses:
      for opt in optimizer:
        
        parameters = [od, act, los, opt]
        # List to store the results of each set
        results = []
        # Dataframe to store the results of each run
        temp = pd.DataFrame(columns=headers2)
        
        # Execute model 3 times for each set
        for i in range(3):
          model = annDef(od, act, los, opt)
          (accuracy,recall,precision,f1) = evalModel(model)
          # Store run results as a new row to dataframe
          new_row = pd.Series([accuracy,recall,precision,f1], index=headers2)
          temp = temp.append(new_row, ignore_index=1)
        
        # Get the statistics of the 3 executions
        acc = temp.Accurasy.mean()
        accSd = temp.Accurasy.std()
        rec = temp.Recall.mean()
        recSd = temp.Recall.std()
        pre = temp.Precision.mean()
        preSd = temp.Precision.std()
        f1 = temp.F1_score.mean()
        f1Sd = temp.F1_score.std()
        
        # Statistics as stings
        accStr = '%.3f+/-%.3f' % (acc, accSd)
        recStr = '%.3f+/-%.3f' % (rec, recSd)
        preStr = '%.3f+/-%.3f' % (pre, preSd)
        f1Str = '%.3f+/-%.3f' % (f1, f1Sd)
        
        stats = [accStr,recStr,preStr,f1Str]
        
        # Store the final results of set
        results.extend(parameters)
        results.extend(stats)
        
        new_row = pd.Series(results, index=headers)
        total = total.append(new_row, ignore_index=1)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 54, 100)           392100    
_________________________________________________________________
flatten_3 (Flatten)          (None, 5400)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 27005     
Total params: 419,105
Trainable params: 419,105
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
 - 0s - loss: 1.1467 - acc: 0.5768
Epoch 2/10
 - 0s - loss: 0.9801 - acc: 0.6000
Epoch 3/10
 - 0s - loss: 0.8639 - acc: 0.6012
Epoch 4/10
 - 0s - loss: 0.6917 - acc: 0.8134
Epoch 5/10
 - 0s - loss: 0.4825 - acc: 0.8866
Epoch 6/10
 - 0s - loss: 0.3150 - acc: 0.9439
Epoch 7/10
 - 0s - loss: 0.2011 - acc: 0.9732
Epoch 8/10
 - 0s - loss: 0.1282 - acc: 0.9915
Epoch 9/10
 - 0s - loss: 0.

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Epoch 1/10
 - 0s - loss: 1.1405 - acc: 0.5646
Epoch 2/10
 - 0s - loss: 0.9861 - acc: 0.6000
Epoch 3/10
 - 0s - loss: 0.8745 - acc: 0.6463
Epoch 4/10
 - 0s - loss: 0.6995 - acc: 0.8000
Epoch 5/10
 - 0s - loss: 0.4937 - acc: 0.8890
Epoch 6/10
 - 0s - loss: 0.3228 - acc: 0.9280
Epoch 7/10
 - 0s - loss: 0.2068 - acc: 0.9646
Epoch 8/10
 - 0s - loss: 0.1328 - acc: 0.9902
Epoch 9/10
 - 0s - loss: 0.0888 - acc: 0.9976
Epoch 10/10
 - 0s - loss: 0.0606 - acc: 0.9988
Test Accuracy: 67.804878
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 54, 100)           392100    
_________________________________________________________________
flatten_5 (Flatten)          (None, 5400)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 27005     
Total params: 419,105
Trainable params: 419,105
Non-

In [41]:
total

Unnamed: 0,Dimensions,Activation,Losses,Optimizer,Accurasy(+/-sd),Recall(+/-sd),Precision(+/-sd),F1_score(+/-sd)
0,100,softmax,categorical_crossentropy,adam,0.676+/-0.017,0.279+/-0.008,0.468+/-0.009,0.280+/-0.008
1,100,softmax,categorical_crossentropy,sgd,0.600+/-0.000,0.200+/-0.000,0.120+/-0.000,0.150+/-0.000
2,100,softmax,binary_crossentropy,adam,0.688+/-0.013,0.301+/-0.011,0.470+/-0.010,0.314+/-0.015
3,100,softmax,binary_crossentropy,sgd,0.600+/-0.000,0.200+/-0.000,0.120+/-0.000,0.150+/-0.000
4,100,sigmoid,categorical_crossentropy,adam,0.665+/-0.011,0.273+/-0.010,0.447+/-0.069,0.277+/-0.014
5,100,sigmoid,categorical_crossentropy,sgd,0.600+/-0.000,0.200+/-0.000,0.120+/-0.000,0.150+/-0.000
6,100,sigmoid,binary_crossentropy,adam,0.659+/-0.013,0.263+/-0.006,0.468+/-0.010,0.262+/-0.006
7,100,sigmoid,binary_crossentropy,sgd,0.598+/-0.003,0.199+/-0.001,0.120+/-0.000,0.150+/-0.000
8,250,softmax,categorical_crossentropy,adam,0.683+/-0.005,0.288+/-0.009,0.472+/-0.005,0.293+/-0.016
9,250,softmax,categorical_crossentropy,sgd,0.598+/-0.003,0.199+/-0.001,0.120+/-0.000,0.150+/-0.000


In [42]:
total.sort_values('Accurasy(+/-sd)',ascending=False)

Unnamed: 0,Dimensions,Activation,Losses,Optimizer,Accurasy(+/-sd),Recall(+/-sd),Precision(+/-sd),F1_score(+/-sd)
2,100,softmax,binary_crossentropy,adam,0.688+/-0.013,0.301+/-0.011,0.470+/-0.010,0.314+/-0.015
10,250,softmax,binary_crossentropy,adam,0.686+/-0.015,0.294+/-0.012,0.474+/-0.008,0.302+/-0.018
18,500,softmax,binary_crossentropy,adam,0.683+/-0.013,0.289+/-0.005,0.472+/-0.015,0.294+/-0.008
8,250,softmax,categorical_crossentropy,adam,0.683+/-0.005,0.288+/-0.009,0.472+/-0.005,0.293+/-0.016
16,500,softmax,categorical_crossentropy,adam,0.678+/-0.005,0.282+/-0.004,0.469+/-0.003,0.284+/-0.003
0,100,softmax,categorical_crossentropy,adam,0.676+/-0.017,0.279+/-0.008,0.468+/-0.009,0.280+/-0.008
22,500,sigmoid,binary_crossentropy,adam,0.673+/-0.013,0.277+/-0.008,0.468+/-0.007,0.278+/-0.009
14,250,sigmoid,binary_crossentropy,adam,0.673+/-0.010,0.277+/-0.004,0.470+/-0.005,0.278+/-0.005
20,500,sigmoid,categorical_crossentropy,adam,0.672+/-0.015,0.283+/-0.005,0.431+/-0.048,0.287+/-0.003
12,250,sigmoid,categorical_crossentropy,adam,0.670+/-0.010,0.277+/-0.008,0.445+/-0.071,0.282+/-0.014


# Vector Semantics - Learn word embeddings
** Get word embeddings with Word2Vec.**
<br>
Begin by reading the gensim documentation for [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html), to figure out how to use the Word2Vec class. 






<img height="20px" src="http://mlbernauer.github.io/assets/python.png" align="left" hspace="5px" vspace="3px">
<h3> Exercise </h3>

1. Read the Word2Vec documentation, and
2. train a Word2Vec model that learns embeddings in $\mathbb R^{100}$  from the TED dataset using SkipGram model. Other options should be default except min_count=10 so that infrequent words are ignored.  
3. train a Word2Vec model that learns embeddings in $\mathbb R^{10}$  from the TED dataset using SkipGram model. 
4. Get the most similar word to "computer" for both models. Do you notice any difference? What exactly?




---



# Neural Network Classification models
Neural Network Models address the n-gram data sparsity issue through parameterization of words as vectors (word embeddings) and using them as inputs to a neural network.


## 2.1 Movie Review Polarity Dataset
The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research.
The dataset is comprised of **1,000 positive** and **1,000 negative movie reviews** drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at imdb.com. The authors refer to this dataset as the “polarity dataset.”

**You can download the dataset from here**:
[Movie Review Polarity Dataset (review_polarity.tar.gz, 3MB)]()

After unzipping the file, you will have a directory called “txt_sentoken” with two sub-directories containing the text “neg” and “pos” for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each neg and pos.

**The data has been cleaned up:**
* The dataset is comprised of only English reviews.
* All text has been converted to lowercase.
* There is white space around punctuation like periods, commas, and brackets.
* Text has been split into one sentence per line.

## 2.2 Data Preparation
 Prepare movie review text data for classification with neural network methods
 
 

*   Split into Train / Validation sets
*   Loading & Cleaning Reviews
*   Define a vocabulary of preferred words



In [0]:
from google.colab import drive
drive.mount('/content/drive/',force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [0]:
# Data paths
negativeReviewsDirectory = '/content/drive/My Drive/Colab Notebooks/Datasets/review_polarity/txt_sentoken/neg'
positiveReviewsDirectory = '/content/drive/My Drive/Colab Notebooks/Datasets/review_polarity/txt_sentoken/pos'

vocab_filename = '/content/drive/My Drive/Colab Notebooks/Datasets/vocab.txt'
embedding_word2vec_filename = '/content/drive/My Drive/Colab Notebooks/Datasets/embedding_word2vec.txt'
glove_embedding = '/content/drive/My Drive/glove.6B.100d.txt'


In [0]:
from string import punctuation
from os import listdir
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)


# Add all docs to vocab
# Define vocabulary
vocab = Counter()
# Get positive and negative documents from respective directories
process_docs(negativeReviewsDirectory, vocab, True)
process_docs(positiveReviewsDirectory, vocab, True)

# print the size of the vocab
print("\nVocabulary size:", len(vocab))
# print the top words in the vocab
print("\nMost common words: \n", vocab.most_common(50))

# Keep tokens with a min occurence
min_occurence = 2
tokens = [k for k,c in vocab.items() if c >= min_occurence]
print("\nUpdated Vocabulary size:", len(tokens))

# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()

# save tokens to a vocabulary file
save_list(tokens, vocab_filename)

## 2.3 Train an Embedding Layer
Word embeddings as part of fitting a neural network model.


*   [Embedding layer](https://keras.io/layers/embeddings/) in Keras deep learning library



In [0]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load all training reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, True)
negative_docs = process_docs(negativeReviewsDirectory, vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, False)
negative_docs = process_docs(negativeReviewsDirectory, vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# Define Model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Using TensorFlow backend.


FileNotFoundError: ignored

## 2.4 Train a Word2Vec Embedding

In [0]:
from string import punctuation
from os import listdir
from gensim.models import Word2Vec


# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):
	clean_lines = list()
	lines = doc.splitlines()
	for line in lines:
		# split into tokens by white space
		tokens = line.split()
		# remove punctuation from each token
		table = str.maketrans('', '', punctuation)
		tokens = [w.translate(table) for w in tokens]
		# filter out tokens not in vocab
		tokens = [w for w in tokens if w in vocab]
		clean_lines.append(tokens)
	return clean_lines

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		doc = load_doc(path)
		doc_lines = doc_to_clean_lines(doc, vocab)
		# add lines to list
		lines += doc_lines
	return lines


# Get all training reviews
positive_lines = process_docs(positiveReviewsDirectory, vocab, True)
negative_lines = process_docs(negativeReviewsDirectory, vocab, True)
sentences = positive_lines + negative_lines
print('Total training sentences: %d' % len(sentences))

# Train a Word2Vec model
model = Word2Vec(sentences, size=100, window=5, workers=-1, min_count=10)
# summarize vocabulary size in model
words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(words))

# save model in ASCII (word2vec) format
model.wv.save_word2vec_format(embedding_word2vec_filename, binary=False)

Total training sentences: 58109
Vocabulary size: 8465


## 2.5 Use pre-trained Embeddings

### 2.5.1 Use pre-trained in-domain Embeddings

In [0]:
from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D


# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()[1:]
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		weight_matrix[i] = embedding.get(word)
	return weight_matrix


# load all training reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, True)
negative_docs = process_docs(negativeReviewsDirectory, vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, False)
negative_docs = process_docs(negativeReviewsDirectory, vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# load embedding from file
raw_embedding = load_embedding(embedding_word2vec_filename)
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=True)

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print("\nModel Summary:\n", model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('\nTest Accuracy: %f' % (acc*100))

Using TensorFlow backend.


NameError: ignored

### 2.5.2 Use pre-trained word Embeddings on external datasets
Google and Stanford provide pre-trained word vectors that you can download, trained with the efficient word2vec and GloVe methods respectively.
We will use the [pre-trained GloVe vectors](https://nlp.stanford.edu/projects/glove/) from the Stanford webpage, which are trained on Wikipedia data. 

<img height="20px" src="http://mlbernauer.github.io/assets/python.png" align="left" hspace="5px" vspace="3px">
<h3> Exercise </h3>

Train a neural network classifier on Movie Review Polarity Dataset using the pre-trained GloVe word embeddings. You can use the code from 'building a text classifier using in-domain Embeddings' above as basis. What modifications are essential when using embeddings from an external dataset? Think and modify the code accordingly.

In [0]:
from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D


# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()[1:]
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		weight_matrix[i] = embedding.get(word)
	return weight_matrix


# load all training reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, True)
negative_docs = process_docs(negativeReviewsDirectory, vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs(positiveReviewsDirectory, vocab, False)
negative_docs = process_docs(negativeReviewsDirectory, vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# load embedding from file
raw_embedding = load_embedding(embedding_word2vec_filename)
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=True)

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print("\nModel Summary:\n", model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('\nTest Accuracy: %f' % (acc*100))

Using TensorFlow backend.


NameError: ignored



---



Thank You! 