# Machine Learning Engineer Nanodegree

## Capstone Project

### Project: Semantic similarity extraction using word vectors in Mahabharata dataset

Welcome to the capstone project of the Machine Learning Engineer Nanodegree! In this notebook, we will use corpus of words from Mahabharata is used as an input to create word vectors using word2vec, with the help of t-SNE, reduce the dimensions of the word vectors and finally use cosine similarity to analyze semantic similarities, i.e. to answer relationship questions based on the learning. The end solution of this project will be to analyze relationships and logics in the dataset. 

The dataset for this project can be found on the [GitHub Mahabharata Machine Learning Repository](https://github.com/TilakD/Mahabharata_extract-semantic-similarities_Natural-languageprocessing/tree/master/Dataset)

Model is assessed using the real facts about the data set, to benchmark the model I have compiled 23 relationship facts and will be adding few more as I build the model. For example, below are a few of the real data used to benchmark the model.
Dhritarastra is related to Pandu, as Sahadeva is related to Nakula

    Bhima is related to Arjuna, as Ambalika is related to Ambika
    Pandu is related to Kunti, as Dhritarashtra is related to Gandhari
    Bhima is related to Draupadi, as Arjuna is related to Chitrangada
    Karna is related to Kunti, as Duryodhana is related to Gandhari
    .
    .
    .


>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

## Exploring the Data
Run the code cell below to load necessary Python libraries.

In [21]:
#future is the missing compatibility layer between Python 2 and Python 3. 
#It allows you to use a single, clean Python 3.x-compatible codebase to 
#support both Python 2 and Python 3 with minimal overhead.
from __future__ import absolute_import, division, print_function

In [22]:
#encoding. word encodig
import codecs
#finds all pathnames matching a pattern, like regex
import glob
#log events for libraries
import logging
#concurrency
import multiprocessing
#dealing with operating system , like reading file
import os
#pretty print, human readable
import pprint
#regular expressions
import re
#natural language toolkit
import nltk
#word 2 vec
import gensim.models.word2vec as w2v
#dimensionality reduction
import sklearn.manifold
#math
import numpy as np
#plotting
import matplotlib.pyplot as plt
#parse dataset
import pandas as pd
#visualization
import seaborn as sns

In [8]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


Set up logging

In [9]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Download NLTK tokenizer models (only the first time)

In [34]:
##stopwords like the at a an, unnecesasry
##tokenization into sentences, punkt 
##http://www.nltk.org/

nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DTILAK\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DTILAK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Prepare Corpus
Load books from files

In [43]:
#get the book names, matching txt file
book_filenames = sorted(glob.glob("..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\*.txt"))
print("Found books:")
book_filenames

Found books:


['..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\1.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\10.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\11.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\12.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\13.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\14.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\15.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\16.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\17.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input\\18.txt',
 '..\\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\\input

Combine the books into one string

In [44]:
#step 1 process data

#initialize raw unicode , we'll add all text to this file in memory
corpus_raw = u""

#for each book, read it, open it un utf 8 format, 
#add it to the raw corpus
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print ("Corpus is now {0} characters long".format(len(corpus_raw)))
    print ()

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\1.txt'...
Corpus is now 295412 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\10.txt'...
Corpus is now 325640 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\11.txt'...
Corpus is now 337972 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\12.txt'...
Corpus is now 364937 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\13.txt'...
Corpus is now 390015 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\14.txt'...
Corpus is now 404662 characters long

Reading '..\Mahabharata_extract-semantic-similarities_Natural-languageprocessing\input\15.txt'...
Corpus is now 417222 characters long

Reading '..\Mahabharata_extract-semantic-similari

Split the corpus into sentences

In [45]:
#tokenizastion! saved the trained model here
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [48]:
#tokenize into sentences
raw_sentences = tokenizer.tokenize(corpus_raw)

In [49]:
#convert into a list of words
#remove unnnecessary, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [50]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [51]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Above all these qualities, he was a devoted servant of Lord Vishnu, and therefore he was given the title, "King of kings".
[u'Above', u'all', u'these', u'qualities', u'he', u'was', u'a', u'devoted', u'servant', u'of', u'Lord', u'Vishnu', u'and', u'therefore', u'he', u'was', u'given', u'the', u'title', u'King', u'of', u'kings']


In [52]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 293,755 tokens


Train Word2Vec

In [53]:
#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [54]:
mahabharata2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [55]:
mahabharata2vec.build_vocab(sentences)

2017-03-20 13:07:41,523 : INFO : collecting all words and their counts
2017-03-20 13:07:41,523 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-03-20 13:07:41,569 : INFO : PROGRESS: at sentence #10000, processed 165878 words, keeping 9238 word types
2017-03-20 13:07:41,605 : INFO : collected 11439 word types from a corpus of 293755 raw words and 17725 sentences
2017-03-20 13:07:41,605 : INFO : Loading a fresh vocabulary
2017-03-20 13:07:41,628 : INFO : min_count=3 retains 5703 unique words (49% of original 11439, drops 5736)
2017-03-20 13:07:41,631 : INFO : min_count=3 leaves 286413 word corpus (97% of original 293755, drops 7342)
2017-03-20 13:07:41,650 : INFO : deleting the raw counts dictionary of 11439 items
2017-03-20 13:07:41,651 : INFO : sample=0.001 downsamples 52 most-common words
2017-03-20 13:07:41,651 : INFO : downsampling leaves estimated 214323 word corpus (74.8% of prior 286413)
2017-03-20 13:07:41,654 : INFO : estimated required memory for

In [56]:
print("Word2Vec vocabulary length:", len(mahabharata2vec.wv.vocab))

Word2Vec vocabulary length: 5703


Start training, this might take a minute or two...

In [57]:
mahabharata2vec.train(sentences)

2017-03-20 13:07:44,782 : INFO : training model with 4 workers on 5703 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7
2017-03-20 13:07:44,786 : INFO : expecting 17725 sentences, matching count from corpus used for vocabulary survey
2017-03-20 13:07:45,813 : INFO : PROGRESS: at 14.64% examples, 156199 words/s, in_qsize 7, out_qsize 0
2017-03-20 13:07:46,835 : INFO : PROGRESS: at 31.27% examples, 163907 words/s, in_qsize 8, out_qsize 0
2017-03-20 13:07:47,842 : INFO : PROGRESS: at 47.59% examples, 167107 words/s, in_qsize 8, out_qsize 0
2017-03-20 13:07:48,855 : INFO : PROGRESS: at 61.88% examples, 163409 words/s, in_qsize 7, out_qsize 0
2017-03-20 13:07:49,871 : INFO : PROGRESS: at 78.11% examples, 164859 words/s, in_qsize 8, out_qsize 0
2017-03-20 13:07:50,948 : INFO : PROGRESS: at 94.27% examples, 164610 words/s, in_qsize 8, out_qsize 0
2017-03-20 13:07:51,160 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-03-20 13:07:51,191 : IN

1071197

Save to file, can be useful later

In [58]:
if not os.path.exists("trained"):
    os.makedirs("trained")

In [59]:
mahabharata2vec.save(os.path.join("trained", "mahabharata2vec.w2v"))

2017-03-20 13:07:54,262 : INFO : saving Word2Vec object under trained\mahabharata2vec.w2v, separately None
2017-03-20 13:07:54,263 : INFO : not storing attribute syn0norm
2017-03-20 13:07:54,265 : INFO : not storing attribute cum_table
2017-03-20 13:07:54,351 : INFO : saved trained\mahabharata2vec.w2v


Explore the trained model.

In [60]:
mahabharata2vec = w2v.Word2Vec.load(os.path.join("trained", "mahabharata2vec.w2v"))

2017-03-20 13:07:57,621 : INFO : loading Word2Vec object from trained\mahabharata2vec.w2v
2017-03-20 13:07:57,657 : INFO : loading wv recursively from trained\mahabharata2vec.w2v.wv.* with mmap=None
2017-03-20 13:07:57,660 : INFO : setting ignored attribute syn0norm to None
2017-03-20 13:07:57,661 : INFO : setting ignored attribute cum_table to None
2017-03-20 13:07:57,665 : INFO : loaded trained\mahabharata2vec.w2v


Compress the word vectors into 2D space and plot them

In [61]:
#my video - how to visualize a dataset easily
tsne = sklearn.manifold.TSNE(n_components=3, random_state=0)

In [62]:
all_word_vectors_matrix = mahabharata2vec.wv.syn0

Train t-SNE, this could take a minute or two...

In [64]:
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

MemoryError: 

Plot the big picture

In [None]:
points = pd.DataFrame(
    [
        (word, coords[0], coords[1], coords[2])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[mahabharata2vec.wv.vocab[word].index])
            for word in mahabharata2vec.wv.vocab
        ]
    ],
    columns=["word", "x", "y", "z"]
)

In [None]:
points.head(10)

In [None]:
sns.set_context("poster")

In [None]:
points.plot.scatter("x", "y", c = "z",s=10, figsize=(12, 12))

In [None]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

People related to Kingsguard ended up together

In [None]:
plot_region(x_bounds=(4.0, 4.2), y_bounds=(-0.5, -0.1))

Food products are grouped nicely as well. Aerys (The Mad King) being close to "roasted" also looks sadly correct

In [None]:
plot_region(x_bounds=(0, 1), y_bounds=(4, 4.5))

Explore semantic similarities between book characters. Words closest to the given word

In [None]:
mahabharata2vec.most_similar("Krishna")

In [None]:
mahabharata2vec.most_similar("Arjuna")

In [None]:
mahabharata2vec.most_similar("Karna")

In [None]:
mahabharata2vec.most_similar("Vrishasena")

Linear relationships between word pairs

In [None]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = mahabharata2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [None]:
nearest_similarity_cosmul("Dhritarastra", "Pandu", "Nakula")
nearest_similarity_cosmul("Bhima", "Arjuna", "Ambika")

In [None]:
from nltk.tag import pos_tag

sentence = "Vrishasena Ambalika at the death of Duhshasana and Chitrasena rushed against Nakula desiring to fight with his father's enemy. A fierce battle then ensued between those two heroes. Vrishasena managed to kill Nakula's horses and pierce him with many arrows. Descending from his chariot, Nakula took up his sword and shield, and making his way toward Vrishasena, he severed the heads of two thousand horsemen. Vrishasena, seeing Nakula coming towards him whirling that sword like a discus, shattered the sword and shield with four crescent shaped arrows. Nakula then quickly ascended Bhima's chariot. As Arjuna came near, Nakula requested him Please slay this sinful person Arjuna then ordered Lord Krishna Proceed toward the son of Karna."
tagged_sent = pos_tag(sentence.split())
print (tagged_sent)

propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
print (propernouns)