<h2>Word2Vec (English Version).</h2>
<p>In this notebook, we will try to construct a word2vec from a dataset found here : https://www.kaggle.com/c/quora-question-pairs</p>

In [21]:
# -*- coding: utf-8 -*-
#Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import timeit

In [11]:
dataset = pd.read_csv("dataset/quora_train.csv")
dataset.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


<h2>Cleaning the data</h2>
<p>Let's clean the data and segment them into sentences</p>

In [14]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
import nltk.data
nltk.data.path.append("D:\nltk_data")

In [13]:
def review_to_wordlist( question, remove_stopwords=False ):
    
    question_text = re.sub("[^a-zA-Z]"," ", question)
    #
    # 3. Convert words to lower case and split them
    words = question_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)

In [15]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [16]:
def review_to_sentences( question, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(question.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

In [23]:
sentences = []  # Initialize an empty list of sentences
print("Parsing sentences from dataset q1")
start = timeit.default_timer()
for question in dataset["question1"]:
    sentences += review_to_sentences(str(question), tokenizer)
stop = timeit.default_timer()
print("Finished parseing q1 : "+str(round(stop - start,2))+"s.")
print("Parsing sentences from dataset q2")    
for question in dataset["question2"]:
    sentences += review_to_sentences(str(question), tokenizer)    
stop = timeit.default_timer()
print("Finished parseing q2 : "+str(round(stop - start,2))+"s.")

Parsing sentences from dataset q1
Finished parseing q1 : 9.87s.
Parsing sentences from dataset q2
Finished parseing q2 : 20.46s.


In [24]:
sentences[0:2]

[['what',
  'is',
  'the',
  'step',
  'by',
  'step',
  'guide',
  'to',
  'invest',
  'in',
  'share',
  'market',
  'in',
  'india'],
 ['what',
  'is',
  'the',
  'story',
  'of',
  'kohinoor',
  'koh',
  'i',
  'noor',
  'diamond']]

In [25]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 500    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

2017-04-28 07:31:04,221 : INFO : 'pattern' package not found; tag filters are not available for English
2017-04-28 07:31:04,282 : INFO : collecting all words and their counts
2017-04-28 07:31:04,284 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-28 07:31:04,314 : INFO : PROGRESS: at sentence #10000, processed 99111 words, keeping 10771 word types
2017-04-28 07:31:04,349 : INFO : PROGRESS: at sentence #20000, processed 198901 words, keeping 15372 word types
2017-04-28 07:31:04,372 : INFO : PROGRESS: at sentence #30000, processed 297847 words, keeping 18786 word types
2017-04-28 07:31:04,397 : INFO : PROGRESS: at sentence #40000, processed 396378 words, keeping 21570 word types
2017-04-28 07:31:04,434 : INFO : PROGRESS: at sentence #50000, processed 495897 words, keeping 23996 word types
2017-04-28 07:31:04,464 : INFO : PROGRESS: at sentence #60000, processed 595251 words, keeping 26209 word types


Training model...


2017-04-28 07:31:04,487 : INFO : PROGRESS: at sentence #70000, processed 693633 words, keeping 28128 word types
2017-04-28 07:31:04,518 : INFO : PROGRESS: at sentence #80000, processed 793440 words, keeping 29888 word types
2017-04-28 07:31:04,553 : INFO : PROGRESS: at sentence #90000, processed 892903 words, keeping 31430 word types
2017-04-28 07:31:04,576 : INFO : PROGRESS: at sentence #100000, processed 992033 words, keeping 32934 word types
2017-04-28 07:31:04,608 : INFO : PROGRESS: at sentence #110000, processed 1091774 words, keeping 34379 word types
2017-04-28 07:31:04,637 : INFO : PROGRESS: at sentence #120000, processed 1190774 words, keeping 35739 word types
2017-04-28 07:31:04,665 : INFO : PROGRESS: at sentence #130000, processed 1289539 words, keeping 36976 word types
2017-04-28 07:31:04,695 : INFO : PROGRESS: at sentence #140000, processed 1389467 words, keeping 38253 word types
2017-04-28 07:31:04,729 : INFO : PROGRESS: at sentence #150000, processed 1488913 words, keepin

In [26]:
# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "quora_context"
model.save(model_name)

2017-04-28 07:32:17,882 : INFO : precomputing L2-norms of word weight vectors
2017-04-28 07:32:17,947 : INFO : saving Word2Vec object under quora_context, separately None
2017-04-28 07:32:17,948 : INFO : not storing attribute syn0norm
2017-04-28 07:32:17,950 : INFO : not storing attribute cum_table
2017-04-28 07:32:18,440 : INFO : saved quora_context


<h2>Word2Vec obtained</h2>
<p>Now we can use it to find some relation between words</p>

In [62]:
model.doesnt_match(["java","python","c","javascript"])

'c'

In [63]:
model.doesnt_match(["dog","elephant","shark","horse"])
#(Dog because pet not wild animal?)

'dog'

In [64]:
model.most_similar('programming')

[('coding', 0.7410617470741272),
 ('python', 0.6689383387565613),
 ('java', 0.6520576477050781),
 ('javascript', 0.5605449080467224),
 ('c', 0.5593013167381287),
 ('programmer', 0.5227774381637573),
 ('php', 0.5206589698791504),
 ('algorithms', 0.5053951740264893),
 ('linux', 0.501064121723175),
 ('framework', 0.4715242385864258)]

In [65]:
model.most_similar('danger')

[('witch', 0.5302742719650269),
 ('justified', 0.49773040413856506),
 ('louisiana', 0.49603551626205444),
 ('scientology', 0.4932976961135864),
 ('missouri', 0.4668501615524292),
 ('majority', 0.4666014015674591),
 ('racial', 0.464800626039505),
 ('detroit', 0.4604220688343048),
 ('wisconsin', 0.459244042634964),
 ('somalia', 0.4563840627670288)]

In [66]:
model.most_similar('strict')

[('unfair', 0.48129281401634216),
 ('violent', 0.47839125990867615),
 ('serious', 0.47685545682907104),
 ('diplomatic', 0.4541323482990265),
 ('prevalent', 0.4492725431919098),
 ('liberal', 0.4460740089416504),
 ('controversial', 0.44428908824920654),
 ('bullies', 0.44424617290496826),
 ('conservative', 0.44246068596839905),
 ('ignorant', 0.4410874545574188)]

In [67]:
model.most_similar('okay')

[('ok', 0.9252203106880188),
 ('acceptable', 0.6984410881996155),
 ('advisable', 0.6913833022117615),
 ('normal', 0.6496853828430176),
 ('safe', 0.6454579830169678),
 ('necessary', 0.6417340040206909),
 ('possible', 0.6039286851882935),
 ('fine', 0.5829481482505798),
 ('weird', 0.5782525539398193),
 ('appropriate', 0.5637184381484985)]

<h3>Read more : https://www.kaggle.com/c/word2vec-nlp-tutorial</h3>