## This notebook shows how to create word vectors using word2vec pipeline

The notebook shows how to generate and save word embeddings from tweets. The tweets are preprocessed to handle URLs, Email ids ets., after which they are input into the word2vec vectorizer.

After features are generated, they are saved. They are re-loaded and used for several lookups, such as checking embeddings for words and sentences, examining words which are close to a given word in feature space.

### Set directories, import libraries

In [1]:
## Set training file name
import os
working_dir = "C:\\Users\\remoteuser\\Desktop\\Data\\Sentiment140_Classification"

training_filename = os.path.join(working_dir, "training_text.csv")
embeddings_file_path = os.path.join(working_dir, "w2vec.txt")

In [2]:
# Show azureml-tatk version
!pip show azureml-tatk

Name: azureml-tatk
Version: 0.1.18123.2b3
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
License: UNKNOWN
Location: c:\users\remoteuser\appdata\local\amlworkbench\python\lib\site-packages
Requires: azure-ml-api-sdk, pytest, pandas, nltk, gensim, scikit-learn, scipy, pdfminer.six, ipython, requests, ipywidgets, numpy, nose, ruamel.yaml, keras, matplotlib, jsonpickle, azure-storage, h5py, bqplot, validators, sklearn-crfsuite, qgrid, docker, dill, unidecode, lxml, pyspark
Required-by: 


In [4]:
from __future__ import absolute_import
from __future__ import division
import collections
import math
import os
import sys
import random
import numpy as np
from six.moves import urllib
from six.moves import xrange  
import tensorflow as tf
from timeit import default_timer as timer
import pandas as pd
import re
import io
from nltk.tokenize import TweetTokenizer
import num2words

from tatk.feature_extraction.word2vec_vectorizer import Word2VecVectorizer
from tatk.feature_extraction.callable_vectorizer import CallableVectorizer

!pip show azureml-tatk

Name: azureml-tatk
Version: 0.1.18123.2b3
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
License: UNKNOWN
Location: c:\users\remoteuser\appdata\local\amlworkbench\python\lib\site-packages
Requires: numpy, azure-storage, h5py, scikit-learn, ipython, requests, pyspark, gensim, scipy, matplotlib, jsonpickle, azure-ml-api-sdk, ruamel.yaml, validators, dill, pandas, pdfminer.six, qgrid, keras, ipywidgets, bqplot, nose, nltk, sklearn-crfsuite, unidecode, docker, lxml, pytest
Required-by: 


### Create functions for loading and preprocessing tweets

In [9]:
# Data processing
# In the following code, we replace Emails, URLS, emoticons etc with special labels
pos_emoticons=["(^.^)","(^-^)","(^_^)","(^_~)","(^3^)","(^o^)","(~_^)","*)",":)",":*",":-*",":]",":^)",":}",
               ":>",":3",":b",":-b",":c)",":D",":-D",":O",":-O",":o)",":p",":-p",":P",":-P",":Þ",":-Þ",":X",
               ":-X",";)",";-)",";]",";D","^)","^.~","_)m"," ~.^","<=8","<3","<333","=)","=///=","=]","=^_^=",
               "=<_<=","=>.<="," =>.>="," =3","=D","=p","0-0","0w0","8D","8O","B)","C:","d'-'","d(>w<)b",":-)",
               "d^_^b","qB-)","X3","xD","XD","XP","ʘ‿ʘ","❤","💜","💚","💕","💙","💛","💓","💝","💖","💞",
               "💘","💗","😗","😘","😙","😚","😻","😀","😁","😃","☺","😄","😆","😇","😉","😊","😋","😍",
               "😎","😏","😛","😜","😝","😮","😸","😹","😺","😻","😼","👍"]

neg_emoticons=["--!--","(,_,)","(-.-)","(._.)","(;.;)9","(>.<)","(>_<)","(>_>)","(¬_¬)","(X_X)",":&",":(",":'(",
               ":-(",":-/",":-@[1]",":[",":\\",":{",":<",":-9",":c",":S",";(",";*(",";_;","^>_>^","^o)","_|_",
               "`_´","</3","<=3","=/","=\\",">:(",">:-(","💔","☹️","😌","😒","😓","😔","😕","😖","😞","😟",
               "😠","😡","😢","😣","😤","😥","😦","😧","😨","😩","😪","😫","😬","😭","😯","😰","😱","😲",
               "😳","😴","😷","😾","😿","🙀","💀","👎"]

emoticonsDict = {}
for i,each in enumerate(pos_emoticons):
    emoticonsDict[each]=' POS_EMOTICON_'+num2words.num2words(i).upper()+' '
    
for i,each in enumerate(neg_emoticons):
    emoticonsDict[each]=' NEG_EMOTICON_'+num2words.num2words(i).upper()+' '
    
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in emoticonsDict.items())
emoticonsPattern = re.compile("|".join(rep.keys()))

# Read in files
def read_tweets(filename):
    """Read the raw tweet data from a file. Replace Emails etc with special tokens """
    with open(filename, 'r') as f:
        all_lines=f.readlines()
        padded_lines=[]
        for line in all_lines:
            line = emoticonsPattern.sub(lambda m: rep[re.escape(m.group(0))], line.lower().strip())
            line=re.sub(r'(.)\1{2,}', r'\1\1',line)
            words_tokens=[token for token in TweetTokenizer().tokenize(line)]                    
            line= ' '.join(token for token in words_tokens )         
            padded_lines.append(line)
    return padded_lines

### Read in and preprocess tweets

In [10]:
tweets = read_tweets(training_filename)
df = pd.DataFrame({'raw_tweets':tweets})
display(df[:5])

Unnamed: 0,raw_tweets
0,going to fort smith today . stoked for the res...
1,argh .. omniture discover 2 is annoying me ..
2,@colinloretz hah ! thanks colin .
3,back from ambleside but am happy in my tummy ..
4,@__parasite__ it can just eat the people we do...


In [11]:
# Define functions wihch are to be used to pre-process tweets
def to_lower_case(x):
    return x.lower()

def emailsReplace(x):
    return x.replace(r'[\w\.-]+@[\w\.-]+', ' EMAIL ')

def numsReplace(x):
    return x.replace(r'[\w\.-]+@[\w\.-]+', ' NUM ')

def userMentionsReplace(x):
    return x.replace(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)', ' USER ')

def urlReplace(x):
    return x.replace(r'r(f|ht)(tp)(s?)(://)(.*)[.|/][^ ]+', ' URL ')

def punctuationReplace(x):
    return x.replace(r'(?<=\w)[^\s\w](?![^\s\w])', ' PUN ')

def atReplace(x):
    return x.replace(r'@', ' AT ')

# Chain functions into a list
featFuncs=[to_lower_case, emailsReplace, numsReplace, userMentionsReplace, urlReplace, punctuationReplace, atReplace]

# Create a transformer specifying functions, 
callable = CallableVectorizer(input_col="raw_tweets", output_col="tweets", feat_list=featFuncs, preprocessor = True)
processed_df = callable.tatk_fit_transform(df)

processed_df.head(3)


CallableVectorizer::tatk_fit_transform ==> start
CallableVectorizer::tatk_fit_transform ==> end 	 Time taken: 0.1 mins


Unnamed: 0,raw_tweets,tweets
0,going to fort smith today . stoked for the res...,going to fort smith today . stoked for the res...
1,argh .. omniture discover 2 is annoying me ..,argh .. omniture discover 2 is annoying me ..
2,@colinloretz hah ! thanks colin .,AT colinloretz hah ! thanks colin .


### Define & train word2vec vectorizer to generate word embeddings

In [12]:
word2vec = Word2VecVectorizer(input_col = 'tweets', output_col = '', embedding_size = 50, return_type = 'word_vector', context_window_size=5, min_df=5, num_workers=4)
word2vec_model = word2vec.tatk_fit(df)

Word2VecVectorizer::tatk_fit ==> start
vocabulary size =69951
Word2VecVectorizer::tatk_fit ==> end 	 Time taken: 0.89 mins


### Save word2vec embeddings in a txt file

In [13]:
word2vec_model.save_embeddings(embeddings_file_path)

Word2VecVectorizer::save_embeddings ==> start
Time taken: 0.04 mins
Word2VecVectorizer::save_embeddings ==> end


### Load the embeddings to memory with include_unk set to True to add OOV treatment

In [14]:
vectorizer = Word2VecVectorizer.load_embeddings(embeddings_file_path, include_unk = True, unk_method = 'rnd', unk_vector = None, unk_word = '<UNK>')

Word2VecVectorizer::load_embeddings ==> start
Time taken: 0.06 mins
Word2VecVectorizer::load_embeddings ==> end


### Embedding Lookup: Get word and subword indices.

In [15]:
df_predict = pd.DataFrame({'text' : ["I have fever", "this is a good book"]})
vectorizer.input_col = 'text'
vectorizer.output_col = 'indices'
vectorizer.return_type = 'word_index'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices
0,I have fever,"[5, 23, 1210]"
1,this is a good book,"[32, 15, 9, 34, 486]"


### Embedding Lookup: Get word embeddings.

In [16]:
vectorizer.output_col = 'word_vector'
vectorizer.return_type = 'word_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector
0,I have fever,"[5, 23, 1210]","[[0.07417800277471542, 1.1873890161514282, 0.0..."
1,this is a good book,"[32, 15, 9, 34, 486]","[[1.233780026435852, 0.49888598918914795, 0.98..."


### Embedding Lookup: Get sentence embedding from word embeddings
This uses averaging of word vectors

In [17]:
vectorizer.output_col = 'sentence_vector'
vectorizer.return_type = 'sentence_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector,sentence_vector
0,I have fever,"[5, 23, 1210]","[[0.07417800277471542, 1.1873890161514282, 0.0...","[-0.4754590118924777, 0.5704799890518188, 0.06..."
1,this is a good book,"[32, 15, 9, 34, 486]","[[1.233780026435852, 0.49888598918914795, 0.98...","[0.9528391897678375, 1.263062632083893, 0.8234..."


### Embedding Lookup: Get most similar word to a given word

In [18]:
vectorizer.embedding_table.most_similar('fever')

[('flu', 0.7631323337554932),
 ('bronchitis', 0.753889262676239),
 ('asthma', 0.7483243942260742),
 ('cough', 0.7394519448280334),
 ('migraine', 0.7222485542297363),
 ('toothache', 0.7108262181282043),
 ('headache', 0.7103149890899658),
 ('tonsillitis', 0.7091843485832214),
 ('glandular', 0.6958644986152649),
 ('infection', 0.6921930909156799)]