## This notebook shows how to create word vectors using word2vec pipeline

The notebook shows how to generate and save word embeddings from tweets. The tweets are preprocessed to handle URLs, Email ids ets., after which they are input into the word2vec vectorizer.

After features are generated, they are saved. They are re-loaded and used for several lookups, such as checking embeddings for words and sentences, examining words which are close to a given word in feature space.

### Set directories, import libraries

In [2]:
## Set training file name
working_dir = "D:\Sentiment140_Classification"

training_filename = os.path.join(working_dir, "training_text.csv")
embeddings_file_path = os.path.join(working_dir, "w2vec.txt")

In [3]:
# Show azureml-tatk version
!pip show azureml-tatk

Name: azureml-tatk
Version: 0.1.18108.8a1
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
License: UNKNOWN
Location: c:\users\remoteuser\appdata\local\amlworkbench\python\lib\site-packages
Requires: ipython, pdfminer.six, scikit-learn, h5py, pandas, ruamel.yaml, docker, pyspark, validators, sklearn-crfsuite, ipywidgets, azure-ml-api-sdk, unidecode, scipy, nltk, gensim, nose, azure-storage, qgrid, matplotlib, requests, numpy, pytest, keras, lxml, bqplot
Required-by: 


In [5]:
from __future__ import absolute_import
from __future__ import division
import collections
import math
import os
import sys
import random
import numpy as np
from six.moves import urllib
from six.moves import xrange  
import tensorflow as tf
from timeit import default_timer as timer
import pandas as pd
import re
import io
from nltk.tokenize import TweetTokenizer
import num2words

from tatk.feature_extraction.word2vec_vectorizer import Word2VecVectorizer
from tatk.feature_extraction.callable_vectorizer import CallableVectorizer

!pip show azureml-tatk

Name: azureml-tatk
Version: 0.1.18108.8a1
Summary: Microsoft Azure Machine Learning Package for Text Analytics
Home-page: https://microsoft.sharepoint.com/teams/TextAnalyticsPackagePreview
Author: Microsoft Corporation
Author-email: amltap@microsoft.com
License: UNKNOWN
Location: c:\users\remoteuser\appdata\local\amlworkbench\python\lib\site-packages
Requires: qgrid, sklearn-crfsuite, h5py, gensim, azure-ml-api-sdk, nltk, ruamel.yaml, ipython, ipywidgets, pyspark, pdfminer.six, lxml, bqplot, pandas, docker, azure-storage, scipy, pytest, validators, unidecode, numpy, keras, nose, scikit-learn, requests, matplotlib
Required-by: 


### Create functions for loading and preprocessing tweets

In [6]:
# Data processing
# In the following code, we replace Emails, URLS, emoticons etc with special labels
pos_emoticons=["(^.^)","(^-^)","(^_^)","(^_~)","(^3^)","(^o^)","(~_^)","*)",":)",":*",":-*",":]",":^)",":}",
               ":>",":3",":b",":-b",":c)",":D",":-D",":O",":-O",":o)",":p",":-p",":P",":-P",":Þ",":-Þ",":X",
               ":-X",";)",";-)",";]",";D","^)","^.~","_)m"," ~.^","<=8","<3","<333","=)","=///=","=]","=^_^=",
               "=<_<=","=>.<="," =>.>="," =3","=D","=p","0-0","0w0","8D","8O","B)","C:","d'-'","d(>w<)b",":-)",
               "d^_^b","qB-)","X3","xD","XD","XP","ʘ‿ʘ","❤","💜","💚","💕","💙","💛","💓","💝","💖","💞",
               "💘","💗","😗","😘","😙","😚","😻","😀","😁","😃","☺","😄","😆","😇","😉","😊","😋","😍",
               "😎","😏","😛","😜","😝","😮","😸","😹","😺","😻","😼","👍"]

neg_emoticons=["--!--","(,_,)","(-.-)","(._.)","(;.;)9","(>.<)","(>_<)","(>_>)","(¬_¬)","(X_X)",":&",":(",":'(",
               ":-(",":-/",":-@[1]",":[",":\\",":{",":<",":-9",":c",":S",";(",";*(",";_;","^>_>^","^o)","_|_",
               "`_´","</3","<=3","=/","=\\",">:(",">:-(","💔","☹️","😌","😒","😓","😔","😕","😖","😞","😟",
               "😠","😡","😢","😣","😤","😥","😦","😧","😨","😩","😪","😫","😬","😭","😯","😰","😱","😲",
               "😳","😴","😷","😾","😿","🙀","💀","👎"]

emoticonsDict = {}
for i,each in enumerate(pos_emoticons):
    emoticonsDict[each]=' POS_EMOTICON_'+num2words.num2words(i).upper()+' '
    
for i,each in enumerate(neg_emoticons):
    emoticonsDict[each]=' NEG_EMOTICON_'+num2words.num2words(i).upper()+' '
    
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in emoticonsDict.items())
emoticonsPattern = re.compile("|".join(rep.keys()))

# Read in files
def read_tweets(filename):
    """Read the raw tweet data from a file. Replace Emails etc with special tokens """
    with open(filename, 'r') as f:
        all_lines=f.readlines()
        padded_lines=[]
        for line in all_lines:
            line = emoticonsPattern.sub(lambda m: rep[re.escape(m.group(0))], line.lower().strip())
            line=re.sub(r'(.)\1{2,}', r'\1\1',line)
            words_tokens=[token for token in TweetTokenizer().tokenize(line)]                    
            line= ' '.join(token for token in words_tokens )         
            padded_lines.append(line)
    return padded_lines

### Read in and preprocess tweets

In [9]:
tweets = read_tweets(training_filename)
df = pd.DataFrame({'raw_tweets':tweets})
display(df[:5])

Unnamed: 0,raw_tweets
0,damn fixtated on @kokupuff lovely thighs / hip...
1,god bless firefox's ' restore previous session...
2,@sherrieshepherd http://twitpic.com/6vn4a - da...
3,@thechadhuck hey ! sorry u had to come back to...
4,bye mommy . we'll miss you .


In [10]:
# Define functions wihch are to be used to pre-process tweets
def to_lower_case(x):
    return x.lower()

def emailsReplace(x):
    return x.replace(r'[\w\.-]+@[\w\.-]+', ' EMAIL ')

def numsReplace(x):
    return x.replace(r'[\w\.-]+@[\w\.-]+', ' NUM ')

def userMentionsReplace(x):
    return x.replace(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)', ' USER ')

def urlReplace(x):
    return x.replace(r'r(f|ht)(tp)(s?)(://)(.*)[.|/][^ ]+', ' URL ')

def punctuationReplace(x):
    return x.replace(r'(?<=\w)[^\s\w](?![^\s\w])', ' PUN ')

def atReplace(x):
    return x.replace(r'@', ' AT ')

# Chain functions into a list
featFuncs=[to_lower_case, emailsReplace, numsReplace, userMentionsReplace, urlReplace, punctuationReplace, atReplace]

# Create a transformer specifying functions, 
callable = CallableVectorizer(input_col="raw_tweets", output_col="tweets", feat_list=featFuncs, preprocessor = True)
processed_df = callable.tatk_fit_transform(df)

processed_df.head(3)


CallableVectorizer::tatk_fit_transform ==> start
CallableVectorizer::tatk_fit_transform ==> end 	 Time taken: 0.1 mins


Unnamed: 0,raw_tweets,tweets
0,damn fixtated on @kokupuff lovely thighs / hip...,damn fixtated on AT kokupuff lovely thighs / ...
1,god bless firefox's ' restore previous session...,god bless firefox's ' restore previous session...
2,@sherrieshepherd http://twitpic.com/6vn4a - da...,AT sherrieshepherd http://twitpic.com/6vn4a -...


### Define & train word2vec vectorizer to generate word embeddings

In [11]:
word2vec = Word2VecVectorizer(input_col = 'tweets', output_col = '', embedding_size = 50, return_type = 'word_vector', context_window_size=5, min_df=5, num_workers=4)
word2vec_model = word2vec.tatk_fit(df)

Word2VecVectorizer::tatk_fit ==> start
vocabulary size =69899
Word2VecVectorizer::tatk_fit ==> end 	 Time taken: 1.05 mins


### Save word2vec embeddings in a txt file

In [12]:
word2vec_model.save_embeddings(embeddings_file_path)

Word2VecVectorizer::save_embeddings ==> start
Time taken: 0.03 mins
Word2VecVectorizer::save_embeddings ==> end


### Load the embeddings to memory with include_unk set to True to add OOV treatment

In [13]:
vectorizer = Word2VecVectorizer.load_embeddings(embeddings_file_path, include_unk = True, unk_method = 'rnd', unk_vector = None, unk_word = '<UNK>')

Word2VecVectorizer::load_embeddings ==> start
Time taken: 0.07 mins
Word2VecVectorizer::load_embeddings ==> end


### Embedding Lookup: Get word and subword indices.

In [14]:
df_predict = pd.DataFrame({'text' : ["I have fever", "this is a good book"]})
vectorizer.input_col = 'text'
vectorizer.output_col = 'indices'
vectorizer.return_type = 'word_index'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices
0,I have fever,"[5, 23, 1208]"
1,this is a good book,"[32, 15, 9, 34, 487]"


### Embedding Lookup: Get word embeddings.

In [15]:
vectorizer.output_col = 'word_vector'
vectorizer.return_type = 'word_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector
0,I have fever,"[5, 23, 1208]","[[-0.014937000349164009, 0.004478000104427338,..."
1,this is a good book,"[32, 15, 9, 34, 487]","[[-0.8164309859275818, -0.9811710119247437, -0..."


### Embedding Lookup: Get sentence embedding from word embeddings
This uses averaging of word vectors

In [16]:
vectorizer.output_col = 'sentence_vector'
vectorizer.return_type = 'sentence_vector'
result = vectorizer.tatk_transform(df_predict)
display(result)

Word2VecVectorizer::tatk_transform ==> start
Word2VecVectorizer::tatk_transform ==> end 	 Time taken: 0.0 mins


Unnamed: 0,text,indices,word_vector,sentence_vector
0,I have fever,"[5, 23, 1208]","[[-0.014937000349164009, 0.004478000104427338,...","[-0.842507665976882, 0.8129770023127397, 0.927..."
1,this is a good book,"[32, 15, 9, 34, 487]","[[-0.8164309859275818, -0.9811710119247437, -0...","[-0.3175797998905182, 0.10082738101482391, -1...."


### Embedding Lookup: Get most similar word to a given word

In [17]:
vectorizer.embedding_table.most_similar('fever')

[('bronchitis', 0.7530185580253601),
 ('tonsillitis', 0.7527911067008972),
 ('flu', 0.7445870637893677),
 ('tonsilitis', 0.7410739660263062),
 ('cough', 0.7255285978317261),
 ('antibiotics', 0.7244501709938049),
 ('sickness', 0.7191905975341797),
 ('sniffles', 0.7161286473274231),
 ('migraine', 0.7115853428840637),
 ('chills', 0.6977540850639343)]