# Finding topn similar words using word2vec. However, if the word is not part of vocab, it will still fail 


NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB). You need to download this file 

In [8]:
import gensim 
import nltk

In [None]:
nltk.download('word2vec_sample')

In [9]:
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

In [10]:
from gensim.models import KeyedVectors

In [11]:
import spacy
from pprint import pprint 
nlp = spacy.load("en")

In [12]:
import en_core_web_sm  # This is the default model ( vocabulary, syntax and entity)
nlp = en_core_web_sm.load()

In [13]:
import codecs
from nltk.corpus import wordnet 
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer


In [186]:
# This is the current pipeline that is used 
pprint(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7ff6626faf10>),
 ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7ff665b83520>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7ff665b834b0>)]


In [14]:
#1. dataset - first three are similar and second three are dissimilar 

text1=["Trying to test how the method is for identical text.",
       "I like that food.",
       "Ram is very nice..",
      "Pure malt whiskey.",
      "It is a ferocious dog and it barks whenever anybody uknown comes near the house.",
      "It is the family cow.",
      "My driving license is my identity in USA as everything is linked to it."]
text2=["Trying to test how the method is for identical text.",
       "That dish is exciting",
       "Is Ram very nice?",
       "Fresh orange juice.",
       "The painting in the art gallery is so fantastic !",
       "I have been driving this sports car for last ten years and I am so satisfied! ",
       "The president led us to war and we lost that war! "]



In [15]:
# in case we are using synonym based on wordnet, there are many compound lemmas and we have to delete
# them based on the following combinations
chars = set('_')

## This is a pruned model from NLTK 

But it is better to work with the full model for any serious problem

In [None]:
from nltk.data import find
word2vec_sample = str(find('/Users/vcroopana/nltk_data/models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [None]:
# this is a pruned vocabulary of the most frequent 44 K words
len(model.vocab)

## This is the full unpruned embedding ( not pruned). Better to use this 

In [None]:
#Load model from local
filename = '/Users/vcroopana/gensim-data/GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [None]:
# Load model as an object

# from gensim.models.word2vec import Word2Vec
# from gensim.models import KeyedVectors
# import gensim.downloader as api

# model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use


In [None]:
# this is a larger vocabulary
len(model.vocab)

In [None]:
model.most_similar(positive=['university'], topn = 6)

##  Doing a similarity math with this embedding

king - man + woman = queen  etc.

In [None]:
#  king - man + woman = queen 
model.most_similar(positive=['woman'.lower(),'king'.lower()], negative=['man'.lower()], topn = 1)

In [None]:
# Germany-Berlin+paris = france 
model.most_similar(positive=['Paris'.lower(),'Germany'.lower()], negative=['Berlin'.lower()], topn = 1)

In [None]:
model.most_similar(positive=['Seoul'.lower(),'Germany'.lower()], negative=['Berlin'.lower()], topn = 10)

## Comparing the meaning of the two sentences by comparing the mean word2vec vectors

This routine may still fail if the word is not part of vocab 

Before applying this routine, ensure that every word is part of vocab that you are using

This function works only for a sentence but if you spply a paragraph, you need to preprocess as simple split function will not work. So to handle this, we will do preprocessing using spacy  


Stemming and lemmatization attempts to get root word (example - rain) for different word inflections (raining, rained etc). Lemma algos gives you real dictionary words, whereas stemming simply cuts off last parts of the word so its faster but less accurate. 

Stemming returns words which are not really dictionary words and hence you will not be able to find pretrained vectors for it in Glove, Word2Vec etc and this is a major disadvantage depending on application. You should stick to Lemmatization in this case. 

We will also use Spacy stopwords list and create a lemma_list and token_list per sentence.

We can clean out all HTML tags by using the regex ‘<[^>]*>’; All the non word characters can be removed by ‘[\W]+’. You should be careful though about not stripping punctuations before word contractions are handled by the lemmatizer. Note that we are doing it after lemmatization and not before. 



In [None]:
# this is a useful function 
# It needs to be further modified so that emoticons identified are replaced by text 
# find emoticons function
import re
def find_emo(text):
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    return emoticons
sample_text = " I loved this movie :) but it was rather sad :( :]"
find_emo(sample_text)
# output
[':)', ':(']

In [22]:
# find a pattern text defined by regex inside a string or a list of strings
import re
def preprocessor(text):
    if isinstance((text), (str)):
        text = re.sub('<[^>]*>', '', text)
        text = re.sub('[\W]+', '', text.lower())
        text = re.sub('[^\x00-\x7F]+', ' ', text) # removes non ascii chars
        text = re.sub('https?://[A-Za-z0-9./]+', ' ', text) # Remove URLs
        # remove punctuations except '_'
        punctuation = ['(', ')', '[',']','?', ':', ':', ',', '.', '!', '/', '"', "'", '@', '#', '&']
#     text = re.sub('[^a-zA-Z]', ' ', text) # remove all other than alphabet chars 
        text = "".join((char for char in text if char not in punctuation))
        text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text) # remove all single characters   
        
        return text
    if isinstance((text), (list)):
        return_list = []
        for i in range(len(text)):
            temp_text = re.sub('<[^>]*>', '', text[i])
            temp_text = re.sub('[\W]+', '', temp_text.lower())
            return_list.append(temp_text)
        return(return_list)
    else:
        pass

In [172]:
# nlp = English()  # It is already done above 
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
# using spacy tokenizer 
tokenizer = nlp.Defaults.create_tokenizer(nlp)

def lemma_token(text):
    text = text.lower()  # to take care of capital case word in glove
    tokens = tokenizer(text)
    token_list = []
    lemma_list = []
    for token in tokens:
        if token.is_stop is False:
            token_preprocessed = preprocessor(token.lemma_)
#             print(token_preprocessed)
            if token_preprocessed != '':
                 lemma_list.append(token_preprocessed)
                 token_list.append(token.text)   
    #return (token_list, lemma_list)
    return lemma_list


In [171]:
#outputs the average word2vec for words in this sentence
def average_vec(words, model):
#     words = lemma_token(text)
    #use unk word when word is not present in vocab to find a predesigned vector which is often the best vector
    word_vecs = [model.word_vec(w) if w in model.vocab else model.word_vec('unk') for w in words ]
    
    op =  (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)
    return op

compare = lambda a,b: cosine_similarity(average_vec(a),average_vec(b)).sum()


In [None]:
compare('Pure malt whiskey'.lower(),'Fresh orange juice'.lower())


In [18]:
annotations_data = pd.read_csv('/Users/vcroopana/Downloads/summer2020/superbowl/ip/SB_ad_annotations.csv', index_col=0) 
annotations_data['Keywords'] = annotations_data['Brand Name']\
                                .str.cat(annotations_data['Ad Name'], sep=" ")\
                                .str.cat(annotations_data['KeyTerms_Edited'], sep=" ")
df = annotations_data.drop_duplicates()
print(df.shape)
annotations_data.head(2)


(75, 10)


Unnamed: 0_level_0,Brand Name,Ad Name,Product,Key Terms Round 1,KeyTerms_Edited,Excitatory Potential,Emotional vs. Rational,Semantic Affinity,Valence,Keywords
Ad Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Trailer,Fast & Furious 9 Trailer,Movie Trailer,"fast and the furious, fast & the furious, fast...","fast_and_the_furious, fast_&_the_furious, fast...",1,1,2,1,Trailer Fast & Furious 9 Trailer fast_and_the...
2,Quibi,Quibi Bank Heist,Video Platform,"quibi, bank heist, robbery, less than ten minu...","quibi, bank_heist, robbery, less_than_ten_minu...",2,1,2,2,"Quibi Quibi Bank Heist quibi, bank_heist, rob..."


In [19]:
man_ann_data = pd.read_csv(r'/Users/vcroopana/Downloads/summer2020/superbowl/mann_ann_sb.csv')    
man_ann_data.head(1)

Unnamed: 0,Ò,user_id,tweet_id,time_of_tweet,user_location,team followed,affective_state,tweet_text,ad_keywords,ad_mentioned,ad_manual
0,64628,2836102000.0,1.22e+18,Mon Feb 03 03:53:56 +0000 2020,,both,1,"Man I wanna look at the brightside ""we thought...","John legend, Chrissy Teigen, genesis, hyundai,...",genesis going away party,none


In [33]:
from nltk.corpus import stopwords
import re

stop = stopwords.words('english')
print(len(stop))

def removeMentions(text):

    textBeforeMention = text.partition("@")[0]
    textAfterMention = text.partition("@")[2]
    textAfterMention =  re.sub(r':', '', textAfterMention) #cadillac join the 31k
    tHandle = textAfterMention.partition(" ")[0].lower() #cadillac    
    text = textBeforeMention+ " " + textAfterMention  
    return text

def cleanTweet(strinp):
    strinp = re.sub(r'RT', "", strinp) # Remove RT
    strinp = strinp.lower()
    
    stop_removed_list = [word for word in strinp.split() if word not in (stop)]
    stop_removed = ' '.join([str(elem) for elem in stop_removed_list])    
    text = re.sub('https?://[A-Za-z0-9./]+', ' ', stop_removed) # Remove URLs
    text = removeMentions(text)
    text = re.sub('[^\x00-\x7F]+', ' ', text) # Remove non-ASCII chars.
    
    # remove punctuations except '_'
    punctuation = ['(', ')', '[',']','?', ':', ':', ',', '.', '!', '/', '"', "'", '@', '#', '&']
#     text = re.sub('[^a-zA-Z]', ' ', text) # remove all other than alphabet chars 
    text = "".join((char for char in text if char not in punctuation))
    
#     text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text) # remove all single characters     
    stop_removed_l = [word for word in text.split() if word not in (stop)]
    stop_removed = ' '.join([str(elem) for elem in stop_removed_l]) 
    return stop_removed

print(cleanTweet("RT @cadillacabc: Joinrt the 31K james_bond") )

179
cadillacabc joinrt 31k james_bond


In [None]:

ad_keywords_clean = annotations_data['Keywords'].apply(lambda ad: cleanTweet(ad))
man_ann_data['tweet_clean'] = man_ann_data['tweet_text'].apply(lambda twt: cleanTweet(twt))

In [None]:
glove_avg_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])

def compareTweetAndAds(tweet):
    ad_keywords = annotations_data['Keywords']
    ad_sim = []
    for ad in ad_keywords:
        compare_res = compare(tweet.lower(), ad.lower())
        ad_sim.append(compare_res)
    return ad_sim
#     max_sim_index = ad_sim.index(max(ad_sim))
#     print(max_sim_index)
#     if(max_sim_index>0.9):
#         return annotations_data.iloc[max_sim_index]['Ad Name']
#     else:
#         return None

man_ann_data['glove_avg'] = man_ann_data['tweet_text'].apply(lambda twt: compareTweetAndAds(twt))
man_ann_data.to_csv("/Users/vcroopana/Downloads/summer2020/superbowl/mann_ann_sb_temp.csv")

In [173]:

def compute_glove_similarity(annotations_data, man_ann_data, model):
    ##### Data Init

    ad_keywords_clean = annotations_data['Keywords'].apply(lambda ad: cleanTweet(ad))
    man_ann_data['tweet_clean'] = man_ann_data['tweet_text'].apply(lambda twt: cleanTweet(twt))

    ###### Preprocessing
    tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(lambda x: lemma_token(x))
    print("Calculated pos tokens of tweets")
    ads_lemma_tokens = ad_keywords_clean.apply(lambda x: lemma_token(x))
    print("Calculated pos tokens of ads")

    ###### Sim calculation
    ad_id = 1
    glove_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])

    for ad in ads_lemma_tokens:
        glove_sim_col =[]
        avg_vec_ad = average_vec(ad, model)
        print("computing sim for ad:" + str(ad_id))
        for i in range(0, len(tweets_lemma_tokens)): 
            avg_vec_twt = average_vec(tweets_lemma_tokens[i], model)     
            if  avg_vec_twt.shape == avg_vec_ad.shape:
                glove_sim = cosine_similarity(avg_vec_twt,avg_vec_ad).sum()
            else:
                glove_sim = 0
            glove_sim_col.append(np.round(glove_sim,3))

        glove_sim_df[annotations_data['Ad Name'][ad_id]] = glove_sim_col
        ad_id = ad_id + 1
    print("glove POS Similarities Calculated")
    return glove_sim_df

In [None]:
#Load model from local
filename = '/Users/vcroopana/gensim-data/GoogleNews-vectors-negative300.bin'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=True)
print('loaded model')


In [None]:
glove_sim_df = compute_glove_similarity(annotations_data, man_ann_data, glove_model)


In [175]:
nlargest = 5
data = glove_sim_df
result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
glove_sim_df_merged = pd.concat([man_ann_data, result_con], axis =1)

glove_sim_df_merged.head()

print("Top n similar ads computed")


Top n similar ads computed


  


In [185]:

glove_sim_df_merged['conf_matrix'] = glove_sim_df_merged.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.7), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
computeAccuracy(glove_sim_df_merged)


n_tp:91 n_fp:364 n_fn:759 n_tn:1286


(0.2, 0.10705882352941176, 0.13946360153256704)

In [186]:
glove_sim_df_merged.to_csv('/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_0.7.csv')

#  Finding most similar words using Glove 

We will use the pre-trained word vectors from Glove. We are especially interested in Twitter dataset with 2B tweets, 27B tokens, and 200d vectors.

We use Gensim to convert Glove vectors into the word2vec, then use KeyedVectors to load vectors in word2vec format.

The source is this article : https://towardsdatascience.com/how-to-solve-analogies-with-word2vec-6ebaf2354009

Same issue as in Word2Vec - words must exist in the vocabulary 
Also it makes sense to lower case everything otherwise may get word does NOT exist error

Unlike word2Vec, while using Glove, remember to lowercase the string.

In [2]:
import gensim
from gensim.models import KeyedVectors
import os

from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath, get_tmpfile


path_glove = os.path.abspath('/Users/vcroopana/gensim-data/glove/glove.twitter.27B.200d.txt')
path_w2v = os.path.abspath('/Users/vcroopana/gensim-data/glove/glove.twitter.27B.200d_w2v.txt')

glove_file = datapath(path_glove)
tmp_file = get_tmpfile(path_w2v)

_ = glove2word2vec(glove_file, tmp_file)


In [3]:
model = KeyedVectors.load_word2vec_format(tmp_file)

In [None]:
model.most_similar(positive=['boy'], topn = 6)

In [None]:
model.most_similar(positive=['man'], topn = 6)

In [None]:
compare('Quick fox jumps over dog'.lower(),'Fast fox jumps over puppy'.lower())


# Calculate similarity between text pairs ( after pre-procssing)

It uses the lemma_token() function using spacy 

(a) tokenizes

(b) picks up the lemma 

(c) eliminates the stopwords 

In [None]:
l1= lemma_token(text1[0])
l2= lemma_token(text1[1])
l3= lemma_token(text1[2])
l4= lemma_token(text1[3])
l5= lemma_token(text1[4])
l6= lemma_token(text1[5])

# compared with 
l1_c= lemma_token(text2[0])
l2_c= lemma_token(text2[1])
l3_c= lemma_token(text2[2])
l4_c= lemma_token(text2[3])
l5_c= lemma_token(text2[4])
l6_c= lemma_token(text2[5])


In [None]:
print(text1[0])
print ("\n")
print(text2[0])
print ("\n")
print(l1)
print ("\n")
print(l1_c)
print("\n Similarity after pre_processing: {}".format(compare(text1[0],text2[0])))
print ("=====================================\n")

print(text1[1])
print ("\n")
print(text2[1])
print ("\n")
print(l2)
print ("\n")
print(l2_c)

print("\n Similarity: {}".format(compare(text1[1],text2[1])))
print ("=====================================\n")

print(text1[2])
print ("\n")
print(text2[2])
print ("\n")
print(l3)
print ("\n")
print(l3_c)

print("\n Similarity: {}".format(compare(text1[2],text2[2])))
print ("=====================================\n")

print(text1[3])
print ("\n")
print(text2[3])
print ("\n")
print(l4)
print ("\n")
print(l4_c)

print("\n Similarity: {}".format(compare(text1[3],text2[3])))
print ("=====================================\n")

print(text1[4])
print ("\n")
print(text2[4])
print ("\n")
print(l5)
print ("\n")
print(l5_c)

print("\n Similarity: {}".format(compare(text1[4],text2[4])))
print ("=====================================\n")

print(text1[5])
print ("\n")
print(text2[5])
print ("\n")
print(l6)
print ("\n")
print(l6_c)

print("\n Similarity: {}".format(compare(text1[5],text2[5])))

#  Modifying similarity logic further by considering POS tag types

Clearly it did not work as the text has lot of implied meanings. So, just looking at average of all the words even after pre-processing  is not working. 

One option is to identify the nouns and develop a similarity around that.  

Currently the code takes noun, adj, adv, verb

In [138]:
# we will try to return lemma list of only the nouns in the text
def lemma_token_pos(text):
    text = text.lower()  # to take care of capital case word in glove
    doc=nlp(text)
    lemma_list = []
    for token in doc:
        if token.is_stop is False:
            if (token.pos_ == 'NOUN' or token.pos_ == 'VERB' or token.pos_ == 'ADJ' or token.pos_ == 'adv'):
                token_preprocessed = preprocessor(token.lemma_)
                if token_preprocessed != '':
                     lemma_list.append(token_preprocessed)
                     #return (token_list, lemma_list)
#     print(lemma_list)
#     print("\n")
    return lemma_list


In [None]:
#testing 
lemma_token_pos(text1[2])

In [140]:
#outputs the average word2vec for words in this sentence
def average_vec_pos(words):
#     words = lemma_token_pos(text)
    #use unk word when word is not present in vocab to find a predesigned vector which is often the best vector
    word_vecs = [model.word_vec(w) if w in model.vocab else model.word_vec('unk') for w in words ]
    if(len(word_vecs) == 0):
#         print('len of word vec=0')
        return (np.array(word_vecs).sum(axis=0)).reshape(1,-1)
    else:                
        return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)

compare_pos = lambda a,b: cosine_similarity(average_vec_pos(a),average_vec_pos(b)).sum()


In [50]:
import numpy as np

def getTopNSimAds(nlargest, data):
    
    order = np.argsort(-data.values, axis=1)[:, :nlargest]
    result = pd.DataFrame(data.columns[order], 
                          columns=['top{}_ad'.format(i) for i in range(1, nlargest+1)],
                          index= data.index)

    order_vals = np.sort(-data.values, axis=1)[:, :nlargest]
    result_vals = pd.DataFrame(-order_vals,
                              columns = ['top{}'.format(i) for i in range(1, nlargest+1)],
                              index= data.index)
    
    result_con = pd.concat([result_vals, result], axis =1)
    return result_con


In [51]:
def get_conf_matrix(ad_manual, ad_algo, ad_prob, thresh_prob):
    res = ""
    ad_manual = ad_manual.lower()
    ad_algo = ad_algo.lower()

    if(ad_manual == ad_algo and ad_prob >= thresh_prob):
        res = 'TP'
    elif(ad_manual == ad_algo and ad_prob < thresh_prob):
        res = 'FN'
    elif(ad_manual=='none' and ad_manual != ad_algo and ad_prob > thresh_prob):
        res = 'FP'
    elif(ad_manual=='none' and ad_manual != ad_algo and ad_prob < thresh_prob):
        res = 'TN'
    elif(ad_manual!='none' and ad_manual!= ad_algo):
        res = 'FN'
    return res
def get_conf_matrix_2(ad_manual, ad_algo, ad_algo_2, ad_algo_3, ad_algo_4, ad_algo_5):
    res = ""
    ad_manual = ad_manual.lower()
    ad_algo = ad_algo.lower()
    ad_algo_2 = ad_algo_2.lower()
    ad_algo_3 = ad_algo_3.lower()
    ad_algo_4 = ad_algo_4.lower()
    ad_algo_5 = ad_algo_5.lower()
    
    if(ad_manual == ad_algo or ad_manual == ad_algo_2 or ad_manual == ad_algo_3 or ad_manual == ad_algo_4
      or ad_manual == ad_algo_5):
        res = 'TP'
    elif(ad_manual=='none'):
        res = 'FP'
    elif(ad_manual!='none' and ad_manual!= ad_algo and ad_manual!= ad_algo_2 and ad_manual!= ad_algo_3
        and ad_manual!= ad_algo_4 and ad_manual!= ad_algo_5):
        res = 'FN'
    elif(ad_manual!='none' and ad_manual!= ad_algo and ad_manual!= ad_algo_2 and ad_manual!= ad_algo_3
        and ad_manual!= ad_algo_4 and ad_manual!= ad_algo_5):
        res = 'TN'
    return res

def computeAccuracy(result):    
    n_tp = result[result['conf_matrix'] == 'TP'].shape[0]
    n_fp = result[result['conf_matrix'] == 'FP'].shape[0]
    n_fn = result[result['conf_matrix'] == 'FN'].shape[0]
    n_tn = result[result['conf_matrix'] == 'TN'].shape[0]    
    print("n_tp:" + str(n_tp)+ " n_fp:" + str(n_fp)+ " n_fn:" + str(n_fn) + " n_tn:" + str(n_tn))
    precision = n_tp/(n_tp+ n_fp)
    recall = n_tp/(n_tp+ n_fn)
    f_measure = (2*precision*recall)/ (precision+recall)

    return precision, recall, f_measure


In [None]:
##### Data Init
ad_keywords_clean = annotations_data['Keywords'].apply(lambda ad: cleanTweet(ad))
man_ann_data['tweet_clean'] = man_ann_data['tweet_text'].apply(lambda twt: cleanTweet(twt))

###### Preprocessing
tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(lambda x: lemma_token_pos(x))
print("Calculated pos tokens of tweets")
ads_lemma_tokens = ad_keywords_clean.apply(lambda x: lemma_token_pos(x))
print("Calculated pos tokens of ads")

###### Sim calculation
ad_id = 1
glove_pos_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])

for ad in ads_lemma_tokens:
    glove_pos_sim_col =[]
    avg_vec_pos_ad = average_vec_pos(ad)
    print("computing sim for ad:" + str(ad_id))
    for i in range(0, len(tweets_lemma_tokens)): 
        avg_vec_pos_twt = average_vec_pos(tweets_lemma_tokens[i])     
        if  avg_vec_pos_twt.shape == avg_vec_pos_ad.shape:
            glove_pos_sim = cosine_similarity(avg_vec_pos_twt,avg_vec_pos_ad).sum()
        else:
            glove_pos_sim = 0
        glove_pos_sim_col.append(np.round(glove_pos_sim,3))
    
    glove_pos_sim_df[annotations_data['Ad Name'][ad_id]] = glove_pos_sim_col
    ad_id = ad_id + 1
glove_pos_sim_df.head(3)
print("glove POS Similarities Calculated")


In [143]:
nlargest = 5
data = glove_pos_sim_df
result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
man_ann_data_glove_pos = pd.concat([man_ann_data, result_con], axis =1)

man_ann_data_glove_pos.head()

print("Top n similar ads computed")



Top n similar ads computed


  


In [146]:

man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.8), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
computeAccuracy(man_ann_data_glove_pos)
#man_ann_data_glove_pos.to_csv("/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_pos_0.8.csv")

n_tp:22 n_fp:432 n_fn:828 n_tn:1218


(0.048458149779735685, 0.02588235294117647, 0.033742331288343565)

In [147]:
man_ann_data_glove_pos.to_csv("/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_pos_0.8.csv")

In [None]:
print(text1[0])
print ("\n")
print(text2[0])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[0],text2[0])))
print ("=====================================\n")


print(text1[1])
print ("\n")
print(text2[1])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[1],text2[1])))
print ("=====================================\n")



print(text1[2])
print ("\n")
print(text2[2])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[2],text2[2])))
print ("=====================================\n")



print(text1[3])
print ("\n")
print(text2[3])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[3],text2[3])))
print ("=====================================\n")


print(text1[4])
print ("\n")
print(text2[4])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[4],text2[4])))
print ("=====================================\n")

print(text1[5])
print ("\n")
print(text2[5])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos(text1[5],text2[5])))
print ("=====================================\n")


Things have now clearly changed.  Working only with the nouns declutters things a lot and similarity value has come down. 

#  We will now improve similarity by adding  top 5 similar words to each side for each word.. It is like pronouncing the difference or similarity 

We are still using the "model" initialized already by loading Glove and converting to Word2Vec 

You can see below that this does not work as distribution semantics is picking up all "lemme" kind of verbs that are mostly occuring together.


In [None]:
#testing out ...
synonym_list=[]
tuple_list=model.most_similar(positive=['woman'], topn = 4)
for a_tuple in tuple_list:
    synonym_list.append(a_tuple[0])
print(synonym_list)    

### So we refine our previous function with synonyms

In [152]:
# we will try to return lemma list of only the POS in the text
def lemma_token_pos_synonyms(text, allowed_pos, model):
    text = text.lower()  # to take care of capital case word in glove
    doc=nlp(text)
    lemma_list = []
    synonym_list=[]
    for token in doc:
        if token.is_stop is False:
            if (token.pos_ in allowed_pos):
                token_preprocessed = preprocessor(token.lemma_)
                if token_preprocessed != '':
                     lemma_list.append(token_preprocessed)
    #print(lemma_list)
    for lemma in lemma_list:
        tuple_list=model.most_similar(positive=['lemma'], topn = 5)
        for a_tuple in tuple_list:
            synonym_list.append(a_tuple[0])
    lemma_list=lemma_list+synonym_list        
#     print("\n Extended lemma list : ")
#     print(lemma_list)
#     print("\n")
    return lemma_list

In [153]:
#outputs the average word2vec for words in this sentence
def average_vec_pos_syn(words, model):
#     words = lemma_token_pos_synonyms(text)
#     word_vecs = [model.word_vec(w) for w in words]
#     return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)
    word_vecs = [model.word_vec(w) if w in model.vocab else model.word_vec('unk') for w in words ]
    if(len(word_vecs) == 0):
        return (np.array(word_vecs).sum(axis=0)).reshape(1,-1)
    else:                
        return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)

compare_pos_syn = lambda a,b: cosine_similarity(average_vec_pos_syn(a),average_vec_pos_syn(b)).sum()


In [157]:
def getSimilarityDf(annotations_data, ads_lemma_tokens, tweets_lemma_tokens, model):
    ad_id = 1
    glove_pos_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])
        
    for ad in ads_lemma_tokens:
        glove_pos_sim_col =[]
        avg_vec_pos_ad = average_vec_pos_wordnet_syn(ad, model)
        print("computing sim for ad:" + str(ad_id))
        for i in range(0, len(tweets_lemma_tokens)): 
            avg_vec_pos_twt = average_vec_pos_wordnet_syn(tweets_lemma_tokens[i], model)     
            if  avg_vec_pos_twt.shape == avg_vec_pos_ad.shape:
                glove_pos_sim = cosine_similarity(avg_vec_pos_twt,avg_vec_pos_ad).sum()
            else:
                glove_pos_sim = 0
            glove_pos_sim_col.append(np.round(glove_pos_sim,3))
            
        glove_pos_sim_df[annotations_data['Ad Name'][ad_id]] = glove_pos_sim_col
        ad_id = ad_id + 1
            
    return glove_pos_sim_df

def computeGlovePOSSynSimilarity(annotations_data, man_ann_data, model, allowed_pos, tweets_lemma_tokens, ads_lemma_tokens):
    ##### Data Init
    ad_keywords_clean = annotations_data['keywords_clean']
        
    ###### Preprocessing - tokenize and filter POS
    
#     tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(
#         lambda x: lemma_token_pos_synonyms(x, allowed_pos, model))
#     print("Calculated pos tokens of tweets")
#     ads_lemma_tokens = ad_keywords_clean.apply(
#         lambda x: lemma_token_pos_synonyms(x, allowed_pos, model))
#     print("Calculated pos tokens of ads")
        
    ###### Similarity calculation
    glove_pos_syn_df = getSimilarityDf(annotations_data, ads_lemma_tokens, tweets_lemma_tokens, model)

    print("Glove POS Synonymn Similarities Calculated")
    return glove_pos_syn_df

In [None]:
allowed_pos = ('NOUN', 'VERB', 'ADJ', 'adv')
tweets_lemma_syn_tokens = man_ann_data['tweet_clean'].apply(lambda x: lemma_token_pos_synonyms(x, allowed_pos, model))
print("Calculated pos tokens of tweets")
ads_lemma_syn_tokens = ad_keywords_clean.apply(lambda x: lemma_token_pos_synonyms(x, allowed_pos, model))
print("Calculated pos tokens of ads")

glove_pos_syn_sim = computeGlovePOSSynSimilarity(annotations_data, man_ann_data, model, allowed_pos,tweets_lemma_syn_tokens, ads_lemma_syn_tokens )


In [167]:
nlargest = 5
data = glove_pos_syn_sim
result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
glove_pos_syn_sim_merged = pd.concat([man_ann_data, result_con], axis =1)

glove_pos_syn_sim_merged['conf_matrix'] = glove_pos_syn_sim_merged.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.9), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
computeAccuracy(glove_pos_syn_sim_merged)

glove_pos_syn_sim_merged.to_csv("/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_pos_syn_0.9.csv")


  


n_tp:49 n_fp:1570 n_fn:801 n_tn:80


In [None]:
print(text1[0])
print ("\n")
print(text2[0])
print ("\n")
print("\n Similarity after pre_processing for noun: {}".format(compare_pos_syn(text1[0],text2[0])))
print ("=====================================\n")



# We will try to define some kind of cross similarity now based on Glove Embedding. Hopefully it will give us better result

Given two lemma list, we do pairwise similarity and then normalize the same

Hypothesis : a word in a source sentence if compared with all the words in target sentence may capure more as average all words in source or target is making us lose some context 




In [20]:
def calculate_glove_pos_cross_similary(text1,text2,model,flag):
    if flag==True:
        words1 = lemma_token_pos(text1)
        words2 = lemma_token_pos(text2)
    else:
        words1 = lemma_token(text1)
        words2 = lemma_token(text2)        
    print(words1)
    print ("\n")
    print(words2)
    print ("\n")
    if flag==True:
        print("\n POS tag is used as filter")
    else:
        print("\n POS tag is NOT used as filter")
    cross_cos_similarity_12 = 0
    word_vecs2 = [model.word_vec(w) for w in words2]
    pairwise_cos_similarity = 0
    for word in words1:
        word_vecs1=model.word_vec(word).reshape(1, -1)
        pairwise_cos_similarity = cosine_similarity(word_vecs1,word_vecs2).sum()
        cross_cos_similarity_12 =cross_cos_similarity_12 + pairwise_cos_similarity 
    norm_cross_cos_similarity = (cross_cos_similarity_12)/ (len(words1)*len(words2)) 
    print("\n cross_cos_similarity_12 : {}".format(cross_cos_similarity_12))
    print("\n norm_cross_cos_similarity : {}".format(norm_cross_cos_similarity))
    return norm_cross_cos_similarity

In [63]:
def compute_glove_pos_cross_similary(annotations_data,man_ann_data,model,flag):
    ad_keywords_clean = annotations_data['keywords_clean']
    tweets_lemma_tokens =[]
    ads_lemma_tokens = []
    if flag==True:
        allowed_pos = ('NOUN', 'VERB', 'ADJ', 'adv')
#         tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(lambda x: self.lemma_token_pos(x, allowed_pos))
        print("Calculated pos tokens of tweets")
#         ads_lemma_tokens = ad_keywords_clean.apply(lambda x: self.lemma_token_pos(x, allowed_pos))
        print("Calculated pos tokens of ads")
    else:
        tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(lambda x: lemma_token(x))
        print("Calculated pos tokens of tweets")
        ads_lemma_tokens = ad_keywords_clean.apply(lambda x: lemma_token(x))
        print("Calculated pos tokens of ads")

    ad_id = 1
    glove_pos_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])
        
    for ad in ads_lemma_tokens:
        ad_token_filtd = [w for w in ad if w in model.vocab]
        word_vecs_ad = [model.word_vec(w) for w in ad_token_filtd]
        glove_pos_sim_col =[]
        if(len(ad_token_filtd) == 0):
            glove_pos_sim_col = [0] * len(tweets_lemma_tokens) 
        else:
            print("computing sim for ad:" + str(ad_id))
            for i in range(0, len(tweets_lemma_tokens)): 
                tweets_lemma_filtd = [ w for w in tweets_lemma_tokens[i] if w in model.vocab ]
                if(len(tweets_lemma_filtd)!=0):
                    pairwise_cos_similarity = 0
                    cross_cos_similarity_12 = 0
                    for word in tweets_lemma_filtd:
                        word_vecs_tweet = model.word_vec(word).reshape(1, -1)
                        pairwise_cos_similarity = cosine_similarity(word_vecs_tweet ,word_vecs_ad).sum()
                        cross_cos_similarity_12 = cross_cos_similarity_12 + pairwise_cos_similarity 
                    norm_cross_cos_similarity = (cross_cos_similarity_12)/ (len(ad_token_filtd)*len(tweets_lemma_filtd)) 
        #             print("\n cross_cos_similarity_12 : {}".format(cross_cos_similarity_12))
#                     print("norm_cross_cos_similarity : {}".format(norm_cross_cos_similarity))
                else:
                    norm_cross_cos_similarity =0
                    
                glove_pos_sim_col.append(np.round(norm_cross_cos_similarity,3))
            
        glove_pos_sim_df[annotations_data['Ad Name'][ad_id]] = glove_pos_sim_col
        ad_id = ad_id + 1
        
    return glove_pos_sim_df

In [None]:
annotations_data['keywords_clean'] = annotations_data['Keywords'].apply(lambda ad: cleanTweet(ad))
man_ann_data['tweet_clean'] = man_ann_data['tweet_text'].apply(lambda twt: cleanTweet(twt))
glove_pos_cross_similary = compute_glove_pos_cross_similary(annotations_data,man_ann_data,model,False)

In [65]:
nlargest = 5
data = glove_pos_cross_similary

result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
man_ann_data_glove_pos_cross_similary = pd.concat([man_ann_data, result_con], axis =1)

man_ann_data_glove_pos_cross_similary.head()

print("Top n similar ads computed")


Top n similar ads computed


  


In [56]:
man_ann_data_glove_pos_cross_similary.to_csv('/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_pos_cross_similary.csv')


In [134]:
man_ann_data_glove_pos_cross_similary['conf_matrix'] = man_ann_data_glove_pos_cross_similary.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.4), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
computeAccuracy(man_ann_data_glove_pos_cross_similary)


n_tp:24 n_fp:610 n_fn:826 n_tn:1031


(0.03785488958990536, 0.02823529411764706, 0.032345013477088944)

In [23]:
# try it out 
# cos_similarity = calculate_glove_pos_cross_similary(text1[0],text2[0],model,False)
cos_similarity = calculate_glove_pos_cross_similary("this is a test","loikujgh",model,False)

['test']


['test']



 POS tag is NOT used as filter

 cross_cos_similarity_12 : 1.0

 norm_cross_cos_similarity : 1.0


In [None]:
def text_pair_cross_similarity(text1,text2,model,flag):
    print("\n Cross Glove Similarity  : {}".format(calculate_glove_pos_cross_similary(text1,text2,model,False)))
    print ("=====================================\n")

for i in range(6):
    text_pair_cross_similarity(text1[i],text2[i],model,False)    


# Using wordnet to add synonyms to each of the text in pair of texts

In [70]:
## try out code 
from nltk.corpus import wordnet 
#Creating a list 
synonyms = []
for syn in wordnet.synsets("travel"):
    for lm in syn.lemmas():
             synonyms.append(lm.name())#adding into synonyms
print (set(synonyms))        

{'traveling', 'go', 'locomotion', 'trip', 'journey', 'move_around', 'move', 'jaunt', 'travelling', 'locomote', 'change_of_location', 'travel'}


Clearly getting the top n nearest words using distributional semantics DOES NOT work as all these "letme" etc are the most frequent occuring patterns !!! 

So, we will try out enhancing with wordnet !! 

In [90]:
# we also need to take care of composite lemma in wordnet. We just delete them now

def clean_lemma(lemma_list,chars):
    new_lemma_list=[]
    delete=0
    for s in lemma_list:
        if any((c in chars) for c in s):
            delete+=1
            #print("\n one composite lemma from wordnet found .. deleting..")
        else:
            new_lemma_list.append(s.lower())
    #print(new_lemma_list)
#     print("Total composite lemma deleted: {}".format(delete))
    return(new_lemma_list)

In [91]:
# we will try to return lemma list of only the nouns in the text
# apparently it works best for noun and verb for some scenarios

def lemma_token_pos_wordnet_syn(text):
    text = text.lower()  # to take care of capital case word in glove
    doc = nlp(text)
    lemma_list = []
    synonym_list=[]
    for token in doc:
        if token.is_stop is False:
            #if (token.pos_ == 'NOUN' or token.pos_ == 'VERB'):
            if (token.pos_ == 'NOUN' or token.pos_ == 'VERB' or token.pos_ == 'ADJ' or token.pos_ == 'adv'):
                token_preprocessed = preprocessor(token.lemma_)
                if token_preprocessed != '':
                     lemma_list.append(token_preprocessed)
#     print("\n lemma list after 1st preprocessing: ")
#     print(lemma_list) # test out 
    # There can be too many lemmas. So, we limit to lem_c=3
    for lemma in lemma_list:
        for syn in wordnet.synsets(lemma):
            for lm in syn.lemmas():
                synonym_list.append(lm.name())#adding into synonyms
    # In wordnet, there are synsets where same word is used in slightly different meanings
    # so we have to make a list of set 
    lemma_list=list(set(lemma_list+synonym_list))        
    lemma_list=clean_lemma(lemma_list,chars)
#     print("\n lemma list after wordnet & cleanup :")
#     print(lemma_list)
#     print("\n")
    return lemma_list

In [None]:
lemma_list= lemma_token_pos_wordnet_syn("My driving license is my identity in USA as everything is linked to it.")

In [None]:
#testing the routine. There will not be any more lemma to delete really 
print(clean_lemma(lemma_list,chars))

In [102]:
# we implement two checks 
# 1. For all composite words that wordnet gives as x_y, we delete 
# 2. Check if the words are existing in vocab. If not, skip 

def average_vec_pos_wordnet_syn(words,model):
#     words = lemma_token_pos_wordnet_syn(text)
    word_list=[]
    omit=0
    for word in words:
        if word in model.vocab:
            word_list.append(word)
        else:
            omit +=1
    word_vecs = [model.word_vec(w) for w in word_list]
#     print("Total no of words not found in vocab: {}".format(omit))
    if(len(word_vecs) ==0):
        return (np.array(word_vecs).sum(axis=0)).reshape(1,-1)
    
    else:
        return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)

def compare_pos_wordnet_syn(a,b):
    cos=cosine_similarity(average_vec_pos_wordnet_syn(a,model),average_vec_pos_wordnet_syn(b,model)).sum()
    return cos

In [103]:


def compute_pos_wordnet_cross_similarity(annotations_data, man_ann_data, model):
    ##### Data Init
    ad_keywords_clean = annotations_data['keywords_clean']
        
    ###### Preprocessing - tokenize
    tweets_lemma_tokens = man_ann_data['tweet_clean'].apply(lambda x: lemma_token_pos_wordnet_syn(x))
    print("Calculated pos tokens of tweets")
    ads_lemma_tokens = ad_keywords_clean.apply(lambda x: lemma_token_pos_wordnet_syn(x))
    print("Calculated pos tokens of ads")
        
    ###### Similarity calculation
    glove_sim_df = getSimilarityDf(annotations_data, ads_lemma_tokens, tweets_lemma_tokens, model)
    print("Glove Similarities Calculated")
    return glove_sim_df

In [None]:
pos_wordnet_sim = compute_pos_wordnet_cross_similarity(annotations_data, man_ann_data, model)


In [111]:
nlargest = 5
data = pos_wordnet_sim

result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
pos_wordnet_sim_merged = pd.concat([man_ann_data, result_con], axis =1)

print("Top n similar ads computed")
pos_wordnet_sim_merged.to_csv('/Users/vcroopana/Downloads/summer2020/superbowl/sim_glove_pos_wordnet_similary.csv')

pos_wordnet_sim_merged['conf_matrix'] = pos_wordnet_sim_merged.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.8), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
computeAccuracy(pos_wordnet_sim_merged)
#Results: 0.8
# Top n similar ads computed
# n_tp:25 n_fp:1318 n_fn:825 n_tn:332
# Out[111]:
# (0.018615040953090096, 0.029411764705882353, 0.022799817601459188)

# Results : 0.9
# Top n similar ads computed
# n_tp:12 n_fp:742 n_fn:838 n_tn:908
# Out[110]:
# (0.015915119363395226, 0.01411764705882353, 0.014962593516209476)

  


Top n similar ads computed
n_tp:25 n_fp:1318 n_fn:825 n_tn:332


(0.018615040953090096, 0.029411764705882353, 0.022799817601459188)

In [None]:
cos=compare_pos_wordnet_syn(text1[5],text2[5])
print(cos)

In [None]:
def text_pair_similarity(text1,text2):
    print(text1)
    print ("\n")
    print(text2)
    print ("\n")
    print("\n Similarity after pre_processing for POS and adding synomyms: {}".format(compare_pos_wordnet_syn(text1,text2)))
    print ("=====================================\n")

for i in range(7):
    text_pair_similarity(text1[i],text2[i])    


In [77]:
doc=nlp("Ram sings beautifully")

In [None]:
lemma_list=[]
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_,"---->",token.lemma_)
    token_preprocessed = preprocessor(token.lemma_)
    if token_preprocessed != '':
        lemma_list.append(token_preprocessed)
    for lemma in lemma_list:
        for syn in wordnet.synsets(lemma):
            for lm in syn.lemmas():
                synonym_list.append(lm.name())#adding into synonyms
    # In wordnet, there are synsets where same word is used in slightly different meanings
    # so we have to make a list of set 
    lemma_list=list(set(lemma_list+synonym_list))        
    lemma_list=clean_lemma(lemma_list,chars)
print(lemma_list)

In [None]:
word_list=[]
omit=0
for word in lemma_list:
    if word in model.vocab:
        word_list.append(word)
    else:
        omit +=1
word_vecs = [model.word_vec(w) for w in word_list]
print("Total no of words not found in vocab: {}".format(omit))
a= (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)


## Findings 

Essentially the idea of getting all lemma for all entries in synset may not work as it can lead to hundreds and it will completely destroy the calculations. 

We have also seen that for very short sentence it may not work ( will give error) as you will be using POS tagging that may not identify any word ( due to shortness of sentence or error in tagger). In that case, you have no lemma identified and the code will give error !! 

The other thing observed is : POS tagger is also statistical and it is never 100% accurate. So, garbage in garbage out may happen 

Best findings so far : stick to Glove but pick up only the correct POS Tagged words so that your similarity calculation is more accurate 


# wordnet based implementation of semantic cross similarity  

We will do cross similarity so that we get similarity with the full target context.

There are quite a few WordNet Similarities. Few have values greater than 1. Wu Palmer is between 0 and 1

Also word.n.01 is the deepest level in the wordnet hierarchy 


In [None]:
filtered_sent1 = []
filtered_sent2 = []
counter1 = 0
counter2 = 0
sent21_similarity = 0
sent12_similarity = 0


In [112]:
# Add synonyms to match list


def synonymsCreator(word):
    synonyms = []

    for syn in wordnet.synsets(word):
        for i in syn.lemmas():
            synonyms.append(i.name())

    return synonyms


In [113]:
# Cehck and return similarity


def simlilarityCheck(word1, word2):

    word1 = word1 + ".n.01"
    word2 = word2 + ".n.01"
    try:
        w1 = wordnet.synset(word1)
        w2 = wordnet.synset(word2)

        return w1.wup_similarity(w2)

    except:
        return 0

# -----------------------------------------------------------------------------------------


In [None]:
# try out the code 
sent1 = text1[0]
sent2 = text2[0]

In [114]:
def simpleFilter(sentence,syn_flag):
    # Does not do wordnet synonym
    filtered_sent = []
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(sentence)

    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))
            if syn_flag==True:
                for i in synonymsCreator(w):
                    filtered_sent.append(i)
    return filtered_sent



In [None]:
filtered_sent1 = simpleFilter(sent1,False)
filtered_sent2 = simpleFilter(sent2, False)


In [None]:
filtered_sent1

In [None]:
filtered_sent2

In [115]:
#let us calculate all cross word similarity check between the two sentences
# for i words and j words, we have ixj combinations of similarity 
#Taking the ixj kind of captures the similarity of one word of one sentence 
# with the context consttuted by all words in the other sentence 
# note that, jxi will mean the same thing from the opposite direction 
# we can probably normalize this with (i*j) assuming that is the max value 
def calculate_wordnet_similary(text1,text2,syn_flag):
    filtered_sent1 = simpleFilter(text1,syn_flag)
    filtered_sent2 = simpleFilter(text2,syn_flag)
    sent1_count = len(filtered_sent1 )
    sent2_count = len (filtered_sent2 )
    if syn_flag==True: # in case we are taking synonyms
        filtered_sent1=list(set(filtered_sent1))
        filtered_sent2=list(set(filtered_sent2))
        filtered_sent1=clean_lemma(filtered_sent1,chars)
        filtered_sent2=clean_lemma(filtered_sent2,chars)
    sent12_similarity=0
    sent21_similarity=0
    
    for i in filtered_sent1:
        for j in filtered_sent2:
            sent12_similarity = sent12_similarity + simlilarityCheck(i, j)
    normalised_similarity= (sent12_similarity+sent21_similarity ) /(sent1_count*sent2_count)       
    print(sent12_similarity)
    #print (sent21_similarity)
    print(normalised_similarity )
    return(normalised_similarity)

In [None]:

def compute_word_net_semantic_similarity(annotations_data, man_ann_data, model, syn_flag):
    ##### Data Init
    ad_keywords_clean = annotations_data['keywords_clean']
    tweets_clean = man_ann_data['tweet_clean']
    
    ad_id = 1
    wordnet_sim_df = pd.DataFrame(columns = annotations_data['Ad Name'])
    
    for ad in ad_keywords_clean:
        
        wordnet_sim_col =[]
        filtered_ad = simpleFilter(ad,syn_flag)      
        sent_ad_count = len(filtered_ad )
        
        if syn_flag==True: # in case we are taking synonyms
            filtered_ad=list(set(filtered_ad))           
            filtered_ad=clean_lemma(filtered_ad,chars)            
        
        print("computing sim for ad:" + str(ad_id))
        
        for i in range(0, len(tweets_clean)): 
            filtered_sent2 = simpleFilter(tweets_clean[i],syn_flag)
            sent2_count = len(filtered_sent2)
            if syn_flag==True:
                filtered_sent2=list(set(filtered_sent2))
                filtered_sent2=clean_lemma(filtered_sent2,chars)
            sent12_similarity=0
            sent21_similarity=0 # Where is this used?
            
            for p in filtered_ad:
                for q in filtered_sent2:
                    sent12_similarity = sent12_similarity + simlilarityCheck(p, q)
            normalised_similarity= (sent12_similarity + sent21_similarity ) /(sent_ad_count*sent2_count)    
            
            wordnet_sim_col.append(normalised_similarity)
            
        wordnet_sim_df[annotations_data['Ad Name'][ad_id]] = wordnet_sim_col
        ad_id = ad_id + 1
    print("Wordnet Similarities Calculated")
    return wordnet_sim_df


In [None]:
wordnet_sim_df = compute_word_net_semantic_similarity(annotations_data, man_ann_data, model, False)

In [128]:
nlargest = 5
data = wordnet_sim_df

result_con = getTopNSimAds(nlargest, data)
## merge mann ann data and sim result
wordnet_sim_df_merged = pd.concat([man_ann_data, result_con], axis =1)

print("Top n similar ads computed")

wordnet_sim_df_merged['conf_matrix'] = wordnet_sim_df_merged.apply(lambda x: get_conf_matrix(x['ad_manual'], x['top1_ad'], 
                                                          x['top1'], 0.19), axis =1)
# man_ann_data_glove_pos['conf_matrix'] = man_ann_data_glove_pos.apply(lambda x: get_conf_matrix_2(x['ad_manual'], x['top1_ad'], 
#                             x['top2_ad'], x['top3_ad'], x['top4_ad'], x['top5_ad']), axis =1)
wordnet_sim_df_merged.to_csv('/Users/vcroopana/Downloads/summer2020/superbowl/sim_wordnet_cross_no_syn.csv')


computeAccuracy(wordnet_sim_df_merged)


  


Top n similar ads computed
n_tp:1 n_fp:488 n_fn:849 n_tn:1162


(0.002044989775051125, 0.001176470588235294, 0.0014936519790888722)

In [None]:
def text_pair_similarity_run(text1,text2):
    print(text1)
    print ("\n")
    print(text2)
    print ("\n")
    print("\n Wordnet Similarity after pre_processing and adding synomyms: {}".format(calculate_wordnet_similary(text1,text2,False)))
    print ("=====================================\n")

for i in range(7):
    text_pair_similarity_run(text1[i],text2[i])    


# Final Findings including BERT similarity from another notebook on Google Colab 


<table style="width:100%">
  <tr>
    <th>Sentence pair</th>
    <th>Comments</th>  
    <th>Glove average</th>
    <th>Glove average+POS </th>
    <th>Glove average+top_n </th>
    <th>Glove cross similarity </th>
    <th>Wordnet -no Synonyms (wu palmer) </th>
    <th>Wordnet + cross similarity </th>
    <th>BERT based similarity </th>  
      
  </tr>
  <tr>
    <td>[Trying to test how the method is for identical text.],[Trying to test how the method is for identical text.] </td>
    <td>Similarity should be 1.00 </td>  
    <td>1.0 </td>
    <td> 1.0 </td>
    <td> 1.0 </td>
    <td> 1.0 </td>
    <td> 1.0 </td>
    <td> 1.0  </td>
      <td> 1.0 </td>
  </tr>
    <tr>
    <td>[I like that food],[That dish is exciting] </td>
    <td>Quite close as both about good dish </td>  
    <td>0.46 </td>
    <td> 0.59</td>
    <td>0.97 </td>
    <td>0.33 </td>
    <td>0.64 </td>
    <td>0.70 </td>
     <td> 0.85 </td>
  </tr>
  <tr>
    <td>[Ram is very nice..],[Is Ram very nice?] </td>
    <td> Essentially same statement but put differently</td>
    <td>0.99</td>
    <td>0.81 </td>
    <td>0.97 </td>
    <td>0.64 </td>
    <td>0.78 </td>
    <td>0.31 </td>
      <td>0.76  </td>
  </tr>
  <tr>
    <td>[Pure malt whiskey.],[Fresh orange juice.] </td>
    <td>Both about drinks but different drinks</td>
    <td>0.50 </td>
    <td>0.45 </td>
    <td>0.97 </td>
    <td>0.32 </td>
    <td>0.65 </td> 
    <td>0.07 </td>
      <td>0.77 </td>
  </tr>
  <tr>
    <td>[It is a ferocious dog and it barks whenever anybody uknown comes near the house.],[The painting in the art gallery is so fantastic !] </td>
    <td>Quite different </td>
    <td>0.55 </td>
    <td>0.47 </td>
    <td>0.97 </td>
    <td>0.24 </td>
    <td>0.70 </td>
    <td>0.07 </td>
      <td>0.66 </td>
  </tr>
  <tr>
    <td>[It is the family cow.],[I have been driving this sports car for last ten years and I am so satisfied!]  </td>
   <td>Quite different </td>
    <td>0.63 </td>
    <td>0.64 </td>
    <td>0.97 </td>
    <td>0.35 </td>
    <td>0.78 </td>
    <td>0.13 </td>
      <td>0.58 </td>
  </tr>
  <tr>
    <td>[My driving license is my identity in USA as everything is linked to it.],[The president led us to war and we lost that war!]  </td>
    <td>Quite different </td>
    <td>0.61 </td>
    <td>0.55 </td>
    <td>0.97 </td>
    <td>0.31 </td>
    <td>0.88</td>
    <td>0.08 </td>
    <td>0.69 </td>  
  </tr>
    
</table>

# Analysis 

1. Glove with top n similar words cannot be used . Similarly the routine for wordnet with synonyms has a flag that is set false as Wordnet can come up with a very large list of synonyms and can result in error in this function or produce very  bad similarity. What we have is pure WUP similarity 

2. First sentence pair is identical - so yielding 1

3. Second sentence pair is quite close ( about good dish) - so BERT, Wordnet Cross similarity, Pure Wordnet  working fine 

4. Third sentence pair is same but one is affirmative and the other is question. So, predictably Glove avearge, Glove Average(POS) works fine as both sides are same. But BERT again is working good and Pure Wordnet WUP similarity is whereas Wordnet Cross Similarity is bad 

5. Fourth, fifth and sixth sentence pairs are kind of different and here it shows that because of enormity of possible synonyms, wordnet still finds some similarity. Average approach simply averages and it is not really good. But Glove cross similarity picks up the difference in meaning quite well 

How do we handle then similar and disparate pairs ? Looks like WUP similarity ( pure Wordnet approach) + BERT are good for similar sentences and Glove Cross similarity good for different sentences. So what may work out is 0.5(first method) + 0.5 (2nd method ) to balance it out ... 
