# Word2Vec + Cosine Similarity

Our choosed method to rank is the combination of word2vec and cosine similarity. We have followed the implementation suggested in this [paper](
https://arxiv.org/pdf/1507.07998.pdf), *Document Embedding with Paragraph Vectors*.

The ranking works as follows. First we create a vocabulary from our collection using Word2Vec class in gensim.models. For each term in each document we transform it using Word2Vec embedding. Then for each document we average the resulting vector in to one single vector representative of the model. 

After doing this for the entire collection we obtain the vector representation of teach document. In query time we repeat the exact same process. Transform each term in query into its corresponding word2vec embedding and average all vector to obtain a final one representative of the query. 

For ranking we use cosine similarity between each query and doc. Top N documents are returned.

https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2vec-based-vector-space-model/

https://arxiv.org/pdf/1507.07998.pdf

In [None]:
import pandas as pd
import re
import spacy
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pickle

Mounted at /content/drive/


In [None]:
data_Final = pd.read_csv(f"{PATH}final_Tweets.csv")
data = data_Final.copy()
data_Final.head()

Unnamed: 0.1,Unnamed: 0,created_at,favorite_count,full_text,id,retweet_count,user.id,user.name
0,0,2020-11-11,4,International friendly roundup: Finland stun F...,1326667371730378753,1,16042794,Guardian US
1,1,2020-11-11,11,When Joe Biden formally takes over the preside...,1326666012142526466,5,16042794,Guardian US
2,2,2020-11-11,4,New Yorker fires Jeffrey Toobin after he repor...,1326663505454510081,1,16042794,Guardian US
3,3,2020-11-11,8,One week on: how has Donald Trump handled losi...,1326661105498796032,1,16042794,Guardian US
4,4,2020-11-11,13,France pays tribute to six-year-old resistance...,1326659924278046728,6,16042794,Guardian US


## Text Prepocessing

* Lowercase the text
* Expand Contractions
* Clean the text
* Remove Stopwords
* Lemmatize words

In this method we have decided to implement a stronger text processing to retain as much as possible relations and semantics between words.

In [None]:
def expand_contractions(text, contractions_dict, contractions_re):
    """
    Given contraction find match and substitude
    """
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace,text)

def clean_text(text):
    """
    * Remove words with digits
    * Replace newline characters with space
    * Remove URLS
    * Replace non english chars with space
    """
    # Remove digits
    text=re.sub('\w*\d\w*','', text)

    # Remove new Line chars
    text=re.sub('\n',' ',text)

    #Remove links
    text=re.sub(r"http\S+", "", text)

    #Replace non-english chars
    text=re.sub('[^a-z]',' ',text)
    
    return text

In [None]:
def preprocessing(text):
    """
    Given a pandas dataframe apply preprocessing techinques
        * Lowercase the text
        * Expand Contractions
        * Clean the text
        * Remove Stopwords
        * Lemmatize words
    """
    # Lower case
    text = text.lower()

    # Regular expression for finding contractions
    contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

    #Expand contractions
    text = expand_contractions(text,contractions_dict,contractions_re)
    text = clean_text(text)

    #Remove added spaces
    text = re.sub(" +"," ",text)
    text = text.strip()

    #Stop words and Lemmatizing
    text = ' '.join([token.lemma_ for token in list(nlp(text)) if (token.is_stop==False)])

    return text

In [None]:
# Dictionary of english Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not","can't": "can not","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not","couldn't've": "could not have",
"didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will",
"he'll've": "he will have","how'd": "how did","how'd'y": "how do you","how'll": "how will","i'd": "i would",
"i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have",
"isn't": "is not","it'd": "it would","it'd've": "it would have","it'll": "it will","it'll've": "it will have",
"let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have",
"needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will",
"she'll've": "she will have","should've": "should have","shouldn't": "should not",
"shouldn't've": "should not have","so've": "so have","that'd": "that would","that'd've": "that would have",
"there'd": "there would","there'd've": "there would have",
"they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have",
"they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
"weren't": "were not","what'll": "what will","what'll've": "what will have","what're": "what are",
"what've": "what have","when've": "when have","where'd": "where did",
"where've": "where have","who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not","won't've": "will not have",
"would've": "would have","wouldn't": "would not","wouldn't've": "would not have","y'all": "you all",
"y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
"you'd": "you would","you'd've": "you would have","you'll": "you will","you'll've": "you will have",
"you're": "you are","you've": "you have"}

# Save dict
pickle.dump(contractions_dict, open(f"{PATH}utils/contractions_dict.p", "wb"))

In [None]:
# Init NLP
nlp = spacy.load("en_core_web_sm",disable=["ner","parser"])
nlp.max_length = 5000000

# Preprocess tweets
data["full_text"]  = data["full_text"].apply(lambda x: preprocessing(x))

In [None]:
# Save preprocessed data
data.to_csv(f"{PATH}w2d_processed.csv", index= False)

## Word2Vec

This part of the notebook is focused on word2vector embedding. First we construct the model vocabulary. Then we represent each document in embedding vector form. 

In query time we represent the query following the same logic applied to the document and we compute the cosine similarity between each document query term. Finally we return top N most similar documents with respect the query.

In [None]:
# Create vocabulary
vocabulary = []
for tweet in data["full_text"]:
    terms = tweet.split()
    vocabulary.append(terms)
    
w2v_model = Word2Vec(vocabulary,size=100, min_count=1,window=2, sg=1,workers=4)

In [None]:
# Save the model
w2v_model.save(f"{PATH}w2v_model.kvmodel")

In [None]:
def embedding_w2v(doc_tokens):
    """
    Returns vector representation of a string
    """
    embeddings = []
    if len(doc_tokens)<1:
        return np.zeros(100)
    else:
        for t in doc_tokens:
            if t in w2v_model.wv.vocab:
                embeddings.append(w2v_model.wv.word_vec(t))
            else:
                embeddings.append(np.random.rand(100))
    
    return np.mean(embeddings, axis = 0)

In [None]:
def w2v_collection(data):
    """
    Given a collection of documents returns the pair id:vector where the vector is
    the embedding representation of the doc.
    """
    id_doc2v = {}
    for id, text in zip(data["id"].values, data["full_text"]):
        id_doc2v[id] = embedding_w2v(text)

    return id_doc2v

In [None]:
def rank(query, id_doc2vec):
    """
    Given a query preprocesses it, embeds it and return ordered dictionary of id:similarity_score
    pair.
    """
    # Pre-process query
    query = preprocessing(query)

    # Query vector
    q_vector = embedding_w2v(query.split())

    #Doc query similarity
    doc_query_sim = {k: cosine_similarity(np.array(v).reshape(1,-1),np.array(q_vector).reshape(1,-1)) for k,v in id_doc2vec.items()}

    # Sort
    doc_query_sim = {k: v for k, v in sorted(doc_query_sim.items(), key=lambda item: item[1], reverse = True)}
    
    return doc_query_sim

In [1]:
def parser_tweet_results(doc):
  """
Given a Pandas dataframe row formates the information por display
Arguments:
  docs -- pandas dataframe with unique row with tweet info.
Returns:
  tweet -- text tweet - str
  authors -- user name of tweet - str
  date -- of publication -- str
  retweets -- count of retweets - str
  favorites -- count of favourites - str
  """
  # Tweet
  tweet = str(doc["full_text"].values)
  tweet = tweet.replace("'","")
  tweet = tweet.replace("[","")
  tweet = tweet.replace("]","")

  # Author
  author = str(doc["user.name"].values)
  author = author.replace("[","")
  author = author.replace("]","")

  # Date
  date = str(doc["created_at"].values)
  date = date.replace("[","")
  date = date.replace("]","")
  date = date.replace("'","")

  # Retweets
  retweets = str(doc["retweet_count"].values)
  retweets = retweets.replace("[","")
  retweets = retweets.replace("]","")

  # Favorites
  favorites = str(doc["favorite_count"].values)
  favorites = favorites.replace("[","")
  favorites = favorites.replace("]","")

  # URL
  id = str(doc["id"].values)
  url = f"https://twitter.com/twitter/statuses/{id}"

  #Hashtags
  hashtags = str(doc["entities.hashtags	"].values)
  
  return tweet, date, author, retweets, favorites, url, hashtags

In [None]:
def search(id_doc2vector, topn= 20):
    """
    Search for tweets inputing a query and see displayed results.
    Arguments:
        id_doc2vector -- dic containing id:vec2doc pair - dic
        topn -- default: 20 - Top N result to display - int.

    """
    print("######################################################")
    print("Insert query:")
    query = input()
    print("######################################################\n")

    # Get ranked docs
    doc_query_sim = rank(query, id_doc2vector)
    ids = list(doc_query_sim.keys())[:topn]

    for index, id in enumerate(ids):
        doc = data_Final[data_Final["id"] == id]
        tweet, date, author, retweets, favorites, url, hashtags = parser_tweet_results(doc)
    
        print("______________________________________________________")
        print(f"Tweet {index}")
        print(f"\t·Author: {author}")
        print(f"\t·Date: {date}")
        print(f"\t·Tweet: {tweet}")
        print(f"\t·Retweets: {retweets}")
        print(f"\t·Favorites: {favorites}")
        print(f"\t·Hashtags: {hashtags}")
        print(f"\t·URL: {url}")
        print("______________________________________________________\n")

In [None]:
# Vector representation id:vector
id_word2vector = w2v_collection(data)

# Save preprocessed data
pickle.dump(id_word2vector, open(f"{PATH}utils/id_word2vector.p", "wb"))

In [None]:
search(id_doc2vector)

######################################################
Insert query:
joe Binde
######################################################

______________________________________________________
Tweet 0
	·Author: 'Bloomberg Politics'
	·Date: 2020-10-27
	·Tweet: U.S. backs Taiwan missile sale with China tensions soaring https://t.co/up2ZouDsj4
	·Retweets: 3
	·Favorites: 9
______________________________________________________

______________________________________________________
Tweet 1
	·Author: 'HuffPost Politics'
	·Date: 2020-11-04
	·Tweet: “He hasn’t won these states. Nobody is saying he won the states. The states haven’t said he’s won," Wallace said. https://t.co/hAuus2BqMK
	·Retweets: 42
	·Favorites: 201
______________________________________________________

______________________________________________________
Tweet 2
	·Author: 'Bloomberg Politics'
	·Date: 2020-11-10
	·Tweet: "For now its still masks and social distancing in the fight against the virus https://t.co/35YRo00obf"
	·R