# Twitter search with word embeddings

Your task is to create a program that searches through twitter tweets using word embeddings. Given a search query, your program should return the top 5 tweets relating to this query for each distance algorithm used (you will use 2 distance algorithms, which means you'll return 10 tweets as a result. lore information below). You can achieve this by performing the following steps:



In [244]:
#pip install tabulate

Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.8.9
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\germd\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


In [170]:
#pip install gensim

In [171]:
#pip install sklearn

In [172]:
#pip install contractions

In [246]:
import pandas as pd
import contractions
import re
import nltk
from nltk.tokenize import TweetTokenizer
import string

import numpy as np
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from tabulate import tabulate


In [175]:
data = pd.read_csv('tweets.csv', encoding="latin-1")

In [176]:
data.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


1- Perform the necessary pre-processing on the tweets. Meep in mind that tweets contain lots of typos and non-conventional characters (like emoticons and the like).



In [177]:
def clean(data):
    data_clean = data
    data_clean['text_clean'] = data_clean['text']
    
    #remove first two characters
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: x[2:])

    #text to lowercase
    data_clean['text_clean'] = data_clean['text_clean'].str.lower()

    #remove URL links
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
    data_clean['text_clean'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

    #remove placeholders
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r'{link}', '', x))
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r"\[video\]", '', x))

    #remove HTML reference characters
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

    #remove handles
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r"@([a-zA-Z0-9_]{1,50})","", x))

    #remove non-letter characters
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))

    # Replace contractions with their longer forms 
    data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x:  contractions.fix(x))

 

    return data_clean


In [178]:
data_clean = clean(data)

In [179]:
#use the tweet tokenizer function from nltk (keep emojis)

tknzr = TweetTokenizer()

data_clean['tokens'] = data_clean['text_clean'].apply(tknzr.tokenize)

data_clean.head()

Unnamed: 0,id,created_at,text,text_clean,tokens
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...,and so the robots spared humanity,"[and, so, the, robots, spared, humanity]"
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...",exactly tesla is absurdly overvalued if ba...,"[exactly, tesla, is, absurdly, overvalued, if,..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",et tu walt',"[et, tu, walt, ']"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',stormy weather in shortville ',"[stormy, weather, in, shortville, ']"
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ...",coal is dying due to nat gas fracking it is ...,"[coal, is, dying, due, to, nat, gas, fracking,..."


In [180]:
#remove the punctuation to remove punctuation used for emojis

PUNCUATION_LIST = list(string.punctuation)

def remove_punctuation(word_list):
    return [w for w in word_list if w not in PUNCUATION_LIST]
    
data_clean['tokens'] = data_clean['tokens'].apply(remove_punctuation)

data_clean = data_clean.dropna()
data_clean.head()

Unnamed: 0,id,created_at,text,text_clean,tokens
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...,and so the robots spared humanity,"[and, so, the, robots, spared, humanity]"
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...",exactly tesla is absurdly overvalued if ba...,"[exactly, tesla, is, absurdly, overvalued, if,..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",et tu walt',"[et, tu, walt]"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',stormy weather in shortville ',"[stormy, weather, in, shortville]"
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ...",coal is dying due to nat gas fracking it is ...,"[coal, is, dying, due, to, nat, gas, fracking,..."


2- Apply word embedding to the pre-processed tweets, using the GloVe model (choose the appropriate pre-trained model from here:  https://nlp.stanford.edu/projects/glove/  that conforms with your computer's processor capabilities; bigger model = more accuracy, and more memory requirements). The embedding representation of 1 tweet is the mean of the word embeddings of all the words in this tweet.


In [181]:
# Global parameters
#root folder
root_folder='.'
data_folder_name='glove.twitter.27B'

#use the files you wanted
glove_filename='glove.twitter.27B.200d.txt'


# Variable for data directory
DATA_PATH = os.path.abspath(os.path.join(root_folder, data_folder_name))
glove_path = os.path.abspath(os.path.join(DATA_PATH, glove_filename))

# Both train and test set are in the root data directory
train_path = DATA_PATH
test_path = DATA_PATH

#Relevant columns
TEXT_COLUMN = 'text'
TARGET_COLUMN = 'target'

In [182]:
# We just need to run this code once, the function glove2word2vec saves the Glove embeddings in the word2vec format 
# that will be loaded in the next section

glove_input_file = glove_filename
word2vec_output_file = glove_filename+'.word2vec'

glove2word2vec(glove_path, word2vec_output_file)

  glove2word2vec(glove_path, word2vec_output_file)


(1193514, 200)

In [183]:
# load the Stanford GloVe model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

In [184]:


#vectorize tweets with our model
def vectorize(tokenized_sentence):
    result = []
    for token in tokenized_sentence:
        if(token in model.key_to_index):
            result.append(model[token])
    return np.mean(result, axis=0)


data_clean['vectorized'] = data_clean['tokens'].apply(vectorize)

  return _methods._mean(a, axis=axis, dtype=dtype,


In [186]:
data_clean = data_clean.dropna()

Unnamed: 0,id,created_at,text,text_clean,tokens,vectorized
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...,and so the robots spared humanity,"[and, so, the, robots, spared, humanity]","[0.16549633, 0.070804335, 0.17029466, 0.200102..."
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...",exactly tesla is absurdly overvalued if ba...,"[exactly, tesla, is, absurdly, overvalued, if,...","[0.17040388, 0.24164951, 0.22459331, 0.1952677..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",et tu walt',"[et, tu, walt]","[-0.021688962, -0.23816268, -0.389869, 0.24019..."
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',stormy weather in shortville ',"[stormy, weather, in, shortville]","[-0.33656335, -0.16922998, -0.48255336, -0.098..."
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ...",coal is dying due to nat gas fracking it is ...,"[coal, is, dying, due, to, nat, gas, fracking,...","[0.05625992, 0.012350748, -0.008011498, 0.0590..."



3- Apply word embeddings to the search query.



In [220]:
#enter the sentence or word to see relevants tweets
sentence = input()
sentence

'i want to go to the moon'

In [221]:
#put sentence to a dataframe
d = {'text': [sentence]}
phrase = pd.DataFrame(data=d)
phrase['text']

0    i want to go to the moon
Name: text, dtype: object

In [222]:
#just tokenize the sentence
tknzr = TweetTokenizer()
phrase['tokens'] = phrase['text'].apply(tknzr.tokenize)
phrase['tokens']

0    [i, want, to, go, to, the, moon]
Name: tokens, dtype: object

In [223]:
#and vectorized it
phrase['vectorized'] = phrase['tokens'].apply(vectorize)
phrase['vectorized']

0    [0.32468012, 0.21205099, 0.22474143, 0.0002394...
Name: vectorized, dtype: object

In [224]:
data_clean.head() 

Unnamed: 0,id,created_at,text,text_clean,tokens,vectorized,cosine,euclidian
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...,and so the robots spared humanity,"[and, so, the, robots, spared, humanity]","[0.16549633, 0.070804335, 0.17029466, 0.200102...",0.808744,3.747881
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...",exactly tesla is absurdly overvalued if ba...,"[exactly, tesla, is, absurdly, overvalued, if,...","[0.17040388, 0.24164951, 0.22459331, 0.1952677...",0.83885,3.62642
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",et tu walt',"[et, tu, walt]","[-0.021688962, -0.23816268, -0.389869, 0.24019...",0.434603,6.263538
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',stormy weather in shortville ',"[stormy, weather, in, shortville]","[-0.33656335, -0.16922998, -0.48255336, -0.098...",0.635994,5.232181
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ...",coal is dying due to nat gas fracking it is ...,"[coal, is, dying, due, to, nat, gas, fracking,...","[0.05625992, 0.012350748, -0.008011498, 0.0590...",0.803826,3.785735


4- Calculate the distance between the embeddings of the search query and that of all the tweets, sort them in increasing order (smaller distance = more relevant to the search query). You will use 2 distance algorithms: cosine similarity and Euclidean distance. Both of these are implemented in scikit learn;
Euclidian distance:  https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html
Cosine similarity: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
After that, you return the top 5 tweets using each of the aforementioned search algorithms.

The tweets dataset to be used is provided to this assignment.

Your deliverable is a python notebook with all the code and necessary explanations.

In [227]:
# Cosine similarity beetween vectorized sentence and vectorized tweets

b = float(0)
compt = 0
cosi = []
for i in data_clean.index:
    cosi.append(float(cosine_similarity(phrase['vectorized'][0].reshape(1, -1),data_clean['vectorized'][i].reshape(1, -1))))
    compt +=1
data_clean['cosine'] = cosi


In [249]:
#print the top 5 most relevant tweet with cosine similarity
print(sentence)
print(tabulate(data_clean[['text','cosine']].nlargest(5, ['cosine']), headers='keys', tablefmt='psql'))


i want to go to the moon
+------+-----------------------------------------------------------------------------------------------------------------------------------------------+----------+
|      | text                                                                                                                                          |   cosine |
|------+-----------------------------------------------------------------------------------------------------------------------------------------------+----------|
| 2816 | b'I made the volume on the Model S http://t.co/wMCnT53M go to 11.  Now I just need to work in a miniature Stonehenge...'                      | 0.965275 |
| 1103 | b'@jpfrappier yes, will go all the way to Alaska'                                                                                             | 0.964836 |
| 2454 | b"To be super clear, I don't wish to (nor could I) mandate anything about a Mars Colony. Am just working on the tech to get people there."    | 0.

In [229]:
# Euclidian Distance beetween vectorized sentence and vectorized tweets

b = float(0)
compt = 0
eucl = []
for i in data_clean.index:
    eucl.append(float(euclidean_distances(phrase['vectorized'][0].reshape(1, -1),data_clean['vectorized'][i].reshape(1, -1))))
    compt +=1
data_clean['euclidian'] = eucl



In [250]:
#print the top 5 most relevant tweet with euclidian distance
print(sentence)
print(tabulate(data_clean[['text','euclidian']].nsmallest(5, ['euclidian']), headers='keys', tablefmt='psql'))

i want to go to the moon
+------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|      | text                                                                                                                                          |   euclidian |
|------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------|
| 1103 | b'@jpfrappier yes, will go all the way to Alaska'                                                                                             |     1.78927 |
| 2816 | b'I made the volume on the Model S http://t.co/wMCnT53M go to 11.  Now I just need to work in a miniature Stonehenge...'                      |     1.87591 |
| 2454 | b"To be super clear, I don't wish to (nor could I) mandate anything about a Mars Colony. Am just working on the tech to get people 

In [None]:
#exemple of result with the glove.twitter.27B.200d.txt file

"""
i want to go to the moon
+------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|      | text                                                                                                                                          |   euclidian |
|------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------|
| 1103 | b'@jpfrappier yes, will go all the way to Alaska'                                                                                             |     1.78927 |
| 2816 | b'I made the volume on the Model S http://t.co/wMCnT53M go to 11.  Now I just need to work in a miniature Stonehenge...'                      |     1.87591 |
| 2454 | b"To be super clear, I don't wish to (nor could I) mandate anything about a Mars Colony. Am just working on the tech to get people there."    |     1.91583 |
| 1874 | b'@QuantumG When we launch I want to know that SpaceX has done everything possible to keep the astronauts safe. Only a few more years to go.' |     1.96451 |
|  565 | b'RT @jeffmason1: "You almost want to get in and take off, don\'t you?" @POTUS says. https://t.co/DfAJOGyBWR'                                 |     1.99987 |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+"""