## Twitter search with word embeddings
#### Your task is to create a program that searches through twitter tweets using word embeddings. Given a search query, your program should return the top 5 tweets relating to this query for each distance algorithm used (you will use 2 distance algorithms, which means you'll return 10 tweets as a result. lore information below). You can achieve this by performing the following steps:

### 1- Perform the necessary pre-processing on the tweets. Meep in mind that tweets contain lots of typos and non-conventional characters (like emoticons and the like).

In [2]:
import pandas as pd
import numpy as np

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation # to remove punctuation from corpus
from nltk import pos_tag

import re
from sklearn.metrics import accuracy_score
from nltk.corpus import wordnet as wn

In [3]:
# Import dataset 
data = pd.read_csv('tweets.csv', encoding='utf-8')
# We use the Utf-8 encoding to include emoticons

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)
display(data)

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."
...,...,...,...
2814,142881284019060736,2011-12-03 08:22:07,b'That was a total non sequitur btw'
2815,142880871391838208,2011-12-03 08:20:28,"b'Great Voltaire quote, arguably better than T..."
2816,142188458125963264,2011-12-01 10:29:04,b'I made the volume on the Model S http://t.co...
2817,142179928203460608,2011-12-01 09:55:11,"b""Went to Iceland on Sat to ride bumper cars o..."


In [4]:
# Pre processing (remove stop word and tokenization) of the dataset


# Function to remove usernames and links
def remove_usernames_links(tweet):
    tweet = re.sub('@[^\s]+','',tweet)
    tweet = re.sub('http[^\s]+','',tweet)
    tweet =re.sub("b'[^\s]+",'', tweet)
    tweet =re.sub('b"[^\s]+','', tweet)
    return tweet
data['text'] = data['text'].apply(remove_usernames_links)

#Function to remove emoticon
def remove_emoticon(tweet):
    emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    tweet = re.sub(emoji_pattern,'',tweet)
    return tweet
    
data['text'] = data['text'].apply(remove_emoticon)


# Remove stopword and basic preprocessing
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
# Tokenize the text column to get the new column 'tokenized_text'
data['stop'] = data['text'].apply(lambda x : remove_stopwords(x.lower())) 
data['tokenized'] = data['stop'].apply(lambda x : preprocess_string(x.lower())) 


pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)
display(data)


Unnamed: 0,id,created_at,text,stop,tokenized
0,849636868052275200,2017-04-05 14:56:29,so the robots spared humanity ...,robots spared humanity ...,"[robot, spare, human]"
1,848988730585096192,2017-04-03 20:01:01,"b"" Exactly. Tesla is absurdly overvalued if...","b"" exactly. tesla absurdly overvalued based pa...","[exactli, tesla, absurdli, overvalu, base, pas..."
2,848943072423497728,2017-04-03 16:59:35,"b' Et tu, Walt?'","b' et tu, walt?'",[walt]
3,848935705057280001,2017-04-03 16:30:19,weather in Shortville ...',weather shortville ...',"[weather, shortvil]"
4,848416049573658624,2017-04-02 06:05:23,"b"" Coal is dying due to nat gas fracking. It'...","b"" coal dying nat gas fracking. it's basically...","[coal, dy, nat, ga, frack, basic, dead]"
...,...,...,...,...,...
2814,142881284019060736,2011-12-03 08:22:07,was a total non sequitur btw',total non sequitur btw',"[total, non, sequitur, btw]"
2815,142880871391838208,2011-12-03 08:20:28,"Voltaire quote, arguably better than Twain. H...","voltaire quote, arguably better twain. hearing...","[voltair, quot, arguabl, better, twain, hear, ..."
2816,142188458125963264,2011-12-01 10:29:04,made the volume on the Model S go to 11. No...,volume model s 11. need work miniature stonehe...,"[volum, model, need, work, miniatur, stoneheng]"
2817,142179928203460608,2011-12-01 09:55:11,to Iceland on Sat to ride bumper cars on ice!...,"iceland sat ride bumper cars ice! no, country,...","[iceland, sat, ride, bumper, car, ic, countri,..."


In [5]:
# Import glove pre trained glove model with twitter
import gensim.downloader as api
wv = api.load('glove-twitter-100')
# Here we take one of the least performant GloVe model for twitter since my computer processor is pretty old (2015)

In [6]:
# Vectorization of the tokenized sentences of the dataset with the GloVe model
def vectorize(tokenized_sentence):
    result = []
    for token in tokenized_sentence:
        if(token in wv):
            result.append(wv[token])
    return np.mean(result, axis=0)


data['vectorized'] = data['tokenized'].apply(vectorize)

  return _methods._mean(a, axis=axis, dtype=dtype,


In [7]:
# Remove null values of the dataset with vectorized
data2 =data.dropna()
data2.isnull().sum()


id            0
created_at    0
text          0
stop          0
tokenized     0
vectorized    0
dtype: int64

### 2- Apply word embedding to the pre-processed tweets, using the GloVe model (choose the appropriate pre-trained model from here:  https://nlp.stanford.edu/projects/glove/  that conforms with your computer's processor capabilities;

### 3- Apply word embeddings to the search query.



In [8]:
# Search query example
search= "Hoping for summer this year"

def vectorize(tokenized_sentence):
    result = []
    for token in tokenized_sentence:
        if(token in wv):
            result.append(wv[token])
    return np.mean(result, axis=0)


vectorized = vectorize(search)


In [9]:
# The value of vectorized query
print(vectorized)


[ 0.20545775 -0.00383064 -0.01100486 -0.1789822  -0.07985076  0.23289976
 -0.33620736  0.06706404 -0.28886515  0.2869019  -0.03734813 -0.11326599
 -3.1475725   0.07909384 -0.14708735 -0.51196355 -0.45266637 -0.12161336
 -0.41320288  0.66862273 -0.09175472 -0.00653237  0.29945067  0.20366755
  0.36769807 -2.4995997  -0.20349513 -0.06186694  0.52953076  0.49043703
 -0.20725296 -0.26491967 -0.04582856 -0.01996399  0.85011727 -0.07256527
 -0.19717656 -0.16653176 -0.02182863  0.14568529 -1.7785684   0.21367615
  0.18615288 -0.58276385 -0.07617687 -0.10680218  0.08381936 -0.6144123
 -0.18013899 -0.70793456  0.00954687 -0.11328564  0.02801714  0.12264876
  0.23909865  0.09296556  0.03435346 -0.27475414  0.41803887 -0.3336523
 -0.24002095 -0.18870139  0.13602425  0.15005605  0.02807633  0.06457825
 -0.14292896  0.201437   -0.01321632 -0.035417    0.26779726 -0.7296068
  0.1128322   0.08366135 -0.35698634  0.24036841 -0.24166612  0.5388846
 -0.24115953 -0.48012257  1.0122213   0.22310232 -0.060

### 4- Calculate the distance between the embeddings of the search query and that of all the tweets, sort them in increasing order (smaller distance = more relevant to the search query). You will use 2 distance algorithms: cosine similarity and Euclidean distance.
### After that, you return the top 5 tweets using each of the aforementioned search algorithms.

In [10]:
# Euclidian Distance  

from scipy.spatial import distance
# Random example
print('Random Euclidian Distance :', distance.euclidean(vectorized,data['vectorized'][2]))

def distance_euclidian(datatest): 
    lowest_euclid = []
    for i in datatest['vectorized'] :
        if (np.isnan(distance.euclidean(vectorized,i)) == False & np.isinf(distance.euclidean(vectorized,i)) == False):
            lowest_euclid.append(distance.euclidean(vectorized,i))
    lowest_euclid.sort()
    print("Smallest value 1: " , lowest_euclid[0])
    print("Smallest value 2: " , lowest_euclid[1])
    print("Smallest value 3: " , lowest_euclid[2])
    print("Smallest value 4: " , lowest_euclid[3])
    print("Smallest value 5: " , lowest_euclid[4])
    return lowest_euclid
lowest_euclid =distance_euclidian(data2)



Random Euclidian Distance : 7.089881896972656
Smallest value 1:  3.852443218231201
Smallest value 2:  3.860917091369629
Smallest value 3:  3.894739866256714
Smallest value 4:  3.9158194065093994
Smallest value 5:  3.916928768157959


In [25]:
# Cosie similarity 
import numpy as np
from numpy.linalg import norm
#Random example 
cosine = np.dot(vectorized,data['vectorized'][0])/(norm(vectorized)*norm(data['vectorized'][0]))
print("Cosine Similarity:", cosine)
def distance_cosine(datatest):
    biggest_cosine = []
    for i in datatest['vectorized'] :
        if (np.isnan(np.dot(vectorized,i)/(norm(vectorized)*norm(i))) == False & np.isinf(np.dot(vectorized,i)/(norm(vectorized)*norm(i)))):
            biggest_cosine.append(np.dot(vectorized,i)/(norm(vectorized)*norm(i)))
    biggest_cosine.sort()

    # We remove negative values
    def removeNegative(arr):
        newArr = []
    
        for x in range(0, len(arr)):
            if (arr[x] >= 0):
                newArr.append(arr[x])
        return newArr

# We take the closest value of one (which mean its similar to our query search)
    biggest_cosine = removeNegative(biggest_cosine)
    print("Biggest value 1: " , biggest_cosine[-1])
    print("Biggest value 2: " , biggest_cosine[-2])
    print("Biggest value 3: " , biggest_cosine[-3])
    print("Biggest value 4: " , biggest_cosine[-4])
    print("Biggest value 5: " , biggest_cosine[-5])
    return biggest_cosine
biggest_cosine= distance_cosine(data2)


Cosine Similarity: 0.39842054
Smallest value 1:  0.7182594
Smallest value 2:  0.71154684
Smallest value 3:  0.70909643
Smallest value 4:  0.70170027
Smallest value 5:  0.693017


In [19]:
liste_euc = [lowest_euclid[0],lowest_euclid[1],lowest_euclid[2],lowest_euclid[3],lowest_euclid[4]]

def top_5_tweets_eucli(liste_distance_euclid):
    d =0
    for distance_euclidian in liste_distance_euclid:
        for i in data2['vectorized'] :
            d+= 1
            if distance.euclidean(vectorized,i) == distance_euclidian :
                print("Top tweet ",data['text'][d])
        d= 0
top_5_tweets_eucli(liste_euc)

Top tweet    "Model X has proved both extraordinary and oddly unremarkable. Rating: \xe2\x98\x85\xe2\x98\x85\xe2\x98\x85\xe2\x98\x85\xe2\x98\x85"   
Top tweet  b'  Yes, if the trend continues, before'
Top tweet  b' Nice ride! Looking forward to seeing you tomorrow.'
Top tweet   3 design sketches 
Top tweet  b' Will do. Probably at least a few days. Depends on whether we need to pull the turbopumps.'


In [26]:
liste_cos = [biggest_cosine[0],biggest_cosine[1],biggest_cosine[2],biggest_cosine[3],biggest_cosine[4]]

def top_5_tweets_cosi(liste_distance_cosine):
    d =0
    for distance_euclidian in liste_distance_cosine:
        for i in data2['vectorized'] :
            d+= 1
            if np.dot(vectorized,i)/(norm(vectorized)*norm(i)) == distance_euclidian :
                print("Top tweet ",data['text'][d])
        d= 0
top_5_tweets_cosi(liste_cos)

Top tweet   to help Boeing is real &amp; am corresponding w 787 chief engineer. Junod's Esquire article had high fiction content."
Top tweet    Dragon on Mars. 
Top tweet   2008 meltdown, vacation for me just meant email with a view, but SpaceX &amp; Tesla are now strong enough that I can make it real (yay!!)'
Top tweet   near term plans to IPO  Only possible in very long term when Mars Colonial Transporter is flying regularly.'
Top tweet   when self-driving cars become safer than human-driven cars, the public may outlaw the latter. Hopefully not.'
