# Big Data Applications for Financial Markets


## 0. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

## 1. Web-Scraping

### 1.A. A First Example

In [2]:
import requests
url = "https://en.wikipedia.org/wiki/Yohan_Blake"
r = requests.get(url)
print(r.url)

https://en.wikipedia.org/wiki/Yohan_Blake


In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Yohan Blake - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Yohan_Blake","wgTitle":"Yohan Blake","wgCurRevisionId":840706701,"wgRevisionId":840706701,"wgArticleId":16692019,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All articles with dead external links","Articles with dead external links from July 2016","Articles with permanently dead external links","Use Jamaican English from March 2015","All Wikipedia articles written in Jamaican English","Use dmy dates from November 2012","198

Now we want to extract only the text of the Wikipedia page. To do this, we can remark that the text is always included between the tags < p > and < /p >. 

However, there is a problem: when we extract the text, we got useless annotations such as [1], [2]... and so on. So to get rid of them, we have to extract them from our text using the extract.() method. To do that, we erase all the text included between the tags < sup > and < /sup >, which corresponds to the annotations.

In [4]:
#We extract the text
our_soup = soup.find_all('p')

#We get rid of the annotations
[s.extract() for s in soup('sup')]

#We display only pure text
for n in soup.find_all('p'):
    print(n.text)

Yohan Blake (born 26 December 1989), is a Jamaican sprinter of the 100-metre and 200-metre sprint races. He won gold at the 100 m at the 2011 World Championships as the youngest 100 m world champion ever, and a silver medal in the 2012 Olympic Games in London in the 100 m and 200 m races for the Jamaican team.
Blake is the second fastest man ever in both 100 m and 200 m. Together with Tyson Gay, he is the joint second fastest man ever over 100 m with a personal best of 9.69 seconds, behind only Usain Bolt. His personal best for the 200 m (19.26 seconds) is the second fastest time ever after Bolt. He holds the Jamaican national junior record for the 100 metres, and was formerly the youngest sprinter to have broken the 10-second barrier (at 19 years, 196 days).
He is coached by Glen Mills and his training partners are Usain Bolt and Daniel Bailey.


Blake attended St. Jago High School in Spanish Town where his first sporting love was cricket. Blake was a fast bowler, and it was only afte

It works! Now, let's sum up all our steps into one single function after creating our own corpus.

### 1.B. Define our Articles using lists of URLS

In [5]:
URLs_business=[]
URLs_business.append('https://en.wikipedia.org/wiki/Lloyd_Blankfein')
URLs_business.append('https://en.wikipedia.org/wiki/Tim_Cook')
URLs_business.append('https://en.wikipedia.org/wiki/Richard_Branson')

URLs_artists=[]
URLs_artists.append('https://en.wikipedia.org/wiki/Snoop_Dogg')
URLs_artists.append('https://en.wikipedia.org/wiki/Dr._Dre')
URLs_artists.append('https://en.wikipedia.org/wiki/Rihanna')

URLs_athletes=[]
URLs_athletes.append('https://en.wikipedia.org/wiki/Shaun_White')
URLs_athletes.append('https://en.wikipedia.org/wiki/Yohan_Blake')
URLs_athletes.append('https://en.wikipedia.org/wiki/Tom_Brady')

URLs=URLs_business+URLs_artists+URLs_athletes
Corpus_Names=['Lloyd Blankfein','Tim Cook','Richard Branson','Snoop Dogg','Dr. Dre','Rihanna','Shaun White','Yohan Blake','Tom Brady']

## 2. Cleaning up and Vectorization

### 2.A. Create our main function (web-scraping, cleaning and vectorization)

In [6]:
import requests
import re
from bs4 import BeautifulSoup

import nltk
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

vectorizer = CountVectorizer(stop_words='english')
stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer,self).build_analyzer()
        return lambda doc: (stemmer.stem(w) for w in analyzer(doc))


def getData(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    main = soup.find_all("p")
    main = [x.text for x in main]
    main = " ".join(main)
    pattern = "\[[0-9]*\]"
    main = re.sub(pattern, "", main)
    vectorizer = StemmedCountVectorizer(stop_words='english')
    counts = vectorizer.fit_transform([main])
    counts = pd.Series(counts.toarray()[0],index=vectorizer.get_feature_names())
    return counts

### 2.B. Build our Corpus and Store it in a dataframe

In [7]:
data = []
for x in URLs:
    data.append(getData(x))
df = pd.concat(data,axis=1)
df.head()
#The better way would have been to create a corpus and fit that but I did this to show an import functionality

Unnamed: 0,0,1,2,3,4,5,6,7,8
000,6.0,1.0,10.0,13.0,3.0,10.0,3.0,,3.0
0020,,,,1.0,,,,,
00s,,,,,,1.0,,,
02,,,,,,,,,1.0
04,,,,,,,,1.0,


In [8]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
000,6.0,1.0,10.0,13.0,3.0,10.0,3.0,,3.0
0020,,,,1.0,,,,,
00s,,,,,,1.0,,,
02,,,,,,,,,1.0
04,,,,,,,,1.0,


In [9]:
# Implement the column names
df.columns = Corpus_Names
df.fillna(0,inplace=True)

In [10]:
# Counting the number of words present in each articles
lengths = df.sum()

In [11]:
# Counting the ratio of words present in the corpus
df = df/lengths
# Tranposing the resutls
df = df.transpose()

In [12]:
df.fillna(0,inplace=True)
lengths = df.sum()
df = df/lengths

In [13]:
df.index

Index(['Lloyd Blankfein', 'Tim Cook', 'Richard Branson', 'Snoop Dogg',
       'Dr. Dre', 'Rihanna', 'Shaun White', 'Yohan Blake', 'Tom Brady'],
      dtype='object')

## 3. Euclidean Distance

In [14]:
# Show the vectors with words representation in the DataFrame
df

Unnamed: 0,000,0020,00s,02,04,08,084,10,100,100th,...,zawaideh,zealand,zenith,zero,zimbabw,zoe,zone,zoom,zurich,æon
Lloyd Blankfein,0.272954,0.0,0.0,0.0,0.0,0.843476,0.0,0.109814,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tim Cook,0.046144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054265,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Richard Branson,0.152443,0.0,0.0,0.0,0.0,0.0,0.0,0.055197,0.035854,0.0,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
Snoop Dogg,0.203849,1.0,0.0,0.0,0.0,0.0,0.0,0.075703,0.01844,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.516874,0.0,0.0,0.0
Dr. Dre,0.033002,0.0,0.0,0.0,0.0,0.0,0.0,0.039832,0.030186,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Rihanna,0.095556,0.0,1.0,0.0,0.0,0.088586,0.0,0.103798,0.101136,0.0,...,0.0,0.141402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Shaun White,0.174066,0.0,0.0,0.0,0.0,0.0,0.0,0.070029,0.045489,0.887859,...,1.0,0.858598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Yohan Blake,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.457177,0.683029,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Tom Brady,0.021985,0.0,0.0,1.0,0.0,0.067939,1.0,0.08845,0.0316,0.112141,...,0.0,0.0,0.0,0.0,0.0,0.0,0.483126,0.0,0.0,0.0


### 3.A. Prepare our Dataframe

In [15]:
import itertools
combos = list(itertools.combinations(df.index, 2))
#itertools.combinations takes an iterable and an argument for how big the groups will be and returns all combinations
print(combos)

[('Lloyd Blankfein', 'Tim Cook'), ('Lloyd Blankfein', 'Richard Branson'), ('Lloyd Blankfein', 'Snoop Dogg'), ('Lloyd Blankfein', 'Dr. Dre'), ('Lloyd Blankfein', 'Rihanna'), ('Lloyd Blankfein', 'Shaun White'), ('Lloyd Blankfein', 'Yohan Blake'), ('Lloyd Blankfein', 'Tom Brady'), ('Tim Cook', 'Richard Branson'), ('Tim Cook', 'Snoop Dogg'), ('Tim Cook', 'Dr. Dre'), ('Tim Cook', 'Rihanna'), ('Tim Cook', 'Shaun White'), ('Tim Cook', 'Yohan Blake'), ('Tim Cook', 'Tom Brady'), ('Richard Branson', 'Snoop Dogg'), ('Richard Branson', 'Dr. Dre'), ('Richard Branson', 'Rihanna'), ('Richard Branson', 'Shaun White'), ('Richard Branson', 'Yohan Blake'), ('Richard Branson', 'Tom Brady'), ('Snoop Dogg', 'Dr. Dre'), ('Snoop Dogg', 'Rihanna'), ('Snoop Dogg', 'Shaun White'), ('Snoop Dogg', 'Yohan Blake'), ('Snoop Dogg', 'Tom Brady'), ('Dr. Dre', 'Rihanna'), ('Dr. Dre', 'Shaun White'), ('Dr. Dre', 'Yohan Blake'), ('Dr. Dre', 'Tom Brady'), ('Rihanna', 'Shaun White'), ('Rihanna', 'Yohan Blake'), ('Rihanna', '

In [16]:
dist = pd.DataFrame(combos,columns = ["Text 1","Text 2"])
print(dist)
#Let's show it as a dataframe

             Text 1           Text 2
0   Lloyd Blankfein         Tim Cook
1   Lloyd Blankfein  Richard Branson
2   Lloyd Blankfein       Snoop Dogg
3   Lloyd Blankfein          Dr. Dre
4   Lloyd Blankfein          Rihanna
5   Lloyd Blankfein      Shaun White
6   Lloyd Blankfein      Yohan Blake
7   Lloyd Blankfein        Tom Brady
8          Tim Cook  Richard Branson
9          Tim Cook       Snoop Dogg
10         Tim Cook          Dr. Dre
11         Tim Cook          Rihanna
12         Tim Cook      Shaun White
13         Tim Cook      Yohan Blake
14         Tim Cook        Tom Brady
15  Richard Branson       Snoop Dogg
16  Richard Branson          Dr. Dre
17  Richard Branson          Rihanna
18  Richard Branson      Shaun White
19  Richard Branson      Yohan Blake
20  Richard Branson        Tom Brady
21       Snoop Dogg          Dr. Dre
22       Snoop Dogg          Rihanna
23       Snoop Dogg      Shaun White
24       Snoop Dogg      Yohan Blake
25       Snoop Dogg        Tom Brady
2

### 3.B. Euclidean Distance function between 2 documents

In [17]:
def findEucDist(t1,t2):
    return sum((df.loc[t1]-df.loc[t2])**2)**.5
#And define a distance formula

### 3.C. Calculate the Euclidean distance between each pair of documents

In [18]:
dist["Euclidean_Distance"] = dist.apply(lambda row: findEucDist(row['Text 1'], row['Text 2']), axis=1)
#If we apply this formula across the rows, we can get the distance between each string
print(dist)

             Text 1           Text 2  Euclidean_Distance
0   Lloyd Blankfein         Tim Cook           21.830150
1   Lloyd Blankfein  Richard Branson           30.953717
2   Lloyd Blankfein       Snoop Dogg           28.212663
3   Lloyd Blankfein          Dr. Dre           30.630031
4   Lloyd Blankfein          Rihanna           31.160930
5   Lloyd Blankfein      Shaun White           21.125676
6   Lloyd Blankfein      Yohan Blake           19.867778
7   Lloyd Blankfein        Tom Brady           30.598625
8          Tim Cook  Richard Branson           30.848578
9          Tim Cook       Snoop Dogg           28.275120
10         Tim Cook          Dr. Dre           30.650189
11         Tim Cook          Rihanna           31.145201
12         Tim Cook      Shaun White           21.184303
13         Tim Cook      Yohan Blake           20.012699
14         Tim Cook        Tom Brady           30.692426
15  Richard Branson       Snoop Dogg           35.198920
16  Richard Branson          Dr

In [19]:
#Since these numbers are close, we could also scale between 0-1 to make it even more clear
dist["Euclidean_Distance_Normalized"] = (dist["Euclidean_Distance"]-dist["Euclidean_Distance"].min())/(dist["Euclidean_Distance"].max()-dist["Euclidean_Distance"].min())
print(dist.sort_values("Euclidean_Distance_Normalized"))

             Text 1           Text 2  Euclidean_Distance  \
33      Shaun White      Yohan Blake           18.357103   
6   Lloyd Blankfein      Yohan Blake           19.867778   
13         Tim Cook      Yohan Blake           20.012699   
5   Lloyd Blankfein      Shaun White           21.125676   
12         Tim Cook      Shaun White           21.184303   
0   Lloyd Blankfein         Tim Cook           21.830150   
24       Snoop Dogg      Yohan Blake           26.485889   
23       Snoop Dogg      Shaun White           27.366567   
2   Lloyd Blankfein       Snoop Dogg           28.212663   
9          Tim Cook       Snoop Dogg           28.275120   
35      Yohan Blake        Tom Brady           28.916319   
28          Dr. Dre      Yohan Blake           29.114710   
19  Richard Branson      Yohan Blake           29.447569   
31          Rihanna      Yohan Blake           29.603844   
34      Shaun White        Tom Brady           29.683508   
27          Dr. Dre      Shaun White    

### 3.D. Comparison between Expectations and Actual distances

We expect distances between two documents of the same group (entrepreneurs, artists and athletes) to be lower than distances between two documents of different groups.
Obviously, it is not so simple. Indeed, even if the smallest distance is given by 2 documents of the same group (athletes: Shaun White – Yohan Blake), the next distances are found between 2 documents of different groups (entrepreneurs and athletes).

## 4. Search Retrieval

### 4.A. Define 5 search engine queries

In [20]:
q1="Pop and hip hop artist or born in Barbuda"
q2="Influencial banker who attended Harvard and live in New York"
q3="Professional snowboarder and skater who won gold medals"
q4="CEO of a Fortune 500 company that created the iPhone"
q5="Rapper and television personality arrested for illegal possession"

### 4.B. Calculate the Euclidean Distance with each of documents in the corpus

In [21]:
# Create a vector with the engine queries
queries=[q1,q2,q3,q4,q5]
queries

['Pop and hip hop artist or born in Barbuda',
 'Influencial banker who attended Harvard and live in New York',
 'Professional snowboarder and skater who won gold medals',
 'CEO of a Fortune 500 company that created the iPhone',
 'Rapper and television personality arrested for illegal possession']

In [22]:
def corpus_add(text):
    vectorizer = StemmedCountVectorizer(stop_words='english')
    counts = vectorizer.fit_transform([text])
    counts = pd.Series(counts.toarray()[0],index=vectorizer.get_feature_names())
    return counts

In [23]:
# Checked it worked properly with q1
corpus_add(q1)

artist     1
barbuda    1
born       1
hip        1
hop        1
pop        1
dtype: int64

In [24]:
new_data = []
for query in queries:
    new_data.append(corpus_add(query))
for x in URLs:
    new_data.append(getData(x))
df_query = pd.concat(new_data,axis=1)
df_query

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
000,,,,,,6.0,1.0,10.0,13.0,3.0,10.0,3.0,,3.0
0020,,,,,,,,,1.0,,,,,
00s,,,,,,,,,,,1.0,,,
02,,,,,,,,,,,,,,1.0
04,,,,,,,,,,,,,1.0,
08,,,,,,2.0,,,,,1.0,,,1.0
084,,,,,,,,,,,,,,1.0
10,,,,,,2.0,,3.0,4.0,3.0,9.0,1.0,5.0,10.0
100,,,,,,,3.0,6.0,3.0,7.0,27.0,2.0,23.0,11.0
100th,,,,,,,,,,,,1.0,,1.0


In [25]:
queries_name = ["Query " + str(i) for i in range(1,len(queries)+1)] 
df_query.columns = queries_name + Corpus_Names
df_query.fillna(0,inplace=True)
lengths = df_query.sum()
df_query = df_query/lengths
# Tranposing the resutls
df_query = df_query.transpose()
df_query.head(14)

Unnamed: 0,000,0020,00s,02,04,08,084,10,100,100th,...,zawaideh,zealand,zenith,zero,zimbabw,zoe,zone,zoom,zurich,æon
Query 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lloyd Blankfein,0.004983,0.0,0.0,0.0,0.0,0.001661,0.0,0.001661,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tim Cook,0.000842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002527,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000842,0.0,0.0,0.0,0.0
Richard Branson,0.002783,0.0,0.0,0.0,0.0,0.0,0.0,0.000835,0.00167,0.0,...,0.0,0.0,0.000278,0.000835,0.000278,0.0,0.0,0.0,0.0,0.0
Snoop Dogg,0.003722,0.000286,0.0,0.0,0.0,0.0,0.0,0.001145,0.000859,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000286,0.0,0.0,0.0
Dr. Dre,0.000603,0.0,0.0,0.0,0.0,0.0,0.0,0.000603,0.001406,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000201,0.0,0.0


In [26]:
df_query.head(14)

Unnamed: 0,000,0020,00s,02,04,08,084,10,100,100th,...,zawaideh,zealand,zenith,zero,zimbabw,zoe,zone,zoom,zurich,æon
Query 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Query 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lloyd Blankfein,0.004983,0.0,0.0,0.0,0.0,0.001661,0.0,0.001661,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tim Cook,0.000842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002527,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000842,0.0,0.0,0.0,0.0
Richard Branson,0.002783,0.0,0.0,0.0,0.0,0.0,0.0,0.000835,0.00167,0.0,...,0.0,0.0,0.000278,0.000835,0.000278,0.0,0.0,0.0,0.0,0.0
Snoop Dogg,0.003722,0.000286,0.0,0.0,0.0,0.0,0.0,0.001145,0.000859,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000286,0.0,0.0,0.0
Dr. Dre,0.000603,0.0,0.0,0.0,0.0,0.0,0.0,0.000603,0.001406,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000201,0.0,0.0


In [27]:
combos_queries = list(itertools.combinations(queries_name, 2))
combos_all = list(itertools.combinations(df_query.index, 2))
combos_final = [x for x in combos_all if x not in combos]
combos_final = [x for x in combos_final if x not in combos_queries]
#itertools.combinations takes an iterable and an argument for how big the groups will be and returns all combinations
print(combos_final)

[('Query 1', 'Lloyd Blankfein'), ('Query 1', 'Tim Cook'), ('Query 1', 'Richard Branson'), ('Query 1', 'Snoop Dogg'), ('Query 1', 'Dr. Dre'), ('Query 1', 'Rihanna'), ('Query 1', 'Shaun White'), ('Query 1', 'Yohan Blake'), ('Query 1', 'Tom Brady'), ('Query 2', 'Lloyd Blankfein'), ('Query 2', 'Tim Cook'), ('Query 2', 'Richard Branson'), ('Query 2', 'Snoop Dogg'), ('Query 2', 'Dr. Dre'), ('Query 2', 'Rihanna'), ('Query 2', 'Shaun White'), ('Query 2', 'Yohan Blake'), ('Query 2', 'Tom Brady'), ('Query 3', 'Lloyd Blankfein'), ('Query 3', 'Tim Cook'), ('Query 3', 'Richard Branson'), ('Query 3', 'Snoop Dogg'), ('Query 3', 'Dr. Dre'), ('Query 3', 'Rihanna'), ('Query 3', 'Shaun White'), ('Query 3', 'Yohan Blake'), ('Query 3', 'Tom Brady'), ('Query 4', 'Lloyd Blankfein'), ('Query 4', 'Tim Cook'), ('Query 4', 'Richard Branson'), ('Query 4', 'Snoop Dogg'), ('Query 4', 'Dr. Dre'), ('Query 4', 'Rihanna'), ('Query 4', 'Shaun White'), ('Query 4', 'Yohan Blake'), ('Query 4', 'Tom Brady'), ('Query 5', 'Ll

In [28]:
def findEucDist_Query(t1,t2):
    return sum((df_query.loc[t1]-df_query.loc[t2])**2)**.5

In [29]:
dist_queries = pd.DataFrame(combos_final,columns = ["Text 1","Text 2"])
dist_queries["Euclidean_Distance"] = dist_queries.apply(lambda row: findEucDist_Query(row['Text 1'], row['Text 2']), axis=1)
print(dist_queries)

     Text 1           Text 2  Euclidean_Distance
0   Query 1  Lloyd Blankfein            0.412070
1   Query 1         Tim Cook            0.413576
2   Query 1  Richard Branson            0.412278
3   Query 1       Snoop Dogg            0.409174
4   Query 1          Dr. Dre            0.408309
5   Query 1          Rihanna            0.408699
6   Query 1      Shaun White            0.417145
7   Query 1      Yohan Blake            0.419430
8   Query 1        Tom Brady            0.416536
9   Query 2  Lloyd Blankfein            0.373505
10  Query 2         Tim Cook            0.383190
11  Query 2  Richard Branson            0.380524
12  Query 2       Snoop Dogg            0.382331
13  Query 2          Dr. Dre            0.380828
14  Query 2          Rihanna            0.381152
15  Query 2      Shaun White            0.387752
16  Query 2      Yohan Blake            0.389086
17  Query 2        Tom Brady            0.384017
18  Query 3  Lloyd Blankfein            0.412741
19  Query 3         

### 4.C. Sort the pairs by distance (from shortest to longest)

In [30]:
dist_queries["Euclidean_Distance_Normalized"] = (dist_queries["Euclidean_Distance"]-dist_queries["Euclidean_Distance"].min())/(dist_queries["Euclidean_Distance"].max()-dist_queries["Euclidean_Distance"].min())
print(dist_queries.sort_values("Euclidean_Distance_Normalized"))

     Text 1           Text 2  Euclidean_Distance  \
9   Query 2  Lloyd Blankfein            0.373505   
11  Query 2  Richard Branson            0.380524   
13  Query 2          Dr. Dre            0.380828   
14  Query 2          Rihanna            0.381152   
12  Query 2       Snoop Dogg            0.382331   
10  Query 2         Tim Cook            0.383190   
17  Query 2        Tom Brady            0.384017   
15  Query 2      Shaun White            0.387752   
16  Query 2      Yohan Blake            0.389086   
24  Query 3      Shaun White            0.393178   
28  Query 4         Tim Cook            0.405000   
4   Query 1          Dr. Dre            0.408309   
5   Query 1          Rihanna            0.408699   
39  Query 5       Snoop Dogg            0.408940   
3   Query 1       Snoop Dogg            0.409174   
29  Query 4  Richard Branson            0.409909   
25  Query 3      Yohan Blake            0.409981   
27  Query 4  Lloyd Blankfein            0.410050   
40  Query 5 

Now, let's display the minimum distance for each query and check if it matches:

In [31]:
#We create a duplicata
ordered_dist_queries = dist_queries.sort_values("Euclidean_Distance_Normalized")

#We reset the index
ordered_dist_queries = ordered_dist_queries.reset_index(drop=True)

list_of_best_matches = []
counter = 0

for i in range(1, 6):
    counter = 0
    for k in ordered_dist_queries['Text 1']:
        if k == ("Query " + str(i)):
            list_of_best_matches.append(ordered_dist_queries['Text 2'][counter])
            break
        counter += 1
   

### 4.D. Comments on results

In [32]:
print(queries) 
print()
print("The celebrities corresponding to each query are respectively: ")
print()
print(list_of_best_matches)

['Pop and hip hop artist or born in Barbuda', 'Influencial banker who attended Harvard and live in New York', 'Professional snowboarder and skater who won gold medals', 'CEO of a Fortune 500 company that created the iPhone', 'Rapper and television personality arrested for illegal possession']

The celebrities corresponding to each query are respectively: 

['Dr. Dre', 'Lloyd Blankfein', 'Shaun White', 'Tim Cook', 'Snoop Dogg']


#### It worked! For each query, the maximum similarity is found for the right person (for instance, the maximum similarity for the query "Professional snow boarder and skater who won gold medals" is well Shaun White)

## 5. Alternative Distance Function -- “Cosine Similarity”

### 5.A. Cosine Similarity function between 2 documents

In [33]:
def findCosSim(t1,t2):
    num = sum(df.loc[t1]*df.loc[t2])
    den1 = sum((df.loc[t1])**2)**.5
    den2 = sum((df.loc[t2])**2)**.5
    return num/(den1*den2)
#And define a distance formula

### 5.B. Calculate & Comment the Cosine Similarity between each pair of documents (as in Part 3.)

In [34]:
dist["Cosine_Similarity"] = dist.apply(lambda row: findCosSim(row['Text 1'], row['Text 2']), axis=1)
#If we apply this formula across the rows, we can get the distance between each string
print(dist)

             Text 1           Text 2  Euclidean_Distance  \
0   Lloyd Blankfein         Tim Cook           21.830150   
1   Lloyd Blankfein  Richard Branson           30.953717   
2   Lloyd Blankfein       Snoop Dogg           28.212663   
3   Lloyd Blankfein          Dr. Dre           30.630031   
4   Lloyd Blankfein          Rihanna           31.160930   
5   Lloyd Blankfein      Shaun White           21.125676   
6   Lloyd Blankfein      Yohan Blake           19.867778   
7   Lloyd Blankfein        Tom Brady           30.598625   
8          Tim Cook  Richard Branson           30.848578   
9          Tim Cook       Snoop Dogg           28.275120   
10         Tim Cook          Dr. Dre           30.650189   
11         Tim Cook          Rihanna           31.145201   
12         Tim Cook      Shaun White           21.184303   
13         Tim Cook      Yohan Blake           20.012699   
14         Tim Cook        Tom Brady           30.692426   
15  Richard Branson       Snoop Dogg    

In [35]:
#Since these numbers are close, we could also scale between 0-1 to make it even more clear
dist["Cosine_Similarity_Normalized"] = (dist["Cosine_Similarity"]-dist["Cosine_Similarity"].min())/(dist["Cosine_Similarity"].max()-dist["Cosine_Similarity"].min())
print(dist.sort_values("Cosine_Similarity_Normalized",ascending=False))

             Text 1           Text 2  Euclidean_Distance  \
0   Lloyd Blankfein         Tim Cook           21.830150   
21       Snoop Dogg          Dr. Dre           34.394929   
26          Dr. Dre          Rihanna           37.058466   
22       Snoop Dogg          Rihanna           35.180012   
8          Tim Cook  Richard Branson           30.848578   
15  Richard Branson       Snoop Dogg           35.198920   
17  Richard Branson          Rihanna           37.528701   
16  Richard Branson          Dr. Dre           37.162945   
25       Snoop Dogg        Tom Brady           34.868543   
33      Shaun White      Yohan Blake           18.357103   
29          Dr. Dre        Tom Brady           36.921559   
1   Lloyd Blankfein  Richard Branson           30.953717   
20  Richard Branson        Tom Brady           37.273863   
34      Shaun White        Tom Brady           29.683508   
11         Tim Cook          Rihanna           31.145201   
32          Rihanna        Tom Brady    

We expect Cosine Similarities between two documents of the same group (entrepreneurs, artists and athletes) to be larger than Cosine Similarities between two documents of different groups, and we expect better results than those given by Euclidean Distance, as we know that in textual analysis, the Cosine Similarity is preferable over Euclidian Distance.
The largest Cosine Similarity is given by 2 documents of the same group (entrepreneurs: Lloyd Blankfein - Tim Cook), and the next Cosine Similarities are found between 2 documents the same group (artists). Even if those results are probably not perfect, they seem better than those given by the Euclidean Distance.

### 5.C. Search Retrieval - Calculate & Comment the Cosine Similarity with each of documents in the corpus (as in Part 4.)

In [36]:
def findCosSim_Query(t1,t2):
    num = sum(df_query.loc[t1]*df_query.loc[t2])
    den1 = sum((df_query.loc[t1])**2)**.5
    den2 = sum((df_query.loc[t2])**2)**.5
    return num/(den1*den2)
#And define a distance formula

In [37]:
dist_queries["Cosine_Similarity"] = dist_queries.apply(lambda row: findCosSim_Query(row['Text 1'], row['Text 2']), axis=1)
print(dist_queries)

     Text 1           Text 2  Euclidean_Distance  \
0   Query 1  Lloyd Blankfein            0.412070   
1   Query 1         Tim Cook            0.413576   
2   Query 1  Richard Branson            0.412278   
3   Query 1       Snoop Dogg            0.409174   
4   Query 1          Dr. Dre            0.408309   
5   Query 1          Rihanna            0.408699   
6   Query 1      Shaun White            0.417145   
7   Query 1      Yohan Blake            0.419430   
8   Query 1        Tom Brady            0.416536   
9   Query 2  Lloyd Blankfein            0.373505   
10  Query 2         Tim Cook            0.383190   
11  Query 2  Richard Branson            0.380524   
12  Query 2       Snoop Dogg            0.382331   
13  Query 2          Dr. Dre            0.380828   
14  Query 2          Rihanna            0.381152   
15  Query 2      Shaun White            0.387752   
16  Query 2      Yohan Blake            0.389086   
17  Query 2        Tom Brady            0.384017   
18  Query 3 

In [38]:
dist_queries["Cosine_Similarity_Normalized"] = (dist_queries["Cosine_Similarity"]-dist_queries["Cosine_Similarity"].min())/(dist_queries["Cosine_Similarity"].max()-dist_queries["Cosine_Similarity"].min())
print(dist_queries.sort_values("Cosine_Similarity_Normalized",ascending=False))

     Text 1           Text 2  Euclidean_Distance  \
24  Query 3      Shaun White            0.393178   
9   Query 2  Lloyd Blankfein            0.373505   
28  Query 4         Tim Cook            0.405000   
25  Query 3      Yohan Blake            0.409981   
4   Query 1          Dr. Dre            0.408309   
5   Query 1          Rihanna            0.408699   
39  Query 5       Snoop Dogg            0.408940   
3   Query 1       Snoop Dogg            0.409174   
27  Query 4  Lloyd Blankfein            0.410050   
29  Query 4  Richard Branson            0.409909   
13  Query 2          Dr. Dre            0.380828   
14  Query 2          Rihanna            0.381152   
11  Query 2  Richard Branson            0.380524   
17  Query 2        Tom Brady            0.384017   
40  Query 5          Dr. Dre            0.411250   
37  Query 5         Tim Cook            0.412216   
43  Query 5      Yohan Blake            0.417226   
41  Query 5          Rihanna            0.412311   
12  Query 2 

Now, let's display the minimum distance for each query and check if it matches:

In [39]:
#We create a duplicata
ordered_dist_queries = dist_queries.sort_values("Cosine_Similarity_Normalized",ascending=False)

#We reset the index
ordered_dist_queries = ordered_dist_queries.reset_index(drop=True)

list_of_best_matches = []
counter = 0

for i in range(1, 6):
    counter = 0
    for k in ordered_dist_queries['Text 1']:
        if k == ("Query " + str(i)):
            list_of_best_matches.append(ordered_dist_queries['Text 2'][counter])
            break
        counter += 1

In [40]:
print(queries) 
print()
print("The celebrities corresponding to each query are respectively: ")
print()
print(list_of_best_matches)

['Pop and hip hop artist or born in Barbuda', 'Influencial banker who attended Harvard and live in New York', 'Professional snowboarder and skater who won gold medals', 'CEO of a Fortune 500 company that created the iPhone', 'Rapper and television personality arrested for illegal possession']

The celebrities corresponding to each query are respectively: 

['Dr. Dre', 'Lloyd Blankfein', 'Shaun White', 'Tim Cook', 'Snoop Dogg']


### 5.D. Comparison Between Euclidean Distance and Cosine Similarity

Theoretically, we knew that in textual analysis, the Cosine Similarity is preferable over Euclidian Distance because it accounts for the length of the vectors. We have confirmed this fact practically by calculating both Euclidean Distances and Cosine Similarities between pairs of documents and comparing them. 

Even if for those 2 methods the results of our particular queries were perfect, we may think because of what we said previously that for harder queries, Cosine Similarity would give better results than Euclidean Distance.

Nevertheless, we should keep in mind that some other measures exist and may be preferable given the particular problem at hand.