<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/practices/P3/Practice_3_IR_and_Recommendation_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Information Retrieval & Elastic Search

### Download and setup ElasticSearch on Google Colab

In [None]:
# Download and extract elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.10.1


--2021-11-01 19:42:04--  https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
Resolving artifacts.elastic.co (artifacts.elastic.co)... 34.120.127.130, 2600:1901:0:1d7::
Connecting to artifacts.elastic.co (artifacts.elastic.co)|34.120.127.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318801277 (304M) [application/x-gzip]
Saving to: ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’


2021-11-01 19:42:52 (6.36 MB/s) - ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’ saved [318801277/318801277]



In [None]:
import os
from subprocess import Popen, PIPE, STDOUT

# If issues are encountered with this section, ES can be manually started as follows:
# ./elasticsearch-7.10.1/bin/elasticsearch

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

In [None]:
# wait a bit then test
!curl -X GET "localhost:9200/"

{
  "name" : "92400915e7f0",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "WCPJqPpsRUuCXSoUS7oArQ",
  "version" : {
    "number" : "7.10.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "1c34507e66d7db1211f66f3513706fdf548736aa",
    "build_date" : "2020-12-05T01:00:33.671820Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


## Information Retrieval

Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of **texts**, images or sounds. (source: Wikipedia).

This practice is intended for the creation of a wikipedia-based search engine. For the purpose of the practice, only a subset of the wikipedia pages will be used.

Data Source: https://snap.stanford.edu/data/wikispeedia.html 

### **Question 1: Pagerank scores**
Exploiting the wikipedia citation network, compute, for each page, its associated [pagerank](http://ilpubs.stanford.edu:8090/422/) score.

What is the page with the highest Pagerank score?


In [None]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/articles.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/categories.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/links.tsv


In [None]:
%%capture
! pip install elasticsearch==7.10.1
! pip install networkx

In [None]:
from urllib.parse import unquote

list_articles = open("articles.tsv").read()
list_articles = list_articles.split("\n")
list_articles = [l for l in list_articles if l!= ""]
list_articles = [l for l in list_articles if l[0] != "#"]
unquoted_list_articles = [unquote(l) for l in list_articles if l[0] != "#"]
dict_articles = {}
for i, l in enumerate(unquoted_list_articles):
    dict_articles[l] = {}
    dict_articles[l]["ID"] = l
    dict_articles[l]["quoted_ID"] = list_articles[i]

In [None]:
from urllib.parse import unquote

list_categories = open("categories.tsv").read()
list_categories = list_categories.split("\n")
list_categories = [l for l in list_categories if l!= ""]
list_categories = [l for l in list_categories if l[0] != "#"]

for l in list_categories:
    k, v = l.split("\t")
    k = unquote(k)
    v = unquote(v)
    if "categories" in dict_articles[k].keys():
        dict_articles[k]["categories"].append(v)
    else:
        dict_articles[k]["categories"] = [v]
    
print (dict_articles)

{'Áedán_mac_Gabráin': {'ID': 'Áedán_mac_Gabráin', 'quoted_ID': '%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures']}, 'Åland': {'ID': 'Åland', 'quoted_ID': '%C3%85land', 'categories': ['subject.Countries', 'subject.Geography.European_Geography.European_Countries']}, 'Édouard_Manet': {'ID': 'Édouard_Manet', 'quoted_ID': '%C3%89douard_Manet', 'categories': ['subject.People.Artists']}, 'Éire': {'ID': 'Éire', 'quoted_ID': '%C3%89ire', 'categories': ['subject.Countries', 'subject.Geography.European_Geography.European_Countries']}, 'Óengus_I_of_the_Picts': {'ID': 'Óengus_I_of_the_Picts', 'quoted_ID': '%C3%93engus_I_of_the_Picts', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures']}, '€2_commemorative_coins': {'ID': '€2_commemorative_coins', 'quoted_ID': '%E2%82%AC2_commemorative

In [None]:
from urllib.parse import unquote

list_links = open("links.tsv").read()
list_links = list_links.split("\n")
list_links = [l for l in list_links if l!= ""]
list_links = [l for l in list_links if l[0] != "#"]

for l in list_links:
    s, t = l.split("\t")
    s = unquote(s)
    t = unquote(t)
    if "out_links" in dict_articles[s].keys():
        dict_articles[s]["out_links"].append(t)
    else:
        dict_articles[s]["out_links"] = [t]

In [None]:
print (dict_articles["Áedán_mac_Gabráin"])

{'ID': 'Áedán_mac_Gabráin', 'quoted_ID': '%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures'], 'out_links': ['Bede', 'Columba', 'Dál_Riata', 'Great_Britain', 'Ireland', 'Isle_of_Man', 'Monarchy', 'Orkney', 'Picts', 'Scotland', 'Wales']}


In [None]:
# your code here

### **Question 2: Wikipedia pages indexing**

Create a new index in ElasticSearch and Index the Wikipedia webpage (alongiside with their content). The content of each page can be found at `plaintext_articles/QUOTED_ID_OF_THE_DOC.txt`

NB: pagerank score must be a field of the indexed doc


In [None]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P3/plaintext_articles.zip
! unzip plaintext_articles.zip

In [None]:
# your code here

### **Question 3: Querying ElasticSearch**

Perform a query using ElasticSearch. Look for your favorite content (choose and report 3 of them) on the full text of the articles.

E.g.:
- query 1 : "The capital of Italy" (surprised by the result?)

In [None]:
# Your code here

### **Question 4: integrating pagerank scores**

Create a template query to include pagerank while computing the score (`_score`). 

Use the [Script score](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score) to generate an hybrid score (`_score + pagerank_score * 250`). 

Perform the same set of queries with this modification, does it change the results?



In [None]:
# Your code here

### **Question 5: integrate semantic dense-vectors**

Generate a new index ("wiki-semantic-search") including all the information of the previous one plus an additional field that contains a BERT-based embedding vector of the `full_text` of the article. Once indexing is completed, repeat the same queries for a qualitative evaluation of the IR system. 

**Some hints below:**
- Use Sentence-BERT pretrained encoders (www.sbert.net). Choose the most suitable pretrained model (trade off between speed and accuracy). E.g., `multi-qa-MiniLM-L6-cos-v1`
- Use cosine similarity to compute the similarity between queries and full text of the article.

In [None]:
%%capture
!pip install sentence-transformers

In [None]:
# create mapping

dense_dim = len(sentence_encodings[0])

index_properties = {}
index_properties['settings']={ "number_of_shards": 2, "number_of_replicas": 1}
index_properties['mappings']={ "dynamic": "true", "_source": { "enabled": "true" }, "properties": {}}
for t in ['ID', 'quoted_ID', 'full_text']: 
    index_properties['mappings']['properties'][t]={ "type": "text" }
for t in ['pagerank_score']: 
    index_properties['mappings']['properties'][t]={ "type": "float" }
for d in ["embedding_bert"]: 
    index_properties['mappings']['properties'][d]={ "type": "dense_vector", "dims": dense_dim }

In [None]:
# Your code here

## Content-based Recommender Systems

A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. (source: [Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))

In this part of the practice you will be required to generate a text-based unsupervised recommendation system (only **content**-based). The final goal is similar to a IR search engine, the main difference relies on **how you define the "queries".**

The tools at your disposal are:
1. `Sentence-BERT model`: should be used to obtain a vector representation of the input data.
2. `ElasticSearch`: can be used for indexing movie information and to perform **fast** similarity search.

For the recommendation system you need the following information:
- Movie's title
- Movie's plot
- Plot's embedding vector

The dataset used for this goal is: [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots). For this practice you will use a truncated version of the data collection to reduce runtime.

In [None]:
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
import pandas as pd
df_movies = pd.read_csv("wiki_plots_2005onward.csv")

--2021-11-01 19:59:28--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45936814 (44M) [text/plain]
Saving to: ‘wiki_plots_2005onward.csv’


2021-11-01 19:59:31 (132 MB/s) - ‘wiki_plots_2005onward.csv’ saved [45936814/45936814]



### **Question 6: movie encodings**

Use Sentence-BERT model to encode movie plots into fixed-size vectors.

NB: the vector dimension is dependent on the choice of the pretrained model.

In [None]:
! pip install sentence-transformers

In [None]:
# Your code here

### **Question 7: ElasticSearch indexing**

Create a new ElasticSearch index (`recsys-movies`) and index all movies with their embedding vectors.



In [None]:
# Your code here

### **Question 8: Query generation**

Create a function that accept the following arguments:
1. `embedding_model`: Sentence-BERT model used to generate embeddings
2. `df_movies`: the dataframe containing all the movies' information
3. `movie_title`: a string containing the title of the movie the user is currently watching.

It should return the embedding vector associated to the query by looking for the `movie_title` plot in `df_movies`. It uses `embedding_model` to encode it.




In [None]:
# Your code here

### **Question 8: Qualitative evaluation (your personal movie recommendation system)**

Evaluate your personal recommendation system by querying for some movies in the data collection. You need to create an elasticsearch query to use the recommendation system (see Q. 5 of this practice).

Just some examples:
1. title: Harry Potter and the Goblet of Fire
2. title: Avengers: Age of Ultron
3. title: Star Wars: The Last Jedi


In [None]:
 # Your code here

### **Question 9 (Bonus)**

Rewrite the function at Q.7 to take multiple movie titles (list of strings). Compute the average vector and use it to obtain recommendations. Perform a qualitative evaluation in this specific case (it is possible to choose movie's titles from the previous list)

In [None]:
# Your code here