<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/2022_2023/Practice_3_IR_and_Recommendation_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Information Retrieval & Elastic Search

### Download and setup ElasticSearch on Google Colab

In [None]:
# Download and extract elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.10.1


In [None]:
import os
from subprocess import Popen, PIPE, STDOUT

# If issues are encountered with this section, ES can be manually started as follows:
# ./elasticsearch-7.10.1/bin/elasticsearch

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

In [None]:
# wait a bit then test
!curl -X GET "localhost:9200/"

## Information Retrieval

Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of **texts**, images or sounds. (source: Wikipedia).

This practice is intended for the creation of a wikipedia-based search engine. For the purpose of the practice, only a subset of the wikipedia pages will be used.

Data Source: https://snap.stanford.edu/data/wikispeedia.html 

Most of the information retrieval systems are based on the **PageRank** algorithm. It is a graph-based algorithm that assigns a score to each node of a graph, based on the number of incoming and outgoing links. The main steps of the algorithm are:

1.   Assign a score of 1/N to each node, where N is the number of nodes in the graph.
2.   For each node, apply the following formula: $PR(i) = \dfrac{(1-d)}{N} + d \sum_{j \in In(i)} \dfrac{PR(j)}{Out(j)}$, where $d$ is the damping factor (usually 0.85) and $In(i)$ is the set of nodes that link to node $i$ and $Out(j)$ is the number of outgoing links from node $j$.
3.   Repeat step 2 until convergence.


For the practice, you can use the `networkx` library to compute the PageRank scores of the nodes of the graph.


---

The following cells download the data and parse them to create a dictionary of pages. Each element of the dictionary contains the following information:

*   `ID`: the ID of the page
*   `quoted_ID`: the ID of the page escaped using HTML encoding
*   `categories`: the list of categories of the page
*   `out_links`: the list of pages that are linked by the current page

Hereafter, an example of the dictionary is shown.

```python

{
    'ID': 'Áedán_mac_Gabráin', 
    'quoted_ID': '%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 
    'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures'], 
    'out_links': ['Bede', 'Columba', 'Dál_Riata', 'Great_Britain', 'Ireland', 'Isle_of_Man', 'Monarchy', 'Orkney', 'Picts', 'Scotland', 'Wales']
}
```

---
### **Question 1: Pagerank scores**

Exploiting the wikipedia citation network, compute, for each page, its associated [pagerank](http://ilpubs.stanford.edu:8090/422/) score.

What is the page with the highest Pagerank score?


In [None]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/articles.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/categories.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/links.tsv


In [None]:
%%capture
! pip install elasticsearch==7.10.1
! pip install networkx

In [None]:
from urllib.parse import unquote

list_articles = open("articles.tsv").read()
list_articles = list_articles.split("\n")
list_articles = [l for l in list_articles if l!= ""]
list_articles = [l for l in list_articles if l[0] != "#"]
unquoted_list_articles = [unquote(l) for l in list_articles if l[0] != "#"]
dict_articles = {}
for i, l in enumerate(unquoted_list_articles):
    dict_articles[l] = {}
    dict_articles[l]["ID"] = l
    dict_articles[l]["quoted_ID"] = list_articles[i]

In [None]:
from urllib.parse import unquote

list_categories = open("categories.tsv").read()
list_categories = list_categories.split("\n")
list_categories = [l for l in list_categories if l!= ""]
list_categories = [l for l in list_categories if l[0] != "#"]

for l in list_categories:
    k, v = l.split("\t")
    k = unquote(k)
    v = unquote(v)
    if "categories" in dict_articles[k].keys():
        dict_articles[k]["categories"].append(v)
    else:
        dict_articles[k]["categories"] = [v]
    
print (dict_articles)

In [None]:
from urllib.parse import unquote

list_links = open("links.tsv").read()
list_links = list_links.split("\n")
list_links = [l for l in list_links if l!= ""]
list_links = [l for l in list_links if l[0] != "#"]

for l in list_links:
    s, t = l.split("\t")
    s = unquote(s)
    t = unquote(t)
    if "out_links" in dict_articles[s].keys():
        dict_articles[s]["out_links"].append(t)
    else:
        dict_articles[s]["out_links"] = [t]

In [None]:
print (dict_articles["Áedán_mac_Gabráin"])

In [None]:
# your code here

### **Question 2: Wikipedia pages indexing**

In this question, you will create an index of the wikipedia pages. It will be used to perform the search of the pages.

Create a new index in ElasticSearch and index all the pages (alongiside with their content). Please note that the content of each page can be found at `plaintext_articles/QUOTED_ID_OF_THE_DOC.txt`. For example, the content of the page with ID `Áedán_mac_Gabráin` can be found at `plaintext_articles/%C3%81ed%C3%A1n_mac_Gabr%C3%A1in.txt`. The following cell will download the necessary files.

NB: the pagerank score of each page should be stored in the index as a field named `pagerank_score`.


In [None]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P3/plaintext_articles.zip
! unzip plaintext_articles.zip

In [None]:
# your code here

### **Question 3: Querying ElasticSearch**

After having indexed the wikipedia pages, you can now perform queries to the search engine. You can use the `elasticsearch` library to perform queries. Look for your favorite content (choose and report 3 of them) on the full text of the articles.

E.g.:
- query 1 : "The capital of Italy" (surprised by the result?)

In [None]:
# Your code here

### **Question 4: integrating pagerank scores**

The standard full-text search engine does not take into account the importance of the pages. In this question, you will modify the query to take into account the pagerank scores of the pages. Create a template query to include pagerank while computing the score (`_score`). 

Use the [Script score](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score) to generate an hybrid score (`_score + pagerank_score * 250`). 250 is a scaling factor that you can tune to obtain better results. 

Perform the same set of queries with this modification, does it change the results?



In [None]:
# Your code here

### **Question 5: integrate semantic dense-vectors**

Standard full-text search engines are not able to capture the semantic meaning of the words. By default, the search engine will return all the pages that contain the query terms. In this question, you will use the semantic dense-vectors to improve the search engine.

**Assignment:** Generate a new index ("wiki-semantic-search") including all the information of the previous one plus an additional field that contains a BERT-based embedding vector of the `full_text` of the article. Once indexing is completed, repeat the same queries for a qualitative evaluation of the IR system. 

**How to create vector embeddings? Some hints:**
- Use Sentence-BERT pretrained encoders (www.sbert.net). Choose the most suitable pretrained model (trade off between speed and accuracy). E.g., `multi-qa-MiniLM-L6-cos-v1`
- Use cosine similarity to compute the similarity between queries and full text of the article.

In [None]:
%%capture
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
model_name="multi-qa-MiniLM-L6-cos-v1"

# Your code here

In [None]:
# create mapping
# the following code can be used to create the mapping for the index and instantiate all the properties and the types of the fields

dense_dim = len(sentence_encodings[0]) # sentence_encodings is the list of the sentence embeddings

index_properties = {}
index_properties['settings']={ "number_of_shards": 2, "number_of_replicas": 1}
index_properties['mappings']={ "dynamic": "true", "_source": { "enabled": "true" }, "properties": {}}
for t in ['ID', 'quoted_ID', 'full_text']: 
    index_properties['mappings']['properties'][t]={ "type": "text" }
for t in ['pagerank_score']: 
    index_properties['mappings']['properties'][t]={ "type": "float" }
for d in ["embedding_bert"]: 
    index_properties['mappings']['properties'][d]={ "type": "dense_vector", "dims": dense_dim }

In [None]:
# Your code here

## Content-based Recommender Systems

A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. (source: [Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))

In this part of the practice you will be required to generate a text-based unsupervised recommendation system (only **content**-based). The final goal is similar to a IR search engine, the main difference relies on **how you define the "queries".**

The tools at your disposal are:
1. `Sentence-BERT model`: should be used to obtain a vector representation of the input data.
2. `ElasticSearch`: can be used for indexing movie information and to perform **fast** similarity search.

For the recommendation system you need the following information:
- Movie's title
- Movie's plot
- Plot's embedding vector

The dataset used for this goal is: [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots). For this practice you will use a truncated version of the data collection to reduce runtime.

In [None]:
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
import pandas as pd
df_movies = pd.read_csv("wiki_plots_2005onward.csv")

### **Question 6: movie encodings**

Using the `Sentence-BERT` model, generate a vector representation of the movie plots. Each vector should be stored in the ElasticSearch index to perform similarity search.

NB: the vector dimension is dependent on the choice of the pretrained model.

In [None]:
! pip install sentence-transformers

In [None]:
# Your code here

### **Question 7: ElasticSearch indexing**

Create a new ElasticSearch index (`recsys-movies`) and index all movies with their embedding vectors. In this case, the index should contain the following fields:
- `title`: the title of the movie
- `plot`: the plot of the movie
- `embedding_bert`: the embedding vector of the plot



In [None]:
# Your code here

### **Question 8: Query generation**

Create a function that takes as input the following parameters:

1. `embedding_model`: Sentence-BERT model used to generate embeddings
2. `df_movies`: the dataframe containing all the movies' information
3. `movie_title`: a string containing the title of the movie the user is currently watching.

It should look for the movie in the dataframe and return the embedding vector of the plot. The `movie_title` is used to retrieve the plot from the dataframe. The `embedding_model` is used to generate the embedding vector of the plot.

NB: There must be a 1:1 correspondence between the movie title inserted by the user and the title of the movie in the dataframe. If the title is not found in the dataframe, the function should return an error.




In [None]:
# Your code here

### **Question 9: Qualitative evaluation (your personal movie recommendation system)**

Evaluate your personal recommendation system by querying for some movies in the data collection. You need to create an elasticsearch query to use the recommendation system (see Q. 5 of this practice).

Just some examples:
1. title: Harry Potter and the Goblet of Fire
2. title: Avengers: Age of Ultron
3. title: Star Wars: The Last Jedi


In [None]:
 # Your code here

### **Question 10 (Bonus)**

Rewrite the function at **Q.7** to take multiple movie titles (list of strings). Compute the average embedding vector of the movies and return it as the query vector. It can be used to generate a recommendation system that takes into account multiple movies watched by the user. Perform a qualitative evaluation in this specific case (it is possible to choose movie's titles from the previous list)

In [None]:
# Your code here