## Search Engine with HuggingFace

<a href="https://colab.research.google.com/github/EffiSciencesResearch/ML4G/blob/main/days/w1d4/Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this tutorial is to understand the basics of using the most used library in modern NLP: Huggingface.

We will try to understand the first steps in creating a semantic search engine.

## Download the list of papers

In [5]:
!wget https://github.com/EffiSciencesResearch/ML4G/raw/4010bb6ccd63dee5896b26ee3c045898e0cf9ed6/days/w1d4/keynesian_eco_ML4G.xlsx -q

In [6]:
!wget https://raw.githubusercontent.com/EffiSciencesResearch/ML4G/38f80110be0802837254c1cd888f387475c9b5fe/days/w1d4/tldr_dataset.csv -q

In [7]:
pip install transformers -q


In [8]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
import pandas as pd
import numpy as np
dfimp = pd.read_excel("keynesian_eco_ML4G.xlsx")
tldrs = pd.read_csv('tldr_dataset.csv')

In [10]:
df = pd.concat([dfimp['title'], dfimp['paperAbstract']], axis=1)

In [11]:
df.iloc[327]

title            The problem of international economic equilibrium
paperAbstract                                                  NaN
Name: 327, dtype: object

## Semantic search engine

Create a search engine by embedding by using https://huggingface.co/sentence-transformers/all-mpnet-base-v2

You can use the sentence transformers library https://www.sbert.net/

Make some queries and check that it works.

In [12]:
from sentence_transformers import SentenceTransformer
# sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# embeddings = model.encode(sentences)
# print(embeddings)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [13]:
articles = []
map_title = lambda t : t # "# " + t
map_abstract = lambda x : "\n" + x
for i, data in df.iterrows():
  sentence = map_title(data['title'])
  if str(data['paperAbstract']) != "nan":
    sentence += map_abstract(data['paperAbstract'])
  articles.append(sentence)

embedings = model.encode(articles)


KeyboardInterrupt: ignored

In [None]:
query = "Poverty in the US"
embeded_query = model.encode(query)
result = max(enumerate(embedings), key=lambda x : np.einsum('i,i', x[1], embeded_query) )[0]
print(df.iloc[result]['title'])

## Few Shot Learning: Tldr

Use GPT-J 6b with some prompt engineering to create tldr of the summaries

You can begin with: https://huggingface.co/gpt2


The first step is to create a single tldr. For that, you will use some  few-shot learning. You can use this link to create your prompt: https://github.com/EffiSciencesResearch/ML4G/blob/38f80110be0802837254c1cd888f387475c9b5fe/days/w1d4/tldr_dataset.csv


After having created a single tldr, the aim is to add a new column in the pandas dataframe containing the tldr:
- automate the process and make inferences by batch. Store the inferences in a new column in the dataframe.
- Use the tqdm library to create a progress bar.
- Notice that it is too slow and switch to GPU. To do this, use ".to(device)" on the output tensor of the tokenizer and on the model.
- Use the command 'nvidia-smi' in the terminal to monitor the GPU usage. Aim at a GPU usage percentage of 70%.
- How does the speed of inference vary with the batch size?
- How does the inference speed vary with the padding policy in the tokenizer?
- How does the quality/speed of inference vary with the beam_search parameter?
- Bonus: read https://huggingface.co/blog/how-to-generate
- Bonus: Use a bigger model https://huggingface.co/EleutherAI/gpt-j-6B


In [31]:
import transformers
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model = GPT2Model.from_pretrained('gpt2')
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt')
# output = model(**encoded_input)
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

# generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


In [26]:
def key(s):
  # print(s.str.len().head())
  # print(s.str.split().map(len))
  return s.str.split().map(len)
tldrs['sumlen'] = tldrs['tldr']  + tldrs['abstract']
tldrs= tldrs.sort_values(by=['sumlen'], key=key)
def nb_token(x):
  return len(x.split())
def max_sentence_len(x):
  return max(map(nb_token, x.split('.')))
EXPECTED_LENGTH = 50
biggest  = max(df['paperAbstract'], key=lambda x : 0 if str(x) == 'nan' or '.' not in x else  
                        max_sentence_len(x))
max_input = max_sentence_len(biggest)
max_input = 200
token_needed =  max_input + 5 + EXPECTED_LENGTH

prompt_length = 512 - token_needed
print(prompt_length)
prompt = ""

for i, res in tldrs.iterrows():
  if i < 10: 
    continue
  new_example = "Title: {}\nAbstract: {}\ntldr: {}\n".format(res['title'], res['abstract'], res['tldr'])
  # print(nb_token(new_example),new_example)
  if nb_token(new_example) > prompt_length:
    print(new_example)
    break
  prompt += new_example
  prompt_length -= nb_token(new_example)
print(prompt)

257
Title: Trade unions as retaining walls against political change: A Gramscian approach to remunicipalisation policies in a Spanish City
Abstract: The 2008 economic and political crisis produced a favourable opportunity structure for the emergence of new and innovative left-wing political projects and trade union strategies in Spain, especial...
tldr: The 2008 economic crisis in Spain produced a favourable opportunity for left-wing projects and trade union strategies.

Title: The House of Rothschild: The World's Banker [Book Review]
Abstract: Review(s) of: The House of Rothschild: The World's Banker, by Niall Ferguson, Two vols, Penguin, 2000.
tldr: The House of Rothschild: The World's Banker.
Title: Where Dividend Hunters Should Take a Peek (and a Pass)
Abstract: Some historically high-yielding sectors aren't necessarily your best stomping grounds, says Morningstar's Josh Peters.
tldr: Some historically high-yielding sectors aren't always your best stomping grounds, says Morningstar

In [42]:
def get_answer(data):
  r = generator(prompt + 
                "Title: {}\nAbstract: {}\ntldr:".format(data['title'], data['paperAbstract'])
, max_length=1000)
  # print(s)
  extract = lambda s : s['generated_text'].split(data['paperAbstract']+"\ntldr:")[1].split("\n")[0]
  return list(map(extract, r))
  
for _,i in df.iloc[3:20].iterrows():
  if str(i["paperAbstract"]) != 'nan':
    print(get_answer(i))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 The control of the economy is examined in terms of the relationships between instruments and targets. The endogeneity of the budget deficit is analysed, exploring'reflation' as a possible solution, and it is shown that monetarism has changed people's response patterns. The mechanical approach to economic management, and particularly inflation, is rejected in favour of political solutions and long-term planning.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 New Found Foundations of Economic Growth


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Lancaster University.


KeyboardInterrupt: ignored

In [35]:
df.iloc[3:20]

Unnamed: 0,title,paperAbstract
3,The Keynesian Revolution and its Critics,
4,The control of the economy,The control of the economy is examined in term...
5,The Sraffian supermultiplier as an alternative...,This paper aims to show that the Sraffian supe...
6,The Local Income and Employment Impact of Lanc...,"The paper presents the results of an analysis,..."
7,Capitalist profit calculation and inflation ac...,
8,Potential output and inflation dynamics after ...,"Ever since the end of the Great Recession, the..."
9,Theories of Currency Crisis,
10,Stockbroking in the Nineties,In this concluding chapter I would like first ...
11,Moeda endógena e progresso tecnológico induzid...,This article intend to analyze the process of ...
12,"A Brief Note on ""Fundamental Disequilibrium""",


In [None]:
help(model.generate)
# You can read the doc for the following parameters:
# inputs, max_length, num_beams
# Read the Greedy Decoding example at the end of the documentation.
# Then you can delete this cell

## Quality filtering 

(Bonus) Implement a strategy to keep only high quality tldr

## Fine-Tuning

Bonus: Fine Tune T5-base from the corpus generated by  GPT-J to accelerate the inference and fine tune your first LLM

Use: https://huggingface.co/docs/transformers/training#training-hyperparameters