<a href="https://colab.research.google.com/github/RitaRez/POC/blob/main/bm25_and_PL2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instalações

Para indexação e ranqueamento utilizaremos a mesma biblioteca utilizada em ... ```pyTerrier```

In [1]:
!pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Bibliotecas e Dados utilizados

```pandas``` para manipular DataFrames, ```pyterrier``` para indexação e retrieval, ```json``` e ```re``` para manipular queries e documents.

In [2]:
import json, re

import pyterrier as pt
import pandas as pd

from google.colab import drive

Vamos realizar o mount para acessar os dados do Drive

In [3]:
drive.mount('/content/drive')

if not pt.started():
  pt.init()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [4]:
path = "/content/drive/My Drive/ToT/"
!ls "/content/drive/My Drive/ToT"

Index  Movies


# Indexação

O corpus presente em documents.json e negative_documents.json é indexado e armazenado no Drive em ```/ToT/Index/```

In [9]:
def parse_documents(documents_file):

  documents = []
  with open(documents_file, 'rt') as file:
    for l in file:
      new_doc = json.loads(l)

      new_doc['docno'] = new_doc['id']
      del new_doc['id']

      documents.append(new_doc)
  
  return documents

def index_corpus(documents_file, negative_documents_file, index_path):
  """
  Indexes the corpus
  """
  
  documents = parse_documents(documents_file) + parse_documents(negative_documents_file)
  
  # build the index
  indexer = pt.IterDictIndexer(index_path, verbose=True, meta={'docno': 20, 'text': 4096, 'title': 4096}, meta_reverse = ["docno"])
  return indexer.index(documents, meta=["docno"])


indexref = index_corpus(path + 'Movies/documents.json', path + 'Movies/negative_documents.json', path + 'Index/index')

# load the index, print the statistics
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

  return indexer.index(documents, meta=["docno"])


00:56:34.270 [ForkJoinPool-3-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 29 empty documents
Number of documents: 13642
Number of terms: 57368
Number of postings: 1679218
Number of fields: 1
Number of tokens: 2609710
Field names: [text]
Positions:   false



# Criação de Modelos para Retrieval

Vamos testar os modelos bm25, pls, tf_idf.

In [10]:
index = pt.IndexFactory.of(path + 'Index/index')
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF", properties={"termpipelines" : "Stopwords,PorterStemmer"})
bm25 = pt.BatchRetrieve(index, wmodel="BM25", properties={"termpipelines" : "Stopwords,PorterStemmer"})
pl2 = pt.BatchRetrieve(index, wmodel="PL2", properties={"termpipelines" : "Stopwords,PorterStemmer"})

# Experimentos

### Leitura de consultas e preprocessamento

* Tiramos pontuação, *stopwords* e realizamos *stemming*.

In [11]:
topics_path = path + 'Movies/queries.json'

qids = []; queries = []
with open(topics_path, 'rt') as file:
  for l in file:
    new_doc = json.loads(l)
    title = re.sub(r'[^\w\s]', ' ', new_doc['title'])
    description = re.sub(r'[^\w\s]', ' ', new_doc['description'])

    qids.append(new_doc['id'])
    queries.append(title + " " + description)

topics = pd.DataFrame({"qid": qids, "query": queries})

topics.head(4)

Unnamed: 0,qid,query
0,cggmzb,TOMT ANIMATION Little girl turns out to be ...
1,9g2x0f,TOMT Movie Late 90s Kids messed up turned...
2,eitiw9,TOMT MOVIE 2000s Teens investigate remote ...
3,km8f9u,tomt movie 1990 s 2000 s witch movie mo...


### Leitura do arquivo *qrels* que mapeia consultas, documentos e se é relevante ou não.

In [13]:
qrels_path = path + 'Movies/qrels.txt'

qrels = pt.io.read_qrels(qrels_path)
qrels.head(4)

Unnamed: 0,qid,docno,label
0,5lj3jl,tt0478303,1
1,5lldhi,tt2618986,1
2,5lmsd4,tt0240772,1
3,5m467z,tt1133691,1


### Leitura do arquivo *qrels* negativo que mapeia consultas, documentos e se é relevante ou não.

Também retirado do reddit, no caso são comentários com respostas erradas.

In [14]:
hard_negatives = path + "Movies/sub_id_to_neg_doc_ids.json"

querie_ids = []; doc_ids = []; labels = []
with open(hard_negatives, 'r') as hard_negatives:
  negatives = json.load(hard_negatives)
  
  for querie_id, doc_id in negatives.items():
    for doc in doc_id:
      querie_ids.append(querie_id)
      doc_ids.append(doc)
      labels.append(0)

negatives_qrels = pd.DataFrame({"qid": querie_ids, "docno": doc_ids, "label": labels})
negatives_qrels.head()

Unnamed: 0,qid,docno,label
0,arngub,tt1038988,0
1,eg0ze2,tt0162661,0
2,e67k4z,tt0376541,0
3,87izqh,tt0161743,0
4,87izqh,tt0092117,0


### Experimento com negativos

In [16]:
all_qrels = pd.concat([qrels, negatives_qrels], ignore_index=True, sort=False)

pt.Experiment(
    [tf_idf, bm25, pl2],
    topics,
    all_qrels,
    eval_metrics=["recall_1", "recall_10", "recip_rank"]
)

Unnamed: 0,name,recall_1,recall_10,recip_rank
0,BR(TF_IDF),0.099027,0.227252,0.1429
1,BR(BM25),0.106751,0.238375,0.152264
2,BR(PL2),0.09416,0.222385,0.138613
