# Relatório e Projeto - Sistema de indexação e busca de documentos

## 1 - Informações da equipe
---
IN1152 - Recuperação Inteligente de Informação - 2022.2

**Equipe**: Matheus Rodrigues de Souza Félix (matheusrdgsf@gmail.com) e Rodrigo Souza de Melo (rsm5@cin.ufpe.br)

**Projeto "Sistema de indexação e busca de documentos"**

Professora Flavia de Almeida Barros (fab@cin.ufpe.br)

- Este relatório apresentará as seções solicitadas no Trabalho 1 - Sistema de indexação e busca de documentos. E mais abaixo o código-fonte do projeto

---

## 2 - Descrição dos documentos (corpus) que serão indexados pelo sistema
---

 - 2 Temas/tópicos dos documentos da sua base
 - Mostrar no relatório 2 ou 3 exemplos de documentos do corpus
 
 - BEIR [1] é um benchmark heterogêneo contendo diversas tarefas de RI. Ele também fornece uma estrutura comum e fácil para avaliação de seus modelos de recuperação baseados em PNL. Dentre as opções do BEIR foi selecionado o SciFact.
 - SciFact [4]: Devido ao rápido crescimento da literatura científica, há a necessidade de sistemas automatizados para auxiliar pesquisadores e o público na avaliação da veracidade das afirmações científicas. Para facilitar o desenvolvimento de sistemas para essa tarefa, é utilizado o SciFact, um conjunto de dados de 1,4 mil declarações escritas por especialistas, combinadas com resumos contendo evidências anotados com rótulos e justificativas de veracidade.
 - Leaderboard [4]: Avalia submissões de modelos no conjunto de dados SciFact, com o objetivo de desenvolver sistemas automatizados para verificação de alegações científicas.
 - Dataset [2]: Formando pelos elementos abaixo.
   - corpus: Representa o título e o texto do documento
   - query: Representa a consulta
   - qrels: Representa a relevância das consulta(s) realizada(s)
   
 - Exemplo de uma estrutura do documento com 2 queries, 2 documentos e as relevâncias de cada consulta [2].

~~~json
corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}
~~~
 
 Referências: 
 * [1] https://github.com/beir-cellar/beir
 * [2] https://huggingface.co/datasets/BeIR/scifact-generated-queries
 * [3] https://github.com/allenai/scifact
 * [4] https://leaderboard.allenai.org/scifact/submissions/about
              

---

## 3 - Arquitetura do sistema 
---

- Prover uma descrição breve das etapas de processamento do sistema construído com base na ferramenta escolhida. Informar qual é o modelo de RI implementado pelo seu sistema, qual a fórmula para cálculo dos pesos e qual a função de ranking (vejam as aulas de modelos de RI). 

---

## 4 - Criação das bases de documentos indexados 
---

- Preparação & Indexação dos documentos: O sistema deve criar, de forma automática, cinco BASES indexadas a partir da base original de documentos. Cada BASE (no Solr, são COREs) deve utilizar processos diferentes na preparação (pré-processamento) dos dados. O objetivo é verificar qual é a melhor configuração de pré-processamento para o seu caso.

   - BASE 1: documentos originais, sem nenhum pré-processamento extra (só tokenização);
   - BASE 2: apenas eliminar stopwords (usar filtro de stopwords);
   - BASE 3: apenas usar stemming, sem eliminar as stopwords (não usar filtro de stopwords);
   - BASE 4: eliminar stopwords (usar filtro de stopwords) e usar stemming;
   - BASE 5: eliminar stopwords (usar filtro de stopwords), não usar stemming, e usar dicionário de sinônimos.


---


## 5 - Criação das consultas e Criação da Matriz de relevância  
---

- Prover uma descrição breve das etapas de processamento do sistema construído com base na ferramenta escolhida. Informar qual é o modelo de RI implementado pelo seu sistema, qual a fórmula para cálculo dos pesos e qual a função de ranking (vejam as aulas de modelos de RI). 

- Esta etapa foi descrita em mais detalhes na atividade correspondente, postada no Classroom. 
  ===> DÚVIDA: NÃO ENCONTREI OS DETALHES NO CLASSROOM.
  
- Mostrar aqui parte da matriz - basta mostrar as 5 colunas exemplificadas abaixo (incluindo a coluna final com a quantidade de documentos relevantes). 

Exemplo de Matriz de relevância “Consultas x Documentos”.


|                                                                      | Doc 1 | Doc 2 | ... Doc 20 | Qtd de docs relevantes |   |   |   |   |   |
|----------------------------------------------------------------------|-------|-------|------------|------------------------|---|---|---|---|---|
| Consulta1 Ex.:  como faço o agendamento para tomar vacina de covid ? | 1     | 0     | 1          | 10                     |   |   |   |   |   |
| Consulta2 Ex.: Qual o preço de carro FIAT UNO usado ?                     | 1     | 1     | 0          | 15                     |   |   |   |   |   |




---

## 6 - Testes/Avaliação 
---

- Submeter as 2 consultas para cada BASE criada, e avaliar cada resultado separadamente -  i.e., calcular separadamente a precisão e a cobertura de cada consulta em relação a cada BASE criada. 
- Usar as fórmulas vistas em aula: precisão, cobertura e F-measure. 
- Incluir no relatório uma matriz de resultados para CADA consulta. Assim podemos ver a influência do pré-processamento dos documentos no resultado final do sistema.





Matriz de resultados para a Consulta 1
- Consulta: incluam aqui o texto da consulta avaliada nessa matriz
- Qtd de documentos relevantes: ver matriz de relevância (avaliação manual)

|  | **Precisão** | **Cobertura** | **F-measure** | **Qtd de Docs relevantes retornados pela  consulta 1** | **Qtd total de documentos retornados pela consulta 1** |
|:--------:|:------------:|:-------------:|:-------------:|:------------------------------------------------------:|:------------------------------------------------------:|
| BASE 1   |              |               |               |                                                        |                                                        |
| BASE 2   |              |               |               |                                                        |                                                        |
| BASE 3   |              |               |               |                                                        |                                                        |
| BASE 4   |              |               |               |                                                        |                                                        |
| BASE 5   |              |               |               |                                                        |                                                        |





Matriz de resultados para a Consulta 2
- Consulta: incluam aqui o texto da consulta avaliada nessa matriz
- Qtd de documentos relevantes: ver matriz de relevância (avaliação manual)

|  | **Precisão** | **Cobertura** | **F-measure** | **Qtd de Docs relevantes retornados pela  consulta 2** | **Qtd total de documentos retornados pela consulta 2** |
|:--------:|:------------:|:-------------:|:-------------:|:------------------------------------------------------:|:------------------------------------------------------:|
| BASE 1   |              |               |               |                                                        |                                                        |
| BASE 2   |              |               |               |                                                        |                                                        |
| BASE 3   |              |               |               |                                                        |                                                        |
| BASE 4   |              |               |               |                                                        |                                                        |
| BASE 5   |              |               |               |                                                        |                                                        |




Matriz de resultados para o Sistema
- As medidas de precisão, cobertura e F-meause do sistema serão obtidas calculando-se a média entre os resultados obtidos com cada consulta em relação a cada BASE criado.


|        | Precisão média | Cobertura média | F-measure média |
|--------|----------------|-----------------|-----------------|
| BASE 1 |                |                 |                 |
| BASE 2 |                |                 |                 |
| BASE 3 |                |                 |                 |
| BASE 4 |                |                 |                 |
| BASE 5 |                |                 |                 |


---

 

## 7 - Conclusão 
---

- Deve conter um texto curto explicando o que vocês concluíram a partir do resultado dos testes (tabelas acima).

---

## Código-fonte do projeto

### Configurações

In [1]:
#!pip install GitPython

### Libs

In [98]:
import os
import pandas as pd
from git import Repo
import gzip
from io import BytesIO
import gzip

from sklearn.metrics import confusion_matrix, classification_report #Utilizado para calcular a matrix de confusão e Relatório de classificação - Checar outra forma para remover o warning e não utilizar o parâmetro zero_division=1 no recall_score 

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
import unidecode
import string
import re

nltk.download('punkt')
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer

from scipy import spatial
from sklearn.neighbors import NearestNeighbors

from sklearn.metrics import precision_score, f1_score, recall_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Collect Scifact

In [55]:
# DATA_PATH = "dataset/scifact-generated-queries"

In [56]:
# if not os.path.isdir(DATA_PATH):
#     Repo.clone_from("https://huggingface.co/datasets/BeIR/scifact-generated-queries", "dataset/scifact-generated-queries")

In [57]:
# with gzip.open(DATA_PATH+"/train.jsonl.gz", 'rb') as f:
#     file_content = f.read()

In [58]:
# dataset = pd.read_json(BytesIO(file_content), lines=True)

In [99]:
# dataset.groupby(["title"]).size().reset_index(name='queries_count')

### Data Collect 20News Labeled

In [100]:
DATA_PATH = "20news_labeled.csv"

In [136]:
dataset = pd.read_csv(DATA_PATH)
dataset.dropna(inplace=True)

In [137]:
dataset.head()

Unnamed: 0,_id,text,target,q1,q2
0,0,"Library of Congress to Host Dead Sea Scroll Symposium April 21-22 To: National and Assignment desks, Daybook Editor Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191 both of the Library of Congress WASHINGTON, April 19 -- A symposium on the Dead Sea Scrolls will be held at the Library of Congress on Wednesday,April 21, and Thursday, April 22. The two-day program, cosponsoredby the library and Baltimore Hebrew University, with additionalsupport from the Project Judaica Foundation, will be held in thelibrary's Mumford Room, sixth floor, Madison Building. Seating is limited, and admission to any session of the symposiummust be requested in writing (see Note A). The symposium will be held one week before the public opening of amajor exhibition, ""Scrolls from the Dead Sea: The Ancient Library ofQumran and Modern Scholarship,"" that opens at the Library of Congresson April 29. On view will be fragmentary scrolls and archaeologicalartifacts excavated at Qumran, on loan from the Israel AntiquitiesAuthority. Approximately 50 items from Library of Congress specialcollections will augment these materials. The exhibition, on view inthe Madison Gallery, through Aug. 1, is made possible by a generousgift from the Project Judaica Foundation of Washington, D.C. The Dead Sea Scrolls have been the focus of public and scholarlyinterest since 1947, when they were discovered in the desert 13 mileseast of Jerusalem. The symposium will explore the origin and meaningof the scrolls and current scholarship. Scholars from diverseacademic backgrounds and religious affiliations, will offer theirdisparate views, ensuring a lively discussion. The symposium schedule includes opening remarks on April 21, at2 p.m., by Librarian of Congress James H. Billington, and byDr. Norma Furst, president, Baltimore Hebrew University. Co-chairingthe symposium are Joseph Baumgarten, professor of Rabbinic Literatureand Institutions, Baltimore Hebrew University and Michael Grunberger,head, Hebraic Section, Library of Congress. Geza Vermes, professor emeritus of Jewish studies, OxfordUniversity, will give the keynote address on the current state ofscroll research, focusing on where we stand today. On the secondday, the closing address will be given by Shmaryahu Talmon, who willpropose a research agenda, picking up the theme of how the Qumranstudies might proceed. On Wednesday, April 21, other speakers will include: -- Eugene Ulrich, professor of Hebrew Scriptures, University ofNotre Dame and chief editor, Biblical Scrolls from Qumran, on ""TheBible at Qumran;"" -- Michael Stone, National Endowment for the Humanitiesdistinguished visiting professor of religious studies, University ofRichmond, on ""The Dead Sea Scrolls and the Pseudepigrapha."" -- From 5 p.m. to 6:30 p.m. a special preview of the exhibitionwill be given to symposium participants and guests. On Thursday, April 22, beginning at 9 a.m., speakers will include: -- Magen Broshi, curator, shrine of the Book, Israel Museum,Jerusalem, on ""Qumran: The Archaeological Evidence;"" -- P. Kyle McCarter, Albright professor of Biblical and ancientnear Eastern studies, The Johns Hopkins University, on ""The CopperScroll;"" -- Lawrence H. Schiffman, professor of Hebrew and Judaic studies,New York University, on ""The Dead Sea Scrolls and the History ofJudaism;"" and -- James VanderKam, professor of theology, University of NotreDame, on ""Messianism in the Scrolls and in Early Christianity."" The Thursday afternoon sessions, at 1:30 p.m., include: -- Devorah Dimant, associate professor of Bible and Ancient JewishThought, University of Haifa, on ""Qumran Manuscripts: Library of aJewish Community;"" -- Norman Golb, Rosenberger professor of Jewish history andcivilization, Oriental Institute, University of Chicago, on ""TheCurrent Status of the Jerusalem Origin of the Scrolls;"" -- Shmaryahu Talmon, J.L. Magnas professor emeritus of Biblicalstudies, Hebrew University, Jerusalem, on ""The Essential 'Commune ofthe Renewed Covenant': How Should Qumran Studies Proceed?"" will closethe symposium. There will be ample time for question and answer periods at theend of each session. Also on Wednesday, April 21, at 11 a.m.: The Library of Congress and The Israel Antiquities Authoritywill hold a lecture by Esther Boyd-Alkalay, consulting conservator,Israel Antiquities Authority, on ""Preserving the Dead Sea Scrolls""in the Mumford Room, LM-649, James Madison Memorial Building, TheLibrary of Congress, 101 Independence Ave., S.E., Washington, D.C. ------ NOTE A: For more information about admission to the symposium,please contact, in writing, Dr. Michael Grunberger, head, HebraicSection, African and Middle Eastern Division, Library of Congress,Washington, D.C. 20540. -30-",christian,0,0
1,1,"Anyone who dies for a ""cause"" runs the risk of dying for a lie. As forpeople being able to tell if he was a liar, well, we've had grifters andcharlatans since the beginning of civilization. If David Copperfield hadbeen the Messiah, I bet he could have found plenty of believers. Jesus was hardly the first to claim to be a faith healer, and he wasn't thefirst to be ""witnessed."" What sets him apart?Rubbish. Nations have followed crazies, liars, psychopaths, and megalomaniacs throughout history. Hitler, Tojo, Mussolini, Khomeini,Qadaffi, Stalin, Papa Doc, and Nixon come to mind...all from this century.Koresh is a non-issue.Take a discrete mathematics or formal logic course. There are flaws in yourlogic everywhere. And as I'm sure others will tell you, read the FAQ!Of course, you have to believe the Bible first. Just because something iswritten in the Bible does not mean it is true, and the age of that tome plusthe lack of external supporting evidence makes it less credible. So if youdo quote from the Bible in the future, try to back up that quote with supporting evidence. Otherwise, you will get flamed mercilessly.Just like weight lifting or guitar playing, eh? I don't know how you define the world ""total,"" but I would imagine a ""total sacrafice [sp]of everything for God's sake"" would involve more than a time commitment.You are correct about our tendency to ""box everything into time units.""Would you explain HOW one should involove God in sports and (hehehe)television?",altheism,1,0
2,2,"Woah...The context is about God's calling out a special people (the Jews) tocarry the ""promise."" To read the meaning as literal people is to miss Paul'sentire point. I'd be glad to send anyone more detailed explanations of thispassage if interested.",christian,0,0
3,3,"See, there you go again, saying that a moral act is only significantif it is ""voluntary."" Why do you think this?And anyway, humans have the ability to disregard some of their instincts.You are attaching too many things to the term ""moral,"" I think.Let's try this: is it ""good"" that animals of the same speciesdon't kill each other. Or, do you think this is right? Or do you think that animals are machines, and that nothing they dois either right nor wrong?Those weren't arbitrary killings. They were slayings related to some sortof mating ritual or whatnot.Yes it was, but I still don't understand your distinctions. Whatdo you mean by ""consider?"" Can a small child be moral? How abouta gorilla? A dolphin? A platypus? Where is the line drawn? Doesthe being need to be self aware?What *do* you call the mechanism which seems to prevent animals ofthe same species from (arbitrarily) killing each other? Don'tyou find the fact that they don't at all significant?",altheism,0,1
4,4,"The last sentence is ironic, since so many readers ofsoc.religion.christian seem to not be embarrassed by apologists such asJosh McDowell and C.S. Lewis. The above also expresses a rather odd senseof history. What makes you think the masses in Aquinas' day, who weremostly illiterate, knew any more about rhetoric and logic than most peopletoday? If writings from the period seem elevated consider that only thecream of the crop, so to speak, could read and write. If everyone inthe medieval period ""knew the rules"" it was a matter of uncriticallyaccepting what they were told.Bill Mayne",christian,0,0


## Pre-rocesing Data

In this step we will create 5 data models:

v1 - Only Tokenization; \
v2 - Only Stopword Filter; \
v3 - Only Stemming; \
v4 - Remove Stopwords and Stemming; \
v5 - Remove Stopworpd and expand words with Synonyms.


In [184]:
def remove_accent(text):
    return unidecode.unidecode(text)

def tokenize(text):
    return word_tokenize(text, language="english")

def pre_process(text, rmv_sw, stem):
    text_lower = text.lower()
    text_rmv_accent = remove_accent(text_lower)
    text_tokenized = tokenize(text_rmv_accent)
    text_rmv_alphanum = list(filter(lambda token: not re.search('\d', token), text_tokenized))
    text_final = list(filter(lambda token: token not in string.punctuation+, text_rmv_alphanum))
    
    if rmv_sw:
        text_final = list(filter(lambda token: token not in STOPWORDS, text_final))
    
    if stem:
        stemmer = SnowballStemmer("english")
        text_final = list(map(lambda token: stemmer.stem(token), text_final))
        
    return text_final

STOPWORDS = set(map(lambda token: remove_accent(token), stopwords.words("english")))

In [185]:
pre_process_v1 = lambda text: pre_process(text, False, False)
pre_process_v2 = lambda text: pre_process(text, True, False)
pre_process_v3 = lambda text: pre_process(text, False, True)
pre_process_v4 = lambda text: pre_process(text, True, True)
#pre_process_v5 = lambda i: pre_process(text, True, False) # TODO: Para este caso eu chequei esta referência -> https://www.holisticseo.digital/python-seo/nltk/wordnet.

In [186]:
dataset_pcrs = dataset.copy().drop_duplicates().head(50)

In [187]:
dataset_pcrs["v1"] = dataset_pcrs["text"].apply(lambda text: pre_process_v1(text))

In [188]:
dataset_pcrs["v2"] = dataset_pcrs["text"].apply(lambda text: pre_process_v2(text))

In [189]:
dataset_pcrs["v3"] = dataset_pcrs["text"].apply(lambda text: pre_process_v3(text))

In [190]:
dataset_pcrs["v4"] = dataset_pcrs["text"].apply(lambda text: pre_process_v4(text))

In [191]:
# @TODO: Create Expand Vocabulary Pipeline add connect pre_process
# dataset_pcrs["v5"] = dataset_pcrs["text"].apply(lambda text: pre_process_v5(text))

In [192]:
# Be careful with dataset_pcrs Len
#HTML(dataset_pcrs.head(3).to_html())

# @TODO: add "()" to STOPWORDS. A better strategy is create a dict simbol to filter and add this pipeline.
dataset_pcrs.head(2)

Unnamed: 0,_id,text,target,q1,q2,v1,v2,v3,v4
0,0,"Library of Congress to Host Dead Sea Scroll Symposium April 21-22 To: National and Assignment desks, Daybook Editor Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191 both of the Library of Congress WASHINGTON, April 19 -- A symposium on the Dead Sea Scrolls will be held at the Library of Congress on Wednesday,April 21, and Thursday, April 22. The two-day program, cosponsoredby the library and Baltimore Hebrew University, with additionalsupport from the Project Judaica Foundation, will be held in thelibrary's Mumford Room, sixth floor, Madison Building. Seating is limited, and admission to any session of the symposiummust be requested in writing (see Note A). The symposium will be held one week before the public opening of amajor exhibition, ""Scrolls from the Dead Sea: The Ancient Library ofQumran and Modern Scholarship,"" that opens at the Library of Congresson April 29. On view will be fragmentary scrolls and archaeologicalartifacts excavated at Qumran, on loan from the Israel AntiquitiesAuthority. Approximately 50 items from Library of Congress specialcollections will augment these materials. The exhibition, on view inthe Madison Gallery, through Aug. 1, is made possible by a generousgift from the Project Judaica Foundation of Washington, D.C. The Dead Sea Scrolls have been the focus of public and scholarlyinterest since 1947, when they were discovered in the desert 13 mileseast of Jerusalem. The symposium will explore the origin and meaningof the scrolls and current scholarship. Scholars from diverseacademic backgrounds and religious affiliations, will offer theirdisparate views, ensuring a lively discussion. The symposium schedule includes opening remarks on April 21, at2 p.m., by Librarian of Congress James H. Billington, and byDr. Norma Furst, president, Baltimore Hebrew University. Co-chairingthe symposium are Joseph Baumgarten, professor of Rabbinic Literatureand Institutions, Baltimore Hebrew University and Michael Grunberger,head, Hebraic Section, Library of Congress. Geza Vermes, professor emeritus of Jewish studies, OxfordUniversity, will give the keynote address on the current state ofscroll research, focusing on where we stand today. On the secondday, the closing address will be given by Shmaryahu Talmon, who willpropose a research agenda, picking up the theme of how the Qumranstudies might proceed. On Wednesday, April 21, other speakers will include: -- Eugene Ulrich, professor of Hebrew Scriptures, University ofNotre Dame and chief editor, Biblical Scrolls from Qumran, on ""TheBible at Qumran;"" -- Michael Stone, National Endowment for the Humanitiesdistinguished visiting professor of religious studies, University ofRichmond, on ""The Dead Sea Scrolls and the Pseudepigrapha."" -- From 5 p.m. to 6:30 p.m. a special preview of the exhibitionwill be given to symposium participants and guests. On Thursday, April 22, beginning at 9 a.m., speakers will include: -- Magen Broshi, curator, shrine of the Book, Israel Museum,Jerusalem, on ""Qumran: The Archaeological Evidence;"" -- P. Kyle McCarter, Albright professor of Biblical and ancientnear Eastern studies, The Johns Hopkins University, on ""The CopperScroll;"" -- Lawrence H. Schiffman, professor of Hebrew and Judaic studies,New York University, on ""The Dead Sea Scrolls and the History ofJudaism;"" and -- James VanderKam, professor of theology, University of NotreDame, on ""Messianism in the Scrolls and in Early Christianity."" The Thursday afternoon sessions, at 1:30 p.m., include: -- Devorah Dimant, associate professor of Bible and Ancient JewishThought, University of Haifa, on ""Qumran Manuscripts: Library of aJewish Community;"" -- Norman Golb, Rosenberger professor of Jewish history andcivilization, Oriental Institute, University of Chicago, on ""TheCurrent Status of the Jerusalem Origin of the Scrolls;"" -- Shmaryahu Talmon, J.L. Magnas professor emeritus of Biblicalstudies, Hebrew University, Jerusalem, on ""The Essential 'Commune ofthe Renewed Covenant': How Should Qumran Studies Proceed?"" will closethe symposium. There will be ample time for question and answer periods at theend of each session. Also on Wednesday, April 21, at 11 a.m.: The Library of Congress and The Israel Antiquities Authoritywill hold a lecture by Esther Boyd-Alkalay, consulting conservator,Israel Antiquities Authority, on ""Preserving the Dead Sea Scrolls""in the Mumford Room, LM-649, James Madison Memorial Building, TheLibrary of Congress, 101 Independence Ave., S.E., Washington, D.C. ------ NOTE A: For more information about admission to the symposium,please contact, in writing, Dr. Michael Grunberger, head, HebraicSection, African and Middle Eastern Division, Library of Congress,Washington, D.C. 20540. -30-",christian,0,0,"[library, of, congress, to, host, dead, sea, scroll, symposium, april, to, national, and, assignment, desks, daybook, editor, contact, john, sullivan, or, lucy, suddreth, both, of, the, library, of, congress, washington, april, --, a, symposium, on, the, dead, sea, scrolls, will, be, held, at, the, library, of, congress, on, wednesday, april, and, thursday, april, the, two-day, program, cosponsoredby, the, library, and, baltimore, hebrew, university, with, additionalsupport, from, the, project, judaica, foundation, will, be, held, in, thelibrary, 's, mumford, room, sixth, floor, madison, building, seating, is, limited, and, admission, to, any, session, of, the, symposiummust, be, requested, in, writing, see, note, a, ...]","[library, congress, host, dead, sea, scroll, symposium, april, national, assignment, desks, daybook, editor, contact, john, sullivan, lucy, suddreth, library, congress, washington, april, --, symposium, dead, sea, scrolls, held, library, congress, wednesday, april, thursday, april, two-day, program, cosponsoredby, library, baltimore, hebrew, university, additionalsupport, project, judaica, foundation, held, thelibrary, 's, mumford, room, sixth, floor, madison, building, seating, limited, admission, session, symposiummust, requested, writing, see, note, symposium, held, one, week, public, opening, amajor, exhibition, ``, scrolls, dead, sea, ancient, library, ofqumran, modern, scholarship, '', opens, library, congresson, april, view, fragmentary, scrolls, archaeologicalartifacts, excavated, qumran, loan, israel, antiquitiesauthority, approximately, items, library, congress, specialcollections, augment, ...]","[librari, of, congress, to, host, dead, sea, scroll, symposium, april, to, nation, and, assign, desk, daybook, editor, contact, john, sullivan, or, luci, suddreth, both, of, the, librari, of, congress, washington, april, --, a, symposium, on, the, dead, sea, scroll, will, be, held, at, the, librari, of, congress, on, wednesday, april, and, thursday, april, the, two-day, program, cosponsoredbi, the, librari, and, baltimor, hebrew, univers, with, additionalsupport, from, the, project, judaica, foundat, will, be, held, in, thelibrari, 's, mumford, room, sixth, floor, madison, build, seat, is, limit, and, admiss, to, ani, session, of, the, symposiummust, be, request, in, write, see, note, a, ...]","[librari, congress, host, dead, sea, scroll, symposium, april, nation, assign, desk, daybook, editor, contact, john, sullivan, luci, suddreth, librari, congress, washington, april, --, symposium, dead, sea, scroll, held, librari, congress, wednesday, april, thursday, april, two-day, program, cosponsoredbi, librari, baltimor, hebrew, univers, additionalsupport, project, judaica, foundat, held, thelibrari, 's, mumford, room, sixth, floor, madison, build, seat, limit, admiss, session, symposiummust, request, write, see, note, symposium, held, one, week, public, open, amajor, exhibit, ``, scroll, dead, sea, ancient, librari, ofqumran, modern, scholarship, '', open, librari, congresson, april, view, fragmentari, scroll, archaeologicalartifact, excav, qumran, loan, israel, antiquitiesauthor, approxim, item, librari, congress, specialcollect, augment, ...]"
1,1,"Anyone who dies for a ""cause"" runs the risk of dying for a lie. As forpeople being able to tell if he was a liar, well, we've had grifters andcharlatans since the beginning of civilization. If David Copperfield hadbeen the Messiah, I bet he could have found plenty of believers. Jesus was hardly the first to claim to be a faith healer, and he wasn't thefirst to be ""witnessed."" What sets him apart?Rubbish. Nations have followed crazies, liars, psychopaths, and megalomaniacs throughout history. Hitler, Tojo, Mussolini, Khomeini,Qadaffi, Stalin, Papa Doc, and Nixon come to mind...all from this century.Koresh is a non-issue.Take a discrete mathematics or formal logic course. There are flaws in yourlogic everywhere. And as I'm sure others will tell you, read the FAQ!Of course, you have to believe the Bible first. Just because something iswritten in the Bible does not mean it is true, and the age of that tome plusthe lack of external supporting evidence makes it less credible. So if youdo quote from the Bible in the future, try to back up that quote with supporting evidence. Otherwise, you will get flamed mercilessly.Just like weight lifting or guitar playing, eh? I don't know how you define the world ""total,"" but I would imagine a ""total sacrafice [sp]of everything for God's sake"" would involve more than a time commitment.You are correct about our tendency to ""box everything into time units.""Would you explain HOW one should involove God in sports and (hehehe)television?",altheism,1,0,"[anyone, who, dies, for, a, ``, cause, '', runs, the, risk, of, dying, for, a, lie, as, forpeople, being, able, to, tell, if, he, was, a, liar, well, we, 've, had, grifters, andcharlatans, since, the, beginning, of, civilization, if, david, copperfield, hadbeen, the, messiah, i, bet, he, could, have, found, plenty, of, believers, jesus, was, hardly, the, first, to, claim, to, be, a, faith, healer, and, he, was, n't, thefirst, to, be, ``, witnessed, '', what, sets, him, apart, rubbish, nations, have, followed, crazies, liars, psychopaths, and, megalomaniacs, throughout, history, hitler, tojo, mussolini, khomeini, qadaffi, stalin, papa, doc, and, nixon, ...]","[anyone, dies, ``, cause, '', runs, risk, dying, lie, forpeople, able, tell, liar, well, 've, grifters, andcharlatans, since, beginning, civilization, david, copperfield, hadbeen, messiah, bet, could, found, plenty, believers, jesus, hardly, first, claim, faith, healer, n't, thefirst, ``, witnessed, '', sets, apart, rubbish, nations, followed, crazies, liars, psychopaths, megalomaniacs, throughout, history, hitler, tojo, mussolini, khomeini, qadaffi, stalin, papa, doc, nixon, come, mind, ..., century.koresh, non-issue.take, discrete, mathematics, formal, logic, course, flaws, yourlogic, everywhere, 'm, sure, others, tell, read, faq, course, believe, bible, first, something, iswritten, bible, mean, true, age, tome, plusthe, lack, external, supporting, evidence, makes, less, credible, youdo, quote, ...]","[anyon, who, die, for, a, ``, caus, '', run, the, risk, of, die, for, a, lie, as, forpeopl, be, abl, to, tell, if, he, was, a, liar, well, we, ve, had, grifter, andcharlatan, sinc, the, begin, of, civil, if, david, copperfield, hadbeen, the, messiah, i, bet, he, could, have, found, plenti, of, believ, jesus, was, hard, the, first, to, claim, to, be, a, faith, healer, and, he, was, n't, thefirst, to, be, ``, wit, '', what, set, him, apart, rubbish, nation, have, follow, crazi, liar, psychopath, and, megalomaniac, throughout, histori, hitler, tojo, mussolini, khomeini, qadaffi, stalin, papa, doc, and, nixon, ...]","[anyon, die, ``, caus, '', run, risk, die, lie, forpeopl, abl, tell, liar, well, ve, grifter, andcharlatan, sinc, begin, civil, david, copperfield, hadbeen, messiah, bet, could, found, plenti, believ, jesus, hard, first, claim, faith, healer, n't, thefirst, ``, wit, '', set, apart, rubbish, nation, follow, crazi, liar, psychopath, megalomaniac, throughout, histori, hitler, tojo, mussolini, khomeini, qadaffi, stalin, papa, doc, nixon, come, mind, ..., century.koresh, non-issue.tak, discret, mathemat, formal, logic, cours, flaw, yourlog, everywher, 'm, sure, other, tell, read, faq, cours, believ, bibl, first, someth, iswritten, bibl, mean, true, age, tome, plusth, lack, extern, support, evid, make, less, credibl, youdo, quot, ...]"


### Vectorizing Data

This step defines 5 (vectorizer, corpus_vetorized) models.

In [195]:
def get_documments(df_column):
    return list(map(lambda tokenized_text: " ".join(tokenized_text), df_column))

In [196]:
corpus_v1 = get_documments(dataset_pcrs["v1"])
corpus_v2 = get_documments(dataset_pcrs["v2"])
corpus_v3 = get_documments(dataset_pcrs["v3"])
corpus_v4 = get_documments(dataset_pcrs["v4"])
# corpus_v5 = get_documments(dataset_pcrs["v5"])

In [197]:
vectorizer_v1 = TfidfVectorizer()
corpus_v1_vct = vectorizer_v1.fit_transform(corpus_v1)

vectorizer_v2 = TfidfVectorizer()
corpus_v2_vct = vectorizer_v2.fit_transform(corpus_v2)

vectorizer_v3 = TfidfVectorizer()
corpus_v3_vct = vectorizer_v3.fit_transform(corpus_v3)

vectorizer_v4 = TfidfVectorizer()
corpus_v4_vct = vectorizer_v4.fit_transform(corpus_v4)

# vectorizer_v5 = TfidfVectorizer()
# corpus_v1_vct = vectorizer_v5.fit_transform(corpus_v5)

### Retrieval Information Function

In [198]:
# kd -> KDTree, nn -> Nearest Neighbor, bf -> Brute Force
# n -> number of docs in return, use -1 to all docs
def info_retrieval(pre_process, corpus, vectorizer, query, n=2, matcher="kd"):
    
    query = " ".join(pre_process(query))
    query_vct = vectorizer.transform([query])
    
    if n == -1:
        
        n = corpus.shape[0]
    
    if matcher == "kd":
        
        kdtree = scipy.spatial.KDTree(corpus_v1_vct.todense())
        
        # p is Minkowski p-norm.
        # p = 1, Manhattan Distance
        # p = 2, Euclidean Distance
        # p = +inf, Chebychev Distance
        distance, index = kdtree.query(query_vct.todense(), n, p=1)
        
    elif matcher == "nn":
        
        nbrs = NearestNeighbors(n_neighbors=n, algorithm="ball_tree").fit(corpus)
        distance, index = nbrs.kneighbors(query_vct)
        
    elif matcher == "bf":

        nbrs = NearestNeighbors(n_neighbors=n, algorithm="brute", metric="cosine").fit(corpus)
        distance, index = nbrs.kneighbors(query_vct)
        
    else:
        
        return "Matcher strategy not avaliable. Set kd to KDTree, nn to Nearest Neighbor and bf to Brute Force"
    
    return list(zip(distance.tolist()[0], index.tolist()[0]))

In [199]:
ir_v1 = lambda query: info_retrieval(pre_process_v1, corpus_v1_vct, vectorizer_v1, query, n=-1, matcher = "bf")
ir_v2 = lambda query: info_retrieval(pre_process_v2, corpus_v2_vct, vectorizer_v2, query, n=-1, matcher = "bf")
ir_v3 = lambda query: info_retrieval(pre_process_v3, corpus_v3_vct, vectorizer_v3, query, n=-1, matcher = "bf")
ir_v4 = lambda query: info_retrieval(pre_process_v4, corpus_v4_vct, vectorizer_v4, query, n=-1, matcher = "bf")
# ir_v5 = lambda query: info_retrieval(pre_process_v5, corpus_v5_vct, vectorizer_v5, query, n=-1, matcher = "bf")

### Snippet Test

In [203]:
query_1 = "Moral and ethics in christianism"
query_2 = 

In [204]:
result = sorted(ir_v1(query), key=lambda i:i[0])

In [205]:
result

[(0.8390608686972042, 3),
 (0.8991347213458132, 22),
 (0.9260732872630356, 5),
 (0.9310174352872853, 17),
 (0.931373084223062, 14),
 (0.9369893113510906, 24),
 (0.9393816132858184, 10),
 (0.9398635625822763, 26),
 (0.9434762214855852, 1),
 (0.9465797970445144, 0),
 (0.9479106325345472, 16),
 (0.9576016806628308, 18),
 (0.9576714605795643, 6),
 (0.9595431028822449, 25),
 (0.9596341593129576, 13),
 (0.9615343319101327, 4),
 (0.9635281658333272, 21),
 (0.9638897268442101, 23),
 (0.9650987409714858, 20),
 (0.974025461722243, 15),
 (0.9755987102116255, 27),
 (0.9761505116142708, 9),
 (0.9762709742374461, 11),
 (0.9812412103759419, 19),
 (0.9888929260057168, 7),
 (1.0, 8),
 (1.0, 2),
 (1.0, 12)]

In [208]:
# Nearest Documment to Query Snippet
# pd.set_option('display.max_colwidth', None)

result_nn = result[0][1]
dataset_pcrs[["_id", "text"]].iloc[[result_nn]]

In [214]:
dataset_pcrs[["_id", "text"]].iloc[[result_nn]]

Unnamed: 0,_id,text
3,3,"See, there you go again, saying that a moral act is only significantif it is ""voluntary."" Why do you think this?And anyway, humans have the ability to disregard some of their instincts.You are attaching too many things to the term ""moral,"" I think.Let's try this: is it ""good"" that animals of the same speciesdon't kill each other. Or, do you think this is right? Or do you think that animals are machines, and that nothing they dois either right nor wrong?Those weren't arbitrary killings. They were slayings related to some sortof mating ritual or whatnot.Yes it was, but I still don't understand your distinctions. Whatdo you mean by ""consider?"" Can a small child be moral? How abouta gorilla? A dolphin? A platypus? Where is the line drawn? Doesthe being need to be self aware?What *do* you call the mechanism which seems to prevent animals ofthe same species from (arbitrarily) killing each other? Don'tyou find the fact that they don't at all significant?"


### Evaluation

In [236]:
# Modeling Query x Relavance
docs = ["doc_%d"%i for i in range(len(dataset))]
_data = list(zip(dataset.q1.tolist(), dataset.q2.tolist()))
_dict = {docs[i]: _data[i] for i in range(len(dataset))}
query_relevance = pd.DataFrame(data=_dict, index = ["Q%d"%(i+1) for i in range(len(_data[0]))])