# Relatório e Projeto - Sistema de indexação e busca de documentos

## 1 - Informações da equipe
---
IN1152 - Recuperação Inteligente de Informação - 2022.2

**Equipe**: Matheus Rodrigues de Souza Félix (matheusrdgsf@gmail.com) e Rodrigo Souza de Melo (rsm5@cin.ufpe.br)

**Projeto "Sistema de indexação e busca de documentos"**

Professora Flavia de Almeida Barros (fab@cin.ufpe.br)

- Este relatório apresentará as seções solicitadas no Trabalho 1 - Sistema de indexação e busca de documentos. E mais abaixo o código-fonte do projeto

---

## 2 - Descrição dos documentos (corpus) que serão indexados pelo sistema
---

 - 2 Temas/tópicos dos documentos da sua base
 - Mostrar no relatório 2 ou 3 exemplos de documentos do corpus
 
 - BEIR [1] é um benchmark heterogêneo contendo diversas tarefas de RI. Ele também fornece uma estrutura comum e fácil para avaliação de seus modelos de recuperação baseados em PNL. Dentre as opções do BEIR foi selecionado o SciFact.
 - SciFact [4]: Devido ao rápido crescimento da literatura científica, há a necessidade de sistemas automatizados para auxiliar pesquisadores e o público na avaliação da veracidade das afirmações científicas. Para facilitar o desenvolvimento de sistemas para essa tarefa, é utilizado o SciFact, um conjunto de dados de 1,4 mil declarações escritas por especialistas, combinadas com resumos contendo evidências anotados com rótulos e justificativas de veracidade.
 - Leaderboard [4]: Avalia submissões de modelos no conjunto de dados SciFact, com o objetivo de desenvolver sistemas automatizados para verificação de alegações científicas.
 - Dataset [2]: Formando pelos elementos abaixo.
   - corpus: Representa o título e o texto do documento
   - query: Representa a consulta
   - qrels: Representa a relevância das consulta(s) realizada(s)
   
 - Exemplo de uma estrutura do documento com 2 queries, 2 documentos e as relevâncias de cada consulta [2].

~~~json
corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}
~~~
 
 Referências: 
 * [1] https://github.com/beir-cellar/beir
 * [2] https://huggingface.co/datasets/BeIR/scifact-generated-queries
 * [3] https://github.com/allenai/scifact
 * [4] https://leaderboard.allenai.org/scifact/submissions/about
              

---

## 3 - Arquitetura do sistema 
---

- Prover uma descrição breve das etapas de processamento do sistema construído com base na ferramenta escolhida. Informar qual é o modelo de RI implementado pelo seu sistema, qual a fórmula para cálculo dos pesos e qual a função de ranking (vejam as aulas de modelos de RI). 

---

## 4 - Criação das bases de documentos indexados 
---

- Preparação & Indexação dos documentos: O sistema deve criar, de forma automática, cinco BASES indexadas a partir da base original de documentos. Cada BASE (no Solr, são COREs) deve utilizar processos diferentes na preparação (pré-processamento) dos dados. O objetivo é verificar qual é a melhor configuração de pré-processamento para o seu caso.

   - BASE 1: documentos originais, sem nenhum pré-processamento extra (só tokenização);
   - BASE 2: apenas eliminar stopwords (usar filtro de stopwords);
   - BASE 3: apenas usar stemming, sem eliminar as stopwords (não usar filtro de stopwords);
   - BASE 4: eliminar stopwords (usar filtro de stopwords) e usar stemming;
   - BASE 5: eliminar stopwords (usar filtro de stopwords), não usar stemming, e usar dicionário de sinônimos.


---


## 5 - Criação das consultas e Criação da Matriz de relevância  
---

- Prover uma descrição breve das etapas de processamento do sistema construído com base na ferramenta escolhida. Informar qual é o modelo de RI implementado pelo seu sistema, qual a fórmula para cálculo dos pesos e qual a função de ranking (vejam as aulas de modelos de RI). 

- Esta etapa foi descrita em mais detalhes na atividade correspondente, postada no Classroom. 
  ===> DÚVIDA: NÃO ENCONTREI OS DETALHES NO CLASSROOM.
  
- Mostrar aqui parte da matriz - basta mostrar as 5 colunas exemplificadas abaixo (incluindo a coluna final com a quantidade de documentos relevantes). 

Exemplo de Matriz de relevância “Consultas x Documentos”.


|                                                                      | Doc 1 | Doc 2 | ... Doc 20 | Qtd de docs relevantes |   |   |   |   |   |
|----------------------------------------------------------------------|-------|-------|------------|------------------------|---|---|---|---|---|
| Consulta1 Ex.:  como faço o agendamento para tomar vacina de covid ? | 1     | 0     | 1          | 10                     |   |   |   |   |   |
| Consulta2 Ex.: Qual o preço de carro FIAT UNO usado ?                     | 1     | 1     | 0          | 15                     |   |   |   |   |   |




---

## 6 - Testes/Avaliação 
---

- Submeter as 2 consultas para cada BASE criada, e avaliar cada resultado separadamente -  i.e., calcular separadamente a precisão e a cobertura de cada consulta em relação a cada BASE criada. 
- Usar as fórmulas vistas em aula: precisão, cobertura e F-measure. 
- Incluir no relatório uma matriz de resultados para CADA consulta. Assim podemos ver a influência do pré-processamento dos documentos no resultado final do sistema.





Matriz de resultados para a Consulta 1
- Consulta: incluam aqui o texto da consulta avaliada nessa matriz
- Qtd de documentos relevantes: ver matriz de relevância (avaliação manual)

|  | **Precisão** | **Cobertura** | **F-measure** | **Qtd de Docs relevantes retornados pela  consulta 1** | **Qtd total de documentos retornados pela consulta 1** |
|:--------:|:------------:|:-------------:|:-------------:|:------------------------------------------------------:|:------------------------------------------------------:|
| BASE 1   |              |               |               |                                                        |                                                        |
| BASE 2   |              |               |               |                                                        |                                                        |
| BASE 3   |              |               |               |                                                        |                                                        |
| BASE 4   |              |               |               |                                                        |                                                        |
| BASE 5   |              |               |               |                                                        |                                                        |





Matriz de resultados para a Consulta 2
- Consulta: incluam aqui o texto da consulta avaliada nessa matriz
- Qtd de documentos relevantes: ver matriz de relevância (avaliação manual)

|  | **Precisão** | **Cobertura** | **F-measure** | **Qtd de Docs relevantes retornados pela  consulta 2** | **Qtd total de documentos retornados pela consulta 2** |
|:--------:|:------------:|:-------------:|:-------------:|:------------------------------------------------------:|:------------------------------------------------------:|
| BASE 1   |              |               |               |                                                        |                                                        |
| BASE 2   |              |               |               |                                                        |                                                        |
| BASE 3   |              |               |               |                                                        |                                                        |
| BASE 4   |              |               |               |                                                        |                                                        |
| BASE 5   |              |               |               |                                                        |                                                        |




Matriz de resultados para o Sistema
- As medidas de precisão, cobertura e F-meause do sistema serão obtidas calculando-se a média entre os resultados obtidos com cada consulta em relação a cada BASE criado.


|        | Precisão média | Cobertura média | F-measure média |
|--------|----------------|-----------------|-----------------|
| BASE 1 |                |                 |                 |
| BASE 2 |                |                 |                 |
| BASE 3 |                |                 |                 |
| BASE 4 |                |                 |                 |
| BASE 5 |                |                 |                 |


---

 

## 7 - Conclusão 
---

- Deve conter um texto curto explicando o que vocês concluíram a partir do resultado dos testes (tabelas acima).

---

## Código-fonte do projeto

### Configurações

In [1]:
#!pip install GitPython

### Libs

In [98]:
import os
import pandas as pd
from git import Repo
import gzip
from io import BytesIO
import gzip

from sklearn.metrics import confusion_matrix, classification_report #Utilizado para calcular a matrix de confusão e Relatório de classificação - Checar outra forma para remover o warning e não utilizar o parâmetro zero_division=1 no recall_score 

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
import unidecode

nltk.download('punkt')
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer

from scipy import spatial
from sklearn.neighbors import NearestNeighbors

from sklearn.metrics import precision_score, f1_score, recall_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Collect Scifact

In [55]:
# DATA_PATH = "dataset/scifact-generated-queries"

In [56]:
# if not os.path.isdir(DATA_PATH):
#     Repo.clone_from("https://huggingface.co/datasets/BeIR/scifact-generated-queries", "dataset/scifact-generated-queries")

In [57]:
# with gzip.open(DATA_PATH+"/train.jsonl.gz", 'rb') as f:
#     file_content = f.read()

In [58]:
# dataset = pd.read_json(BytesIO(file_content), lines=True)

In [99]:
# dataset.groupby(["title"]).size().reset_index(name='queries_count')

### Data Collect 20News Labeled

In [100]:
DATA_PATH = "20news_labeled.csv"

In [103]:
df = pd.read_csv(DATA_PATH)

In [112]:
df.insert(0, '_id', range(len(df)))

In [113]:
df.to

Unnamed: 0,_id,text,target,q1,q2
0,0,"Library of Congress to Host Dead Sea Scroll Symposium April 21-22 To: National and Assignment desks, Daybook Editor Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191 both of the Library of Congress WASHINGTON, April 19 -- A symposium on the Dead Sea Scrolls will be held at the Library of Congress on Wednesday,April 21, and Thursday, April 22. The two-day program, cosponsoredby the library and Baltimore Hebrew University, with additionalsupport from the Project Judaica Foundation, will be held in thelibrary's Mumford Room, sixth floor, Madison Building. Seating is limited, and admission to any session of the symposiummust be requested in writing (see Note A). The symposium will be held one week before the public opening of amajor exhibition, ""Scrolls from the Dead Sea: The Ancient Library ofQumran and Modern Scholarship,"" that opens at the Library of Congresson April 29. On view will be fragmentary scrolls and archaeologicalartifacts excavated at Qumran, on loan from the Israel AntiquitiesAuthority. Approximately 50 items from Library of Congress specialcollections will augment these materials. The exhibition, on view inthe Madison Gallery, through Aug. 1, is made possible by a generousgift from the Project Judaica Foundation of Washington, D.C. The Dead Sea Scrolls have been the focus of public and scholarlyinterest since 1947, when they were discovered in the desert 13 mileseast of Jerusalem. The symposium will explore the origin and meaningof the scrolls and current scholarship. Scholars from diverseacademic backgrounds and religious affiliations, will offer theirdisparate views, ensuring a lively discussion. The symposium schedule includes opening remarks on April 21, at2 p.m., by Librarian of Congress James H. Billington, and byDr. Norma Furst, president, Baltimore Hebrew University. Co-chairingthe symposium are Joseph Baumgarten, professor of Rabbinic Literatureand Institutions, Baltimore Hebrew University and Michael Grunberger,head, Hebraic Section, Library of Congress. Geza Vermes, professor emeritus of Jewish studies, OxfordUniversity, will give the keynote address on the current state ofscroll research, focusing on where we stand today. On the secondday, the closing address will be given by Shmaryahu Talmon, who willpropose a research agenda, picking up the theme of how the Qumranstudies might proceed. On Wednesday, April 21, other speakers will include: -- Eugene Ulrich, professor of Hebrew Scriptures, University ofNotre Dame and chief editor, Biblical Scrolls from Qumran, on ""TheBible at Qumran;"" -- Michael Stone, National Endowment for the Humanitiesdistinguished visiting professor of religious studies, University ofRichmond, on ""The Dead Sea Scrolls and the Pseudepigrapha."" -- From 5 p.m. to 6:30 p.m. a special preview of the exhibitionwill be given to symposium participants and guests. On Thursday, April 22, beginning at 9 a.m., speakers will include: -- Magen Broshi, curator, shrine of the Book, Israel Museum,Jerusalem, on ""Qumran: The Archaeological Evidence;"" -- P. Kyle McCarter, Albright professor of Biblical and ancientnear Eastern studies, The Johns Hopkins University, on ""The CopperScroll;"" -- Lawrence H. Schiffman, professor of Hebrew and Judaic studies,New York University, on ""The Dead Sea Scrolls and the History ofJudaism;"" and -- James VanderKam, professor of theology, University of NotreDame, on ""Messianism in the Scrolls and in Early Christianity."" The Thursday afternoon sessions, at 1:30 p.m., include: -- Devorah Dimant, associate professor of Bible and Ancient JewishThought, University of Haifa, on ""Qumran Manuscripts: Library of aJewish Community;"" -- Norman Golb, Rosenberger professor of Jewish history andcivilization, Oriental Institute, University of Chicago, on ""TheCurrent Status of the Jerusalem Origin of the Scrolls;"" -- Shmaryahu Talmon, J.L. Magnas professor emeritus of Biblicalstudies, Hebrew University, Jerusalem, on ""The Essential 'Commune ofthe Renewed Covenant': How Should Qumran Studies Proceed?"" will closethe symposium. There will be ample time for question and answer periods at theend of each session. Also on Wednesday, April 21, at 11 a.m.: The Library of Congress and The Israel Antiquities Authoritywill hold a lecture by Esther Boyd-Alkalay, consulting conservator,Israel Antiquities Authority, on ""Preserving the Dead Sea Scrolls""in the Mumford Room, LM-649, James Madison Memorial Building, TheLibrary of Congress, 101 Independence Ave., S.E., Washington, D.C. ------ NOTE A: For more information about admission to the symposium,please contact, in writing, Dr. Michael Grunberger, head, HebraicSection, African and Middle Eastern Division, Library of Congress,Washington, D.C. 20540. -30-",christian,0,0
1,1,"Anyone who dies for a ""cause"" runs the risk of dying for a lie. As forpeople being able to tell if he was a liar, well, we've had grifters andcharlatans since the beginning of civilization. If David Copperfield hadbeen the Messiah, I bet he could have found plenty of believers. Jesus was hardly the first to claim to be a faith healer, and he wasn't thefirst to be ""witnessed."" What sets him apart?Rubbish. Nations have followed crazies, liars, psychopaths, and megalomaniacs throughout history. Hitler, Tojo, Mussolini, Khomeini,Qadaffi, Stalin, Papa Doc, and Nixon come to mind...all from this century.Koresh is a non-issue.Take a discrete mathematics or formal logic course. There are flaws in yourlogic everywhere. And as I'm sure others will tell you, read the FAQ!Of course, you have to believe the Bible first. Just because something iswritten in the Bible does not mean it is true, and the age of that tome plusthe lack of external supporting evidence makes it less credible. So if youdo quote from the Bible in the future, try to back up that quote with supporting evidence. Otherwise, you will get flamed mercilessly.Just like weight lifting or guitar playing, eh? I don't know how you define the world ""total,"" but I would imagine a ""total sacrafice [sp]of everything for God's sake"" would involve more than a time commitment.You are correct about our tendency to ""box everything into time units.""Would you explain HOW one should involove God in sports and (hehehe)television?",altheism,1,0
2,2,"Woah...The context is about God's calling out a special people (the Jews) tocarry the ""promise."" To read the meaning as literal people is to miss Paul'sentire point. I'd be glad to send anyone more detailed explanations of thispassage if interested.",christian,0,0
3,3,"See, there you go again, saying that a moral act is only significantif it is ""voluntary."" Why do you think this?And anyway, humans have the ability to disregard some of their instincts.You are attaching too many things to the term ""moral,"" I think.Let's try this: is it ""good"" that animals of the same speciesdon't kill each other. Or, do you think this is right? Or do you think that animals are machines, and that nothing they dois either right nor wrong?Those weren't arbitrary killings. They were slayings related to some sortof mating ritual or whatnot.Yes it was, but I still don't understand your distinctions. Whatdo you mean by ""consider?"" Can a small child be moral? How abouta gorilla? A dolphin? A platypus? Where is the line drawn? Doesthe being need to be self aware?What *do* you call the mechanism which seems to prevent animals ofthe same species from (arbitrarily) killing each other? Don'tyou find the fact that they don't at all significant?",altheism,0,1
4,4,"The last sentence is ironic, since so many readers ofsoc.religion.christian seem to not be embarrassed by apologists such asJosh McDowell and C.S. Lewis. The above also expresses a rather odd senseof history. What makes you think the masses in Aquinas' day, who weremostly illiterate, knew any more about rhetoric and logic than most peopletoday? If writings from the period seem elevated consider that only thecream of the crop, so to speak, could read and write. If everyone inthe medieval period ""knew the rules"" it was a matter of uncriticallyaccepting what they were told.Bill Mayne",christian,0,0
5,5,"Perhaps we have different definitions of absolute then. To me,an absolute is something that is constant across time, culture,situations, etc. True in every instance possible. Do you agreewith this definition? I think you do:A simple example:In the New Testament (sorry I don't have a Bible at work, and can'tprovide a reference), women are instructed to be silent and covertheir heads in church. Now, this is scripture. By your definition, this is truth and therefore absolute. Do women in your church speak? Do they cover their heads? If all scripture is absolute truth, it seems to me that women speaking in and coming to church with bare heads should be intolerable to evangelicals. Yet, clearly, women do speak in evangelical churches and come with bare heads. (At least this was the case in the evangelical churches I grew up in.)Evangelicals are clearly not taking this particular part of scripture to be absolute truth. (And there are plenty of other examples.)Can you reconcile this?I don't claim that there are *no* absolutes. I think there are veryfew, though, and determining absolutes is difficult.But you are claiming that all of Scripture is absolute. How can youdetermine absolutes derived from Scripture when you can't agree howto interpret the Scripture? It's very difficult to see how you can claim something which is based on your own *interpretation* is absolute. Do you deny that your ownbackground, education, prejudices, etc. come into play when you read the Bible, and determine how to interpret a passsage? Do you deny that you in fact interpret?",christian,0,1
6,6,"I apologize if this article is slightly confusing, and late. The origonaldraft didn't make it through the moderators quote-screens. So I didviolence to it, but if you remember the article I am respondingto it should still make sence.What, no hello for heathan netters?I feel all left out now. :([deletia- table of content, intro, homosexuality][deletia- incorrect attributions]Uh, you have your attributions wrong, you were respondingto my article, so Dan Johnson should be the 1st one.[deletia- no free gifts speil nuked by moderator fiat.]Ah, in the _cosmic_ sence.. but who lives in the cosmic sence?Not me! Cosmicly, we don't even exist for all practical purposes.I can hardly use the Cosmic Sence Of Stuff as a guide to life.It would just say: ""don't bother.""Luckily for mortals, there are many sences of scale you can talkabout. In a human sence, you can have big purposes.But the influence of Aristotle, Confucious, Alexander, Ceasar andcountless others is still with us, although their works have perished.But they have changed to course of history, and while humanity exists,their deeds cannot be said to have come to nothing, even if theyare utterly forgotten.One day, surely. (well, unless you believe in the Second Coming, whichI do not)But in that time we can make a difference.In the end. But it must be the end; until then, there is all thepoint you can muster. And when that end comes, there will be nobodyto ask, ""Gee, I don't think James Sledd's deeds are gonna makemuch of a difference, ulitmately, ya know?"".But they will have already have made a difference, great or small,before the end.Why must your ends be eternal to be worthwhile?Little is in the eye of the beholder, of course.I don't doubt it. But I have thought about the cosmic scale. Andit does not seem to mean much to us, here, today.I would not find this comforting. But perhaps it is merely mydefinitions. Here's what I think the relevant terms are:""Reality""\tThat which is real.""Illusion""\tThat which is not real, but seems to be.""Real""\t\tObjectively ExistingFor ""reality"" to be an ""illusion"" would mean, then:That which is real is not real, but seems to be.Or:That which objectively exists, does not objectively exist, butdoes seem to objectively exist.From which we can conclude, that unless you want to get acontradiction, that no things objectively exist.But I have a problem with this because I would like to saythat *I* objectively exist, if nothing else. Cogito Ergo Sumand all that.Perhaps you do not mean all that, but rather mean:""Objective Reality is Unreachable by humans.""Which is not so bad, and so far as I know is true.Have on. If reality is an illusion, isn't True Reality an illusiontoo? And if True Reality is spirit, doens't that make Spirit an Illusionas well?If I am not distinctly confused, this is getting positively Buddhist.That is one hell of a statement, although perhaps true.Do you mean to imply that it was *intended* to be so? If so,please show that this is true. If not, please explain how thiscan give a purpose to anything.How does it do that?Wouldn't the world=school w/ intent idea make the world a preparation forsome *greater* purpose, rather than a purpose in itself.What pressure?It is not necessary to be a success in human terms, unless yourgoals either include doing so or require doing so before theythemselves can be achived.Indeed, many people have set goals for themselves thatdo not include success in human terms as _I_ understand it. Checkout yer Buddhist monk type guy. Out for nirvana, which is notat all the same thing.Why is learning to love a goal? What happens if you fail in thisgoal? To you? To God? To the mysterious Purpose?[deletia- question about immortailty and my answer deleted because it was mostly quote.]I'll have a crack at that.(1) The nature of eternal life is neatly described by its name: It isthe concept of life without death, life without end.(2) No. We can put together word to describe it, but we cannot imagine it.(2a) No metaphor is adequate next to eternity; if it were we could notunderstand it either. (or so I suspect)---\t\t\t- Dan JohnsonAnd God said ""Jeeze, this is dull""... and it *WAS* dull. Genesis 0:0",christian,0,0
7,7,"...But what was wrong with it? It won't tempt anyone to any kind of sin, asfar as I can tell. It doesn't belittle anyone. It does not substituteoffensiveness for humor (it's genuinely funny).We shouldn't assume that _all_ jokes that mention sexuality are ""dirty""merely because so many are.And we should never mistake prudery for spirituality. It can be the directopposite -- a symptom of the _lack_ of a healthy perspective on God'screation.",christian,0,1
8,8,"You don't think these are little things because with twenty-twentyhindsight, you know what they led to.",altheism,0,0
9,9,"Of course, I'd still recommend that Michael read _True and Reasonable_by Douglas Jacoby.Joe Fisher",christian,0,0


## Pre-rocesing Data

In this step we will create 5 data models:

v1 - Only Tokenization; \
v2 - Only Stopword Filter; \
v3 - Only Stemming; \
v4 - Remove Stopwords and Stemming; \
v5 - Remove Stopworpd and expand words with Synonyms.


In [60]:
def remove_accent(text):
    return unidecode.unidecode(text)

def tokenize(text):
    return word_tokenize(text, language="english")

def pre_process(text, rmv_sw, stem):
    text_lower = text.lower()
    text_rmv_accent = remove_accent(text_lower)
    text_final = tokenize(text_rmv_accent)
    
    if rmv_sw:
        text_final = list(filter(lambda token: token not in STOPWORDS, text_final))
    
    if stem:
        stemmer = SnowballStemmer("english")
        text_final = list(map(lambda token: stemmer.stem(token), text_final))
        
    return text_final

STOPWORDS = set(map(lambda token: remove_accent(token), stopwords.words("english")))

In [61]:
pre_process_v1 = lambda text: pre_process(text, False, False)
pre_process_v2 = lambda text: pre_process(text, True, False)
pre_process_v3 = lambda text: pre_process(text, False, True)
pre_process_v4 = lambda text: pre_process(text, True, True)
#pre_process_v5 = lambda i: pre_process(text, True, False) # TODO: Para este caso eu chequei esta referência -> https://www.holisticseo.digital/python-seo/nltk/wordnet.

In [62]:
dataset_pcrs = dataset.copy().drop("query", axis=1).drop_duplicates().head(50)

In [63]:
dataset_pcrs["v1"] = dataset_pcrs["text"].apply(lambda text: pre_process_v1(text))

In [64]:
dataset_pcrs["v2"] = dataset_pcrs["text"].apply(lambda text: pre_process_v2(text))

In [65]:
dataset_pcrs["v3"] = dataset_pcrs["text"].apply(lambda text: pre_process_v3(text))

In [66]:
dataset_pcrs["v4"] = dataset_pcrs["text"].apply(lambda text: pre_process_v4(text))

In [67]:
# @TODO: Create Expand Vocabulary Pipeline add connect pre_process
# dataset_pcrs["v5"] = dataset_pcrs["text"].apply(lambda text: pre_process_v5(text))

In [68]:
# Be careful with dataset_pcrs Len
#HTML(dataset_pcrs.head(3).to_html())

# @TODO: add "()" to STOPWORDS. A better strategy is create a dict simbol to filter and add this pipeline.
dataset_pcrs.head(2)

Unnamed: 0,_id,title,text,v1,v2,v3,v4
0,4983,Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.,"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 versus 1.1 microm2/ms). Relative anisotropy was higher the closer birth was to term with greater absolute values in the internal capsule than in the central white matter. Preterm infants at term showed higher mean diffusion coefficients in the central white matter (1.4 +/- 0.24 versus 1.15 +/- 0.09 microm2/ms, p = 0.016) and lower relative anisotropy in both areas compared with full-term infants (white matter, 10.9 +/- 0.6 versus 22.9 +/- 3.0%, p = 0.001; internal capsule, 24.0 +/- 4.44 versus 33.1 +/- 0.6% p = 0.006). Nonmyelinated fibers in the corpus callosum were visible by diffusion tensor MRI as early as 28 wk; full-term and preterm infants at term showed marked differences in white matter fiber organization. The data indicate that quantitative assessment of water diffusion by diffusion tensor MRI provides insight into microstructural development in cerebral white matter in living infants.","[alterations, of, the, architecture, of, cerebral, white, matter, in, the, developing, human, brain, can, affect, cortical, development, and, result, in, functional, disabilities, ., a, line, scan, diffusion-weighted, magnetic, resonance, imaging, (, mri, ), sequence, with, diffusion, tensor, analysis, was, applied, to, measure, the, apparent, diffusion, coefficient, ,, to, calculate, relative, anisotropy, ,, and, to, delineate, three-dimensional, fiber, architecture, in, cerebral, white, matter, in, preterm, (, n, =, 17, ), and, full-term, infants, (, n, =, 7, ), ., to, assess, effects, of, prematurity, on, cerebral, white, matter, development, ,, early, gestation, preterm, infants, (, n, =, 10, ), were, studied, ...]","[alterations, architecture, cerebral, white, matter, developing, human, brain, affect, cortical, development, result, functional, disabilities, ., line, scan, diffusion-weighted, magnetic, resonance, imaging, (, mri, ), sequence, diffusion, tensor, analysis, applied, measure, apparent, diffusion, coefficient, ,, calculate, relative, anisotropy, ,, delineate, three-dimensional, fiber, architecture, cerebral, white, matter, preterm, (, n, =, 17, ), full-term, infants, (, n, =, 7, ), ., assess, effects, prematurity, cerebral, white, matter, development, ,, early, gestation, preterm, infants, (, n, =, 10, ), studied, second, time, term, ., central, white, matter, mean, apparent, diffusion, coefficient, 28, wk, high, ,, 1.8, microm2/ms, ,, decreased, toward, term, 1.2, microm2/ms, ...]","[alter, of, the, architectur, of, cerebr, white, matter, in, the, develop, human, brain, can, affect, cortic, develop, and, result, in, function, disabl, ., a, line, scan, diffusion-weight, magnet, reson, imag, (, mri, ), sequenc, with, diffus, tensor, analysi, was, appli, to, measur, the, appar, diffus, coeffici, ,, to, calcul, relat, anisotropi, ,, and, to, delin, three-dimension, fiber, architectur, in, cerebr, white, matter, in, preterm, (, n, =, 17, ), and, full-term, infant, (, n, =, 7, ), ., to, assess, effect, of, prematur, on, cerebr, white, matter, develop, ,, earli, gestat, preterm, infant, (, n, =, 10, ), were, studi, ...]","[alter, architectur, cerebr, white, matter, develop, human, brain, affect, cortic, develop, result, function, disabl, ., line, scan, diffusion-weight, magnet, reson, imag, (, mri, ), sequenc, diffus, tensor, analysi, appli, measur, appar, diffus, coeffici, ,, calcul, relat, anisotropi, ,, delin, three-dimension, fiber, architectur, cerebr, white, matter, preterm, (, n, =, 17, ), full-term, infant, (, n, =, 7, ), ., assess, effect, prematur, cerebr, white, matter, develop, ,, earli, gestat, preterm, infant, (, n, =, 10, ), studi, second, time, term, ., central, white, matter, mean, appar, diffus, coeffici, 28, wk, high, ,, 1.8, microm2/m, ,, decreas, toward, term, 1.2, microm2/m, ...]"
3,5836,Induction of myelodysplasia by myeloid-derived suppressor cells.,"Myelodysplastic syndromes (MDS) are age-dependent stem cell malignancies that share biological features of activated adaptive immune response and ineffective hematopoiesis. Here we report that myeloid-derived suppressor cells (MDSC), which are classically linked to immunosuppression, inflammation, and cancer, were markedly expanded in the bone marrow of MDS patients and played a pathogenetic role in the development of ineffective hematopoiesis. These clonally distinct MDSC overproduce hematopoietic suppressive cytokines and function as potent apoptotic effectors targeting autologous hematopoietic progenitors. Using multiple transfected cell models, we found that MDSC expansion is driven by the interaction of the proinflammatory molecule S100A9 with CD33. These 2 proteins formed a functional ligand/receptor pair that recruited components to CD33’s immunoreceptor tyrosine-based inhibition motif (ITIM), inducing secretion of the suppressive cytokines IL-10 and TGF-β by immature myeloid cells. S100A9 transgenic mice displayed bone marrow accumulation of MDSC accompanied by development of progressive multilineage cytopenias and cytological dysplasia. Importantly, early forced maturation of MDSC by either all-trans-retinoic acid treatment or active immunoreceptor tyrosine-based activation motif–bearing (ITAM-bearing) adapter protein (DAP12) interruption of CD33 signaling rescued the hematologic phenotype. These findings indicate that primary bone marrow expansion of MDSC driven by the S100A9/CD33 pathway perturbs hematopoiesis and contributes to the development of MDS.","[myelodysplastic, syndromes, (, mds, ), are, age-dependent, stem, cell, malignancies, that, share, biological, features, of, activated, adaptive, immune, response, and, ineffective, hematopoiesis, ., here, we, report, that, myeloid-derived, suppressor, cells, (, mdsc, ), ,, which, are, classically, linked, to, immunosuppression, ,, inflammation, ,, and, cancer, ,, were, markedly, expanded, in, the, bone, marrow, of, mds, patients, and, played, a, pathogenetic, role, in, the, development, of, ineffective, hematopoiesis, ., these, clonally, distinct, mdsc, overproduce, hematopoietic, suppressive, cytokines, and, function, as, potent, apoptotic, effectors, targeting, autologous, hematopoietic, progenitors, ., using, multiple, transfected, cell, models, ,, we, found, that, mdsc, expansion, is, driven, ...]","[myelodysplastic, syndromes, (, mds, ), age-dependent, stem, cell, malignancies, share, biological, features, activated, adaptive, immune, response, ineffective, hematopoiesis, ., report, myeloid-derived, suppressor, cells, (, mdsc, ), ,, classically, linked, immunosuppression, ,, inflammation, ,, cancer, ,, markedly, expanded, bone, marrow, mds, patients, played, pathogenetic, role, development, ineffective, hematopoiesis, ., clonally, distinct, mdsc, overproduce, hematopoietic, suppressive, cytokines, function, potent, apoptotic, effectors, targeting, autologous, hematopoietic, progenitors, ., using, multiple, transfected, cell, models, ,, found, mdsc, expansion, driven, interaction, proinflammatory, molecule, s100a9, cd33, ., 2, proteins, formed, functional, ligand/receptor, pair, recruited, components, cd33, 's, immunoreceptor, tyrosine-based, inhibition, motif, (, itim, ), ,, inducing, secretion, ...]","[myelodysplast, syndrom, (, mds, ), are, age-depend, stem, cell, malign, that, share, biolog, featur, of, activ, adapt, immun, respons, and, ineffect, hematopoiesi, ., here, we, report, that, myeloid-deriv, suppressor, cell, (, mdsc, ), ,, which, are, classic, link, to, immunosuppress, ,, inflamm, ,, and, cancer, ,, were, mark, expand, in, the, bone, marrow, of, mds, patient, and, play, a, pathogenet, role, in, the, develop, of, ineffect, hematopoiesi, ., these, clonal, distinct, mdsc, overproduc, hematopoiet, suppress, cytokin, and, function, as, potent, apoptot, effector, target, autolog, hematopoiet, progenitor, ., use, multipl, transfect, cell, model, ,, we, found, that, mdsc, expans, is, driven, ...]","[myelodysplast, syndrom, (, mds, ), age-depend, stem, cell, malign, share, biolog, featur, activ, adapt, immun, respons, ineffect, hematopoiesi, ., report, myeloid-deriv, suppressor, cell, (, mdsc, ), ,, classic, link, immunosuppress, ,, inflamm, ,, cancer, ,, mark, expand, bone, marrow, mds, patient, play, pathogenet, role, develop, ineffect, hematopoiesi, ., clonal, distinct, mdsc, overproduc, hematopoiet, suppress, cytokin, function, potent, apoptot, effector, target, autolog, hematopoiet, progenitor, ., use, multipl, transfect, cell, model, ,, found, mdsc, expans, driven, interact, proinflammatori, molecul, s100a9, cd33, ., 2, protein, form, function, ligand/receptor, pair, recruit, compon, cd33, 's, immunoreceptor, tyrosine-bas, inhibit, motif, (, itim, ), ,, induc, secret, ...]"


### Vectorizing Data

This step defines 5 (vectorizer, corpus_vetorized) models.

In [69]:
def get_documments(df_column):
    return list(map(lambda tokenized_text: " ".join(tokenized_text), df_column))

In [70]:
corpus_v1 = get_documments(dataset_pcrs["v1"])
corpus_v2 = get_documments(dataset_pcrs["v2"])
corpus_v3 = get_documments(dataset_pcrs["v3"])
corpus_v4 = get_documments(dataset_pcrs["v4"])
# corpus_v5 = get_documments(dataset_pcrs["v5"])

In [71]:
vectorizer_v1 = TfidfVectorizer()
corpus_v1_vct = vectorizer_v1.fit_transform(corpus_v1)

vectorizer_v2 = TfidfVectorizer()
corpus_v2_vct = vectorizer_v2.fit_transform(corpus_v2)

vectorizer_v3 = TfidfVectorizer()
corpus_v3_vct = vectorizer_v3.fit_transform(corpus_v3)

vectorizer_v4 = TfidfVectorizer()
corpus_v4_vct = vectorizer_v4.fit_transform(corpus_v4)

# vectorizer_v5 = TfidfVectorizer()
# corpus_v1_vct = vectorizer_v5.fit_transform(corpus_v5)

### Retrieval Information Function

In [72]:
# kd -> KDTree, nn -> Nearest Neighbor, bf -> Brute Force
# n -> number of docs in return, use -1 to all docs
def info_retrieval(pre_process, corpus, vectorizer, query, n=2, matcher="kd"):
    
    query = " ".join(pre_process(query))
    query_vct = vectorizer.transform([query])
    
    if n == -1:
        
        n = corpus.shape[0]
    
    if matcher == "kd":
        
        kdtree = scipy.spatial.KDTree(corpus_v1_vct.todense())
        
        # p is Minkowski p-norm.
        # p = 1, Manhattan Distance
        # p = 2, Euclidean Distance
        # p = +inf, Chebychev Distance
        distance, index = kdtree.query(query_vct.todense(), n, p=1)
        
    elif matcher == "nn":
        
        nbrs = NearestNeighbors(n_neighbors=n, algorithm="ball_tree").fit(corpus)
        distance, index = nbrs.kneighbors(query_vct)
        
    elif matcher == "bf":

        nbrs = NearestNeighbors(n_neighbors=n, algorithm="brute", metric="cosine").fit(corpus)
        distance, index = nbrs.kneighbors(query_vct)
        
    else:
        
        return "Matcher strategy not avaliable. Set kd to KDTree, nn to Nearest Neighbor and bf to Brute Force"
    
    return list(zip(distance.tolist()[0], index.tolist()[0]))

In [73]:
ir_v1 = lambda query: info_retrieval(pre_process_v1, corpus_v1_vct, vectorizer_v1, query, n=-1, matcher = "bf")
ir_v2 = lambda query: info_retrieval(pre_process_v2, corpus_v2_vct, vectorizer_v2, query, n=-1, matcher = "bf")
ir_v3 = lambda query: info_retrieval(pre_process_v3, corpus_v3_vct, vectorizer_v3, query, n=-1, matcher = "bf")
ir_v4 = lambda query: info_retrieval(pre_process_v4, corpus_v4_vct, vectorizer_v4, query, n=-1, matcher = "bf")
# ir_v5 = lambda query: info_retrieval(pre_process_v5, corpus_v5_vct, vectorizer_v5, query, n=-1, matcher = "bf")

### Snippet Test

In [74]:
query = "Alterations of the architecture of cerebral white matter in the developing human brain"

In [75]:
result = sorted(ir_v1(query), key=lambda i:i[0])

In [76]:
result

[(0.5512980389649619, 0),
 (0.9018937692186384, 4),
 (0.9030226639226278, 8),
 (0.904491849656119, 22),
 (0.9045364304922894, 3),
 (0.9055550155037727, 15),
 (0.9059548306703978, 40),
 (0.9107837002564602, 23),
 (0.9180953622326151, 7),
 (0.9188460324359904, 10),
 (0.9197344396945608, 37),
 (0.9204179861579849, 26),
 (0.9209788244798877, 47),
 (0.924605936837051, 34),
 (0.9257283763325889, 14),
 (0.9301273361039335, 46),
 (0.930343144950218, 20),
 (0.9307127649225015, 25),
 (0.9309219855750911, 36),
 (0.9310791524131075, 17),
 (0.9323432398604214, 21),
 (0.9333678935374439, 42),
 (0.9347401841822083, 30),
 (0.9375193526919786, 6),
 (0.9380836793410617, 24),
 (0.9438240485634083, 29),
 (0.9438526522441384, 33),
 (0.9448974986965711, 1),
 (0.9451534537521332, 31),
 (0.945777795071706, 27),
 (0.9465581953663532, 13),
 (0.9472385587498705, 45),
 (0.9495743997281749, 2),
 (0.9497327886616467, 5),
 (0.9499669994122191, 44),
 (0.950751768678862, 28),
 (0.9521823112364727, 32),
 (0.95235179118

In [77]:
# Nearest Documment to Query Snippet
# pd.set_option('display.max_colwidth', None)

result_nn = result[0][1]
dataset_pcrs[["_id", "title", "text"]].iloc[[result_nn]]["_id"][0]

4983

In [78]:
# pd.set_option('display.max_colwidth', 50)

### Evaluation

In [79]:
# Auxiliar Function to Get ID of nearest documment to specified query.
def get_nearest_id(query, model, data):
    result = sorted(model(query), key=lambda i:i[0])
    result_nn = result[0][1]
    return int(data[["_id"]].iloc[[result_nn]]["_id"])

In [80]:
dataset_pcrs[["_id"]]

Unnamed: 0,_id
0,4983
3,5836
6,7912
9,18670
12,19238
15,33370
18,36474
21,54440
24,70115
27,70490


In [81]:
# used_ids -> flag if we used some batch in dataset_pcrs in development
used_ids = dataset_pcrs["_id"].tolist()
ds_eval = dataset[dataset["_id"].isin(used_ids)].drop(["text"], axis=1).head(50)

In [82]:
ds_eval["ir_v1_nn"] = ds_eval["query"].apply(lambda query: get_nearest_id(query, ir_v1, dataset_pcrs))
ds_eval["ir_v2_nn"] = ds_eval["query"].apply(lambda query: get_nearest_id(query, ir_v2, dataset_pcrs))
ds_eval["ir_v3_nn"] = ds_eval["query"].apply(lambda query: get_nearest_id(query, ir_v3, dataset_pcrs))
ds_eval["ir_v4_nn"] = ds_eval["query"].apply(lambda query: get_nearest_id(query, ir_v4, dataset_pcrs))
#ds_eval["ir_v5_nn"] = ds_eval_v1["query"].apply(lambda i: get_nearest_id(query, ir_v5, dataset_pcrs))

In [83]:
ds_eval.head(5)

Unnamed: 0,_id,title,query,ir_v1_nn,ir_v2_nn,ir_v3_nn,ir_v4_nn
0,4983,Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.,what is the diffusion coefficient of cerebral white matter?,4983,4983,4983,4983
1,4983,Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.,what is diffusion tensor,4983,4983,4983,4983
2,4983,Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.,what is the diffusion coefficient of the cerebral cortex,4983,4983,4983,4983
3,5836,Induction of myelodysplasia by myeloid-derived suppressor cells.,which type of hematopoiesis is characterized by an increased proliferation of myeloid-derived suppressor cells?,5836,5836,5836,5836
4,5836,Induction of myelodysplasia by myeloid-derived suppressor cells.,which cell types have hematopoiesis,5836,5836,5836,5836


In [84]:
# @TODO: add ir_v5_nn
eval_models = {"model": ["ir_v1_nn","ir_v2_nn","ir_v3_nn","ir_v4_nn"],
               "precision": [], "recall": [], "f-measure": []}

for model in eval_models["model"]:
    
    y_true = list(map(lambda i: int(i), ds_eval["_id"].tolist()))
    y_pred = list(map(lambda i: int(i), ds_eval[model].tolist()))
    
    
    eval_models["precision"].append(precision_score(y_true, y_pred, average='weighted'))
    
    #Old
    #eval_models["recall"].append(recall_score(y_true, y_pred, average='weighted'))
    #Test -> Mesmo removendo o warning, o resultado deu o mesmo ao imprimir o resultado dos métodos.
    
    eval_models["recall"].append(recall_score(y_true, y_pred, average='weighted', zero_division=1))
    
    eval_models["f-measure"].append(f1_score(y_true, y_pred, average='weighted'))
    
    #Sugestão avaliar a matrix de confusão e o relatório de classificação para verificar como alterar o modelo.
    # print(confusion_matrix(y_true,y_pred))
    # print(classification_report(y_true,y_pred))
    
    '''
    Alteração no código -> eval_models["recall"].append(recall_score(y_true, y_pred, average='weighted', zero_division=1))
         Foi apenas adicionado o parâmetro "zero_division=1"


         Fonte: https://stackoverflow.com/questions/68534836/warning-precision-and-f-score-are-ill-defined-and-being-set-to-0-0-in-labels-wi
                https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
    '''

In [85]:
results = pd.DataFrame(eval_models)

In [86]:
results

Unnamed: 0,model,precision,recall,f-measure
0,ir_v1_nn,1.0,1.0,1.0
1,ir_v2_nn,1.0,1.0,1.0
2,ir_v3_nn,1.0,0.96,0.976
3,ir_v4_nn,1.0,0.96,0.976


## 20-News Dataset Study

In [87]:
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(data_home="dataset/20_newgroup", subset='train', remove=('headers', 'footers', 'quotes'), download_if_missing=True, categories=cats)

In [88]:
data= pd.Series(newsgroups_train.data)
data=pd.DataFrame(data)
data.columns = ['text']
data['target'] = pd.Series(newsgroups_train.target)

In [89]:
pd.set_option('display.max_colwidth', None)
data["target"] = data["target"].apply(lambda target: "altheism" if target == 0 else "christian")
data["text"] = data["text"].apply(lambda text: text.replace("\n", ""))
data.head()

Unnamed: 0,text,target
0,"Library of Congress to Host Dead Sea Scroll Symposium April 21-22 To: National and Assignment desks, Daybook Editor Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191 both of the Library of Congress WASHINGTON, April 19 -- A symposium on the Dead Sea Scrolls will be held at the Library of Congress on Wednesday,April 21, and Thursday, April 22. The two-day program, cosponsoredby the library and Baltimore Hebrew University, with additionalsupport from the Project Judaica Foundation, will be held in thelibrary's Mumford Room, sixth floor, Madison Building. Seating is limited, and admission to any session of the symposiummust be requested in writing (see Note A). The symposium will be held one week before the public opening of amajor exhibition, ""Scrolls from the Dead Sea: The Ancient Library ofQumran and Modern Scholarship,"" that opens at the Library of Congresson April 29. On view will be fragmentary scrolls and archaeologicalartifacts excavated at Qumran, on loan from the Israel AntiquitiesAuthority. Approximately 50 items from Library of Congress specialcollections will augment these materials. The exhibition, on view inthe Madison Gallery, through Aug. 1, is made possible by a generousgift from the Project Judaica Foundation of Washington, D.C. The Dead Sea Scrolls have been the focus of public and scholarlyinterest since 1947, when they were discovered in the desert 13 mileseast of Jerusalem. The symposium will explore the origin and meaningof the scrolls and current scholarship. Scholars from diverseacademic backgrounds and religious affiliations, will offer theirdisparate views, ensuring a lively discussion. The symposium schedule includes opening remarks on April 21, at2 p.m., by Librarian of Congress James H. Billington, and byDr. Norma Furst, president, Baltimore Hebrew University. Co-chairingthe symposium are Joseph Baumgarten, professor of Rabbinic Literatureand Institutions, Baltimore Hebrew University and Michael Grunberger,head, Hebraic Section, Library of Congress. Geza Vermes, professor emeritus of Jewish studies, OxfordUniversity, will give the keynote address on the current state ofscroll research, focusing on where we stand today. On the secondday, the closing address will be given by Shmaryahu Talmon, who willpropose a research agenda, picking up the theme of how the Qumranstudies might proceed. On Wednesday, April 21, other speakers will include: -- Eugene Ulrich, professor of Hebrew Scriptures, University ofNotre Dame and chief editor, Biblical Scrolls from Qumran, on ""TheBible at Qumran;"" -- Michael Stone, National Endowment for the Humanitiesdistinguished visiting professor of religious studies, University ofRichmond, on ""The Dead Sea Scrolls and the Pseudepigrapha."" -- From 5 p.m. to 6:30 p.m. a special preview of the exhibitionwill be given to symposium participants and guests. On Thursday, April 22, beginning at 9 a.m., speakers will include: -- Magen Broshi, curator, shrine of the Book, Israel Museum,Jerusalem, on ""Qumran: The Archaeological Evidence;"" -- P. Kyle McCarter, Albright professor of Biblical and ancientnear Eastern studies, The Johns Hopkins University, on ""The CopperScroll;"" -- Lawrence H. Schiffman, professor of Hebrew and Judaic studies,New York University, on ""The Dead Sea Scrolls and the History ofJudaism;"" and -- James VanderKam, professor of theology, University of NotreDame, on ""Messianism in the Scrolls and in Early Christianity."" The Thursday afternoon sessions, at 1:30 p.m., include: -- Devorah Dimant, associate professor of Bible and Ancient JewishThought, University of Haifa, on ""Qumran Manuscripts: Library of aJewish Community;"" -- Norman Golb, Rosenberger professor of Jewish history andcivilization, Oriental Institute, University of Chicago, on ""TheCurrent Status of the Jerusalem Origin of the Scrolls;"" -- Shmaryahu Talmon, J.L. Magnas professor emeritus of Biblicalstudies, Hebrew University, Jerusalem, on ""The Essential 'Commune ofthe Renewed Covenant': How Should Qumran Studies Proceed?"" will closethe symposium. There will be ample time for question and answer periods at theend of each session. Also on Wednesday, April 21, at 11 a.m.: The Library of Congress and The Israel Antiquities Authoritywill hold a lecture by Esther Boyd-Alkalay, consulting conservator,Israel Antiquities Authority, on ""Preserving the Dead Sea Scrolls""in the Mumford Room, LM-649, James Madison Memorial Building, TheLibrary of Congress, 101 Independence Ave., S.E., Washington, D.C. ------ NOTE A: For more information about admission to the symposium,please contact, in writing, Dr. Michael Grunberger, head, HebraicSection, African and Middle Eastern Division, Library of Congress,Washington, D.C. 20540. -30-",christian
1,"Anyone who dies for a ""cause"" runs the risk of dying for a lie. As forpeople being able to tell if he was a liar, well, we've had grifters andcharlatans since the beginning of civilization. If David Copperfield hadbeen the Messiah, I bet he could have found plenty of believers. Jesus was hardly the first to claim to be a faith healer, and he wasn't thefirst to be ""witnessed."" What sets him apart?Rubbish. Nations have followed crazies, liars, psychopaths, and megalomaniacs throughout history. Hitler, Tojo, Mussolini, Khomeini,Qadaffi, Stalin, Papa Doc, and Nixon come to mind...all from this century.Koresh is a non-issue.Take a discrete mathematics or formal logic course. There are flaws in yourlogic everywhere. And as I'm sure others will tell you, read the FAQ!Of course, you have to believe the Bible first. Just because something iswritten in the Bible does not mean it is true, and the age of that tome plusthe lack of external supporting evidence makes it less credible. So if youdo quote from the Bible in the future, try to back up that quote with supporting evidence. Otherwise, you will get flamed mercilessly.Just like weight lifting or guitar playing, eh? I don't know how you define the world ""total,"" but I would imagine a ""total sacrafice [sp]of everything for God's sake"" would involve more than a time commitment.You are correct about our tendency to ""box everything into time units.""Would you explain HOW one should involove God in sports and (hehehe)television?",altheism
2,"Woah...The context is about God's calling out a special people (the Jews) tocarry the ""promise."" To read the meaning as literal people is to miss Paul'sentire point. I'd be glad to send anyone more detailed explanations of thispassage if interested.",christian
3,"See, there you go again, saying that a moral act is only significantif it is ""voluntary."" Why do you think this?And anyway, humans have the ability to disregard some of their instincts.You are attaching too many things to the term ""moral,"" I think.Let's try this: is it ""good"" that animals of the same speciesdon't kill each other. Or, do you think this is right? Or do you think that animals are machines, and that nothing they dois either right nor wrong?Those weren't arbitrary killings. They were slayings related to some sortof mating ritual or whatnot.Yes it was, but I still don't understand your distinctions. Whatdo you mean by ""consider?"" Can a small child be moral? How abouta gorilla? A dolphin? A platypus? Where is the line drawn? Doesthe being need to be self aware?What *do* you call the mechanism which seems to prevent animals ofthe same species from (arbitrarily) killing each other? Don'tyou find the fact that they don't at all significant?",altheism
4,"The last sentence is ironic, since so many readers ofsoc.religion.christian seem to not be embarrassed by apologists such asJosh McDowell and C.S. Lewis. The above also expresses a rather odd senseof history. What makes you think the masses in Aquinas' day, who weremostly illiterate, knew any more about rhetoric and logic than most peopletoday? If writings from the period seem elevated consider that only thecream of the crop, so to speak, could read and write. If everyone inthe medieval period ""knew the rules"" it was a matter of uncriticallyaccepting what they were told.Bill Mayne",christian


In [90]:
#df.insert(0, 'New_ID', range(880, 880 + len(df)))
data.insert(0, 'New_ID', range(880, 880 + len(data)))

In [91]:
data.head(230).to_csv('out4.csv', index=False)