#simple boolean search engine for the Brown Corpus


create a simple boolean search engine for the Brown Corpus.



# Brown Corpus

The Brown Corpus is a collection of writing samples that was compiled in 1961 by professors at Brown University. The corpus contains approximately 500 writing samples totalling approximately 1,000,000 words. All of the writing samples are from 1961 and were written in American English. The samples cover a wide range of topics and come from a variety of sources, such as books, magazines, and newspapers. Although small by modern standards, the corpus was considered very large in its day. For many decades it was the primary dataset used in NLP research.



There is a copy of the Brown Corpus available online. This code downloads the corpus into the variable `brown_corpus`.

In [None]:
import urllib.request, json
with urllib.request.urlopen("https://storage.googleapis.com/wd13/brown_corpus.json") as url:
  data = json.load(url)
  brown_corpus = data['brown_corpus']

`brown_corpus` is a list. Each element of the list is a document. Documents are stored as strings.

In [None]:
# print the text of document 10
docid = 10
print(brown_corpus[docid])


_MIAMI, FLA&, MARCH 17_
   - The Orioles tonight retained the distinction of
being the only winless team among the eighteen Major-League
clubs as they dropped their sixth straight spring exhibition
decision, this one to the Kansas City Athletics by
a score of 5 to 3.
   Indications as late as the top of the sixth were
that the Birds were to end their victory draought as
they coasted along with a 3-to-o advantage.
#SIEBERN HITS HOMER#
Over the first five frames, Jack Fisher, the big righthandler
who figures to be in the middle of Oriole plans for
a drive on the 1961 American League pennant, held the
~A's scoreless while yielding three scattered hits.
   Then Dick Hyde, submarine-ball hurler, entered the
contest and only five batters needed to face him before
there existed a 3-to-3 deadlock.
   A two-run homer by Norm Siebern and a solo blast
by Bill Tuttle tied the game, and single runs in the
eighth and ninth gave the Athletics their fifth victory
in eight starts.


The full text of each document is hosted at the url `https://storage.googleapis.com/wd13/brown_corpus/<docid>.txt` where `<docid>` is the index of the document in `brown_corpus`.

In [None]:
# print the url for document 102
docid = 102
print("https://storage.googleapis.com/wd13/brown_corpus/"+str(docid)+'.txt')

https://storage.googleapis.com/wd13/brown_corpus/102.txt


# Create a tokenizer

Write a function `tokenize` that takes a string and returns a list of tokens.

In [None]:
import string
import re

#creating tokenized list using regression.
def tokenize(doc):
  list=[]
  doc= doc.lower()
  newDocString = doc.split(' ')

  for x in newDocString:
    list.append(re.sub('[^A-Za-z0-9]+', '', x))
  return list



# Create a Search Index

Create a dictionary called `search_index`. The keys in `search_index` will be the tokens in `brown_corpus`. Each token will be linked to a set containing the indexes of the documents that contain that token.

For example, the token 'ski' occurs in documents 109, 120, 148, 174, 457, and 482. Therefore,

`search_index['ski']`

will return the following set of document indexes

`{109, 120, 148, 174, 457, 482}`

In [None]:
search_index = {}
def dataIndexing(brown_corpus):
  tokenizedList=[]
  # adding tokenized documents list in another list name tokenizedList.
  for i in range(len(brown_corpus)):
    tokenizedList.append(tokenize(brown_corpus[i]))

  # creating a dictionary
  for index, sublist in enumerate(tokenizedList):
    for sublistWords in sublist:
      if sublistWords in search_index:
        if index not in search_index[sublistWords]:
          search_index[sublistWords] +=[index]
      else:
        search_index[sublistWords]=[index]
  return


dataIndexing(brown_corpus)

# Create a Function for "AND" Queries

Write a funciton `query_and` that
* takes a query string `q`
* tokenizes `q`
* finds all of the documents in `brown_corpus` that contain *all* of the tokens in `q`
* prints links to each of those documents

In [None]:
def query_and(q):
  NumberList=[]
  FinalNumList=[]
  listOfTokens=tokenize(q)
  # check token exist and append document number
  for tokenVal in listOfTokens:
    if tokenVal.lower() in search_index:
      NumberList.append(search_index[tokenVal])

  # select document containing all tokens
  for i in NumberList:
    for j in i:
      count=0
      for z in range(len(NumberList)):
        if j in NumberList[z]:
          count+=1

      if count == len(NumberList):
        if j not in FinalNumList:
          FinalNumList.append(j)

  for i in FinalNumList:
    print("https://storage.googleapis.com/wd13/brown_corpus/"+str(i)+'.txt')

  return

query_and('Hello world')

https://storage.googleapis.com/wd13/brown_corpus/57.txt
https://storage.googleapis.com/wd13/brown_corpus/328.txt
https://storage.googleapis.com/wd13/brown_corpus/431.txt
https://storage.googleapis.com/wd13/brown_corpus/494.txt


# Create a Function for "OR" Queries

Write a funciton `query_or` that
* takes a query string `q`
* tokenizes `q`
* finds all of the documents in `brown_corpus` that contain *any* of the tokens in `q`
* prints links to each of those documents

In [None]:
def query_or(q):
  NumberList=[]
  FinalNumList=[]
  listOfTokens=tokenize(q)

  # check token exist and append document number
  for tokenVal in listOfTokens:
    if tokenVal in search_index:
      NumberList.append(search_index[tokenVal])

  # for append document numbers in single list FinalNumList
  for i in NumberList:
    for j in i:
      if j not in FinalNumList:
        FinalNumList.append(j)

  FinalNumList.sort()
  for i in FinalNumList:
    print("https://storage.googleapis.com/wd13/brown_corpus/"+str(i)+'.txt')

  return

query_or('hello world')

https://storage.googleapis.com/wd13/brown_corpus/3.txt
https://storage.googleapis.com/wd13/brown_corpus/5.txt
https://storage.googleapis.com/wd13/brown_corpus/7.txt
https://storage.googleapis.com/wd13/brown_corpus/9.txt
https://storage.googleapis.com/wd13/brown_corpus/10.txt
https://storage.googleapis.com/wd13/brown_corpus/12.txt
https://storage.googleapis.com/wd13/brown_corpus/13.txt
https://storage.googleapis.com/wd13/brown_corpus/15.txt
https://storage.googleapis.com/wd13/brown_corpus/16.txt
https://storage.googleapis.com/wd13/brown_corpus/20.txt
https://storage.googleapis.com/wd13/brown_corpus/21.txt
https://storage.googleapis.com/wd13/brown_corpus/24.txt
https://storage.googleapis.com/wd13/brown_corpus/25.txt
https://storage.googleapis.com/wd13/brown_corpus/30.txt
https://storage.googleapis.com/wd13/brown_corpus/34.txt
https://storage.googleapis.com/wd13/brown_corpus/35.txt
https://storage.googleapis.com/wd13/brown_corpus/36.txt
https://storage.googleapis.com/wd13/brown_corpus/37.

# Try Different Queries

Test out both your `query_and` and `query_or` functions.

In [None]:
query_and('hello world')

https://storage.googleapis.com/wd13/brown_corpus/57.txt
https://storage.googleapis.com/wd13/brown_corpus/328.txt
https://storage.googleapis.com/wd13/brown_corpus/431.txt
https://storage.googleapis.com/wd13/brown_corpus/494.txt


In [None]:
query_or('hello world')

https://storage.googleapis.com/wd13/brown_corpus/3.txt
https://storage.googleapis.com/wd13/brown_corpus/5.txt
https://storage.googleapis.com/wd13/brown_corpus/7.txt
https://storage.googleapis.com/wd13/brown_corpus/9.txt
https://storage.googleapis.com/wd13/brown_corpus/10.txt
https://storage.googleapis.com/wd13/brown_corpus/12.txt
https://storage.googleapis.com/wd13/brown_corpus/13.txt
https://storage.googleapis.com/wd13/brown_corpus/15.txt
https://storage.googleapis.com/wd13/brown_corpus/16.txt
https://storage.googleapis.com/wd13/brown_corpus/20.txt
https://storage.googleapis.com/wd13/brown_corpus/21.txt
https://storage.googleapis.com/wd13/brown_corpus/24.txt
https://storage.googleapis.com/wd13/brown_corpus/25.txt
https://storage.googleapis.com/wd13/brown_corpus/30.txt
https://storage.googleapis.com/wd13/brown_corpus/34.txt
https://storage.googleapis.com/wd13/brown_corpus/35.txt
https://storage.googleapis.com/wd13/brown_corpus/36.txt
https://storage.googleapis.com/wd13/brown_corpus/37.