<a href="https://colab.research.google.com/github/Anson3208/Sentiment-Textmining-Analysis-Learning/blob/main/02_Search_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 Search Index


# Brown Corpus

The Brown Corpus is a collection of writing samples that was compiled in 1961 by professors at Brown University. The corpus contains approximately 500 writing samples totalling approximately 1,000,000 words. All of the writing samples are from 1961 and were written in American English. The samples cover a wide range of topics and come from a variety of sources, such as books, magazines, and newspapers. Although small by modern standards, the corpus was considered very large in its day. For many decades it was the primary dataset used in NLP research. 



There is a copy of the Brown Corpus available online. This code downloads the corpus into the variable `brown_corpus`. 

In [None]:
import urllib.request, json 
with urllib.request.urlopen("https://storage.googleapis.com/wd13/brown_corpus.json") as url:
  data = json.load(url)
  brown_corpus = data['brown_corpus']

`brown_corpus` is a list. Each element of the list is a document. Documents are stored as strings. 

In [None]:
# print the text of document 10
docid = 10
print(brown_corpus[docid])

_MIAMI, FLA&, MARCH 17_
   - The Orioles tonight retained the distinction of
being the only winless team among the eighteen Major-League
clubs as they dropped their sixth straight spring exhibition
decision, this one to the Kansas City Athletics by
a score of 5 to 3.
   Indications as late as the top of the sixth were
that the Birds were to end their victory draought as
they coasted along with a 3-to-o advantage.
#SIEBERN HITS HOMER#
Over the first five frames, Jack Fisher, the big righthandler
who figures to be in the middle of Oriole plans for
a drive on the 1961 American League pennant, held the
~A's scoreless while yielding three scattered hits.
   Then Dick Hyde, submarine-ball hurler, entered the
contest and only five batters needed to face him before
there existed a 3-to-3 deadlock.
   A two-run homer by Norm Siebern and a solo blast
by Bill Tuttle tied the game, and single runs in the
eighth and ninth gave the Athletics their fifth victory
in eight starts.


The full text of each document is hosted at the url `https://storage.googleapis.com/wd13/brown_corpus/<docid>.txt` where `<docid>` is the index of the document in `brown_corpus`. 

In [None]:
# print the url for document 102
docid = 102
print("https://storage.googleapis.com/wd13/brown_corpus/"+str(docid)+'.txt')

https://storage.googleapis.com/wd13/brown_corpus/102.txt


# Create a tokenizer

Write a function `tokenize` that takes a string and returns a list of tokens. 

In [None]:
import urllib.request, json 
with urllib.request.urlopen("https://storage.googleapis.com/wd13/stopwords%20and%20lemmas.json") as url: #import the a dictionary of words as url-->to include lemma and stopwords
  data = json.load(url) #the data in url has no datatype-->use json.load to load the data as dictionary
  stopwords = data['stopwords'] #the link is a dictionary format, stopwords is the key which store the list of words that should be excluded
  lemmas = data['lemmas'] #lemmas is a list of words that is the simplified. e.g. abandoned = abandon


import re
def tokenize(doc):
  lowercase_doc = doc.lower()
  re_pattern = r'[A-Za-z0-9]+' #r make it as string, + is to ...
  raw_tokens = re.findall(re_pattern,lowercase_doc)
  clean_tokens = []
  for token in raw_tokens:
    if token not in stopwords:
      if token in lemmas:
        lemma = lemmas[token]
      else:
        lemma = token
      clean_tokens.append(lemma)
  return(clean_tokens)



# Create a Search Index

Create a dictionary called `search_index`. The keys in `search_index` will be the tokens in `brown_corpus`. Each token will be linked to a set containing the indexes of the documents that contain that token. 

For example, the token 'ski' occurs in documents 109, 120, 148, 174, 457, and 482. Therefore, 

`search_index['ski']`

will return the following set of document indexes

`{109, 120, 148, 174, 457, 482}`

In [None]:
search_index = {}                               #create empty dictionary
for index in range(0,len(brown_corpus)):        #for loop create iteration (index) for the brown_corpus index
  for token in tokenize(brown_corpus[index]):   #for loop tokenize brown_corpus
    if token in search_index:                   #check if token in search_index
      search_index[token].add(index)            #Add token to exisiting dictionary set
    else:
      search_index[token] = {index}             #Add token as set (not list) Eg. dictionary set--> {Apple:{1,2,3}} ; dictionary list--> {Apple:[1,2]}

#search_index[token] = {index} is a set, need to use .add to add value
#search_index[token] = [index] is a list, need to use .append to add value
#the difference is set can generate value without duplicated

In [None]:
search_index['ski']

{109, 120, 148, 174, 457, 482}

# Create a Function for "AND" Queries

Write a funciton `query_and` that 
* takes a query string `q`
* tokenizes `q`
* finds all of the documents in `brown_corpus` that contain *all* of the tokens in `q`
* prints links to each of those documents

In [None]:
#Objective: to create a function that can find all of the documents in brown_corpus with the condition 'AND'


def query_and(q):
  list_of_sets = []
  for token in tokenize(q):                                                                           #First tokenize q
    list_of_sets.append(search_index[token])                                                          #search_index the token and Out-put will be a set. Append the result of search_index to be a list
  list_of_docid = list(list_of_sets[0].intersection(*list_of_sets))                                   #generate a list with condition
  print(list_of_docid)
  for index in range(0,len(list_of_docid)):                                                           #iteration of the list
    print("https://storage.googleapis.com/wd13/brown_corpus/"+ str(list_of_docid[index]) +'.txt')

# Create a Function for "OR" Queries

Write a funciton `query_or` that 
* takes a query string `q`
* tokenizes `q`
* finds all of the documents in `brown_corpus` that contain *any* of the tokens in `q`
* prints links to each of those documents

In [None]:

def query_or(q):
  list_of_sets = []
  for token in tokenize(q):
    list_of_sets.append(search_index[token])
  list_of_docid = list(list_of_sets[0].union(*list_of_sets))
  print(list_of_docid)
  for index in range(0,len(list_of_docid)):
    print("https://storage.googleapis.com/wd13/brown_corpus/"+ str(list_of_docid[index]) +'.txt')

# Try Different Queries

Test out both your `query_and` and `query_or` functions. 

In [None]:
query_and('hello world')

[328, 494, 239, 431, 435, 57]
https://storage.googleapis.com/wd13/brown_corpus/328.txt
https://storage.googleapis.com/wd13/brown_corpus/494.txt
https://storage.googleapis.com/wd13/brown_corpus/239.txt
https://storage.googleapis.com/wd13/brown_corpus/431.txt
https://storage.googleapis.com/wd13/brown_corpus/435.txt
https://storage.googleapis.com/wd13/brown_corpus/57.txt


In [None]:
query_or('hello world')

[3, 5, 7, 9, 10, 12, 13, 14, 15, 16, 19, 20, 21, 24, 25, 26, 28, 30, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50, 52, 53, 54, 56, 57, 60, 61, 62, 63, 64, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 77, 78, 79, 80, 83, 84, 87, 88, 89, 90, 91, 92, 93, 95, 96, 97, 98, 99, 101, 102, 103, 104, 105, 106, 107, 108, 110, 111, 113, 114, 116, 120, 121, 122, 130, 131, 139, 140, 142, 145, 146, 147, 149, 150, 152, 153, 154, 158, 159, 163, 165, 166, 167, 168, 169, 171, 174, 176, 177, 178, 181, 182, 183, 186, 188, 190, 191, 192, 193, 194, 195, 197, 198, 199, 200, 201, 203, 204, 205, 206, 207, 208, 209, 210, 211, 213, 214, 215, 216, 219, 220, 221, 223, 224, 225, 226, 227, 228, 229, 230, 231, 234, 236, 238, 239, 240, 241, 243, 244, 245, 247, 252, 253, 254, 255, 257, 258, 261, 263, 264, 265, 266, 267, 269, 270, 273, 275, 276, 277, 280, 285, 286, 289, 292, 293, 294, 296, 297, 319, 320, 322, 324, 325, 328, 332, 333, 335, 337, 338, 339, 343, 346, 348, 349, 351, 354, 357, 359, 360, 361, 362, 