# Self study 2


In this self-study we build an index that supports Boolean search over the web pages that you crawl with the crawler from the 1st self study. You can continue to just extract the titles of the web-pages you crawl, or you can be more adventurous and look at the whole text that you get from the .get_text() method of a BeautifulSoup parser. In either case, the collection of texts from the crawled web-pages is you corpus. You should then:

- construct the vocabulary of terms for your corpus
- build an 'inverted' index for your vocabulary
- implement Boolean search for your index (

In [11]:
# Some things already used in self study 1:
import requests
from bs4 import BeautifulSoup
import json



A useful resource is the nltk natural language processing package:
https://www.nltk.org/
which provides methods for tokenization, stemming, and much more (the 'punkt' package is needed for tokenization):

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bruger\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

Now let's use the title string of the AAU homepage as an example:

In [4]:
r=requests.get('https://www.aau.dk/')
r_parse = BeautifulSoup(r.text, 'html.parser')
string=r_parse.find('title').string
print(string)

AAU - Viden for verden


We can tokenize:

In [5]:
tokens=nltk.word_tokenize(string)
for t in tokens:
    print(t)

AAU
-
Viden
for
verden


And we can stem:

In [6]:
ps=nltk.PorterStemmer()
for t in tokens:
    print(ps.stem(t))



aau
-
viden
for
verden


For Danish language the Porter stemmer will not be terribly useful! There is also a Danish option:

In [7]:
from nltk.stem.snowball import SnowballStemmer

dstemmer=SnowballStemmer("danish")

In [8]:
for t in tokens:
    print(dstemmer.stem(t))


aau
-
vid
for
verd


What is most useful for you depends on which websites you crawl. It is not essential for the exercise that the stemming always is the best possible ...!

In [14]:
with open('crawled.json', "r") as jfile:
    results = json.load(jfile)
print(len(results)) 


500


In [24]:
ps = nltk.PorterStemmer()
id = 0
for e in results:
    try:
        title = e['title']
        tokens=nltk.word_tokenize(title)
        toks = []
        for token in tokens:
            toks.append(ps.stem(token))
        e['tokens'] = toks
        e['id'] = id
        id += 1
    except:
        print('woops big error WTF!!!')
        print(e)
        e['tokens'] = []
        e['id'] = -1
        print('\n\n\n')
    
    

woops big error WTF!!!
{'url': 'https://web.archive.org/web/20210128124326/https://www.loveandlemons.com/carrot-cake/', 'title': None, 'tokens': []}




woops big error WTF!!!
{'url': 'https://web.archive.org/web/20210624080114/https://www.loveandlemons.com/easy-vegetarian-chili/', 'title': None, 'tokens': []}




woops big error WTF!!!
{'url': 'https://web.archive.org/web/20200621055810/http://bit.ly/17IFI5O', 'title': None, 'tokens': []}




woops big error WTF!!!
{'url': 'https://web.archive.org/web/20210524041406/https://www.loveandlemons.com/avocado-quinoa-stuffed-acorn-squash/', 'title': None, 'tokens': []}




woops big error WTF!!!
{'url': 'https://web.archive.org/web/20210508150700/http://georgessf.com/', 'title': None, 'tokens': []}






In [66]:
invertedIndex ={}
for e in results:
    for token in e['tokens']:
        if token in invertedIndex.keys():
            invertedIndex[token].append(e['id'])
        else:
            invertedIndex[token] = [e['id']]
    

In [None]:
invertedIndex


In [27]:
print(len(invertedIndex))


737


In [81]:
def andMerge(l, m):
    result = []
    if len(l) == 0 or len(m) == 0:
        return result
    i = 0
    j = 0
    ie = l[i]
    je = m[j]
    cont = True;
    while cont:
        k = ie-je
        if k == 0:
            result.append(ie)
            if len(l) -1 > i and len(m) -1 > j:
                i += 1
                j += 1
                ie = l[i]
                je = m[j]
            else:
                cont = False;
        elif k < 0 and len(l)-1 > i:
            i += 1
            ie = l[i]
        elif k > 0 and len(m)-1 > j:
            j += 1
            je = m[j]
        else:
            cont = False
    return result
            

def search(searchstring):
    words = searchstring.split(" ")
    imm = []
    for w in words:
        imm.append(ps.stem(w))
    words = imm
    ids = []
    if len(words) == 1:
        w = words[0]
        if w in invertedIndex.keys():
            ids = invertedIndex[w]
    else:
        w = words.pop(0)
        lis = []
        if w in invertedIndex.keys():
            lis = invertedIndex[w]
        w = words.pop(0)
        lim = []
        if w in invertedIndex.keys():
            lim = invertedIndex[w]
        imm = andMerge(lis,lim)
        for word in words:
            if word in invertedIndex.keys():
                lim = invertedIndex[word]
            else:
                lim = []
            imm = andMerge(imm, lim)
        ids = imm
    res = []
    for i in ids:
        for e in results:
            if e['id'] == i:
                res.append(e['url'])
    return res
            
            

In [84]:
search("cookies")


['https://www.loveandlemons.com/easy-cookie-recipes/',
 'https://www.loveandlemons.com/peanut-butter-chocolate-chip-cookie-bars/',
 'https://www.loveandlemons.com/tahini-cookies/',
 'https://www.loveandlemons.com/peanut-butter-no-bake-cookies/',
 'https://www.loveandlemons.com/sugar-cookies/',
 'https://iamafoodblog.com/small-batch-bas-best-chocolate-chip-cookies/',
 'https://iamafoodblog.com/tag/cookies/',
 'https://www.loveandlemons.com/vegan-cherry-chocolate-oatmeal-cookies/']

In [64]:
invertedIndex['recip']


[]

In [65]:
invertedIndex


{'game': [0, 22, 35, 51],
 'grump': [0, 22, 35, 51],
 'love': [],
 'and': [1,
  2,
  3,
  5,
  6,
  8,
  9,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  23,
  25,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  36,
  37,
  39,
  40,
  41,
  41,
  42,
  43,
  45,
  46,
  48,
  49,
  50,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  61,
  62,
  63,
  64,
  65,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  81,
  82,
  83,
  86,
  86,
  87,
  89,
  90,
  91,
  92,
  93,
  95,
  96,
  98,
  100,
  101,
  102,
  104,
  105,
  106,
  107,
  109,
  110,
  112,
  114,
  115,
  116,
  117,
  119,
  120,
  121,
  123,
  124,
  127,
  128,
  129,
  130,
  131,
  133,
  134,
  137,
  138,
  140,
  143,
  144,
  146,
  148,
  149,
  150,
  152,
  154,
  155,
  156,
  158,
  160,
  161,
  162,
  164,
  166,
  169,
  171,
  172,
  173,
  174,
  176,
  177,
  178,
  179,
  180,
  182,
  183,
  187,
  190,
  191,
  193,
  194,
  