baby-search-engine

A baby search engine for CS600 course project using compressed trie and django.

Summary

We use a compressed trie to index the words appearing in the html files. We also store the occurrence list of each word as an independant file to the disk. In this list, we store the link's id and the frenquency of the word in this html. When searching using the engine, we first use the compressed trie to search the index of the given words and find the occurrence list on the disk. Finally we return the title and link that contains the given words.

Requirements

Python 3
bs4
goose3
nltk
django
requests

Instructions

Install all required softwares and packages

pip install -r requirements.txt

In the python console, download the stopwords and punkt corpus required by nltk

import nltk


nltk.download('stopwords')
nltk.download('punkt')

Run preprocess.py to download html files, extract information and construct trie:

python preprocess.py

Run the django server in your terminal, open your browser and type localhost:8000:

python ./main.py runserver 0.0.0.0:8000

Type some words to examine the search results, like 'election', 'debate' and 'minister' etc.

Notes

If you input empty text, the system will jump back to the index page.
We will remove the english stopwords in the query text input.
We limit the query text with 64 characters.
You can use space and punctuations to split the input words.
We will stem the input words to search the results.

Algoithms and Data Structures

Compressed Trie

We use a compressed trie as described in out textbook to store the words and their indexes.

Inserting We directly inserting words to the compressed trie. When inserting a word, we compare the current word with the key of the node and find the common prefix. Then we recursively do this operation to the end of this word. Finally we store the value of this word to the external node. A trick is we use a space as the key of external nodes so that we don't require a word not being the prefix of other words.
Searching Searching is similiar to inserting. But we need a perfect match of key and current word. We only claim the word exists in the trie until we find an external node with the key of space. Finally, we return the value of this word.

All time complexites are the same as those in our textbook.

Occurrence List

A problem is how to match the word with it's occurence. My solution in this baby search engine is we store the id of the word in the trie, and store the occurence list to the disk, of which the filename is the same as thw word id.

An occurence list consists of the link id which this word appears in, the term frequency of this word in each link and a score of each link corresponding to this word.

[
    [link_id, frequency, score],
    [link_id, frequency, score],
    ...
]

Ranking

In this baby engine, we use the combination of the frequency of a word and the page rank to evaluate each link to the given word. More specifically, the final score $s_{wi}$ of link $i$ to the word $w$ is $$ s_{wi} = f_{wi} \times r_i $$ where $f_{wi}$ is the term frequency of $w$ in link $i$, and $r_i$ is the page rank of link $i$. (See PageRank - Wikiedia)

Preprocess

The input of preprocess.py is a url file located in data/input/urls. Our program will automatically download the html file and extract the title, links contained in html and pure content of html. Then it will insert all words to the compressed trie and update each occurence list. Finally, it will calculate the score and save the trie to the disk.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data/input		data/input
screenshots		screenshots
server		server
utils		utils
.gitignore		.gitignore
README.md		README.md
engine.py		engine.py
main.py		main.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

baby-search-engine

Summary

Requirements

Instructions

Notes

Algoithms and Data Structures

Compressed Trie

Occurrence List

Ranking

Preprocess

About

Releases

Packages

Languages

LuChang-CS/baby-search-engine

Folders and files

Latest commit

History

Repository files navigation

baby-search-engine

Summary

Requirements

Instructions

Notes

Algoithms and Data Structures

Compressed Trie

Occurrence List

Ranking

Preprocess

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages