Skip to content

A collection of algorithms used in search engines implemented by myself

License

Notifications You must be signed in to change notification settings

Eckhoff42/Google-at-home

Repository files navigation

Search-engine-algorithm-collection

A collection of algorithms used in search engines implemented by myself. The intention of this project is not to implement the most effective algorithms, but to implement a variety of techniques. The selection of algorithms is inspired by the Stanford course Introduction to Information Retrieval

image

Table of contents

Structure

image

How to run google at home

Run google at home with the command:

python3 GoogleAtHome.py <query> <seed> <max_pages>
  • query: the query you want to search
  • seed: the URL of the seed page
  • max_pages: the maximum number of pages you want to crawl

This will crawl the web starting at <seed> The text content of the pages will be saved to the temp directory. The search engine will then be built on the crawled pages. The results will be printed to the console ranked according to tf-idf.

simplescreenrecorder-2022-11-13_17.17.31.mp4

How to run the search engine

Leave the files you want to index in the corpus/ directory. Currently, only `.txt`` files are indexed. There are currently some demo files in the directory.

run the file

python3 main.py --directory corpus --query <your query> --operator <your search operator> 

The current operators are:

  • AND
  • OR
  • N_OF_M
  • RANKING

the output of the query is a list of file names.

How to run the crawler

The crawler can be run with the command:

python3 crawler.py <url> <max_pages>
  • seed: the URL of the seed page
  • max_pages: the maximum number of pages you want to crawl

this will crawl the web starting at <seed> The text content of the pages will be saved to the temp directory.

Inverted index

Inverted index functionality is created in the InvertedIndex.py file. A collection of documents is parsed and posting lists are created. Each list consists of the term and an ordered list of document_ids containing the term. If a term occurs multiple times in a document there is still only one element in the posting list.

"example" -> 8, 11, 22
"term" -> 11, 12

Two posting lists can be merged with the merge_and() and merge_or() functions. the AND and OR operator is used respectively to decide what is included in the result posting list. This can be used to evaluate queries.

Normalizer

The normalizer is used to combine terms with the same meaning into the same group. By doing this the number of posting lists is reduced saving space. In addition to this evaluation of queries will be more correct as words carrying the same meaning are made equivalent

normalizing operations

  • lowercase: Example -> example
  • removing non-alphabetic characters: example[1*] -> example
  • stemming: ies -> i

Stemming

is performed using a simplified Porter stemming algorithm. The algorithm is used to truncate different endings of terms into one. By doing this different forms of the same word are transformed into the same. NB: this algorithm is English specific

Compressor

Compressing the inverted index has multiple benefits. A smaller index uses less memory that allows bigger parts of large indexes to be in memory at the same time. Another benefit is better query evaluation. Normalization is a form of compression where multiple words with the same meaning eg. "do, does, doing" are interpreted as representing the same. This leads to better query evaluation.

The compressor removes stop words eg. "the, and, in, to, one" that are common but bear little meaning in isolation. One side effect of this is removing false positives in query evaluation where "The quick fox" would include all documents including the term "the".

Gap encoding

A compression technique used to reduce the size of the inverted index. The idea is to store the difference between two following document_ids in the posting list instead of the actual document_id. Decoding a document_id is done by adding the values of all previous elements in the gap-encoded list

Before

"secondTerm" -> 11, 12
"firstTerm" -> 1, 2, 3, 8, 11

After

"secondTerm" -> 11,1
"firstTerm" -> 1,1,1,5,3
# decoding this -----^
# is done by 1+1+1+5 = 8

Variable byte encoding

A compression technique used to allow a buffer to contain numbers of different sizes. In our case, the smallest unit is 1 byte (8bit). The first bit of each byte is used to signify if the byte is the last byte in the sequence. The remaining 7 bits are used to store the value of the number.

encoding a number

1. transform the number into binary
  - 480 = 1101001011
2. split the binary into 7 bit chunks
  - 110 1001011
3. add trailing zeros to the first byte
  - 0000110 1001011
4. add a 0 to the first bit all but the last byte with a 1
  - 00000110 11001011

decoding a byte string

byteString = 00000110 11001011 10001010
1. read bytes until the first byte with a 1 in the first bit is found
  - 00000110 11001011
2. remove the first bit from all bytes
  - 0000110 1001011
3. remove trailing zeros from the first byte
  - 1101001011

Gamma encoding

A compression technique used to allow a buffer to contain numbers of different sizes. A gamma-encoded number is represented as a combination of offset and length.

  • offset: the number in binary with the leading 1 cut off
  • length: the length of the offset in unary

encoding a number

1. create offset and length
  - offset: 13 -> 1101 -> 101
  - length: 3 -> 1110
2. create gamma encoded number
  - 1110101

Search engine

The search engine given a query evaluates the string and finds matching documents. What documents are matching depends on how the query is evaluated and multiple search functions are given.

Search(operator = "AND"):

after parsing the query normalizing and stemming the terms it looks for documents where all search terms are present

Search(operator = "OR"):

after parsing the query normalizing and stemming the terms it looks for documents where at least one search term is present

Search_n_of_m():

after parsing the query normalizing and stemming the terms it looks for documents where at least a specified percentage of the words are present. A query consisting of 4 terms and a match_percentage of 50% means any document containing at least 2 search terms is a match.

Ranker

Executing a query using the search engine will return a list of documents that matches the query. The ranker takes this list and ranks the documents based on how relevant they are to the query. The ranker is implemented in the Ranker.py file. The relevance of document d for a given term t is decided by the log of term frequency (tf).

  • $w_{t,d} = 1 + log_{10}(t,d)$ iff $tf_{t,d} &gt; 0$
  • $w_{t,d} = 0$ iff $tf_{t,d} &gt; 0$

The more occurrences of a term in a document the more relevant the document is. Furthermore, we wish to give common terms a lower weight and uncommon terms a high weight. This is done by weighting the term frequency with the inverse document frequency (idf).

  • $idf_{t} = log_{10}\frac{N}{df_{t}}$
  • document frequency (df) is the number of documents containing the term
  • N is the total number of documents

The weight of a term t for a document d is then

  • $w_{t,d} = log_{10}(1+ tf_{t,d}) * log_{10}\frac{N}{df_{t}}$

Scoring a document d for a query q is done by summing the weights of all terms in the query for the document.

  • $score_{d,q} = \sum_{t \in q} w_{t,d}$

Query evaluation

One naive solution for evaluating a query is to calculate the score for each document. This is however not optimal as it requires calculating the score for all documents. A better solution is to only calculate the score for documents that match the query. This is done by using the inverted index to find documents that match the query. The documents are then ranked by calculating the score for each document. Other techniques can be used to reduce the number of documents that need to be ranked, but they are not implemented in this project (yet).

Web crawler

A web crawler commonly referred to as a spider is a program used to index the internet. It reads the content of a website and extracts links to other websites. The crawler then visits the linked websites and repeats the process. The crawler is implemented in the Crawler.py file. The crawler is implemented using the requests library to fetch the content of a website and the BeautifulSoup library to parse the content.

image

The standard process

  1. Get an URL from the queue
  2. Register the URL as visited
  3. Check the robots.txt file to see if the URL is allowed to be crawled
    1. check if the robots.txt is cashed
    2. if not get the url/robots.txt and parse it
  4. Fetch the content of a website
    1. Save the specified to a file
    2. Add all links to the queue if they are not already in the queue
  5. Repeat until the n sites have been crawled

About

A collection of algorithms used in search engines implemented by myself

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published