Skip to content

๐Ÿ“š Crawl Wikipedia webpages for Computer Science and use them to build a small web search engine

Notifications You must be signed in to change notification settings

EthanWng97/Web-Search-Engine-for-Computer-Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

58 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Web-Search-Engine-for-Computer-Science

Full document

Medium

Prerequisites

  • Python 3.7.6

Install

git clone https://github.com/NavePnow/Web-Search-Engine-for-Computer-Science.git

Usage

Crawling

python3 crawling.py

parameter setting

url = "https://en.wikipedia.org/wiki/Outline_of_computer_science"    
crawl = Crawl(url, 'robots.txt')    
crawl.get_rules()    
crawl.crawl('./webpage', num=2000)
  1. url: root link for crawling
  2. Crawl(url, โ€˜robots.txtโ€™): robots.txt of wikipedia
  3. crawl.crawl('./webpage', num=2000): folder of saved webpages and number of crawling webpages

Index

python3 index.py -i directory-of-documents -d dictionary-file -p postings-file
  1. Documents to be indexed are stored in directory-of-documents
  2. dictionary-file: This file will store array a, average length of total docs, info of docs and dictionary.
  3. postings-file: This file will store tf, doc_id and position of each term.

Search

python3 search.py -d dictionary-file -p postings-file -q query-file -o output-file-of-results
  1. dictionary-file: This file is generated from index.py, which contains array a, average length of total docs, info of docs and dictionary.
  2. postings-file: This file is generated from index.py, which contains tf, doc_id and position of each term.
  3. query-file: This file contains several queries to be tested.
  4. output-file-of-results: This file contains the result and corresponding urls from queries.txt

Advanced option

-e  enable query expansion
-f  disable relevance feedback
-s  enable printing score

Run tests

python3 search.py -d dictionary.txt -p postings.txt -q queries.txt -o output.txt

Author

๐Ÿ‘ค Evan โ€” Crawling, PageRank, Indexer, query expansion

๐Ÿ‘ค Rulin โ€” Searcher, relevance feedback

๐Ÿค Contributing

Contributions, issues and feature requests are welcome! Feel free to check issues page.

๐Ÿ’ฐ Show your support

Give a โญ๏ธ if this project helped you!

PayPal Patron
paypal

๐Ÿ“– Reference

Releases

No releases published

Packages

No packages published

Languages