- Python 3.7.6
git clone https://github.com/NavePnow/Web-Search-Engine-for-Computer-Science.git
python3 crawling.py
parameter setting
url = "https://en.wikipedia.org/wiki/Outline_of_computer_science"
crawl = Crawl(url, 'robots.txt')
crawl.get_rules()
crawl.crawl('./webpage', num=2000)
url
: root link for crawlingCrawl(url, โrobots.txtโ)
: robots.txt of wikipediacrawl.crawl('./webpage', num=2000)
: folder of saved webpages and number of crawling webpages
python3 index.py -i directory-of-documents -d dictionary-file -p postings-file
- Documents to be indexed are stored in
directory-of-documents
dictionary-file
: This file will storearray a
,average length of total docs
,info of docs
anddictionary
.postings-file
: This file will storetf
,doc_id
andposition
of each term.
python3 search.py -d dictionary-file -p postings-file -q query-file -o output-file-of-results
dictionary-file
: This file is generated from index.py, which containsarray a
,average length of total docs
,info of docs
and dictionary.postings-file
: This file is generated from index.py, which containstf
,doc_id
andposition
of each term.query-file
: This file contains several queries to be tested.output-file-of-results
: This file contains the result and corresponding urls from queries.txt
Advanced option
-e enable query expansion
-f disable relevance feedback
-s enable printing score
python3 search.py -d dictionary.txt -p postings.txt -q queries.txt -o output.txt
๐ค Evan โ Crawling, PageRank, Indexer, query expansion
๐ค Rulin โ Searcher, relevance feedback
- Github: @XJDKC
Contributions, issues and feature requests are welcome! Feel free to check issues page.
Give a โญ๏ธ if this project helped you!
PayPal | Patron |
---|---|
|