Web-Search-Engine-for-Computer-Science

Full document

Medium

Prerequisites

Python 3.7.6

Install

git clone https://github.com/NavePnow/Web-Search-Engine-for-Computer-Science.git

Usage

Crawling

python3 crawling.py

parameter setting

url = "https://en.wikipedia.org/wiki/Outline_of_computer_science"    
crawl = Crawl(url, 'robots.txt')    
crawl.get_rules()    
crawl.crawl('./webpage', num=2000)

url: root link for crawling
Crawl(url, ‘robots.txt’): robots.txt of wikipedia
crawl.crawl('./webpage', num=2000): folder of saved webpages and number of crawling webpages

Index

python3 index.py -i directory-of-documents -d dictionary-file -p postings-file

Documents to be indexed are stored in directory-of-documents
dictionary-file: This file will store array a, average length of total docs, info of docs and dictionary.
postings-file: This file will store tf, doc_id and position of each term.

Search

python3 search.py -d dictionary-file -p postings-file -q query-file -o output-file-of-results

dictionary-file: This file is generated from index.py, which contains array a, average length of total docs, info of docs and dictionary.
postings-file: This file is generated from index.py, which contains tf, doc_id and position of each term.
query-file: This file contains several queries to be tested.
output-file-of-results: This file contains the result and corresponding urls from queries.txt

Advanced option

-e  enable query expansion
-f  disable relevance feedback
-s  enable printing score

Run tests

python3 search.py -d dictionary.txt -p postings.txt -q queries.txt -o output.txt

Author

👤 Evan — Crawling, PageRank, Indexer, query expansion

Twitter: @NavePnow
Github: @NavePnow

👤 Rulin — Searcher, relevance feedback

Github: @XJDKC

🤝 Contributing

Contributions, issues and feature requests are welcome! Feel free to check issues page.

💰 Show your support

Give a ⭐️ if this project helped you!

PayPal	Patron

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
__pycache__		__pycache__
simhash		simhash
webpage		webpage
.gitignore		.gitignore
DOC.pdf		DOC.pdf
README.md		README.md
README.txt		README.txt
crawling.py		crawling.py
dictionary.txt		dictionary.txt
fetchedurls		fetchedurls
index.py		index.py
indexer.py		indexer.py
output.txt		output.txt
pagerank.py		pagerank.py
postings.txt		postings.txt
queries.txt		queries.txt
refiner.py		refiner.py
robots.txt		robots.txt
search.py		search.py
searcher.py		searcher.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Search-Engine-for-Computer-Science

Full document

Prerequisites

Install

Usage

Crawling

Index

Search

Run tests

Author

🤝 Contributing

💰 Show your support

📖 Reference

About

Releases

Packages

Languages

EthanWng97/Web-Search-Engine-for-Computer-Science

Folders and files

Latest commit

History

Repository files navigation

Web-Search-Engine-for-Computer-Science

Full document

Prerequisites

Install

Usage

Crawling

Index

Search

Run tests

Author

🤝 Contributing

💰 Show your support

📖 Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages