How fast can you navigate from a start article to a goal article in Wikipedia?
Given a start article, you can only advance by clicking on the links to other Wikipedia articles, within the article itself.
I used to play this game with friends during my army service a few years ago, so creating this project was pure fun π
starting from the Python (programming language) Wikipedia article, navigate to the Carl Friedrich Gauss Wikipedia article.
Few possible answers:
- Python (programming language) -> Factorial -> Carl Friedrich Gauss
- Python (programming language) -> Complex number -> Carl Friedrich Gauss
- Python (programming language) -> Number Theory -> Mathematics -> Carl Friedrich Gauss
Given a start and goal articles provided by the user, the code in this project uses the BFS or the DFS algorithms in order to find a solution path from the start to the goal articles, following the instructions above.
The whole code is written in Python.
Notes:
- There may be more than one possible path.
- BFS will always find the shortest path (in regards to the number of 'jumps' from one article to another).
- Since Wikipedia is huge and the search can take a lot of time, two changeable restrictions are applied:
- The amount of articles that will be processed in total, see
MAX_ARTICLES_TO_SEARCH
inmain.py
- The maximum length of a possible path from the start to goal articles, see
MAX_WIKI_PATH_LENGTH
inmain.py
.
- The amount of articles that will be processed in total, see
- This project was created and testsed with Python 3.9.
- Create a Python Virtual Environment ('.venv'):
python -m venv .venv
- Install requirements:
pip install -r requirements.txt -y
- Run
main.py
file - in the command line executepy main.py
. - Choose the start and goal articles.
- Choose which search algorithm should be used.
- The output may be empty if:
- A possible solution path does not exist.
- All of the possible solutions are longer than
MAX_WIKI_PATH_LENGTH
. - The
MAX_ARTICLES_TO_SEARCH
bound was reached and the articles that were processed can not form a possible solution path.
-
Time:
The search alogorithms (BFS / DFS) fetch articles from Wikipedia and process them.
Given the start and goal articles, as more and more articles are being fetched and processed, the response time from Wikipedia gets longer: From 0.5 seconds per one fetch up to 2 minutes and more!
This affects significantly on the total run time. One search with a possible solution of two 'jumps' can take a lot of precious minutes.
As a solution, I added 'async' fetching versions to BFS and DFS. These versions perform better than the not-async versions, yet the total time can still take minutes. -
Semantics:
During the search, the different links within an article are not processed in a semantic way. We can reduce the search time if more semantically related links (articles) to the goal article were processed first.