Follow this link to beta test the search engine for yourself, making sure to submit queries that pertain to the topic of health.
DISCLAIMER: The information on this site is not intended or implied to be a substitute for professional medical advice, diagnosis or treatment. Always seek the advice of your physician or other qualified health care provider with any questions you may have regarding a medical condition or treatment and before undertaking a new health care regimen, and never disregard professional medical advice or delay in seeking it because of something you have read on this website.
This project offers users a Flask-leveraged web-based search engine to search for Wikipedia articles relating to the topic of health. The search engine queries our data set, which is an inverted index built from a corpus of ~500 Wikipedia articles. The documents are returned based on relevance to the query using the Okapi BM25 algorithm. Additionally, text summaries of each article are included in the search results as well as pagination for user-friendly browsing.
We leveraged several Jupyter Notebooks to execute multiple different functions necessary for the search engine to run. First, the focused_crawler.ipynb sources Wikipedia articles based on a seed URL and keyword. For our seed URL, we used the Wikipedia Article for Health, and the keyword we used was "health." Next, we set the depth parameter to 10 so that the crawler would go 10 levels deep in Wikipedia space by adding links to articles within the articles that contained the word "health," through 10 iterations of crawling.
The other key notebooks include:
duplicate_removal.ipynb: Removes articles that are either identical or different URLs that redirect to the same or nearly identical articles.
inv_index.ipynb: Builds inverted index from corpus generated by the Focused Crawler.
text_summarizer.ipynb: Provides most salient sentences in summary form from each URL present in the corpus.
app.ipynb: Employs Flask library to generate search engine webpage, takes in user query and runs it through the Search Engine (search_engine.py), and contains important logic for appending the correct text summarizies to each URL result, as well as pagination information for use in the index.html file. Each result page should only contain 10 results, and the user should be able to proceed through pages individually until they reach the termination of the search results.
Ground Truth Preparation.ipynb: Prepares the ground truth table which can be downloaded as Excel sheet to be filled out by the user.
Evaluation.ipynb: Evaluates our search engine with a given ground truth sheet.
WordCloud.ipynb: Creates a word cloud from all documents in our corpus. Note that you must have downloaded the full corpus to create the word cloud. Refer to section Pre-Processing.
The index.html and base.html files provide the back-end structure of the search engine webpage. The index.html file contains Jinja2 scripting to incorporate logic into the web page's layout dependent on certain conditions. For example, if a user reaches the last page of a certain query's results, they should not be able to click on the next page button since there wouldn't be a next page. Another instance of Jinja2 logic is modifying what is shown on the home page compared to post-search. The user does not need to see an empty section labeled with "No Search Results" on the homepage before they have even executed a query. The base.html file contains HTML code for visual aspects like font color and so forth.
The first thing you need to do is to download a copy of this project and install the requirements:
git clone https://github.com/TrevBot17/Capstone.git
pip install -r requirements.txt
The project includes the following folders with corresponding Jupyter Notebooks and Python scripts: src/Pre-Processing and src/App. The fully functional webpage can be run by just using the app.py (or app.ipynb) in the App folder. However, if you want to run every from scratch, you'll need to follow the steps below.
- Run the
focused_crawler.ipynbnotebook to generate the corpus. Within this file, you can play around with the parameters likekeywordanddepthto modify the content of the corpus. This notebook will create theRaw_TXT_Downloadsfolder that will need to be referenced in future notebooks. - Run
duplicate_removal.ipynbfile to remove articles that are either identical or different URLs that redirect to the same or nearly identical articles. - Run the
inv_index.ipynbnotebook to build the inverted index from your previously generated corpus, which should output the inverted index of the corpus as a pickle file to theAppfolder. - Run the
text_summarizer.ipynbnotebook to create text summaries associated with each URL in the corpus, which should output the text summaries of the corpus as a pickle file to theAppfolder.
Run app.py (or app.ipynb) to build the search engine webpage running on your local machine. This script interacts with the search_engine.py script, so make sure to have that script in the same directory.
The above describes the steps to run the search engine locally. In order to create a publicly available version we decided to host the web page on the Heroku cloud platform service. For that, a Heroku account was set up and the web page was generated via the Heroku CLI and deployed with Heroku Git. The content of the App folder (except the app.ipynb file) need to be pushed to the Heroku platform, including the requirement.txt file. It is recommended that, to successfully run and publish this project, an IDE comparable to VS Code is used.
In order to evaluate the performance of our search engine run the Evaluation.ipynb notebook located in the main src folder. This notebook will leverage the existing ground truth Excel sheet Capstone Ground Truth.xlsx located in the same folder. If you want to create your own ground truth, you can use the notebook Ground Truth Preparation.ipynb. It will create a Pandas DataFrame based on a set of queries, which then can be downloaded as an Excel sheet and filled out by a user in the same way as it was done in Capstone Ground Truth.xlsx.
