A simple web app with the main purpose of Scraping data from the Website ExamTopics From there collect the scraped data and use it to test yourself
Issue is that to look at all the questions you need to have the subscription but you can look at any question if you search up the exam name along with the question number
Create a script to go through the different pages and just collect the data necessary as shown below in Data Section
- Run
py app.py
to run the backend server - Run
npm run dev
, which runs the website server, from here go to Web Scraper - navigate to
Scrape Page
- Choose a Provider and an Exam (You will not be able to proceed with scraping without these two data)
- Click
Scrape!
- Stop the program whenever you feel you have collected enough questions
- Step 1
- Run
npm run dev
, which runs the website server, from here go to Web Scraper - navigate to
Scrape Page
- Choose a Provider and an Exam (You will not be able to proceed with scraping without these two data)
- Click
Scrape!
- Inspect element of the website, the latest Console Log will include all of the details the backend needs to proceed with scraping
- Step 2
- In the code of app.py above the
main()
function is a placeholder to paste the details object assigning it to scrape_details - Run
py main.py
to run the backend server - Watch the magic hapen!
- Stop the program whenever you feel you have collected enough questions
Example Exams
- /exams/amazon/aws-certified-developer-associate-dva-c02
Questions will be written in format below
[
{
"exam_name" : ""
"questionNo": "",
"topic": "",
"question": "",
"choices": [
"",
"",
"",
"",
""
],
"answer": "",
"comments": [
{
"user": "",
"selected": "",
"comment": "",
"comments": []
}
]
}
]
After every round, the runs are saved into /dumps/runs this is to keep track of details such as
- How many successful scrape
- How many of those successful scrape resulted in a question
- How many other questions/exam/etc were encountered
- How many error pages encountered
- How long did the program run for
in the form of
{
program_start_date: '',
program_end_date: '',
duration: '',
total_bans: '',
longest_delay: '',
scrape_data: {
total_successful_scrape: '',
total_questions: '',
total_unused_questions: '',
total_error_pages
}
}
- Goes to the page
https://www.examtopics.com/discussions/[provider]/[pageNum]/
- From the Scrape Section within the webApp, the user chooses a provider and the selected exam which consists of the exam-code and exam-name
- Here it is substituted to eg.
https://www.examtopics.com/discussions/amazon/1/
- This page consists of a list of exam questions by the provider, and the most recent created discussion will be at the top
- The script scapes this page and collects the title and link to the discussion, it then uses a fuzzysearch algorithm to match the similarity of the title with the exam-code and exam-name that you selected
- as shown below
- Any similarity matches over 80 will be considered a hit and the script will fetch from the link and scrape that site for the questions in the format shown in
Data Section
- Afterwards the pageNum of discussion will be incremented to repeat from step 2
To run the python server, create first the python virtual environment, by virtual venv
then install from requirements.txt pip install -r 'requirements.txt'
then run venv\Scripts\python server.py
To run the python scripts run py main.py
, make sure you are in your virtual environment with `
If you added more python libraries be sure the libraries are added to requirements, pip freeze > requirements.txt
this command automatically adds from your venv