examtopics-webscraper

Summary

A simple web app with the main purpose of Scraping data from the Website ExamTopics From there collect the scraped data and use it to test yourself

Main Issue

Issue is that to look at all the questions you need to have the subscription but you can look at any question if you search up the exam name along with the question number

Solution?

Create a script to go through the different pages and just collect the data necessary as shown below in Data Section

How to use

Automatic Scraping

Run py app.py to run the backend server
Run npm run dev, which runs the website server, from here go to Web Scraper
navigate to Scrape Page
Choose a Provider and an Exam (You will not be able to proceed with scraping without these two data)
Click Scrape!
Stop the program whenever you feel you have collected enough questions

Manual Scraping (If Automatic Scraping is not working)

Step 1

Run npm run dev, which runs the website server, from here go to Web Scraper
navigate to Scrape Page
Choose a Provider and an Exam (You will not be able to proceed with scraping without these two data)
Click Scrape!
Inspect element of the website, the latest Console Log will include all of the details the backend needs to proceed with scraping

Step 2

In the code of app.py above the main() function is a placeholder to paste the details object assigning it to scrape_details
Run py main.py to run the backend server
Watch the magic hapen!
Stop the program whenever you feel you have collected enough questions

Data Section

Example Exams

/exams/amazon/aws-certified-developer-associate-dva-c02

Questions will be written in format below

[
    {
        "exam_name" : ""
        "questionNo": "",
        "topic": "",
        "question": "",
        "choices": [
            "",
            "",
            "",
            "",
            ""
        ],
        "answer": "",
        "comments": [
            {
                "user": "",
                "selected": "",
                "comment": "",
                "comments": []
            }
        ]
    }
]

IN progress

After every round, the runs are saved into /dumps/runs this is to keep track of details such as

How many successful scrape
How many of those successful scrape resulted in a question
How many other questions/exam/etc were encountered
How many error pages encountered
How long did the program run for

in the form of

{
    program_start_date: '',
    program_end_date: '',
    duration: '',
    total_bans: '',
    longest_delay: '',
    scrape_data: {
        total_successful_scrape: '',
        total_questions: '',
        total_unused_questions: '',
        total_error_pages
    }
}

How it the script works

1.

Goes to the page https://www.examtopics.com/discussions/[provider]/[pageNum]/
From the Scrape Section within the webApp, the user chooses a provider and the selected exam which consists of the exam-code and exam-name
Here it is substituted to eg. https://www.examtopics.com/discussions/amazon/1/

2.

This page consists of a list of exam questions by the provider, and the most recent created discussion will be at the top
The script scapes this page and collects the title and link to the discussion, it then uses a fuzzysearch algorithm to match the similarity of the title with the exam-code and exam-name that you selected
as shown below

3.

Any similarity matches over 80 will be considered a hit and the script will fetch from the link and scrape that site for the questions in the format shown in Data Section
Afterwards the pageNum of discussion will be incremented to repeat from step 2

For Contributors

To run the python server, create first the python virtual environment, by virtual venv then install from requirements.txt pip install -r 'requirements.txt' then run venv\Scripts\python server.py

To run the python scripts run py main.py, make sure you are in your virtual environment with `

If you added more python libraries be sure the libraries are added to requirements, pip freeze > requirements.txt this command automatically adds from your venv

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
__pycache__		__pycache__
public		public
script		script
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
image.png		image.png
main.py		main.py
next-env.d.ts		next-env.d.ts
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

examtopics-webscraper

Summary

Main Issue

Solution?

How to use

Automatic Scraping

Manual Scraping (If Automatic Scraping is not working)

Data Section

IN progress

How it the script works

1.

2.

3.

For Contributors

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

AdamFilch/examtopics-webscraper

Folders and files

Latest commit

History

Repository files navigation

examtopics-webscraper

Summary

Main Issue

Solution?

How to use

Automatic Scraping

Manual Scraping (If Automatic Scraping is not working)

Data Section

IN progress

How it the script works

1.

2.

3.

For Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages