Data Collection Pipeline Project

This Python project is an implementation of a data collection pipeline, which automatically browse a website and collect information from it.

This project requires Selenium and Pandas.

Milestone 1 ~ 3: Prototyping the web scraper

The website I am scraping is VNDB, a visual novel wiki, because it is a website that does not change its layout very often. The script will be able to collect the data from top-rated visual novels with a certain keyword or genre.

I created a general-purpose basic scraper class with Selenium, with basic functionalities including clicking on elements by Xpath and accepting cookies applicable for all websites. I then created a child class of our basic scraper named VNDBScraper in a separate file, which is dedicated to the VNDB website.

Milestone 4: Retrieving image & text data

In the VNDBScraper class, I created a method named download_img() which downloads the head image from the description page of a visual novel and save it in raw_data/images. Note that this method is never invoked - I just want to practise retrieving image data.

Then I created get_info() method which scrapes the information of all the visual novels in the search result table on a single page and save them as a list of dictionaries. The information collected includes description page url, title, platforms, languages available, release date, popularity and rating. I also created next() method which finds the "next page" button and clicks on it.

The procedure of scraping is contained in the start_scrape() function, which calls get_info() to scrape the data from the first page. Then it goes on to the next page until the number of novels scraped reaches a certain limit (default 300) or when it reaches the last page. I made use of the selenium.common.exceptions.NoSuchElementException to determine whether or not we are on the last page:

while True:
    # Collect info with scraper.get_info()
    try:
        scraper.next()
    except NoSuchElementException:
        print('Scraping finished - Last page scraped.')
        break

Finally, the data collected are stored in raw_data/data.json .

Milestone 5: Testing

I tested all the methods of VNDBScraper class with the built-in unittest module and made sure that all the public methods are running error-free and giving the output as intended. For example, I confirmed the get_info() method correctly gives a list of non-empty dictionaries, each encodes all available information for a search result. Also, to test the download_image() method, I made use of the glob package and checked the path of the image with the following code:

list_of_files = glob.glob('raw_data/images/*')
image_path = max(list_of_files, key = path.getctime) # This returns the latest file

Milestone 6: Create a docker image to run our scraper

I created a dockerfile according to the instructions given in docker_selenium.md and built a docker image to run our scraper.

The commands used are as follows:

To build the image:docker build -t vndbscraper

To push the container to Dockerhub:docker tag vndbscraper alanimp18/vndbscraper_img docker push alanimp18/vndbscraper_img

Milestone 7: Setup the Github secrets

I generated an access token for my Docker CLI. Then I setup the Github secrets which contains the credentials required to log into my dockerhub repo. I also created a GitHub action that builds and pushes Docker image to DockerHub repo.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
.vscode		.vscode
__pycache__		__pycache__
raw_data		raw_data
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
TestScript.ipynb		TestScript.ipynb
VNDBScraper.py		VNDBScraper.py
WebScraper.py		WebScraper.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Collection Pipeline Project

Milestone 1 ~ 3: Prototyping the web scraper

Milestone 4: Retrieving image & text data

Milestone 5: Testing

Milestone 6: Create a docker image to run our scraper

Milestone 7: Setup the Github secrets

About

Releases

Packages

Contributors 2

Languages

Alan258IMP/DataCollection_AICore

Folders and files

Latest commit

History

Repository files navigation

Data Collection Pipeline Project

Milestone 1 ~ 3: Prototyping the web scraper

Milestone 4: Retrieving image & text data

Milestone 5: Testing

Milestone 6: Create a docker image to run our scraper

Milestone 7: Setup the Github secrets

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages