Data Collection Pipeline

Overview

In this project, I will make a Data Collection pipeline, in which tabular data and images are collected from a website and stored in a relational database and data lake, respectively, in the cloud. After testing, I will use Docker to containerise the application and deploy it to an EC2 instance, to make it scalable. Using GitHub Actions, I will set up a CI/CD pipeline to push a new Docker image when any chances are made to the code. The container will be monitored with Prometheus and I will create dashboards to visualise those metrics using Grafana.

Progress to date

I have created a general webscraper module that uses Selenium and Requests to extract data from a website. I have also made two further modules, which contain scrapers for particular websites (see below for more detail). Currently the data is collected and stored locally in a json file, but once unit and integration testing is complete, this will be uploaded to AWS.

Websites

The two websites I wanted to scrape were:

https://inspirassion.com/es/

This website is a collocations dictionary. Collocations are combinations of words that sound natural to a native speaker, e.g. in English, we say 'heavy rain', not 'strong rain.' As a former English teacher, I understand the value of learning collocations and as a current Spanish student, I was interested in scraping this website to get a collection of verbs and adjectives that collocate with a particular noun. I have decided to focus my search on abstract nouns, such as 'libertad' (freedom) and 'curiosidad' (curiosity), as I felt this was a gap in my knowledge at my current level.
https://ideas.lego.com

This is the Lego Ideas website. Anyone can submit an idea for a Lego project and if it receives enough support, it will get made into a real Lego set. I love browsing this website and seeing all the amazing designs.

Web scraper

I created a general Scraper class using Selenium and Requests. In the constructor, the Selenium Chrome webdriver is initialised. WebDriverManager, in this case ChromeDriverManager, is used to automate the management of the drivers required by Selenium WebDriver.

Using Selenium, I created methods to cover the main actions a user performs when browsing a website. The methods of the Scraper class are:

Methods that replicate user behaviour	Methods for 'behind-the-scenes' actions
open_url()	wait_for()
search()	get_img_links()
click_button()	get_list_links()
scroll_up_top()	download_raw_data()
scroll_down_bottom()	download_images()
accept_cookies()	create_uuid()
infinite_scroll()	switch_frame()
see_more()	get_html() (uses Requests and BeautifulSoup)
next_page()	find_in_html() (uses Requests and BeautifulSoup)
quit()	restart_search()
	container_to_list()
	try_append()
	get_html_and_java()
	find_all_in_html()

Making use of inheritance, I made a child class for each website I wanted to scrape. These inherited all the methods above and allowed me to make some methods specific to each website.

Methods specific to LegoScraper	Methods specific to CollocationsScraper
get_links()	create_dict()
create_id()	get_words()
get_supporters_days_remaining()	get_infinitives()
get_name_date_creator()	get_phrases()
collect_info()	get_frequency()
explore_product_ideas()

Data is collected from the website and initially stored in a DataClass. This instance of the DataClass is then written to a JSON file.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.gitignore		.gitignore
Project		Project
Testing		Testing
Web_scraper.egg-info		Web_scraper.egg-info
__pycache__		__pycache__
node_modules		node_modules
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Collection Pipeline

Overview

Progress to date

Websites

Web scraper

About

Uh oh!

Releases

Packages

Languages

License

ESheldon31/Data_collection

Folders and files

Latest commit

History

Repository files navigation

Data Collection Pipeline

Overview

Progress to date

Websites

Web scraper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages