Skip to content

✨ The current project is a basic process pipeline for extraction, transformation, loading, analysis and presentation. All of this was done using appropriate web scraping, data analysis/presentation and database tools.

License

Notifications You must be signed in to change notification settings

gabrielmotablima/DS001--scraping-to-analysis--Extra-Store

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎉 DS001 - Scraping to Analysis (Extra Store)

The present project is a basic process pipeline of extrating, transforming, loading, analysing and presenting. All of that was made by using suitable tools of web scraping, data analysis/presentation and databases.

cover


Objectives:

  • Create a crawler able to scrape offers and reviews from Extra web store, more specifically, offers and reviews about coolers, televisions and printers;
  • Save the data in a database in an automated way;
  • Analyze products and reviews data;
  • Create a basic presentation using Extra offers information.

💻 Step 1. Code code... and code

To code a programming to get web site information is needed a crawler (the crawler in DS001 project was made in Python and Scrapy). Looking at Extra web store source code and requests in browser we can find some API URL been triggered. Using API URLs the work becomes easier.

🛣️ Step 2. Choose a way to scrape and save the data

As reviews data can be extracted while scraping offers data, it's a good way to split the work into three spiders (coolers, televisions and printers spiders) without create additional spiders to reviews only. Basically, review objects are bigger than offer objects, then the impact of scraping the two together per spider isn't too severe. The crawler saves the data in MongoDB database itself using the files "pipelines.py" and "items.py".

🕷️ Step 3. Run the spiders

Running the spiders with command "scrapy crawl <<SPIDER_NAME>>": step_3.1

So...

step_3.2

💾 Step 4. Wait...

Data been saved in MongoDB database:

step_4

  • Products data format in database:

step_4.1

  • Reviews data format in database:

step_4.2

I early stoped the crawlers due the time to deliver the case 😳. So, the result... was about 31k data documents saved within MongoDB datase.

step_4.3

🕶️ Step 5. Looking for a first undestanding about the data

MongoDB has its own tools to basic data analysis in database:

step_5

📈 Step 6. Making a deeper descriptive analysis

In a Jupyter Notebook some incredible things can be done. Python is a really flexible and versatile programming language. Using libraries/packages like Matplotlib, Pandas, Numpy, Seaborn a complete descriptive analysis is tangible.

step_6

🎨 Step 7. Exporting data and making a simple presentation

  • Exporting products data from MongoDB as CSV:

step_7.1

  • Exporting reviews data from MongoDB as CSV:

step_7.2

All presentation was made in Power BI Desktop, an awesome tool to data visualization and presentation.

  • Iterative charts presentation in computer:

step_7.3

  • Iterative charts presentation in smartphone:

step_7.3

🚀 The end.

About

✨ The current project is a basic process pipeline for extraction, transformation, loading, analysis and presentation. All of this was done using appropriate web scraping, data analysis/presentation and database tools.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published