Skip to content

Ritvik-Gupta/scrapy_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy Web Mining

Scarping Mobile phones from Flipkart

Details
Name Ritvik Gupta
Registration Number 19BCE0397
Assignment 5th - Web Scraping

Web Scraped data includes the following attributes for both phone types:

  • Image URL - The main photo of the phone

  • Phone URL - Link to the page for the phone on flipkart for the consumer

  • Name

  • Rating - Average rating of the phone by reviewrs

  • Total Reviews

  • Price

  • Colors - Model colors available

  • Storages - Model storage space available (eg: 64GB)

  • General Specs - Specifications such as In The Box, SIM Type, Hybrid Sim Slot, Touchscreen, OTG Compatible.

Scrapy is a tool like BeautifulSoup ( bs4 ) for web scraping but unlike the latter provides much more features along with parallel scraping multiple webpages and recursively scraping paginated sites.

Project includes two spider scripts as following:

  • Scrape a limited amount of Samsung Galaxy Phones, from the first page, and store the scraped data in a JSON format with multiple fields in a nested structure

  • Scrape recursively through all iPhones from all 15 pages ( starting from first page ) present on flipkart for different models. Each paginated page would call its "Next" page and follow the links to the end. Scraped data is stored in CSV format and cannot have nested structure so the "General Specs" is flattened out.

Details about mining each individual component during the scraping process can be found and followed in detail with comments specified

Tools Used

Main and only tool used is Scrapy for Python ( following the tutorial ).

Generating and Running Spiders

To genrate the two spiders the command used is

scrapy genspider <spider-name> <main-url-used>

Note: Spider Names need to be unique to identify the spiders In our case they are flipkart_iphones and flipkart_galaxys

To run a specific spider

scrapy crawl <spider-name-provided> -O <output-file>.<csv|json>

Note: The flag -O overwrites any previous content and -o appends.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages