| Details | |
|---|---|
| Name | Ritvik Gupta |
| Registration Number | 19BCE0397 |
| Assignment | 5th - Web Scraping |
Web Scraped data includes the following attributes for both phone types:
-
Image URL - The main photo of the phone
-
Phone URL - Link to the page for the phone on flipkart for the consumer
-
Name
-
Rating - Average rating of the phone by reviewrs
-
Total Reviews
-
Price
-
Colors - Model colors available
-
Storages - Model storage space available (eg: 64GB)
-
General Specs - Specifications such as In The Box, SIM Type, Hybrid Sim Slot, Touchscreen, OTG Compatible.
Scrapy is a tool like BeautifulSoup ( bs4 ) for web scraping but unlike the latter provides much more features along with parallel scraping multiple webpages and recursively scraping paginated sites.
-
Scrape a limited amount of Samsung Galaxy Phones, from the first page, and store the scraped data in a JSON format with multiple fields in a nested structure
-
Scrape recursively through all iPhones from all 15 pages ( starting from first page ) present on flipkart for different models. Each paginated page would call its "Next" page and follow the links to the end. Scraped data is stored in CSV format and cannot have nested structure so the "General Specs" is flattened out.
Details about mining each individual component during the scraping process can be found and followed in detail with comments specified
Main and only tool used is Scrapy for Python ( following the tutorial ).
To genrate the two spiders the command used is
scrapy genspider <spider-name> <main-url-used>Note: Spider Names need to be unique to identify the spiders In our case they are
flipkart_iphonesandflipkart_galaxys
To run a specific spider
scrapy crawl <spider-name-provided> -O <output-file>.<csv|json>Note: The flag
-Ooverwrites any previous content and-oappends.
