Selenium is an extremely powerful tool used for web data scraping however, it has some flaws that are fair because it was produced mainly to test web applications. On the other hand, BeautifulSoup was developed and produced for data scraping and it is extremely powerful indeed. However, even if BeautifulSoup has its faults, it won’t be beneficial if the required data is behind the “wall”, as it needs the user’s login for accessing the data or needs some actions from users. That’s where we can utilize Selenium, for automating user interactions through the website as well as we would use BeautifulSoup for scraping data.
I would be using BeautifulSoup and Selenium to extract product information like name, ratings, etc. from https://www.amazon.in/.
Selenium
BeautifulSoup
time
pandas
- Opening Chrome browser and accessing https://www.amazon.in/ via Selenium.
- Getting the "Search Term" thourgh user input.
- Formatting the url to a dynamic url where "Search Term" can be changed.
- Scraping the Product Names, Prices, Ratings, No. Of Items Sold and Link for each of the products.
- Navigating to the next page using Selenium by clicking the next button automatically after the current page is being scraped.
- Handling errors like if some elements are not present (for example if reviews are not present) with try and catch block.
- Converting the scrapped data to .csv file using the
pandas
library.
Note: I set the page limit to first 7 pages for scraping, you can increase it to scrap more pages.
- Automatically scrapping each page one by one.
- The output .csv file.
- There were some AttributeErrors so had to use try catch block to overcome it.
- I used Selenium for paginating through the next page so if we come over to the last page then there would be error as the "next page" button won't be there. I handeled it using try and catch block and a recursive function which will exit the scrapping after it reaches the last page.
- Scrapping in Amazon isn't allowed so had to use a
time
function and delayed the scraping with some seconds so that the site dosen't think that a bot is scraping, - Scraping huge amount of data in Amazon might lead to an ip ban so I scraped the first 7 pages.