**Amazon Product Data Scraper**

---
: This project uses Python and the requests and Beautiful Soup libraries to scrape data from Amazon product pages. It extracts information such as product title, price, ratings, number of reviews, and availability, which can be stored in a structured format (like CSV, Excel or in a dataframe) for further analysis.

**1. Importing Libraries:**

- `requests`: Used to make HTTP requests to fetch the webpage content.
- `BeautifulSoup`: Used to parse the HTML content and extract data.
- `time`, `datetime`: Used for handling time-related tasks (if needed).
- `pandas`: Used for data manipulation and storage (if you intend to store the data in a dataframe).

**2. Defining Extraction Functions:**

- `get_title`, `get_price`, `get_ratings`, `get_stars`, `get_availability`: These functions take a BeautifulSoup object (representing the HTML content) and extract specific product information. They use BeautifulSoup's `find` and `get_text` methods to locate and extract data based on HTML tags and attributes. If the information is not found, they return "N/A".

**3. Fetching Product Data:**

- The code sets up the URL of an Amazon product page and headers for the request.
- It uses `requests.get` to fetch the webpage content.
- It creates a BeautifulSoup object (`soup1`) to parse the content using the `lxml` parser.

**4. Extracting and Displaying Data:**

- The code calls the extraction functions (e.g., `get_title(soup1)`) to extract the desired information from the `soup1` object.
- The extracted data is then printed to the console.

**5. Scraping Multiple Products:**

- A header is defined for the scraper to identify itself to Amazon.
- The code fetches product links from an Amazon search results page for "pink ipad".
- It iterates through the links, making a request to each product page and extracting data using the defined functions.
- The extracted data for each product is printed to the console.

**In summary, the code acts as a web scraper to automatically extract product data from Amazon and display it.** This data can be further processed or stored for analysis.

In [None]:
import bs4
from bs4 import BeautifulSoup
import requests
import time
import datetime
import pandas as pd

In [None]:
URL = "https://www.amazon.com/Sony-PlayStation-500GB-Premium-Bundle-4/dp/B0DHYN6C69/ref=sr_1_1_sspa?dib=eyJ2IjoiMSJ9.Qm_QLXKAY0urfoup8ZpaMGhvRftCjGy7a9Wlj8u5VZiGP6U6D1Jmjwl6NrzKClIHY2hdPbyXvBeaf1q4nwEMl7b-XOphB99lKkP5Eg7t0zzJ33x7wEdtAZxPrAMmOeDEXCVZ1dSAIA-cjELjcjiHw4wTA-mfiJ4pDpqEouX5ZvdPbvrmeJXcQKtiuyurBWwsupRy3JKv1XgroNaQjeC-nWo76g0xBNgyFE3xlzZJ9JU.JbgINLVN-MSedo_Ta-JnwRwo8aPUmOFLWJPpvHzfZso&dib_tag=se&keywords=playstation+4&qid=1730981073&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1"
### https://httpbin.org/get ###
headers = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'})
###############################

page = requests.get(URL,headers = headers)

soup1 = BeautifulSoup(page.content, "lxml")




---



In [None]:
def get_title(soup):
  try:
      title = soup.find("span", attrs={"id":"productTitle"})
      title_value = title.get_text().strip()
  except:
      title_value = "N/A"
  return title_value
get_title(soup1)

'N/A'

In [None]:
def get_price(soup):
  try:
    price = soup.find("span", attrs={"class":"a-price-whole"})
    price_value = price.get_text().strip()
  except:
    price_value = "N/A"
  return price_value
get_price(soup1)


'N/A'

In [None]:
def get_ratings(soup):
  try:
    rating = soup1.find("span", attrs={"id":"acrCustomerReviewText"})
    rating_value = rating.get_text().strip()
  except:
    rating_value = "N/A"
  return rating_value
get_ratings(soup1)

'N/A'

In [None]:
def get_stars(soup):
  try:
    stars = soup.find("span", attrs={"id":"acrPopover"})
    stars_value = stars.get_text().strip()
    stars_array = stars_value.split()[1:]
    stars_final = ' '.join(stars_value.split()[1:])

  except:
    stars_final = "N/A"
  return stars_final
get_stars(soup1)

'N/A'

In [None]:
def get_availability(soup):
	try:
		available = soup.find("span", attrs={'id':'availability'})
		available = available.find("span").string.strip()

	except AttributeError:
		available = "N/A"

	return available
get_availability(soup1)

'N/A'

In [None]:
if __name__ == '__main__':
  header = ({'User-Agent':
	            'Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
	            'Accept-Language': 'en-US'})
  URL = "https://www.amazon.co.uk/s?k=pink+ipad&crid=M4TWXFEQQUKW&sprefix=pink+ipad%2Caps%2C96&ref=nb_sb_noss_1"
  webpage = requests.get(URL, headers=header)
  soup = BeautifulSoup(webpage.content, "lxml")
  links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})
  links_list = []
  for link in links:
      links_list.append(link.get('href'))
  for link in links_list:
      new_webpage = requests.get("https://www.amazon.com" + link, headers=header)
      new_soup = BeautifulSoup(new_webpage.content, "lxml")
      print("Product Title =", get_title(new_soup))
      print("Product Price =", get_price(new_soup))
      print("Product Rating =", get_ratings(new_soup))
      print("Number of Product Reviews =", get_stars(new_soup))
      print("Availability =", get_availability(new_soup))
      print()
      print()

Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = ProCase for iPad 10th Generation Case with Pencil Holder 2022 10.9 Inch, Clear Back iPad 10 Case, 10th Gen iPad Case for A2696 A2757 A2777 -Pink
Product Price = 9.
Product Rating = N/A
Number of Product Reviews = 4.6 out of 5 stars
Availability = N/A


Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = N/A
Product Price = N/A
Product Rating = N/A
Number of Product Reviews = N/A
Availability = N/A


Product Title = ProCase for iPad 10th Generatio

ConnectionError: HTTPSConnectionPool(host='www.amazon.comhttps', port=443): Max retries exceeded with url: /aax-eu.amazon.co.uk/x/c/JHwD1Z5CreGfxfQ8uxccKs0AAAGTCH54_wMAAAH2AQBvbm9fdHhuX2JpZDIgICBvbm9fdHhuX2ltcDEgICCQbrtK/https://www.amazon.co.uk/Adjustable-Bracket-Surface-Portable-Monitor-Black/dp/B08L4W9C9N/ref=sxbs_sbv_search_btf?content-id=amzn1.sym.74376d27-29ad-4668-acf9-017b903e74ef%3Aamzn1.sym.74376d27-29ad-4668-acf9-017b903e74ef&crid=M4TWXFEQQUKW&cv_ct_cx=pink+ipad&dib=eyJ2IjoiMSJ9.7waBeZPu-rUCBJtKKBe0MQ.sOyuFcmSBq1Ms__aWPz7DZPzjbOkPvOXe7wBYEL6nLU&dib_tag=se&keywords=pink+ipad&nsdOptOutParam=true&pd_rd_i=B08L4W9C9N&pd_rd_r=cc0be81d-54d1-4890-bd27-b3c0cb4d8161&pd_rd_w=xE9Hv&pd_rd_wg=WahfG&pf_rd_p=74376d27-29ad-4668-acf9-017b903e74ef&pf_rd_r=B7ASJJDMR0V1XZ9YFE1S&qid=1731014326&sbo=RZvfv%2F%2FHxDF%2BO5021pAnSA%3D%3D&sprefix=pink+ipad%2Caps%2C96&sr=1-1-f1821008-9dea-4812-b2b6-4a6e4a4f2d55 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7800c3986d70>: Failed to resolve 'www.amazon.comhttps' ([Errno -2] Name or service not known)"))