<a href="https://colab.research.google.com/github/Charishma-Bailapudi/amazon-web-scraping/blob/main/amazon_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Web Scraping**
1.Web Scraping is a technique to extract a large amount of data from several websites. The term "scraping" refers to obtaining the information from another source (webpages) and saving it into a local file.

2.Web Scrapping extracts the data from websites in the unstructured format. It helps to collect these unstructured data and convert it in a structured form.





**IMPORT REQUIRED LIBRARIES**

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

### **How does Web Scrapping work?**
**Step -1: Find the URL that you want to scrape**

First, you should understand the requirement of data according to your project. A webpage or website contains a large amount of information. That's why scrap only relevant information. In simple words, the developer should be familiar with the data requirement.

**Step - 2: Inspecting the Page**

The data is extracted in raw HTML format, which must be carefully parsed and reduce the noise from the raw data. In some cases, data can be simple as name and address or as complex as high dimensional weather and stock market data.

**Step - 3: Write the code**

Write a code to extract the information, provide relevant information, and run the code.

**Step - 4: Store the data in the file**

Store that information in required csv, xml, JSON file format.




In [None]:
# Function to extract Product Title
def get_title(soup):

    try:
        # Outer Tag Object
        title = soup.find("span", attrs={"id":'productTitle'})
        
        # Inner NavigatableString Object
        title_value = title.text

        # Title as a string value
        title_string = title_value.strip()

    except AttributeError:
        title_string = ""

    return title_string

# Function to extract Product Price
def get_price(soup):

    try:
        price = soup.find("span", attrs={'id':'priceblock_ourprice'}).string.strip()

    except AttributeError:

        try:
            # If there is some deal price
            price = soup.find("span", attrs={'id':'priceblock_dealprice'}).string.strip()

        except:
            price = ""

    return price

# Function to extract Product Rating
def get_rating(soup):

    try:
        rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
    
    except AttributeError:
        try:
            rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
        except:
            rating = ""	

    return rating

# Function to extract Number of User Reviews
def get_review_count(soup):
    try:
        review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()

    except AttributeError:
        review_count = ""	

    return review_count

# Function to extract Availability Status
def get_availability(soup):
    try:
        available = soup.find("div", attrs={'id':'availability'})
        available = available.find("span").string.strip()

    except AttributeError:
        available = "Not Available"	

    return available


In [None]:
if __name__ == '__main__':

    # add your user agent 
    HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'})

    # The webpage URL
    URL = "https://www.amazon.com/s?k=playstation+4&ref=nb_sb_noss_2"

    # HTTP Request
    webpage = requests.get(URL, headers=HEADERS)

    # Soup Object containing all data
    soup = BeautifulSoup(webpage.content, "html.parser")

    # Fetch links as List of Tag Objects
    links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

    # Store the links
    links_list = []

    # Loop for extracting links from Tag Objects
    for link in links:
            links_list.append(link.get('href'))

    d = {"title":[], "price":[], "rating":[], "reviews":[],"availability":[]}
    
    # Loop for extracting product details from each link 
    for link in links_list:
        new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)

        new_soup = BeautifulSoup(new_webpage.content, "html.parser")

        # Function calls to display all necessary product information
        d['title'].append(get_title(new_soup))
        d['price'].append(get_price(new_soup))
        d['rating'].append(get_rating(new_soup))
        d['reviews'].append(get_review_count(new_soup))
        d['availability'].append(get_availability(new_soup))

    
    amazon_df = pd.DataFrame.from_dict(d)
    amazon_df['title'].replace('', np.nan, inplace=True)
    amazon_df = amazon_df.dropna(subset=['title'])
    amazon_df.to_csv("amazon_data.csv", header=True, index=False)


In [None]:
amazon_df

Unnamed: 0,title,price,rating,reviews,availability
1,PlayStation 4 500GB Console (Renewed),$347.93,3.9 out of 5 stars,329 ratings,In Stock.
2,PlayStation 4 Slim 1TB Console (Renewed),$349.95,4.3 out of 5 stars,977 ratings,In Stock.
8,"Playstation SONY 4, 500GB Slim System [CUH-221...",,4.7 out of 5 stars,326 ratings,Only 7 left in stock - order soon.
10,Sony - PlayStation 4 Pro Console (3002470) Jet...,$413.99,4.3 out of 5 stars,204 ratings,Only 6 left in stock - order soon.
11,Sony PlayStation 4 Pro 1TB Console - Black (PS...,,4.5 out of 5 stars,"4,183 ratings",Only 10 left in stock - order soon.
12,"Playstation Sony 4, 500GB Slim System [CUH-221...",,4.5 out of 5 stars,298 ratings,In Stock.
19,OIVO PS4 Stand Cooling Fan Station for Playsta...,,4.5 out of 5 stars,"40,792 ratings",In Stock.
34,DualShock 4 Wireless Controller for PlayStatio...,$59.99,4.7 out of 5 stars,"142,604 ratings",
36,PlayStation 4 Slim 1TB Console (Renewed),$349.95,4.3 out of 5 stars,977 ratings,In Stock.
39,"Playstation Sony 4, 500GB Slim System [CUH-221...",,4.5 out of 5 stars,298 ratings,In Stock.
