# Web scraping
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

# Web Scraping Lib/Process
request library: It is a Python library which is used to read the web page data from the URL of the corresponding page.

BeautifulSoup: It is a class which is used to beautify the HTML code structure so that the user can read it to get an idea about HTML code syntax.

The web scraping process can be divided into four major parts:

Reading: HTML page read and upload

Parsing: To beautify the HTML code in an understandable format

Extraction: Extraction of data from the web page

Transformation: Converting the information into the required format, e.g., CSV

### Topic Covered: Web Scrapping & Beautiful Soup



Suppose you want to buy a phone and want to compare the features and prices of the Samsung 
phones available on Flipkart and Amazon.

#### Example 1: Scraping Data from Flipkart

In order to do that, let's apply the understanding of web scraping on the Flipkart website to fetch 
prices and technical specifications data from the web page. This Flipkart link
<br>
<a>https://www.flipkart.com/search?q=samsung+mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&as-pos=1&as-type=RECENT&suggestionId=samsung+mobiles%7CMobiles&requestId=18944876-a6ef-44c0-ac67-d31d7b11a548&as-backfill=on</a>
<br>
contains information about some features of Samsung phones, including technical specifications, 
prices, discounts and rating, etc.

In [1]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

In [2]:
url=  "https://www.flipkart.com/search?q=samsung+mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&as-pos=1&as-type=RECENT&suggestionId=samsung+mobiles%7CMobiles&requestId=18944876-a6ef-44c0-ac67-d31d7b11a548&as-backfill=on"
client= uReq(url)
page_html= client.read()
client.close()

In [3]:
client

<http.client.HTTPResponse at 0x26489fd3fa0>

In [4]:
page_soup= soup(page_html)

In [14]:
# print(page_soup.prettify())

In [6]:
containers= page_soup.findAll("div", {"class": "_13oc-S"})

In [7]:
print(len(containers))

24


In [8]:
import re
filename= "Samsung.csv"
f= open(filename, "w",encoding="UTF-8")

headers= "Mobile_Name,Price,Ratings,Links,Technical_Specifications\n"
f.write(headers)

for container in containers:
    name= container.findAll('div',{"class":"_4rR01T"})
    Name= name[0].text
    Name = re.sub(r"[\(\)]",'',Name)
    Name=Name.replace(",","")
    link=container.findAll("a")
    Link=link[0].attrs['href']
    price= container.findAll("div", {"class": "_30jeq3 _1_WHN1"})
    Price=price[0].text
    Price=Price.replace(",","")
    Price = re.sub('[^A-Za-z0-9]+', '', Price)
    rating=container.findAll("div", {"class": "_3LWZlK"})
    Ratings=rating[0].text
    specification =container.findAll("ul", {"class": "_1xgFaf"})
    Specification=specification[0].text
    f.write(Name + ","+ Price + "," + Ratings + ","+Link+"," + Specification+"\n")

  
f.close()

import pandas as pd
df = pd.read_csv("Samsung.csv", encoding="utf-8")
df

Get the phone details with min and max price of the phone.

In [10]:
max=pd.DataFrame(df[df.Price == df.Price.max()])
min=pd.DataFrame(df[df.Price == df.Price.min()])


In [11]:
max

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
20,SAMSUNG Galaxy M42 Prism Dot Gray 128 GB,21999,4.3,/samsung-galaxy-m42-prism-dot-gray-128-gb/p/it...,8 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...


In [12]:
min

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
13,SAMSUNG Guru Music 2 SM-B315E,1990,4.2,/samsung-guru-music-2-sm-b315e/p/itmf050063b51...,Expandable Upto 16 GB5.08 cm (2 inch) QVGA Dis...


### Example 2
In order to do that, let's apply the understanding of web scraping on the Amazon website to fetch 
prices and technical specifications data from the web page. This Amazon link
<a>
https://www.amazon.in/s?k=samsung&rh=n%3A1389401031&ref=nb_sb_noss </a>
contains information about some features of Samsung phones, including technical specifications, 
prices, discounts and rating, etc.


In [13]:
url1=  "https://www.amazon.in/s?k=samsung&rh=n%3A1389401031&ref=nb_sb_noss"
client1= uReq(url1)
page_html1= client1.read()
client1.close()


HTTPError: HTTP Error 503: Service Unavailable

In [None]:
client1

<http.client.HTTPResponse at 0x2724f1d79d0>

In [None]:
page= soup(page_html1)

In [None]:
box=page.find_all("div",{"class":"sg-col sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-20 s-list-col-right"})

In [None]:
len(box)

24

In [None]:
import re
filename= "Samsung2.csv"
f= open(filename, "w",encoding="UTF-8")

headers= "Mobile_Name,Price,Ratings,Links,Discounts\n"
f.write(headers)

for phone in box:
    name= phone.findAll("span",{"class":"a-size-medium a-color-base a-text-normal"})
    Name= name[0].text
    Name=Name.replace(",","")
    link=phone.findAll("a")
    Link=link[0].attrs['href']
    price= phone.findAll("span", {"class": "a-offscreen"})
    Price=price[0].text
    Price = re.sub('[^A-Za-z0-9]+', '', Price)
    rating=phone.findAll("span", {"class": "a-icon-alt"})
    Ratings=rating[0].text
    discount =phone.findAll("span", {"class": None})
    Discount=discount[4].text
    print(Name + ","+ Price + "," + Ratings + ","+Link+"," + Discount+"\n")
    f.write(Name + ","+ Price + "," + Ratings + ","+Link+"," + Discount+"\n")
    # print(Name + "," + Link + ","+ Price +"\n")
f.close()

Samsung Galaxy M12 (Blue4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XGDN3TZ,(19% off)

Samsung Galaxy M12 (Black4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJCMGL7,(19% off)

Samsung Galaxy M32 (Light Blue 6GB RAM 128GB | FHD+ sAMOLED 90Hz Display | 6000mAh Battery | 64MP Quad Camera,13999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/B096VDR283,(26% off)

Samsung Galaxy M32 (Black 4GB RAM 64GB | FHD+ sAMOLED 90Hz Display | 6000mAh Battery | 64MP Quad Camera,11999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/B096VDG9QV,(29% off)

Samsung Galaxy S20 FE 5G (Cloud Navy 8GB RAM 128GB Storage) with No Cost EMI & Additional Exchange Offers,36990,4.4 out of 5 stars,/Samsung-Galaxy-Cloud-128GB-Storage/dp/B08VB57558,(51%

In [None]:
import pandas as pd
df2 = pd.read_csv("Samsung2.csv", encoding="utf-8")
df2

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Discounts
0,Samsung Galaxy M12 (Blue4GB RAM 64GB Storage) ...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XG...,(19% off)
1,Samsung Galaxy M12 (Black4GB RAM 64GB Storage)...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJ...,(19% off)
2,Samsung Galaxy M32 (Light Blue 6GB RAM 128GB |...,13999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/...,(26% off)
3,Samsung Galaxy M32 (Black 4GB RAM 64GB | FHD+ ...,11999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/...,(29% off)
4,Samsung Galaxy S20 FE 5G (Cloud Navy 8GB RAM 1...,36990,4.4 out of 5 stars,/Samsung-Galaxy-Cloud-128GB-Storage/dp/B08VB57558,(51% off)
5,Samsung Galaxy M32 5G (Sky Blue 8GB RAM 128GB ...,22999,4.1 out of 5 stars,/Samsung-Galaxy-Blue-128GB-Storage/dp/B09CGJFY5N,(12% off)
6,Samsung Galaxy M12 (Blue6GB RAM 128GB Storage)...,12499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Replacement/dp/B08...,(14% off)
7,Samsung Galaxy A13 Blue 4GB RAM 64GB Storage w...,13999,3.7 out of 5 stars,/Samsung-Storage-Additional-Exchange-SM-A135FL...,(24% off)
8,Samsung Galaxy M32 (Black 6GB RAM 128GB | FHD+...,13999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/...,(26% off)
9,Samsung Galaxy M33 5G (Deep Ocean Blue 8GB 128...,19499,3.9 out of 5 stars,/Samsung-Storage-Adapter-Purchased-Separately/...,(25% off)


In [None]:
df2_max=pd.DataFrame(df[df2.Price == df2.Price.max()])
df2_min=pd.DataFrame(df[df2.Price == df2.Price.min()])

In [None]:
df2_max

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
4,SAMSUNG Galaxy F22 Denim Blue 128 GB,13999,4.3,/samsung-galaxy-f22-denim-blue-128-gb/p/itm30c...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...


In [None]:
df2_min

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
20,SAMSUNG Galaxy F12 Celestial Black 64 GB,11499,4.3,/samsung-galaxy-f12-celestial-black-64-gb/p/it...,4 GB RAM | 64 GB ROM | Expandable Upto 512 GB1...


## Get the price of Samsung Galaxy F12 available with price comparison in Amazon & flipkart

In [None]:
#flipkart
df3=pd.DataFrame(df[df.Mobile_Name == "SAMSUNG Galaxy M12 Black 64 GB"])


In [None]:
df3

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications


In [None]:
#amazon
df4=pd.DataFrame(df2[df2.Mobile_Name == "Samsung Galaxy M12 (Black4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate"])

In [None]:
df4

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Discounts
1,Samsung Galaxy M12 (Black4GB RAM 64GB Storage)...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJ...,(19% off)


In [None]:
a=df3["Price"]

In [None]:
df4["Price"]


1    10499
Name: Price, dtype: int64

Conclusion:
Flipkart is expensive than amazon