# Web scraping
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

# Web Scraping Lib/Process
request library: It is a Python library which is used to read the web page data from the URL of the corresponding page.

BeautifulSoup: It is a class which is used to beautify the HTML code structure so that the user can read it to get an idea about HTML code syntax.

The web scraping process can be divided into four major parts:

Reading: HTML page read and upload

Parsing: To beautify the HTML code in an understandable format

Extraction: Extraction of data from the web page

Transformation: Converting the information into the required format, e.g., CSV

### Topic Covered: Web Scrapping & Beautiful Soup



Suppose you want to buy a phone and want to compare the features and prices of the Samsung 
phones available on Flipkart and Amazon.

#### Example 1: Scraping Data from Flipkart

In order to do that, let's apply the understanding of web scraping on the Flipkart website to fetch 
prices and technical specifications data from the web page. This Flipkart link
<br>
<a>https://www.flipkart.com/search?q=samsung+mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&as-pos=1&as-type=RECENT&suggestionId=samsung+mobiles%7CMobiles&requestId=18944876-a6ef-44c0-ac67-d31d7b11a548&as-backfill=on</a>
<br>
contains information about some features of Samsung phones, including technical specifications, 
prices, discounts and rating, etc.

In [3]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

In [4]:
url=  "https://www.flipkart.com/search?q=samsung+mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_5_na_na_na&as-pos=1&as-type=RECENT&suggestionId=samsung+mobiles%7CMobiles&requestId=18944876-a6ef-44c0-ac67-d31d7b11a548&as-backfill=on"
client= uReq(url)
page_html= client.read()
client.close()

In [5]:
client

<http.client.HTTPResponse at 0x1e67c8b5c00>

In [6]:
page_soup= soup(page_html)

In [7]:
containers= page_soup.findAll("div", {"class": "_13oc-S"})

In [8]:
print(len(containers))

24


In [9]:
import re
filename= "Samsung.csv"
f= open(filename, "w",encoding="UTF-8")

headers= "Mobile_Name,Price,Ratings,Links,Technical_Specifications\n"
f.write(headers)

for container in containers:
    name= container.findAll('div',{"class":"_4rR01T"})
    Name= name[0].text
    Name = re.sub(r"[\(\)]",'',Name)
    Name=Name.replace(",","")
    link=container.findAll("a")
    Link=link[0].attrs['href']
    price= container.findAll("div", {"class": "_30jeq3 _1_WHN1"})
    Price=price[0].text
    Price=Price.replace(",","")
    Price = re.sub('[^A-Za-z0-9]+', '', Price)
    rating=container.findAll("div", {"class": "_3LWZlK"})
    Ratings=rating[0].text
    specification =container.findAll("ul", {"class": "_1xgFaf"})
    Specification=specification[0].text
    f.write(Name + ","+ Price + "," + Ratings + ","+Link+"," + Specification+"\n")

  
f.close()

In [11]:
import pandas as pd
df = pd.read_csv("Samsung.csv", encoding="utf-8")
df

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
0,SAMSUNG Galaxy F22 Denim Blue 64 GB,11999,4.3,/samsung-galaxy-f22-denim-blue-64-gb/p/itmce0a...,4 GB RAM | 64 GB ROM | Expandable Upto 1 TB16....
1,SAMSUNG Galaxy F22 Denim Blue 128 GB,13999,4.3,/samsung-galaxy-f22-denim-blue-128-gb/p/itm30c...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
2,SAMSUNG Galaxy F22 Denim Black 128 GB,13999,4.3,/samsung-galaxy-f22-denim-black-128-gb/p/itm9f...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
3,SAMSUNG Galaxy M32 Light Blue 128 GB,14799,4.3,/samsung-galaxy-m32-light-blue-128-gb/p/itm5d8...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
4,SAMSUNG Galaxy F22 Denim Black 64 GB,11999,4.3,/samsung-galaxy-f22-denim-black-64-gb/p/itm6f4...,4 GB RAM | 64 GB ROM | Expandable Upto 1 TB16....
5,SAMSUNG Galaxy F23 5G Aqua Blue 128 GB,16999,4.1,/samsung-galaxy-f23-5g-aqua-blue-128-gb/p/itme...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
6,SAMSUNG Galaxy F23 5G Copper Blush 128 GB,16999,4.1,/samsung-galaxy-f23-5g-copper-blush-128-gb/p/i...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
7,SAMSUNG Galaxy M33 5G Deep Ocean Blue 128 GB,16875,4.0,/samsung-galaxy-m33-5g-deep-ocean-blue-128-gb/...,6 GB RAM | 128 GB ROM16.76 cm (6.6 inch) Displ...
8,SAMSUNG Galaxy F23 5G Aqua Blue 128 GB,15999,4.1,/samsung-galaxy-f23-5g-aqua-blue-128-gb/p/itm8...,4 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
9,SAMSUNG Galaxy F23 5G Forest Green 128 GB,16999,4.1,/samsung-galaxy-f23-5g-forest-green-128-gb/p/i...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...


Get the phone details with min and max price of the phone.

In [32]:
max=pd.DataFrame(df[df.Price == df.Price.max()])
min=pd.DataFrame(df[df.Price == df.Price.min()])


In [33]:
max

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
5,SAMSUNG Galaxy F23 5G Aqua Blue 128 GB,16999,4.1,/samsung-galaxy-f23-5g-aqua-blue-128-gb/p/itme...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
6,SAMSUNG Galaxy F23 5G Copper Blush 128 GB,16999,4.1,/samsung-galaxy-f23-5g-copper-blush-128-gb/p/i...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...
9,SAMSUNG Galaxy F23 5G Forest Green 128 GB,16999,4.1,/samsung-galaxy-f23-5g-forest-green-128-gb/p/i...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...


In [34]:
min

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
20,SAMSUNG GURU GT,1497,4.1,/samsung-guru-gt/p/itm15efc9f269431?pid=MOBG9Z...,153 MB RAM | 153 MB ROM3.81 cm (1.5 inch) Disp...


### Example 2
In order to do that, let's apply the understanding of web scraping on the Amazon website to fetch 
prices and technical specifications data from the web page. This Amazon link
<a>
https://www.amazon.in/s?k=samsung&rh=n%3A1389401031&ref=nb_sb_noss </a>
contains information about some features of Samsung phones, including technical specifications, 
prices, discounts and rating, etc.


In [15]:
url1=  "https://www.amazon.in/s?k=samsung&rh=n%3A1389401031&ref=nb_sb_noss"
client1= uReq(url1)
page_html1= client1.read()
client1.close()


In [16]:
client1

<http.client.HTTPResponse at 0x1e60c7a2b00>

In [17]:
page= soup(page_html1)

In [18]:
box=page.find_all("div",{"class":"sg-col sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-20 s-list-col-right"})

In [20]:
len(box)

24

In [21]:
import re
filename= "Samsung2.csv"
f= open(filename, "w",encoding="UTF-8")

headers= "Mobile_Name,Price,Ratings,Links,Discounts\n"
f.write(headers)

for phone in box:
    name= phone.findAll("span",{"class":"a-size-medium a-color-base a-text-normal"})
    Name= name[0].text
    Name=Name.replace(",","")
    link=phone.findAll("a")
    Link=link[0].attrs['href']
    price= phone.findAll("span", {"class": "a-offscreen"})
    Price=price[0].text
    Price = re.sub('[^A-Za-z0-9]+', '', Price)
    rating=phone.findAll("span", {"class": "a-icon-alt"})
    Ratings=rating[0].text
    discount =phone.findAll("span", {"class": None})
    Discount=discount[4].text
    print(Name + ","+ Price + "," + Ratings + ","+Link+"," + Discount+"\n")
    f.write(Name + ","+ Price + "," + Ratings + ","+Link+"," + Discount+"\n")
    # print(Name + "," + Link + ","+ Price +"\n")
f.close()

Samsung Galaxy M12 (Blue4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XGDN3TZ,(19% off)

Samsung Galaxy M12 (Black4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJCMGL7,(19% off)

Samsung Galaxy M32 (Black 4GB RAM 64GB | FHD+ sAMOLED 90Hz Display | 6000mAh Battery | 64MP Quad Camera,14999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/B096VDG9QV,(12% off)

Samsung Galaxy M32 (Light Blue 6GB RAM 128GB | FHD+ sAMOLED 90Hz Display | 6000mAh Battery | 64MP Quad Camera,16999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/B096VDR283,(11% off)

Samsung Galaxy M32 5G (Sky Blue 8GB RAM 128GB Storage) | Dimensity 720 Processor | 5000mAh Battery| Knox Security,22999,4.1 out of 5 stars,/Samsung-Galaxy-Blue-128GB-Storage/dp/B09CGJFY

In [22]:
import pandas as pd
df2 = pd.read_csv("Samsung2.csv", encoding="utf-8")
df2

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Discounts
0,Samsung Galaxy M12 (Blue4GB RAM 64GB Storage) ...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XG...,(19% off)
1,Samsung Galaxy M12 (Black4GB RAM 64GB Storage)...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJ...,(19% off)
2,Samsung Galaxy M32 (Black 4GB RAM 64GB | FHD+ ...,14999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/...,(12% off)
3,Samsung Galaxy M32 (Light Blue 6GB RAM 128GB |...,16999,4.1 out of 5 stars,/Samsung-Galaxy-Storage-Months-Replacement/dp/...,(11% off)
4,Samsung Galaxy M32 5G (Sky Blue 8GB RAM 128GB ...,22999,4.1 out of 5 stars,/Samsung-Galaxy-Blue-128GB-Storage/dp/B09CGJFY5N,(12% off)
5,Samsung Galaxy M33 5G (Deep Ocean Blue 6GB 128...,17999,3.8 out of 5 stars,/Samsung-Storage-Adapter-Purchased-Separately/...,(28% off)
6,Samsung Galaxy S20 FE 5G (Cloud Navy 8GB RAM 1...,39990,4.4 out of 5 stars,/Samsung-Galaxy-Cloud-128GB-Storage/dp/B08VB57558,(47% off)
7,Samsung Galaxy A13 Blue 4GB RAM 64GB Storage w...,14999,3.5 out of 5 stars,/Samsung-Storage-Additional-Exchange-SM-A135FL...,(19% off)
8,Samsung Galaxy M21 2021 Edition (Arctic Blue 4...,12999,4.2 out of 5 stars,/Samsung-Storage-sAMOLED-Replacement-SM-M215GL...,(10% off)
9,Samsung Galaxy A52s 5G (Black 8GB RAM 128GB St...,29700,4.2 out of 5 stars,/Samsung-Galaxy-Storage-Without-Offers/dp/B09D...,(29% off)


In [23]:
df2_max=pd.DataFrame(df[df2.Price == df2.Price.max()])
df2_min=pd.DataFrame(df[df2.Price == df2.Price.min()])

In [24]:
df2_max

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
6,SAMSUNG Galaxy F23 5G Copper Blush 128 GB,16999,4.1,/samsung-galaxy-f23-5g-copper-blush-128-gb/p/i...,6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...


In [25]:
df2_min

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
19,SAMSUNG Galaxy M12 White 64 GB,10965,4.2,/samsung-galaxy-m12-white-64-gb/p/itmfc7abd828...,4 GB RAM | 64 GB ROM16.51 cm (6.5 inch) Displa...


## Get the price of Samsung Galaxy F12 available with price comparison in Amazon & flipkart

In [26]:
#flipkart
df3=pd.DataFrame(df[df.Mobile_Name == "SAMSUNG Galaxy M12 Black 64 GB"])


In [27]:
df3

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Technical_Specifications
14,SAMSUNG Galaxy M12 Black 64 GB,11192,4.2,/samsung-galaxy-m12-black-64-gb/p/itm425898eed...,4 GB RAM | 64 GB ROM16.51 cm (6.5 inch) Displa...


In [28]:
#amazon
df4=pd.DataFrame(df2[df2.Mobile_Name == "Samsung Galaxy M12 (Black4GB RAM 64GB Storage) 6000 mAh with 8nm Processor | True 48 MP Quad Camera | 90Hz Refresh Rate"])

In [29]:
df4

Unnamed: 0,Mobile_Name,Price,Ratings,Links,Discounts
1,Samsung Galaxy M12 (Black4GB RAM 64GB Storage)...,10499,4.1 out of 5 stars,/Samsung-Galaxy-M12-Storage-Processor/dp/B08XJ...,(19% off)


In [30]:
a=df3["Price"]

In [31]:
df4["Price"]


1    10499
Name: Price, dtype: int64

Conclusion:
Flipkart is expensive than amazon