# Web Scraping 

Web Scraping, web harvesting, or web data extraction is a technique used to extract large amounts of data from websites. The most popular libraries used for web scraping in python are Scrapy, Beautiful soup, and Selenium. Beautiful Soup is a Python library for pulling data out of HTML and XML files. 

To extract data using web scraping with python, you need to follow these basic steps:

1. Find the URL that you want to scrape
2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format 

### Import Libraries.

In [2]:
import requests     # making HTTP requests in Python
import bs4          # beautifulsoup 4 library 
#import lxml         # processing XML and HTML in the Python language
import pandas as pd  

### Request the data.

Make a request to a web page, and print the response text. One of the most common HTTP methods is GET. The GET method indicates that you’re trying to get or retrieve data from a specified resource. To make a GET request, invoke requests.get(). A Response is a powerful object for inspecting the results of the request. The first bit of information that you can gather from Response is the status code. A status code informs you of the status of the request.

For example, a 200 OK status means that your request was successful, whereas a 404 NOT FOUND status means that the resource you were looking for was not found. There are many other possible status codes as well to give you specific insights into what happened with your request.

By accessing .status_code, you can see the status code that the server returned:

In [3]:
response=requests.get("https://www.snapdeal.com/products/men-apparel-jeans?sort=plrty")
print(response.status_code)

200


The response of a GET request often has some valuable information, known as a payload, in the message body. Using the attributes and methods of Response, you can view the payload in a variety of different formats. While `.content` gives you access to the raw bytes of the response payload, you will often want to convert them into a string using a character encoding such as UTF-8. response will do that for you when you access `.text`

In [4]:
print(response.content)

b'<!DOCTYPE html>\n\t<!--[if IE 8]><html lang="en" class="ie ie8 lt-ie9 lt-ie10"> <![endif]-->\n<!--[if IE 9]><html lang="en" class="ie ie9 lt-ie10"> <![endif]-->\n<!--[if IE]><html lang="en" class="ie"><![endif]-->\n<!--[if gt IE 9]><!--><html lang="en"><!--<![endif]-->\n\t<head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb# snapdeallog: https://ogp.me/ns/fb/snapdeallog#">\n\t\t<link rel="dns-prefetch" href="https://i1.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://i2.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://i3.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://i4.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://n1.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://n2.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://n3.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://n4.sdlcdn.com" />\r\n\t<link rel="dns-prefetch" href="https://sa.snapdeal.com" />\r\n\t<link rel="dns-prefetch" href="https://search-su

In [6]:
print(response.text)

<!DOCTYPE html>
	<!--[if IE 8]><html lang="en" class="ie ie8 lt-ie9 lt-ie10"> <![endif]-->
<!--[if IE 9]><html lang="en" class="ie ie9 lt-ie10"> <![endif]-->
<!--[if IE]><html lang="en" class="ie"><![endif]-->
<!--[if gt IE 9]><!--><html lang="en"><!--<![endif]-->
	<head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb# snapdeallog: https://ogp.me/ns/fb/snapdeallog#">
		<link rel="dns-prefetch" href="https://i1.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://i2.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://i3.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://i4.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://n1.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://n2.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://n3.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://n4.sdlcdn.com" />
	<link rel="dns-prefetch" href="https://sa.snapdeal.com" />
	<link rel="dns-prefetch" href="https://search-suggester.snapdeal.com" />
	<link rel="d

## Basic Structure of HTML document

<img src = "https://stuyhsdesign.files.wordpress.com/2015/09/basic-structure.png">

###  Beautiful Soup.

By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To parse a document, pass it into the BeautifulSoup constructor. 

In [8]:
data= bs4.BeautifulSoup(response.text)
print(data)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


<!DOCTYPE html>
<!--[if IE 8]><html lang="en" class="ie ie8 lt-ie9 lt-ie10"> <![endif]--><!--[if IE 9]><html lang="en" class="ie ie9 lt-ie10"> <![endif]--><!--[if IE]><html lang="en" class="ie"><![endif]--><!--[if gt IE 9]><!--><html lang="en"><!--<![endif]-->
<head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb# snapdeallog: https://ogp.me/ns/fb/snapdeallog#">
<link href="https://i1.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://i2.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://i3.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://i4.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://n1.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://n2.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://n3.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://n4.sdlcdn.com" rel="dns-prefetch"/>
<link href="https://sa.snapdeal.com" rel="dns-prefetch"/>
<link href="https://search-suggester.snapdeal.com" rel="dns-prefetch"/>
<link href="https://mobileapi.snapdeal.com" rel="

### Extract the Product information.

In [10]:
read=data.find_all('div', 'product-desc-rating')
print(read)

[<div class="product-desc-rating ">
<a class="dp-widget-link noUdLine" hidomntrack="" href="https://www.snapdeal.com/product/urbano-fashion-light-blue-slim/637244412538" pogid="637244412538" target="_blank">
<p class="product-title " title="Urbano Fashion Light Blue Slim Jeans">Urbano Fashion Light Blue Slim Jeans</p>
<div class="product-price-row clearfix">
<div class="lfloat marR10">
<span class="lfloat product-desc-price strike ">Rs. 1,599</span>
<span class="lfloat product-price" data-price="739" display-price="739" id="display-price-637244412538">Rs.  739</span>
</div>
<div class="product-discount">
<span>54% Off</span>
</div>
</div>
</a>
</div>, <div class="product-desc-rating ">
<a class="dp-widget-link noUdLine" hidomntrack="" href="https://www.snapdeal.com/product/ragzo-multicolored-slim-jeans/629142971521" pogid="629142971521" target="_blank">
<p class="product-title " title="RAGZO Multicolored Slim Jeans">RAGZO Multicolored Slim Jeans</p>
<div class="product-price-row clearf

In [12]:
len(read)

19

In [13]:
read[0]

<div class="product-desc-rating ">
<a class="dp-widget-link noUdLine" hidomntrack="" href="https://www.snapdeal.com/product/urbano-fashion-light-blue-slim/637244412538" pogid="637244412538" target="_blank">
<p class="product-title " title="Urbano Fashion Light Blue Slim Jeans">Urbano Fashion Light Blue Slim Jeans</p>
<div class="product-price-row clearfix">
<div class="lfloat marR10">
<span class="lfloat product-desc-price strike ">Rs. 1,599</span>
<span class="lfloat product-price" data-price="739" display-price="739" id="display-price-637244412538">Rs.  739</span>
</div>
<div class="product-discount">
<span>54% Off</span>
</div>
</div>
</a>
</div>

In [14]:
prod_name=read[0].find_all('p', 'product-title')
prod_name

[<p class="product-title " title="Urbano Fashion Light Blue Slim Jeans">Urbano Fashion Light Blue Slim Jeans</p>]

In [15]:
prod_name[0].getText()

'Urbano Fashion Light Blue Slim Jeans'

In [16]:
original_price=read[0].find_all('span', 'lfloat product-desc-price strike ')
original_price

[<span class="lfloat product-desc-price strike ">Rs. 1,599</span>]

In [17]:
original_price[0].getText()

'Rs. 1,599'

In [19]:
dis_price=read[0].find_all('span', 'lfloat product-price')
dis_price

[<span class="lfloat product-price" data-price="739" display-price="739" id="display-price-637244412538">Rs.  739</span>]

In [20]:
dis_price[0].getText()

'Rs.  739'

In [21]:
discount=read[0].find_all('div', "product-discount")
discount

[<div class="product-discount">
 <span>54% Off</span>
 </div>]

In [22]:
print(discount[0].getText())


54% Off



### Extract the Product Name.

In [14]:
for i in read:
    product_name=i.find_all('p', 'product-title')
    print(product_name)

[<p class="product-title " title="HALOGEN White Slim Jeans">HALOGEN White Slim Jeans</p>]
[<p class="product-title " title="Studio Nexx Gray Cotton Regular Fit Jeans">Studio Nexx Gray Cotton Regular Fit Jeans</p>]
[<p class="product-title " title="BUKKL Indigo Blue Slim Jeans">BUKKL Indigo Blue Slim Jeans</p>]
[<p class="product-title " title="Just Trousers Grey Slim Jeans">Just Trousers Grey Slim Jeans</p>]
[<p class="product-title " title="BUKKL Black Slim Jeans">BUKKL Black Slim Jeans</p>]
[<p class="product-title " title="Dom &amp; B Blue Slim Jeans">Dom &amp; B Blue Slim Jeans</p>]
[<p class="product-title " title="RAGZO Blue Slim Jeans">RAGZO Blue Slim Jeans</p>]
[<p class="product-title " title="Gericho London Navy Blue Slim Jeans">Gericho London Navy Blue Slim Jeans</p>]
[<p class="product-title " title="Urbano Fashion Light Blue Slim Jeans">Urbano Fashion Light Blue Slim Jeans</p>]
[<p class="product-title " title="denword White Slim Jeans">denword White Slim Jeans</p>]
[<p cl

In [15]:
for i in read:
    product_name=i.find_all('p', 'product-title')
    product_name=product_name[0].getText()
    print(product_name)

HALOGEN White Slim Jeans
Studio Nexx Gray Cotton Regular Fit Jeans
BUKKL Indigo Blue Slim Jeans
Just Trousers Grey Slim Jeans
BUKKL Black Slim Jeans
Dom & B Blue Slim Jeans
RAGZO Blue Slim Jeans
Gericho London Navy Blue Slim Jeans
Urbano Fashion Light Blue Slim Jeans
denword White Slim Jeans
BOLTS and BARRELS Blue Slim Jeans
Urbano Fashion Black Slim Jeans
Mufti Black Skinny Jeans
RAGZO Black Slim Jeans
Spykar Blue Skinny Jeans
Mufti Black Slim Jeans
Spykar Blue Slim Jeans
Lawson Black Skinny Jeans
Hopewell Dark Blue Slim Jeans
Crimsoune Club Blue Slim Jeans


### Extract the Original Price and Discounted Price.

In [16]:
for i in read:
    product_name=i.find_all('p','product-title')
    product_name=product_name[0].getText()
    original_price= i.find_all('span', 'lfloat product-desc-price strike ')
    original_price=original_price[0].getText()
    print(original_price)

Rs. 3,599
Rs. 1,299
Rs. 3,999
Rs. 1,650
Rs. 3,999
Rs. 1,299
Rs. 1,499
Rs. 1,999
Rs. 1,599
Rs. 3,999
Rs. 2,199
Rs. 1,799
Rs. 2,199
Rs. 1,899
Rs. 2,699
Rs. 3,899
Rs. 2,399
Rs. 3,599
Rs. 999
Rs. 2,399


In [17]:
for i in read:
    product_name=i.find_all('p','product-title')
    product_name=product_name[0].getText()
    original_price= i.find_all('span', 'lfloat product-desc-price strike ')
    original_price=original_price[0].getText()
    discounted_price=i.find_all('span', 'lfloat product-price')
    discounted_price=discounted_price[0].getText()
    print(discounted_price)

Rs.  1,099
Rs.  914
Rs.  895
Rs.  699
Rs.  1,018
Rs.  549
Rs.  945
Rs.  599
Rs.  739
Rs.  1,249
Rs.  999
Rs.  869
Rs.  1,539
Rs.  969
Rs.  1,349
Rs.  1,559
Rs.  1,199
Rs.  1,099
Rs.  598
Rs.  1,200


### Append the data extracted into a list.

In [23]:
name=[]
op=[]
dp=[]

for i in read:
    product_name=i.find_all('p','product-title')
    product_name=product_name[0].getText()
    original_price= i.find_all('span', 'lfloat product-desc-price strike ')
    original_price=original_price[0].getText()
    discounted_price=i.find_all('span', 'lfloat product-price')
    discounted_price=discounted_price[0].getText()
    name.append(product_name)
    op.append(original_price)
    dp.append(discounted_price)

### Make a data frame of the lists using pandas.

In [24]:
import pandas as pd
final_data = pd.DataFrame({
    "product_name": name,
    "original_price": op,
    "discounted_price": dp,
})

In [25]:
final_data.head()

Unnamed: 0,discounted_price,original_price,product_name
0,Rs. 739,"Rs. 1,599",Urbano Fashion Light Blue Slim Jeans
1,"Rs. 2,499","Rs. 4,197",RAGZO Multicolored Slim Jeans
2,"Rs. 1,167","Rs. 3,999",BUKKL Blue Slim Jeans
3,Rs. 666,Rs. 999,Hopewell Black Slim Jeans
4,"Rs. 1,018","Rs. 3,999",BUKKL Black Slim Jeans


### Convert the data frame into a csv file and save it.

In [29]:
final_data.to_csv("snapdeal_scrap.csv", index=False)

In [30]:
scrape_data = pd.read_csv("snapdeal_scrap.csv")

In [31]:
scrape_data.head()

Unnamed: 0,discounted_price,original_price,product_name
0,Rs. 739,"Rs. 1,599",Urbano Fashion Light Blue Slim Jeans
1,"Rs. 2,499","Rs. 4,197",RAGZO Multicolored Slim Jeans
2,"Rs. 1,167","Rs. 3,999",BUKKL Blue Slim Jeans
3,Rs. 666,Rs. 999,Hopewell Black Slim Jeans
4,"Rs. 1,018","Rs. 3,999",BUKKL Black Slim Jeans
