# Introduction

The first step to any Web Scraping project is to get the content of the page from an URL. In this notebook, we will see how to use the Requests Library and access the content from Amazon and Youtube. 

Once we get the content, we will Use Beautiful Soup to parse the data into structure format

### Imports


Uncomment and run the below line to install requests library

In [1]:
#!pip install requests
#!pip install bs4

In [2]:
import requests
from bs4 import BeautifulSoup

### Extracting data from Amazon

To get information about an product from Amazon, we need the ASIN (Amazon Standard Identification Number) of the product.

So let us first search for Nike Woman Shoes, and get list of ASINS associated

## Let us first try extracting the data by simply passing the url

In [3]:
search_query="nike+women+shoes"
base_url="https://www.amazon.in/s?k="

In [4]:
url= base_url+search_query
print(url)

https://www.amazon.in/s?k=nike+women+shoes


### Pass the URL To Amazon, using requests.get(). 

In [5]:
search_response=requests.get(url)

In [6]:
search_response.status_code

200

In [7]:
search_response.text #search_response.content

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'><link rel="dns-prefetch" href="//images-eu.ssl-images-amazon.com"><link rel="dns-prefetch" href="//m.media-amazon.com"><link rel="dns-prefetch" href="//completion.amazon.com"><!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/517rp2NH2UL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01cbS3UK11L.css,21mOLw+nYYL.css,01L8Y-JFEhL

##### 503 indicates Forbidden Page. Let us see the error

In [8]:
search_response.content

b'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'><link rel="dns-prefetch" href="//images-eu.ssl-images-amazon.com"><link rel="dns-prefetch" href="//m.media-amazon.com"><link rel="dns-prefetch" href="//completion.amazon.com"><!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/517rp2NH2UL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01cbS3UK11L.css,21mOLw+nYYL.css,01L8Y-JFEh

One reason for this is we have to set header information

In [9]:
header={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}

In [10]:
search_response=requests.get(url,headers=header)

In [11]:
search_response.status_code

200

It is an OK response. Let us see what the response content is

In [12]:
search_response.text

'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="en-us"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Robot Check</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n    var ue_t0 = (+ new Date()),\n        ue_csm = window,\n        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n        ue_furl = "fls-eu.amazon.in",\n        ue_mid = "

Again, we do not have access to the page. We may have to set the cookie. To set the cookies.. Go to an amazon URL and inspect the page and set cookies as a dictionary

In [13]:
#amazon_url="https://www.amazon.in/dp/"


In [14]:
## U need to set ur own cookied
cookie={}


In [17]:
def getAmazonSearch(search_query):
    url="https://www.amazon.in/s?k="+search_query
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"
    #print(page.content)

In [18]:
search_response=getAmazonSearch("nike+women+shoes")

https://www.amazon.in/s?k=nike+women+shoes


Let us write this response to a file and inspect it further. 

In [19]:
search_response.content



We can see that the data is present in value of data-asin property in div tag, where class is "sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"

Let us extract this data using Beautiful Soup

In [20]:
soup=BeautifulSoup(search_response.content)

In [21]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-in"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/><link href="https://images-eu.ssl-images-amazon.com" rel="dns-prefetch"/><link href="https://m.media-amazon.com" rel="dns-prefetch"/><link href="https://completion.amazon.com" rel="dns-prefetch"/><script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a

In [22]:
soup.title # Gives the title 

<title>Amazon.in: nike women shoes</title>

## Extract Product List

In [23]:
soup.find("span",attrs={'class':'a-size-base-plus a-color-base a-text-normal'})

<span class="a-size-base-plus a-color-base a-text-normal">Nike Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)</span>

In [24]:
product_list_tag=soup.findAll("span",attrs={'class':'a-size-base-plus a-color-base a-text-normal'})

In [25]:
type(product_list_tag)

bs4.element.ResultSet

In [26]:
product_list_tag

[<span class="a-size-base-plus a-color-base a-text-normal">Nike Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Adidas Women's Adispree 2.0 W Running Shoes</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Puma Women's Agile t1 NM Wn s IDP Sneakers</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Reebok Men's Run Escape Lp Running Shoes</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Skechers Women's GO Walk LITE-ENAMOR Nordic Walking Shoes</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Women's WMNS RUNALLDAY Wolf White-Cool Grey Running Shoes-7UK 9.5US (898484-016</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Revolution 4 Sports Running Shoes for Women</span>,
 <span class="a-size-base-plus a-color-base a-text-normal">Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)</span>,
 <span 

In [27]:
product_tag=product_list_tag[0]

In [28]:
product_tag

<span class="a-size-base-plus a-color-base a-text-normal">Nike Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)</span>

In [29]:
len(product_list_tag)

61

In [30]:
product_list=[]
for i in range(len(product_list_tag)):
    product_list.append(product_list_tag[i].text)

In [31]:
product_list

["Nike Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)",
 "Adidas Women's Adispree 2.0 W Running Shoes",
 "Puma Women's Agile t1 NM Wn s IDP Sneakers",
 "Reebok Men's Run Escape Lp Running Shoes",
 "Skechers Women's GO Walk LITE-ENAMOR Nordic Walking Shoes",
 "Women's WMNS RUNALLDAY Wolf White-Cool Grey Running Shoes-7UK 9.5US (898484-016",
 'Revolution 4 Sports Running Shoes for Women',
 "Women's Revolution 4 Obsidian/Mountain Blue Running Shoes (908999-403)",
 "Women's WMNS Downshifter 7 Running Shoes",
 "Women's WMNS Classic Cortez Nylon Running Shoes",
 "Women's WMNS RUNALLDAY Running Shoes",
 "Women's WMNS Court Royale PREM Tennis Shoes",
 "Women's Running Shoes",
 "Women's Running Shoes",
 "Men's WMNS RUNALLDAY Running Shoes",
 "Men's WMNS Revolution 4 Running Shoes",
 "Women's WMNS Epic React Flyknit Running Shoes",
 "Women's Revolution 4 Navy Blue Running Shoes(908999-406)",
 "Women's WMNS Flex Bijoux Multisport Training Shoes",
 "Women's WMNS Air Zoom Pe

In [58]:
### Extract Price of Products`m

In [32]:
price=soup.findAll('span',class_='a-price-whole')

In [33]:
p=[]
for i in range(len(price)):
    p.append(price[i].text)

In [34]:
s=soup.find('span',class_='a-price-symbol')

In [35]:
for i in range(len(p)):
    p[i]=s.text+p[i]

In [36]:
p

['₹1,999',
 '₹1,899',
 '₹1,599',
 '₹1,164',
 '₹2,879',
 '₹2,475',
 '₹1,999',
 '₹1,999',
 '₹2,745',
 '₹3,572',
 '₹2,199',
 '₹2,747',
 '₹2,901',
 '₹2,663',
 '₹2,949',
 '₹1,949',
 '₹8,797',
 '₹2,032',
 '₹2,637',
 '₹5,937',
 '₹5,497',
 '₹1,529',
 '₹489',
 '₹1,049',
 '₹1,349',
 '₹3,965',
 '₹1,513',
 '₹3,672',
 '₹1,647',
 '₹4,722',
 '₹2,752',
 '₹2,912',
 '₹2,697',
 '₹3,140',
 '₹3,253',
 '₹2,272',
 '₹2,799',
 '₹3,627',
 '₹1,997',
 '₹4,562',
 '₹5,497',
 '₹2,877',
 '₹2,568',
 '₹2,074',
 '₹2,217',
 '₹3,960',
 '₹4,122',
 '₹6,047',
 '₹4,356',
 '₹2,197',
 '₹2,715',
 '₹5,393',
 '₹2,417',
 '₹9,997',
 '₹3,140',
 '₹3,140',
 '₹899',
 '₹1,099',
 '₹449',
 '₹1,648']

## Get all Span Tags

In [59]:
span_tags=soup.findAll('span')## Get all anchor tags

In [38]:
span_tags

[<span cel_widget_id="TITLE_AND_META-TITLE_AND_META" class="celwidget slot=TITLE_AND_META template=TITLE_AND_META widgetId=title-and-meta index=0">
 <title>Amazon.in: nike women shoes</title>
 <meta content="Amazon.in: nike women shoes" name="description"/>
 <meta content="nike women shoes, Amazon.in" name="keywords"/>
 <link href="https://www.amazon.in/nike-women-shoes/s?k=nike+women+shoes" rel="canonical"/>
 </span>,
 <span class="nav-sprite nav-logo-base"></span>,
 <span class="nav-sprite nav-logo-ext"></span>,
 <span class="nav-sprite nav-logo-locale"></span>,
 <span class="icp-nav-link-inner">
 <span class="nav-line-1">
 <span class="icp-nav-globe-img-2"></span>
 <span class="icp-nav-language">EN</span>
 </span>
 <span class="nav-line-2"> 
         <span class="nav-icon nav-arrow"></span>
 </span>
 </span>,
 <span class="nav-line-1">
 <span class="icp-nav-globe-img-2"></span>
 <span class="icp-nav-language">EN</span>
 </span>,
 <span class="icp-nav-globe-img-2"></span>,
 <span cla

In [39]:
for spantag in span_tags:
    try:
        print(spantag['class'])
    except:
        print(str(spantag)+"does not have class attribute")
        pass

['celwidget', 'slot=TITLE_AND_META', 'template=TITLE_AND_META', 'widgetId=title-and-meta', 'index=0']
['nav-sprite', 'nav-logo-base']
['nav-sprite', 'nav-logo-ext']
['nav-sprite', 'nav-logo-locale']
['icp-nav-link-inner']
['nav-line-1']
['icp-nav-globe-img-2']
['icp-nav-language']
['nav-line-2']
['nav-icon', 'nav-arrow']
['nav-line-1']
['nav-line-2', '']
['nav-icon', 'nav-arrow']
['nav-line-3']
['nav-line-4']
['nav-line-1']
['nav-line-2']
['nav-line-1']
['nav-line-2', '']
['nav-icon', 'nav-arrow']
['nav-line-1']
['nav-line-2']
['nav-icon', 'nav-arrow']
['nav-cart-icon', 'nav-sprite']
['nav-cart-count', 'nav-cart-0']
['nav-search-label']
<span id="searchDropdownDescription" style="display:none">Select the department you want to search in</span>does not have class attribute
['nav-search-submit-text', 'nav-sprite']
['a-declarative']
['nav-line-1']
['nav-line-2']
<span id="nav-your-amazon-text"><span class="nav-shortened-name">Aiswarya</span>'s Amazon.in</span>does not have class attribute

In [41]:
soup.find("div",{'class':'sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32'})

<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32" data-asin="B01HSFUOOQ" data-index="1"><div class="sg-col-inner">
<span cel_widget_id="SEARCH_RESULTS-SEARCH_RESULTS" class="celwidget slot=SEARCH_RESULTS template=SEARCH_RESULTS widgetId=search-results index=1">
<div class="s-expand-height s-include-content-margin s-border-bottom">
<div class="a-section a-spacing-medium">
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-micro s-min-height-extra-large">
</div>
</div></div>
</div>
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-none">
<span class="rush-component" data-component-type="s-product-

Find gives u only the first instance. Here it has given all tags under the first <div> 

In [50]:
div_tags=soup.find("div",class_='sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32')

In [51]:
div_tags

<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32" data-asin="B01HSFUOOQ" data-index="1"><div class="sg-col-inner">
<span cel_widget_id="SEARCH_RESULTS-SEARCH_RESULTS" class="celwidget slot=SEARCH_RESULTS template=SEARCH_RESULTS widgetId=search-results index=1">
<div class="s-expand-height s-include-content-margin s-border-bottom">
<div class="a-section a-spacing-medium">
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-micro s-min-height-extra-large">
</div>
</div></div>
</div>
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-none">
<span class="rush-component" data-component-type="s-product-

In [52]:
type(div_tags)

bs4.element.Tag

In [53]:
len(div_tags) ## There are 48 search results`m

1

In [56]:
print(div_tags['data-asin'])
div_tags['data-index']

B01HSFUOOQ


'1'

In [None]:
tag_0['data-asin']

Let us now get the value of property data-asin for these tags. ASIN is a product identifier on Amazon. Similar to ISBN for books

In [None]:

asin_list=[]
for tag in div_tags:
    print(tag['data-asin']) #Get the value associated with property data-asin
    asin_list.append(tag)

Now for each of the ASIN Tags we must get the reviews associated with it..Let us extract product name and the Reviews. Let us do it for one product first

In [None]:
def getAmazonPage(asin,amazon_url):
    url=amazon_url+asin
    print(url)
    page=requests.get(url,cookies=cookie,headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'})
    print(page.status_code)
    if page.status_code==200:
        return page
    else:
        return "Error"
        
    

In [None]:
amazonb_url="https://www.amazon.in/dp/"
product_response=getAmazonPage("B078NHQ35C",amazon_url)

In [None]:
product_response.content

In [None]:
soup=BeautifulSoup(product_response.content)

In [None]:
soup

In [None]:
total_reviews=soup.find("span",{"data-hook":"top-customer-reviews-title"})
total_reviews.text

In [None]:
total_reviews=total_reviews.text.replace(" customer reviews","")
total_reviews=int(total_reviews)
total_reviews

##### The review details are present inside div tag with data-hook review.Let us extract this tag

In [None]:
review_data=soup.findAll("div",{"data-hook":'review-collapsed'})

In [None]:
len(review_data)

Also, while there are 17 customer reviews, there are only 8 in this page. If we want to extract, we need to check the URL in <a> tag where data-hook=see-all-reviews-link-foot

In [None]:
see_all_review_url=soup.find("a",{"data-hook":"see-all-reviews-link-foot"})['href']

In [None]:
see_all_review_url

In [None]:
see_all_review_url="https://www.amazon.in/"+see_all_review_url

In [None]:
see_all_review_url

In [None]:
all_review_data=requests.get(see_all_review_url,headers=headers,cookies=cookies)

In [None]:
all_review_soup=BeautifulSoup(all_review_data.content)

In [None]:
all_review_soup

In [None]:
ratings=[]
reviews=[]
review_title=[]

Ratings is present in <i> tag where data-hook is review-star-rating

In [None]:
ratings_tag=all_review_soup.findAll("i",{"data-hook":"review-star-rating"})
for rating in ratings_tag:
    #print(rating)
    print(rating.text)
    ratings.append(rating.text)

In [None]:
print(len(ratings_tag)) #This page had 10 ratings

In [None]:
summary_soup=all_review_soup.findAll("a",{"data-hook":"review-title"})
for summary in summary_soup:
    print(summary)
    print(summary.text)
    review_title.append(summary.text)

In [None]:
reviews_soup=all_review_soup.findAll("span",{'data-hook':'review-body'})
for review in reviews_soup:
    print(review)
    print(review.text)
    reviews.append(review.text)

In [None]:
### But we have not got all the reviews, we need to get the other reviews..

In [None]:
pagination=all_review_soup.findAll("ul",{"class":'a-pagination'})

In [None]:
pagination

In [None]:
anchor_tag=pagination[0].find("a")

In [None]:
anchor_tag

In [None]:
anchor_tag['href']

In [None]:
next_page="https://www.amazon.in/"+anchor_tag['href']
next_page

In [None]:
next_page_response=requests.get(next_page,headers=headers,cookies=cookies)

In [None]:
next_page_soup=BeautifulSoup(next_page_response.content)

In [None]:
next_page_soup.findAll("i",{"data-hook":'review-star-rating'})

## Like before we can extract this data and add it to the list. 

As an assignment, look for an brand+product of intrest and scrap the reviews for all items.