# Web Scrapping

Extracting webpages and parsing them for in readable format.

Usually it is HTML. We'll use
- **Requests** to get the webpage
- **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** to parse it. It parses HTML and XML with the help of a parser(**html or lxml**)


Let's start with an example from [geeksforgeeks](https://www.geeksforgeeks.org/python-convert-an-html-table-into-excel/)

In [13]:
#A quick example of how to import files from the internet using pandas built-in features :
import pandas as pd 
  
# The URL where we want to extract the table
url = "https://www.geeksforgeeks.org/extended-operators-in-relational-algebra/"
  
# Read in the first table. The read_html returns a list of tables
table = pd.read_html(url)[0]
table

Unnamed: 0,0,1,2,3,4
0,ROLL_NO,NAME,ADDRESS,PHONE,AGE
1,1,RAM,DELHI,9455123451,18
2,2,RAMESH,GURGAON,9652431543,18
3,3,SUJIT,ROHTAK,9156253131,20
4,4,SURESH,DELHI,9156768971,18


<center> <h1> Beautiful Soup  </h1> </center>

From the website

**"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help."**

Install 
- **pip install beautifulsoup4**
- **pip install lxml**
in your virtual environment.

Documentation available [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Protocol to follow when scrapping the web page
- Check for robot.txt and see what is allowed
- Avoid lots of simultaneous calls. Your IP may get blocked. Use sleep between making get call to avoid this.

- Use Requests get method to get the webpage html
- Parse it using BeautifulSoup and lxml. It creates a hierarchical structure of html elements.
- In Chrome/Safari/Mozzilla/etc right click and click on inspect to open developer tools. Inspecting the html elements for their attributes and hierarchical order.
- Use Beautiful Soup object to get to the desired element.

# An example of parsing html

Visit this [w3schools](https://www.w3schools.com/html/html_basic.asp) to get an idea about HTML


In [16]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""


<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>    
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>


In [17]:
from bs4 import BeautifulSoup as bsoup

In [18]:
soup = bsoup(html_doc, 'lxml')
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>



# Navigating this data structure

In [19]:
soup.title

<title>The Dormouse's story</title>

In [20]:
soup.title.text

"The Dormouse's story"

In [21]:
soup.title.string

"The Dormouse's story"

In [25]:
soup.title.parents

<generator object PageElement.parents at 0x11bcfcde0>

In [26]:
# print the name and text in parent tag 
soup.title.parent.text

"The Dormouse's story"

In [28]:
head_tag = soup.head
head_tag

#Returns the children of head
head_tag.contents

[<title>The Dormouse's story</title>]

In [31]:
title_tag = head_tag.contents[0]
title_tag

title_tag.contents

#there is also the .children function 
for child in title_tag.children:
    print(child)

The Dormouse's story


In [32]:
#Gets the first occurrence 
soup.body.b

<b>The Dormouse's story three little sisters</b>

In [33]:
#Finds the first hyperlink
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# p tag represent a paragraph of text

In [34]:
soup.p

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>

p tag has some attribute too, like class here. How to get the value of attribute

In [36]:
soup.p['class']

#Get all attributes 
# soup.p.attrs

{'class': ['story_title']}

But there were more **p** tags. How to get them from soup data structure

In [37]:
soup.find_all('p')

[<p class="story_title"><b>The Dormouse's story three little sisters</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

The **a** tag defines a hyperlink, to link  to another webpage

<img src=""> </img>

In [38]:
# Find all the url(href) in a tags
for atag in soup.find_all('a'):
    print(atag['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [39]:
# Third link
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [40]:
# complete text
soup.get_text()

"The Dormouse's story\n\nThe Dormouse's story three little sisters\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n\n\n"

# Let's create some Beautiful Soup

We will scrap fry electronics for telescopes following the protocol and store the result in a csv file.



# Checking robots.txt

In [41]:
!curl  https://www.frys.com/robots.txt

User-agent: * 
Crawl-delay: 10 
Sitemap: https://www.frys.com/sitemap_index.xml 
Visit-time: 0030-0300 
Disallow: /ShopCartServlet 
Disallow: /wf 



In [42]:
# import requests and bs4 
from bs4 import BeautifulSoup as bsoup
import requests

In [43]:
response = requests.get('https://www.frys.com/search?query_search=&cat=-68822&nearbyStoreName=false&pType=pDisplay&fq=a%20Regular%20Items&start=0&cat=-68822&from=0&to=99&isKeyword=true')

In [44]:
response.status_code

200

In [45]:
response.text

'\r\n\r\n<!-- Desktop page for search. -->\r\n\r\n<HTML lang="en">\r\n\t<HEAD>\r\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t\t<base href="https://www.frys.com/">\r\n\t\t<TITLE>Fry\'s Electronics | </TITLE>\r\n\t\t<!-- Turn off telephone number detection. -->\r\n<meta name = "format-detection" content = "telephone=no">\r\n\r\n\r\n<LINK rel="stylesheet" href="/js/menus.css" type="text/css" media="screen"/>\r\n\r\n\r\n\r\n\r\n<!-- 629171 -->\r\n<script type="text/javascript">\r\nfunction fnvalidatePhoneNumber(phoneNumbers,id) {\r\n\t\t  var phonenumber=jQuery(phoneNumbers).val();\r\n\t\t  if(phonenumber!=null && phonenumber!=\'\' && phonenumber!=\' \') { \r\n\t\t  phonenumber = phonenumber.trim();\r\n\t\t  var phonenumber_d=phonenumber.replace(/-/gi,"").replace(/[(]/gi,"").replace(/ /gi,"").replace(/[)]/gi,"");\r\n\t\t\t  var regExp1 = "^[0-9]{11,}$";\r\n\t\t\t  var phone = (phonenumber_d).match(regExp1);\r\n\t\t\t  if (phone) {\r\n\t\t\t\t  jQuery(phoneNumbers).val(p

In [46]:
soup= bsoup(response.text, 'lxml')
soup

<!-- Desktop page for search. --><html lang="en">
<head>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<base href="https://www.frys.com/"/>
<title>Fry's Electronics | </title>
<!-- Turn off telephone number detection. -->
<meta content="telephone=no" name="format-detection"/>
<link href="/js/menus.css" media="screen" rel="stylesheet" type="text/css"/>
<!-- 629171 -->
<script type="text/javascript">
function fnvalidatePhoneNumber(phoneNumbers,id) {
		  var phonenumber=jQuery(phoneNumbers).val();
		  if(phonenumber!=null && phonenumber!='' && phonenumber!=' ') { 
		  phonenumber = phonenumber.trim();
		  var phonenumber_d=phonenumber.replace(/-/gi,"").replace(/[(]/gi,"").replace(/ /gi,"").replace(/[)]/gi,"");
			  var regExp1 = "^[0-9]{11,}$";
			  var phone = (phonenumber_d).match(regExp1);
			  if (phone) {
				  jQuery(phoneNumbers).val(phonenumber_d.substring(0,3)+"-"+phonenumber_d.substring(3,6)+"-"+phonenumber_d.substring(6,15));
				  document.getElementById(id).innerHTML

In [49]:
telescope_containers = soup.find_all('div', {"class":"col-xs-12 col-sm-12 pad_lr_tab5 product togrid"})
telescope_containers

[<div class="col-xs-12 col-sm-12 pad_lr_tab5 product togrid" id="prodCol" style="margin-top: 10px;position: relative;">
 <div class="col-xs-12 col-sm-4 col-md-4" id="prodImg" style="text-align: center">
 <!-- start scr #583483 -->
 <!-- end scr #583483 -->
 <a data-ajax="false" href="product/9742974;jsessionid=2FC7901AF82DFE55986E7F35ED4CCF5A.node1?site=sr:SEARCH:MAIN_RSLT_PG">
 <img alt="Frys#9742974" height="200px" onerror="this.onerror=null;setDefaultImage(this,'tn')" src="https://images.frys.com/art/product/300x300/9742974.01.prod.jpg" width="200px"/>
 </a>
 </div>
 <div class="col-xs-12 col-sm-4 col-md-5 pad_none_tab pad_lr_desk5 toGirdDesc" id="prodDesc" style="margin-top: 5px;">
 <p class="font_reg productDescp" id="prodDescp">
 <small><b><a data-ajax="false" href="product/9742974;jsessionid=2FC7901AF82DFE55986E7F35ED4CCF5A.node1?site=sr:SEARCH:MAIN_RSLT_PG">Meade StarPro AZ 70mm Refractor Telescope Meade StarPro AZ 70mm Meade StarPro AZ 70mm</a></b></small>
 </p>
 <div class="c

In [50]:
telescope_container = telescope_containers[0]
telescope_container

<div class="col-xs-12 col-sm-12 pad_lr_tab5 product togrid" id="prodCol" style="margin-top: 10px;position: relative;">
<div class="col-xs-12 col-sm-4 col-md-4" id="prodImg" style="text-align: center">
<!-- start scr #583483 -->
<!-- end scr #583483 -->
<a data-ajax="false" href="product/9742974;jsessionid=2FC7901AF82DFE55986E7F35ED4CCF5A.node1?site=sr:SEARCH:MAIN_RSLT_PG">
<img alt="Frys#9742974" height="200px" onerror="this.onerror=null;setDefaultImage(this,'tn')" src="https://images.frys.com/art/product/300x300/9742974.01.prod.jpg" width="200px"/>
</a>
</div>
<div class="col-xs-12 col-sm-4 col-md-5 pad_none_tab pad_lr_desk5 toGirdDesc" id="prodDesc" style="margin-top: 5px;">
<p class="font_reg productDescp" id="prodDescp">
<small><b><a data-ajax="false" href="product/9742974;jsessionid=2FC7901AF82DFE55986E7F35ED4CCF5A.node1?site=sr:SEARCH:MAIN_RSLT_PG">Meade StarPro AZ 70mm Refractor Telescope Meade StarPro AZ 70mm Meade StarPro AZ 70mm</a></b></small>
</p>
<div class="col-xs-12 pad_

In [51]:
def scrap_telescope(telescope_container):
    product_dict= {}
    telescope_info = telescope_container.find('div', {"class":"col-xs-12 col-sm-4 col-md-5 pad_none_tab pad_lr_desk5 toGirdDesc"})

    product_desc_container = telescope_info.find('p')
    product_desc = product_desc_container.text.strip()
    print(product_desc)
    
    product_dict['product_desc']=product_desc
    product_info_container =  telescope_info.find('div', {"class":"col-xs-12 pad_none_tab pad_none_desk prodModel"})

    product_attr_container = product_info_container.find_all('p')

    for product_attr in product_attr_container[:-1]:
        product_val = product_attr.text.strip().split(':')
        product_dict[product_val[0]] = product_val[1]
    return product_dict    

In [52]:
for telescope_container in telescope_containers:
    print(scrap_telescope(telescope_container))


Meade StarPro AZ 70mm Refractor Telescope Meade StarPro AZ 70mm Meade StarPro AZ 70mm
{'product_desc': 'Meade StarPro AZ 70mm Refractor Telescope Meade StarPro AZ 70mm Meade StarPro AZ 70mm', 'Frys #': ' 9742974', 'Brand': ' MEADE', 'UPC ': ' 709942999839', 'Model': ' StarPro AZ 70mm Refr'}
Meade 50mm Infinity Altazimuth Refractor Telescope
{'product_desc': 'Meade 50mm Infinity\x99 Altazimuth Refractor Telescope', 'Frys #': ' 9672093', 'Brand': ' MEADE', 'UPC ': ' 709942997002', 'Model': ' 209001'}
Meade ETX80 Observer 80mm f/5 Achromat Refractor Telescope
{'product_desc': 'Meade ETX80 Observer 80mm f/5 Achromat Refractor Telescope', 'Frys #': ' 9672053', 'Brand': ' MEADE', 'UPC ': ' 709942997491', 'Model': ' 205002'}


### Let's look at another example found in a pretty neat article [here](https://towardsdatascience.com/data-science-skills-web-scraping-using-python-d1a85ef607ed)

In [11]:
# import libraries
from bs4 import BeautifulSoup as bsoup
import urllib.request
import csv

In [12]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [13]:
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)

# parse the html using beautiful soup and store in variable 'soup'
soup = bsoup(page, 'html.parser')

In [14]:
soup

<!-- Template Name: League Table page
-->
<!DOCTYPE html>

<!--[if lt IE 7 ]> <html class="ie ie6 no-js" lang="en-GB"> <![endif]-->
<!--[if IE 7 ]>    <html class="ie ie7 no-js" lang="en-GB"> <![endif]-->
<!--[if IE 8 ]>    <html class="ie ie8 no-js" lang="en-GB"> <![endif]-->
<!--[if IE 9 ]>    <html class="ie ie9 no-js" lang="en-GB"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en-GB">
<!--<![endif]-->
<!-- the "no-js" class is for Modernizr. -->
<head id="live2-fasttrack-com"><link data-minify="1" href="https://www.fasttrack.co.uk/wp-content/cache/min/1/491f36ab0b8dd6f7a583b637337b7273.css" rel="stylesheet"/>
<meta charset="utf-8"/>
<!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>
        League table - Fast Track    </title>
<meta content="League table - Fast Track" name="title"/>
<meta content="" name="description"/>
<meta content="" name="keyword"/>
<meta con

In [15]:
# Let's find the data from the table :
table = soup.find('table', attrs={'class': 'tableSorter2'})
results = table.find_all('tr')
print('Number of results', len(results))

Number of results 101


In [18]:
#Let's look at the top 2 rows to check the structure of each row in our table :
results[:2]

[<tr>
 <th>Rank</th>
 <th>Company</th>
 <th class="">Location</th>
 <th class="no-word-wrap">Year end</th>
 <th class="" style="text-align:right;">Annual sales rise over 3 years</th>
 <th class="" style="text-align:right;">Latest sales £000s</th>
 <th class="" style="text-align:right;">Staff</th>
 <th class="">Comment</th>
 <!--				<th>FYE</th>-->
 </tr>, <tr>
 <td>1</td>
 <td><a href="https://www.fasttrack.co.uk/company_profile/revolut-2/"><span class="company-name">Revolut</span></a>Digital banking services provider</td>
 <td>East London</td>
 <td>Dec 18</td>
 <td style="text-align:right;">507.56%</td>
 <td style="text-align:right;">*58,300</td>
 <td style="text-align:right;">700</td>
 <td>Valued at $1.7bn in 2018 and reported to be raising an additional $500m this year that could value it at $5bn</td>
 <!--						<td>Dec 18</td>-->
 </tr>]

In [19]:
# create and write headers to a list 
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments'])
print(rows)

[['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments']]


In [20]:
results

[<tr>
 <th>Rank</th>
 <th>Company</th>
 <th class="">Location</th>
 <th class="no-word-wrap">Year end</th>
 <th class="" style="text-align:right;">Annual sales rise over 3 years</th>
 <th class="" style="text-align:right;">Latest sales £000s</th>
 <th class="" style="text-align:right;">Staff</th>
 <th class="">Comment</th>
 <!--				<th>FYE</th>-->
 </tr>, <tr>
 <td>1</td>
 <td><a href="https://www.fasttrack.co.uk/company_profile/revolut-2/"><span class="company-name">Revolut</span></a>Digital banking services provider</td>
 <td>East London</td>
 <td>Dec 18</td>
 <td style="text-align:right;">507.56%</td>
 <td style="text-align:right;">*58,300</td>
 <td style="text-align:right;">700</td>
 <td>Valued at $1.7bn in 2018 and reported to be raising an additional $500m this year that could value it at $5bn</td>
 <!--						<td>Dec 18</td>-->
 </tr>, <tr>
 <td>2</td>
 <td><a href="https://www.fasttrack.co.uk/company_profile/bizuma-3/"><span class="company-name">Bizuma</span></a>B2B e-commerce p

In [9]:
# loop over results
for result in results:
    # find all columns per result
    data = result.find_all('td')
    # check that columns have data 
    if len(data) == 0: 
        continue
        
    # write columns to variables
    rank = data[0].getText()
    company = data[1].getText()
    location = data[2].getText()
    yearend = data[3].getText()
    salesrise = data[4].getText()
    sales = data[5].getText()
    staff = data[6].getText()
    comments = data[7].getText()
    
#     print('Company is', company) 
#     print('Sales', sales)
    
    
    # extract description from the name
    companyname = data[1].find('span', attrs={'class':'company-name'}).getText()    
    description = company.replace(companyname, '')
    
    # remove unwanted characters
    sales = sales.strip('*').strip('†').replace(',','')
    
    
    # go to link and extract company website
    url = data[1].find('a').get('href')
    page = urllib.request.urlopen(url)
    # parse the html 
    soup = bsoup(page, 'html.parser')
    # find the last result in the table and get the link
    try:
        tableRow = soup.find('table').find_all('tr')[-1]
        webpage = tableRow.find('a').get('href')
    except:
        webpage = None
        
    # write each result to rows
    rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])
## print(rows)

In [65]:
for row in rows:
    print(row)

['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments']
['1', 'Revolut', 'http://www.revolut.com', 'Digital banking services provider', 'East London', 'Dec 18', '507.56%', '58300', '700', 'Valued at $1.7bn in 2018 and reported to be raising an additional $500m this year that could value it at $5bn']
['2', 'Bizuma', 'http://www.bizuma.com', 'B2B e-commerce platform', 'Central London', 'Mar 19', '315.18%', '26414', '114', 'Connects wholesale buyers and sellers from over 50 countries']
['3', 'Global-e', 'http://www.global-e.com', 'Cross-border ecommerce solutions', 'Central London', 'Dec 18', '303.09%', '29297', '28', 'Its technology helps ecommerce retailers localise their websites in more than 200 markets']
['4', 'Jungle Creations', 'http://www.junglecreations.com', 'Social media & ecommerce services', 'East London', 'Dec 18', '302.53%', '15972', '159', 'Launched the first-ever delivery-only restaur

In [10]:
# Create csv and write rows to output file
with open('techtrack100.csv','w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerows(rows)