# Data Engineering - Web Scraping

## Exercise 1: To Scrape dot Com

For this exercise, we will use a site that was actually _made for scraping_: [Web Scraping Sandbox](https://toscrape.com/) 

In [2]:
# 1.1 imports (regex, beautifulsoup, requests, and pandas)
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import re 

In [3]:
# 1.2 scrape all urls from https://toscrape.com/
url = 'https://toscrape.com/'
page = requests.get(url)

page


<Response [200]>

In [4]:
soup = bs(page.content)

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Scraping Sandbox
  </title>
  <link href="./css/bootstrap.min.css" rel="stylesheet"/>
  <link href="./css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row">
    <div class="col-md-1">
    </div>
    <div class="col-md-10 well">
     <img class="logo" src="img/zyte.png" width="200px"/>
     <h1 class="text-right">
      Web Scraping Sandbox
     </h1>
    </div>
   </div>
   <div class="row">
    <div class="col-md-1">
    </div>
    <div class="col-md-10">
     <h2>
      Books
     </h2>
     <p>
      A
      <a href="http://books.toscrape.com">
       fictional bookstore
      </a>
      that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at:
      <a href="http://books.toscrape.com">
       books.toscra

In [9]:
# 1.3 scrape all text ('p') from https://toscrape.com/
elements = soup.findAll("p")
for elt in elements:
    print(elt,'\n')

<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p> 

<p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p> 



## Exercise 2: The Office (wikipedia)

For this exercise, scrape the side bar data (text box only), as a dictionary from [The Office Wikipedia Page](https://en.wikipedia.org/wiki/The_Office_(American_TV_series)).

Convert your dictionary into a dataframe and print it as shown: 

![](../assets/the_office_DF.png)

In [3]:
# exercise 2
url = 'https://en.wikipedia.org/wiki/The_Office_(American_TV_series)'
page = requests.get(url)
page

<Response [200]>

In [4]:
webpage = bs(page.content)
print(webpage.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Office (American TV series) - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(

In [10]:
print(webpage.table.findChildren)

<bound method Tag.find_all of <table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em; font-size: 125%;"><i>The Office</i></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:The_Office_US_logo.svg"><img alt="The Office US logo.svg" data-file-height="85" data-file-width="500" decoding="async" height="43" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/250px-The_Office_US_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/375px-The_Office_US_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/500px-The_Office_US_logo.svg.png 2x" width="250"/></a></td></tr><tr><th class="infobox-label" scope="row">Genre</th><td class="infobox-data category"><style data-mw-deduplicate="TemplateStyles:r1126788409">.mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{l

In [26]:
table = webpage.select("table")[0]
columns = table.select("tr th")
column_names = [c.get_text() for c in columns] 
column_names


['The Office',
 'Genre',
 'Based on',
 'Developed by',
 'Starring',
 'Theme music composer',
 'Country of origin',
 'Original language',
 'No. of seasons',
 'No. of episodes',
 'Production',
 'Executive producers',
 'Producers',
 'Cinematography',
 'Editors',
 'Camera setup',
 'Running time',
 'Production companies',
 'Distributor',
 'Release',
 'Original network',
 'Picture format',
 'Audio format',
 'Original release',
 'Chronology',
 'Related']

In [155]:
dict_tags = {}
tr = webpage.find_all('tr')

for t_r in enumerate(tr):
    div = t_r[1].find_all(["i","div", "td"])
    dict_tags[t_r[0]] = div

dict_tags

{0: [<i>The Office</i>],
 1: [<td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:The_Office_US_logo.svg"><img alt="The Office US logo.svg" data-file-height="85" data-file-width="500" decoding="async" height="43" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/250px-The_Office_US_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/375px-The_Office_US_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/80/The_Office_US_logo.svg/500px-The_Office_US_logo.svg.png 2x" width="250"/></a></td>],
 2: [<td class="infobox-data category"><style data-mw-deduplicate="TemplateStyles:r1126788409">.mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{line-height:inherit;list-style:none;margin:0;padding:0}.mw-parser-output .plainlist ol li,.mw-parser-output .plainlist ul li{margin-bottom:0}</style><div class="plainlist">
  <ul><li><a href="/wiki/Mockumentary" title="Mockumentary">Mo

In [156]:

for k,v in dict_tags.items():
    if len(v)!=0:
        dict_tags[k]=[(a.get_text()).replace("\n"," , ") for a in v[0]]

dict_tags

{0: ['The Office'],
 1: [''],
 2: ['.mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{line-height:inherit;list-style:none;margin:0;padding:0}.mw-parser-output .plainlist ol li,.mw-parser-output .plainlist ul li{margin-bottom:0}',
  ' , Mockumentary , Workplace comedy , Cringe comedy , Sitcom , '],
 3: ['The Officeby Ricky GervaisStephen Merchant'],
 4: ['Greg Daniels'],
 5: ['',
  ' , Steve Carell , Rainn Wilson , John Krasinski , Jenna Fischer , B. J. Novak , Melora Hardin , David Denman , Leslie David Baker , Brian Baumgartner , Kate Flannery , Angela Kinsey , Oscar Nunez , Phyllis Smith , Ed Helms , Mindy Kaling , Paul Lieberstein , Creed Bratton , Craig Robinson , Ellie Kemper , Zach Woods , Amy Ryan , James Spader , Catherine Tate , Clark Duke , Jake Lacy , '],
 6: ['Jay Ferguson'],
 7: ['United States'],
 8: ['English'],
 9: ['9'],
 10: ['201 ', '(list of episodes)'],
 11: [],
 12: ['',
  ' , Ben Silverman , Greg Daniels , Ricky Gervais , Stephen Merchant , Howard K

In [157]:
dict_tags.pop(1)

['']

In [158]:
dict_tags2={}
for i in zip(column_names, dict_tags.values()):
    dict_tags2[i[0]] = i[1]
dict_tags2

{'The Office': ['The Office'],
 'Genre': ['.mw-parser-output .plainlist ol,.mw-parser-output .plainlist ul{line-height:inherit;list-style:none;margin:0;padding:0}.mw-parser-output .plainlist ol li,.mw-parser-output .plainlist ul li{margin-bottom:0}',
  ' , Mockumentary , Workplace comedy , Cringe comedy , Sitcom , '],
 'Based on': ['The Officeby Ricky GervaisStephen Merchant'],
 'Developed by': ['Greg Daniels'],
 'Starring': ['',
  ' , Steve Carell , Rainn Wilson , John Krasinski , Jenna Fischer , B. J. Novak , Melora Hardin , David Denman , Leslie David Baker , Brian Baumgartner , Kate Flannery , Angela Kinsey , Oscar Nunez , Phyllis Smith , Ed Helms , Mindy Kaling , Paul Lieberstein , Creed Bratton , Craig Robinson , Ellie Kemper , Zach Woods , Amy Ryan , James Spader , Catherine Tate , Clark Duke , Jake Lacy , '],
 'Theme music composer': ['Jay Ferguson'],
 'Country of origin': ['United States'],
 'Original language': ['English'],
 'No. of seasons': ['9'],
 'No. of episodes': ['201 

In [175]:
pd.DataFrame({'':column_names, 'values': dict_tags2.items()}).reset_index(drop=True)

Unnamed: 0,Unnamed: 1,values
0,The Office,"(The Office, [The Office])"
1,Genre,"(Genre, [.mw-parser-output .plainlist ol,.mw-p..."
2,Based on,"(Based on, [The Officeby Ricky GervaisStephen ..."
3,Developed by,"(Developed by, [Greg Daniels])"
4,Starring,"(Starring, [, , Steve Carell , Rainn Wilson ,..."
5,Theme music composer,"(Theme music composer, [Jay Ferguson])"
6,Country of origin,"(Country of origin, [United States])"
7,Original language,"(Original language, [English])"
8,No. of seasons,"(No. of seasons, [9])"
9,No. of episodes,"(No. of episodes, [201 , (list of episodes)])"
