### Other Popular Web Scraping tools & libraries

This notebook mainly goes over how to get data with the Python packages `requests` and  `BeautifulSoup`. However, there are many other Python packages that can be used for scraping.

Two very popular and widely used are:

* **[Selenium:](http://selenium-python.readthedocs.io/)** Pyton scraper that can act as a human when visiting websites, almost like a macro. Makes sense of modern Javascript based websites built with React, Angular etc.
* **[Scrapy:](https://scrapy.org/)** For automated scripting and has a lot of built in tools for web crawling and scraping that can facilitate the process (e.g. time based, IP rotation etc). Mainly script based scraping for larger projects.


### API: Application Programming Interfaces

Many services offer API's to grab data (Twitter, Wikipedia, Reddit etc.) We have already used an API in the Pandas notebook when we grabbed stock data in CSV format to do analysis. If a good API exists, it is usually the preferred method of obtaining data.

# Helpful webscraping Cheat Sheet

If you want a good documentation of functions in requests and Beautifulsoup (as well as how to save scarped data to an SQLite database), this is a good resource:

- https://blog.hartleybrody.com/web-scraping-cheat-sheet/

# Table of Contents
(Clickable document links)
___

### [0: Pre-steup](#sec0)
Document setup and Python 2 and Python 3 compability

### [1: Simple webscraping intro](#sec1)

Simple example of webscraping on a premade HTML template

### [2: Scrape Data-X Schedule](#sec2)

Find and scrape the current Data-X schedule. 

### [3: IMDB top 250 movies w MetaScore](#sec3)

Scrape IMDB and compare MetaScore to user reviews.

### [4: Scrape Images and Files](#sec4)

Scrape a website of Images, PDF's, CSV data or any other file type.

## [Breakout Problem: Scrape Weather Data](#secBK)

Scrape real time weather data in Berkeley.


### [Appendix](#sec5)

#### [Scrape Bloomberg sitemap for political news headlines](#sec6)

#### [Webcrawl Twitter, recusrive URL link fetcher + depth](#sec7)

#### [SEO, visualize webite categories as a tree](#sec8)

<a id='sec0'></a>
## Pre-Setup

In [1]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>")) 
# if 100% it would fit the screen

In [2]:
# make it run on py2 and py3
from __future__ import division, print_function

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to parse information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [3]:
import requests # The requests library is an 
# HTTP library for getting and posting content etc.

import bs4 as bs # BeautifulSoup4 is a Python library 
# for pulling data out of HTML and XML code.
# We can query markup languages for specific content

# San Francisco Hotel

In [4]:
source = requests.get("https://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html") 
# a GET request will download the HTML webpage.

**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error. Frequent appearance of the status codes like 404 (Not Found), 403 (Forbidden), 408 (Request Timeout) might indicate that you got blocked.

In [5]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

## Hotel Name

In [6]:
# we can also search for classes within all tags, using class_
# note _ is used to distinguish with Python's builtin class function

print(soup.find(class_='listing_title')) 

<div class="listing_title"><a class="property_title prominent" data-clicksource="HotelName" dir="ltr" href="/Hotel_Review-g60713-d268533-Reviews-Argonaut_Hotel_A_Noble_House_Hotel-San_Francisco_California.html" id="property_268533" onclick="return false;" target="_blank">Argonaut Hotel, A Noble House Hotel</a></div>


In [7]:
import pandas as pd
hotel_info = pd.DataFrame()
name = []
for p in soup.find_all(class_='listing_title'): # print all text paragraphs on the webpage
    name.append(p.text)
hotel_info['Name'] = name

## Hotel Price

In [8]:
price = []
for p in soup.find_all(class_='price'): 
    price.append(p.text)
hotel_info['Price'] = price[0::4]


In [9]:
hotel_info.head()

Unnamed: 0,Name,Price
0,"Argonaut Hotel, A Noble House Hotel",$189
1,Fairmont San Francisco,$306
2,Hyatt Regency San Francisco,$219
3,Club Quarters Hotel in San Francisco,$132
4,Hotel Zoe Fisherman's Wharf,$122


## Hotel Rating

In [10]:
bubble = [] 
for p in soup.find_all(class_='ui_bubble_rating'): 
    bubble.append(p.get('alt'))
bubble = bubble[5:]
i = 0
for bb in bubble:
    bubble[i] = bb.split(' ')[0]
    i += 1
hotel_info['Rating'] = bubble

In [11]:
hotel_info.head()

Unnamed: 0,Name,Price,Rating
0,"Argonaut Hotel, A Noble House Hotel",$189,4.5
1,Fairmont San Francisco,$306,4.5
2,Hyatt Regency San Francisco,$219,4.0
3,Club Quarters Hotel in San Francisco,$132,4.5
4,Hotel Zoe Fisherman's Wharf,$122,4.0


## Hotel Type

In [12]:
hotel_info['Type'] = 'Hotel'

In [13]:
hotel_info.head()

Unnamed: 0,Name,Price,Rating,Type
0,"Argonaut Hotel, A Noble House Hotel",$189,4.5,Hotel
1,Fairmont San Francisco,$306,4.5,Hotel
2,Hyatt Regency San Francisco,$219,4.0,Hotel
3,Club Quarters Hotel in San Francisco,$132,4.5,Hotel
4,Hotel Zoe Fisherman's Wharf,$122,4.0,Hotel


# Lake Tahoe

In [14]:
source2 = requests.get("https://www.tripadvisor.com/Hotels-g155987-Lake_Tahoe_California_California-Hotels.html") 
# a GET request will download the HTML webpage.

In [15]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code

soup_LT = bs.BeautifulSoup(source2.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

## Hotel Name

In [16]:
# we can also search for classes within all tags, using class_
# note _ is used to distinguish with Python's builtin class function

print(soup_LT.find(class_='listing_title')) 

<div class="listing_title ui_columns is-gapless is-mobile is-multiline"><div class="ui_column is-narrow"><span class="ui_merchandising_pill sponsored_v2">Sponsored</span></div><div class="ui_column is-narrow title_wrap"><a class="property_title prominent" data-clicksource="HotelName" dir="ltr" href="/Hotel_Review-g1798615-d248201-Reviews-Basecamp_South_Lake_Tahoe-South_Lake_Tahoe_Lake_Tahoe_California_California.html" id="property_248201" onclick="return false;" target="_blank">Basecamp South Lake Tahoe</a></div></div>


In [17]:
hotel_info_LT = pd.DataFrame()
name = []
for p in soup_LT.find_all(class_='listing_title'): # print all text paragraphs on the webpage
    name.append(p.text)
hotel_info_LT['Name'] = name

In [18]:
print(len(name))
hotel_info_LT

32


Unnamed: 0,Name
0,SponsoredBasecamp South Lake Tahoe
1,Beach Retreat & Lodge at Tahoe
2,Hotel Azure
3,Resort at Squaw Creek
4,Marriott's Timber Lodge
5,Forest Suites Resort at Heavenly Village
6,SponsoredOlympic Village Inn
7,7 Seas Inn at Tahoe
8,Postmarc Hotel and Spa Suites
9,"Grand Residences by Marriott, Lake Tahoe"


## Hotel Price

In [19]:
price = []
for p in soup_LT.find_all(class_='price'): 
    price.append(p.text)
    print(p.text)
print(len(price))
hotel_info_LT['Price'] = price[0::4]

$95
$95
$93
$95
$122
$105
$89
$129
$89
$109
$103
$136
$229
$229
$229
$1,678
$209
$209
$209
$209
$99
$84
$79
$177
$170
$189
$170
$93
$93
$91
$93
$76
$76
$71
$98
$204
$259
$204
$259
$103
$111
$103
$145
$179
$179
$179
$208
$95
$95
$93
$95
$185
$185
$195
$129
$129
$129
$129
$99
$99
$87
$89
$145
$109
$99
$109
$145
$155
$155
$155
$139
$139
$159
$139
$55
$55
$55
$61
$239
$239
$230
$239
$79
$79
$79
$79
$149
$149
$177
$149
$158
$158
$158
$158
$139
$139
$139
$150
$119
$119
$117
$119
$429
$439
$429
$429
$107
$107
$104
$107
$457
$436
$459
$476
$98
$178
$95
$98
$56
$53
$56
$56

$206
$198
$450
126


## Hotel Rating

In [20]:
bubble = [] 
for p in soup_LT.find_all(class_='ui_bubble_rating'): 
    bubble.append(p.get('alt'))
bubble = bubble[5:]
i = 0
for bb in bubble:
    bubble[i] = bb.split(' ')[0]
    i += 1
hotel_info_LT['Rating'] = bubble

## Hotel Type

In [21]:
hotel_info_LT['Type'] = 'Hotel'

In [22]:
hotel_info_LT

Unnamed: 0,Name,Price,Rating,Type
0,SponsoredBasecamp South Lake Tahoe,$95,4.0,Hotel
1,Beach Retreat & Lodge at Tahoe,$122,4.0,Hotel
2,Hotel Azure,$89,4.5,Hotel
3,Resort at Squaw Creek,$229,4.0,Hotel
4,Marriott's Timber Lodge,$209,4.0,Hotel
5,Forest Suites Resort at Heavenly Village,$99,4.0,Hotel
6,SponsoredOlympic Village Inn,$170,4.0,Hotel
7,7 Seas Inn at Tahoe,$93,5.0,Hotel
8,Postmarc Hotel and Spa Suites,$76,4.5,Hotel
9,"Grand Residences by Marriott, Lake Tahoe",$259,4.5,Hotel


## Activities Lake Tahoe

In [23]:
source3 = requests.get("https://www.tripadvisor.com/Attractions-g155987-Activities-Lake_Tahoe_California_California.html") 
# a GET request will download the HTML webpage.

In [24]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code

soup_ALT = bs.BeautifulSoup(source3.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

## Hotel Name

In [25]:
# we can also search for classes within all tags, using class_
# note _ is used to distinguish with Python's builtin class function

print(soup_ALT.find('h3')) 

<h3>Nature &amp; Parks</h3>


In [26]:
act_info_LT = pd.DataFrame()
name = []
for p in soup_ALT.find_all('h3'): # print all text paragraphs on the webpage
    name.append(p.text)
act_info_LT['Name'] = name

In [27]:
print(len(name))
act_info_LT

48


Unnamed: 0,Name
0,Nature & Parks
1,Ski & Snowboard Areas
2,Hiking Trails
3,Outdoor Activities
4,Shopping
5,Parks
6,Sights & Landmarks
7,Beaches
8,Museums
9,Water Sports


## Hotel Price

In [28]:
price = []
for p in soup_LT.find_all(class_='price'): 
    price.append(p.text)
    print(p.text)
print(len(price))
hotel_info_LT['Price'] = price[0::4]

$95
$95
$93
$95
$122
$105
$89
$129
$89
$109
$103
$136
$229
$229
$229
$1,678
$209
$209
$209
$209
$99
$84
$79
$177
$170
$189
$170
$93
$93
$91
$93
$76
$76
$71
$98
$204
$259
$204
$259
$103
$111
$103
$145
$179
$179
$179
$208
$95
$95
$93
$95
$185
$185
$195
$129
$129
$129
$129
$99
$99
$87
$89
$145
$109
$99
$109
$145
$155
$155
$155
$139
$139
$159
$139
$55
$55
$55
$61
$239
$239
$230
$239
$79
$79
$79
$79
$149
$149
$177
$149
$158
$158
$158
$158
$139
$139
$139
$150
$119
$119
$117
$119
$429
$439
$429
$429
$107
$107
$104
$107
$457
$436
$459
$476
$98
$178
$95
$98
$56
$53
$56
$56

$206
$198
$450
126


## Hotel Rating

In [29]:
bubble = [] 
for p in soup_LT.find_all(class_='ui_bubble_rating'): 
    bubble.append(p.get('alt'))
bubble = bubble[5:]
i = 0
for bb in bubble:
    bubble[i] = bb.split(' ')[0]
    i += 1
hotel_info_LT['Rating'] = bubble

## Hotel Type

In [30]:
hotel_info_LT['Type'] = 'Hotel'

In [31]:
hotel_info_LT

Unnamed: 0,Name,Price,Rating,Type
0,SponsoredBasecamp South Lake Tahoe,$95,4.0,Hotel
1,Beach Retreat & Lodge at Tahoe,$122,4.0,Hotel
2,Hotel Azure,$89,4.5,Hotel
3,Resort at Squaw Creek,$229,4.0,Hotel
4,Marriott's Timber Lodge,$209,4.0,Hotel
5,Forest Suites Resort at Heavenly Village,$99,4.0,Hotel
6,SponsoredOlympic Village Inn,$170,4.0,Hotel
7,7 Seas Inn at Tahoe,$93,5.0,Hotel
8,Postmarc Hotel and Spa Suites,$76,4.5,Hotel
9,"Grand Residences by Marriott, Lake Tahoe",$259,4.5,Hotel


In [32]:
# Extract links / urls
# Links in html is usually coded as <a href="url">
# where the link is url

print(soup.a)
print(type(soup.a))


<a class="tabLink pid18957" data-title="Flights" href="/CheapFlightsHome" onclick="ta.setEvtCookie('TopNav', 'click', 'Flights', 0, this.href);setPID(1940)">
  
    Flights
      
          </a>
<class 'bs4.element.Tag'>


In [33]:
soup.a.get('href') 
# to get the link from href attribute

'/CheapFlightsHome'

In [34]:
links = soup.find_all('a')

In [35]:
links

[<a class="tabLink pid18957" data-title="Flights" href="/CheapFlightsHome" onclick="ta.setEvtCookie('TopNav', 'click', 'Flights', 0, this.href);setPID(1940)">
   
     Flights
       
           </a>,
 <a class="tabLink pid4968" data-title="Vacation Rentals" href="/Rentals" onclick="ta.setEvtCookie('TopNav', 'click', 'VacationRentals', 0, this.href)">
   
     Vacation Rentals
       
           </a>,
 <a class="tabLink pid2973" data-title="Restaurants" href="/Restaurants" onclick="ta.setEvtCookie('TopNav', 'click', 'Restaurants', 0, this.href)">
   
     Restaurants
       
           </a>,
 <a class="tabLink pid39877" data-title="Things to do" href="/Attractions" onclick="ta.setEvtCookie('TopNav', 'click', 'Attractions', 0, this.href)">
   
     Things to do
       
           </a>,
 <a class="subLink" href="/Tourism-g60713-San_Francisco_California-Vacations.html">San Francisco Tourism</a>,
 <a class="subLink" href="/Hotels-g60713-San_Francisco_California-Hotels.html">San Francisco H

In [36]:
# if we want to list links and their text info

links = soup.find_all('a')

for l in links:
    print("Info about {}: ".format(l.text), \
          l.get('href')) 
# then we have extracted the link

Info about 
  
    Flights
      
          :  /CheapFlightsHome
Info about 
  
    Vacation Rentals
      
          :  /Rentals
Info about 
  
    Restaurants
      
          :  /Restaurants
Info about 
  
    Things to do
      
          :  /Attractions
Info about San Francisco Tourism:  /Tourism-g60713-San_Francisco_California-Vacations.html
Info about San Francisco Hotels:  /Hotels-g60713-San_Francisco_California-Hotels.html
Info about San Francisco Bed and Breakfast:  /Hotels-g60713-c2-San_Francisco_California-Hotels.html
Info about San Francisco Vacation Rentals:  /VacationRentals-g60713-Reviews-San_Francisco_California-Vacation_Rentals.html
Info about San Francisco Vacation Packages:  /Vacation_Packages-g60713-San_Francisco_California-Vacations.html
Info about Flights to San Francisco:  /Flights-g60713-San_Francisco_California-Cheap_Discount_Airfares.html
Info about San Francisco Restaurants:  /Restaurants-g60713-San_Francisco_California.html
Info about Things to Do in San Fr

Info about Beck's Motor Lodge:  /Hotel_Review-g60713-d112289-Reviews-Beck_s_Motor_Lodge-San_Francisco_California.html
Info about  :  /Hotel_Review-g60713-d112289-Reviews-Beck_s_Motor_Lodge-San_Francisco_California.html#REVIEWS
Info about 734 reviews:  /Hotel_Review-g60713-d112289-Reviews-Beck_s_Motor_Lodge-San_Francisco_California.html#REVIEWS
Info about From the friendliest staff we encountered on our trip to the US, to the large room, comfy bed, to the location, we could not fault Beck’s.:  /ShowUserReviews-g60713-d112289-r670312650-Beck_s_Motor_Lodge-San_Francisco_California.html#review_670312650
Info about We compared the lowest prices from 10 websites:  #
Info about :  /Hotel_Review-g60713-d6383475-Reviews-Hotel_G_San_Francisco-San_Francisco_California.html
Info about Hotel G San Francisco:  /Hotel_Review-g60713-d6383475-Reviews-Hotel_G_San_Francisco-San_Francisco_California.html
Info about  :  /Hotel_Review-g60713-d6383475-Reviews-Hotel_G_San_Francisco-San_Francisco_California.ht

# Other useful scraping tips

### robots.txt

Always check if a webiste has a `robots.txt` document specifying what parts of the site that you're allowed to scrape (however, the website cannot prevent requests from getting its content, but I'd recommend you all to be nice). It may also contain information about the scraping frequency allowed etc.

E.g. 
- http://www.imdb.com/robots.txt
- http://www.nytimes.com/robots.txt

### user-agent

When you're sending a request to a webpage (no matter if it comes from your computer, iphone, or Python's request package), then you also include a user-agent. This let's the webserver know how to render the contents for you. You can also send user-agent information via a request (to specify who you are for example, or to disguise that you're an automated scraper).

Find your machine's / browser's true user agent here: https://www.whoishostingthis.com/tools/user-agent/

In [37]:
# user-agent example

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0',
    'From': 'data-x@gmail.com' 
}

response = requests.get('http://alex.fo/other/data-x', headers=headers)
print(response)
print(response.headers) # the response will also have some meta informaiton about the content

<Response [200]>
{'Date': 'Mon, 24 Feb 2020 20:27:16 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Thu, 23 Jan 2020 15:54:08 GMT', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Origin': '*', 'Expires': 'Mon, 24 Feb 2020 20:37:16 GMT', 'Cache-Control': 'max-age=600', 'X-Proxy-Cache': 'MISS', 'X-GitHub-Request-Id': 'CC80:30A6:23F53:2E687:5E5431A4', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '56a42de3fae8d352-LAX', 'Content-Encoding': 'gzip'}


<a id='sec2'></a>

# Data-X website Scraping
### Now let us scrape the current Syllabus Schedule from the Data-X website


In [38]:
source = requests.get('https://data-x.blog/').content 
# get the source content

In [39]:
soup = bs.BeautifulSoup(source,'html.parser')

In [40]:
print(soup.prettify()) 
# .prettify() method makes the HTML code more readable

# as you can see this code is more difficult 
# to read then the simple example above
# mostly because this is a real Wordpress website

<!DOCTYPE html>
<html class="no-js no-svg" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <script>
   (function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);
  </script>
  <title>
   Data-X at Berkeley
  </title>
  <link href="//fonts.googleapis.com" rel="dns-prefetch">
   <link href="//s.w.org" rel="dns-prefetch"/>
   <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
   <link href="https://data-x.blog/feed/" rel="alternate" title="Data-X at Berkeley » Feed" type="application/rss+xml"/>
   <link href="https://data-x.blog/comments/feed/" rel="alternate" title="Data-X at Berkeley » Comments Feed" type="application/rss+xml"/>
   <script type="text/javascript">
    window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/12.0.0-1\/72x72\/","ext":".png","svgUrl":"https:\/\/s

#### Print the Title of the website

In [41]:
print(soup.find('title').text) 
# check that we are at the correct website

Data-X at Berkeley


#### Extract all paragraphs of text

In [42]:
for p in soup.find_all('p'):
    print(p.text)

Ikhlaq Sidhu, UC Berkeley (contact)
Arash Nourian, UC Berkeley (contact)
Data-X is a framework designed at UC Berkeley for learning and applying AI, data science, and emerging technologies. Data-X fills a gap between theory and practice to empower data and AI projects.  Data-X projects create new ventures, new research, and corporate innovations all over the world.
Who is it for:
For students interested in careers, new ventures, and innovative projects in areas related to data science and information technology systems.
What is the problem:
Taking a purely theoretical course is not enough.  Students often take course after course in technical subject areas without being able to implement, apply, and/or make an innovative impact. 

The Solution:
Data-X places a real life innovative emerging technology project at the center of a learning experience that includes powerful tools, theory, and innovation behaviors and mindset.  Data-X also builds on Innovation Engineering, a powerful framewo

### Look at the navigation bar

In [43]:
navigation_bar = soup.find('nav')
print(navigation_bar)

<nav aria-label="Top Menu" class="main-navigation" id="site-navigation" role="navigation">
<button aria-controls="top-menu" aria-expanded="false" class="menu-toggle">
<svg aria-hidden="true" class="icon icon-bars" role="img"> <use href="#icon-bars" xlink:href="#icon-bars"></use> </svg><svg aria-hidden="true" class="icon icon-close" role="img"> <use href="#icon-close" xlink:href="#icon-close"></use> </svg>Menu	</button>
<div class="menu-primary-container"><ul class="menu" id="top-menu"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item current_page_item menu-item-8" id="menu-item-8"><a aria-current="page" href="/">Home</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-726" id="menu-item-726"><a href="https://data-x.blog/about-data-x/">About Data-X</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-183" id="menu-item-183"><a href="https://data-x.blog/resources/">Resources</a></li>
<

In [44]:
# These are the linked subpages in the navigation bar
nav_bar = navigation_bar.text
print(nav_bar)



    Menu	
Home
About Data-X
Resources
Data-X Online
Berkeley Syllabus
Projects  

Project Guideline
Innovation Engineering Book
Past Projects
Project Ideas
Advisors


Posts
Labs
Contact

  Scroll down to content



### Scrape the Syllabus of its content
(maybe to use in an App)

In [45]:
# Now we want to find the Syllabus, 
# however we are at the root web page, not displaying the Syllabus

# Get all links from navigation bar at the data-x home webpage
for url in navigation_bar.find_all('a'): 
    link = url.get('href')
    if 'data-x.blog' in link: # check link to a subpage
        print(link) 
        if 'syllabus' in link:
            syllabus_url = link

https://data-x.blog/about-data-x/
https://data-x.blog/resources/
https://data-x.blog/dx-online/
https://data-x.blog/syllabus/
http://data-x.blog/projects
https://data-x.blog/project-guideline/
https://data-x.blog/innovation-engineering-book/
http://data-x.blog/projects
http://data-x.blog/project-ideas
https://data-x.blog/advisors/
http://data-x.blog/posts
https://data-x.blog/project/
https://data-x.blog/contact/


In [46]:
# syllabus is located at https://data-x.blog/syllabus/
print(syllabus_url)

https://data-x.blog/syllabus/


In [47]:
# Open new connection to the Syllabus url. Replace soup object.

source = requests.get(syllabus_url).content
soup = bs.BeautifulSoup(source, 'html.parser')

print(soup.body.prettify()) 
# we can see that the Syllabus is built up of <td>, <tr> and <table> tags

<body class="page-template-default page page-id-94 wp-embed-responsive has-header-image page-one-column colors-light cannot-edit">
 <div class="site" id="page">
  <a class="skip-link screen-reader-text" href="#content">
   Skip to content
  </a>
  <header class="site-header" id="masthead" role="banner">
   <div class="custom-header">
    <div class="custom-header-media">
     <div class="wp-custom-header" id="wp-custom-header">
      <img alt="Data-X at Berkeley" height="973" sizes="100vw" src="https://data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png" srcset="https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?w=2000&amp;ssl=1 2000w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=300%2C146&amp;ssl=1 300w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=768%2C374&amp;ssl=1 768w, https://i2.

### Find the course schedule table from the syllabus:  
Usually organized data in HTML format on a website is stored in tables under `<table>, <tr>,` and `<td>` tags. Here we want to extract the information in the Data-X syllabus.

**NOTE:**  To identify element, class or id  name of the object of your interest on a web page, you can go to the link address in your browser, under 'more tools' option click __'developer tools'__. This opens  the 'Document object Model' of the webpage. Hover on the element of your interest on the webpage to check its location. This will help you in deciding which parts of 'soup content' you want to parse. More info at: https://developer.chrome.com/devtools

In [48]:
# We can see that course schedule is in <table><table/> elements
# We can also get the table
full_table = soup.find_all('table')

In [49]:
full_table

[<table width="472">
 <tbody>
 <tr>
 <td width="132"><strong>Topic 1:</strong></td>
 <td width="341">Introduction<br/>
 <strong>Theory</strong>: Overview of Frameworks for obtaining insights from data (Slides).<br/>
 <strong>Tools</strong>: Python Review</td>
 </tr>
 <tr>
 <td width="132">Code</td>
 <td width="341">1. Introduction to GitHub<br/>
 2. Setting up Anaconda Environment<br/>
 3. Coding with Python Review</td>
 </tr>
 <tr>
 <td width="132">Homework</td>
 <td width="341"><a href="https://data-x.blog/wp-content/uploads/2020/01/HW1.pdf">HW1 </a></td></tr></tbody></table>,
 <table width="472">
 <tbody>
 <tr>
 <td width="132"><strong>Topic 2:</strong></td>
 <td width="341"><strong>Tools:</strong> Linear Regression, Data as a Signal with Correlation</td>
 </tr>
 <tr>
 <td width="132">Code</td>
 <td width="341">—</td>
 </tr>
 <tr>
 <td width="132">Reading</td>
 <td width="341"></td>
 </tr>
 <tr>
 <td width="132">Project</td>
 <td width="341">Module 2: Team Formation 1</td>
 </tr>
 <

In [50]:
# A new row in an HTML table starts with <tr> tag
# A new column entry is defined by <td> tag
table_result = list()
for table in full_table:
    for row in table.find_all('tr'):
        row_cells = row.find_all('td') # find all table data
        row_entries = [cell.text for cell in row_cells]
        print(row_entries) 
        table_result.append(row_entries)
        # get all the table data into a list

['Topic 1:', 'Introduction\nTheory: Overview of Frameworks for obtaining insights from data (Slides).\nTools: Python Review']
['Code', '1. Introduction to GitHub\n2. Setting up Anaconda Environment\n3. Coding with Python Review']
['Homework', 'HW1 ']
['Topic 2:', 'Tools:\xa0Linear Regression, Data as a Signal with Correlation']
['Code', '—']
['Reading', '']
['Project', 'Module 2: Team Formation 1']
['Topic 3:', 'Theory: Regression -ML']
['Code', '\xa0Coding with Numpy']
['Reading', 'DataCamp, tutorialpoint,']
['Project Module 3', 'Module 3: Team Formation 2']
['Topic 4:', 'Theory:\xa0Classification and Logistic Regression']
['Code', 'Coding with Pandas']
['Reading', '']
['\xa0Project', 'Develop insightful story and brainstorm solutions']
['Topic 5:', 'Theory:\xa0Correlation']
['Code', '—']
['Reading', 'Correlation Reading']
['\xa0Project', 'Team break out discussions']
['Topic 6:', 'Theory: Prediction & Intro to Skikit-Learn']
['Code', 'Coding with Skikit-Learn']
['Reading', 'Predictio

In [51]:
# We can also read it in to a Pandas DataFrame
import pandas as pd
pd.set_option('display.max_colwidth', 10000)

df = pd.DataFrame(table_result)
df

Unnamed: 0,0,1
0,Topic 1:,Introduction\nTheory: Overview of Frameworks for obtaining insights from data (Slides).\nTools: Python Review
1,Code,1. Introduction to GitHub\n2. Setting up Anaconda Environment\n3. Coding with Python Review
2,Homework,HW1
3,Topic 2:,"Tools: Linear Regression, Data as a Signal with Correlation"
4,Code,—
...,...,...
70,Project,Module 11-12
71,Topic 18:,Project Presentations – Demo Day(s)
72,Code,Presentation including running code and code samples
73,Due,Includes preparation time in last week


In [52]:
# Pandas can also grab tables from a website automatically

import pandas as pd

import html5lib
# requires html5lib: 
#!conda install --yes html5
dfs = pd.read_html('https://data-x.blog/syllabus/') 
# returns a list of all tables at url



In [53]:
dfs

[          0  \
 0  Topic 1:   
 1      Code   
 2  Homework   
 3   Project   
 
                                                                                                              1  
 0  Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review  
 1                    1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review  
 2                                                                                                      HW1 HW2  
 3                                                                               Module 1: Project Introduction  ,
           0                                                            1
 0  Topic 2:  Tools: Linear Regression, Data as a Signal with Correlation
 1      Code                                                            —
 2   Reading                                                          NaN
 3   Project                                   Modu

In [54]:
print(type(dfs)) #list of tables
print(len(dfs)) # we only have one table
print(type(dfs[0])) # stored as DataFrame
df = pd.concat(dfs,ignore_index=True)
df = df.dropna()

<class 'list'>
19
<class 'pandas.core.frame.DataFrame'>


In [55]:
# Looks so-so, however striped from break line characters etc.
df.head()

Unnamed: 0,0,1
0,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review
1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review
2,Homework,HW1 HW2
3,Project,Module 1: Project Introduction
4,Topic 2:,"Tools: Linear Regression, Data as a Signal with Correlation"


In [56]:
# Make it nicer

# Assign column names
df.columns=  ['Part','Detailed Description']

# Assing week number
weeks = list()
i=0
for k in range(df.shape[0]):
    if 'Topic' in df.iloc[k,0]:
        i=i+1
    weeks.append('Lecture{}'.format(i))
df['Week'] = weeks

In [57]:
df.head()

Unnamed: 0,Part,Detailed Description,Week
0,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review,Lecture1
1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review,Lecture1
2,Homework,HW1 HW2,Lecture1
3,Project,Module 1: Project Introduction,Lecture1
4,Topic 2:,"Tools: Linear Regression, Data as a Signal with Correlation",Lecture2


In [58]:
# Set Week and Part as Multiindex
df = df.set_index(['Week','Part'])

In [59]:
df.head(12).dropna()

Unnamed: 0_level_0,Unnamed: 1_level_0,Detailed Description
Week,Part,Unnamed: 2_level_1
Lecture1,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review
Lecture1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review
Lecture1,Homework,HW1 HW2
Lecture1,Project,Module 1: Project Introduction
Lecture2,Topic 2:,"Tools: Linear Regression, Data as a Signal with Correlation"
Lecture2,Code,—
Lecture2,Project,Module 2: Team Formation 1
Lecture3,Topic 3:,Theory: Regression -ML
Lecture3,Code,Coding with Numpy
Lecture3,Reading,"DataCamp, tutorialpoint,"


<a id='sec3'></a>

# Keep a current list IMDB top 250 vs MetaScore

Let's say that we want to build an app that can display the most popular movies at the IMDB website.

We got to the URL that lists the top 250 movies according to the reviews: http://www.imdb.com/chart/top

We see that the entries are stored in a table format, so we try pandas.

In [60]:
df_imdb = pd.read_html('http://www.imdb.com/chart/top',attrs={'class':'chart full-width'})[0]

In [61]:
df_imdb.head()

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. The Shawshank Redemption (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. The Godfather (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. The Godfather: Part II (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. The Dark Knight (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 12 Angry Men (1957),8.9,12345678910 NOT YET RELEASED Seen,


In [62]:
df_imdb.drop(df_imdb.columns[[0,3,4]],axis=1,inplace=True)

In [63]:
df_imdb.tail()

Unnamed: 0,Rank & Title,IMDb Rating
245,246. Aladdin (1992),8.0
246,247. Guardians of the Galaxy (2014),8.0
247,248. Neon Genesis Evangelion: The End of Evangelion (1997),8.0
248,249. Groundhog Day (1993),8.0
249,250. Le Samouraï (1967),8.0


In [64]:
# Extract all URLs to find meta score
imdb_html = requests.get('http://www.imdb.com/chart/top').content
soup = bs.BeautifulSoup(imdb_html, features='html.parser')

In [65]:
links = soup.find('table').find_all('a')
urls = ['http://www.imdb.com'+l.get('href') for l in links]
urls[0]

'http://www.imdb.com/title/tt0111161/'

In [66]:
urls[-1]

'http://www.imdb.com/title/tt0062229/'

In [67]:
import numpy as np
meta_scores = np.zeros(250, dtype=int)

In [68]:

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0',
    'From': 'data-x@gmail.com' 
}

for idx,url in enumerate(urls):
    print('Getting metscore for movie {}'.format(idx))
    film = requests.get(url, headers=headers, timeout=10)
    print(film)
    soup = bs.BeautifulSoup(film.content, features='html.parser')
    info = soup.find(class_='metacriticScore score_favorable titleReviewBarSubItem')
    meta_scores[idx] = int(info.find('span').text)
    if idx == 5:
        break

Getting metscore for movie 0
<Response [200]>
Getting metscore for movie 1
<Response [200]>
Getting metscore for movie 2
<Response [200]>
Getting metscore for movie 3
<Response [200]>
Getting metscore for movie 4
<Response [200]>
Getting metscore for movie 5
<Response [200]>


In [69]:
df_imdb['meta_scores'] = meta_scores

In [70]:
df_imdb.head()

Unnamed: 0,Rank & Title,IMDb Rating,meta_scores
0,1. The Shawshank Redemption (1994),9.2,80
1,2. The Godfather (1972),9.1,80
2,3. The Godfather: Part II (1974),9.0,100
3,4. The Dark Knight (2008),9.0,100
4,5. 12 Angry Men (1957),8.9,90


<a id='sec4'></a>
# Scrape images and other files

Let's see how we can automatically find and download files linked at any website.

In [71]:
# As we can see there are two images on the data-x.blog/resources
# say that we want to download them
# Images are displayed with the <img> tag in HTML

# open connection and create new soup

raw = requests.get('https://data-x.blog/resources/').content
soup = bs.BeautifulSoup(raw,features='html.parser')

print(soup.find('img')) 
# as we can see below the image urls 
# are stored in the src attribute inside the img tag

<img alt="Data-X at Berkeley" height="973" sizes="100vw" src="https://data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png" srcset="https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?w=2000&amp;ssl=1 2000w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=300%2C146&amp;ssl=1 300w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=768%2C374&amp;ssl=1 768w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?resize=1024%2C498&amp;ssl=1 1024w, https://i2.wp.com/data-x.blog/wp-content/uploads/2018/04/cropped-AdobeStock_120712749-Converted.png?w=1480&amp;ssl=1 1480w" width="2000"/>


In [72]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'): 
    img_url = img.get('src') 
    if '.jpeg' in img_url or '.jpg' in img_url:
        print(img_url)
        img_urls.append(img_url)
    

https://i2.wp.com/data-x.blog/wp-content/uploads/2017/05/unnamed-2.jpg?resize=740%2C416&ssl=1


In [73]:
%ls

Henny-WebCrawler.ipynb
Untitled.ipynb
webscraping-requests-beautifulsoup.ipynb
webscraping_tripadvisor.ipynb


In [74]:
# To download and save files with Python we can use 
# the shutil library which is a file operations library
'''
The shutil module offers a number of high-level operations on files and 
collections of files. In particular, functions are provided which support 
file copying and removal.
'''

import shutil

for idx, img_url in enumerate(img_urls): 
    #enumarte to create a file integer name for every image
    
    # make a request to the image URL
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ 
    # stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: 
        # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) 
        # save the raw file object

    del img_source # to remove the file from memory

In [75]:
%ls

Henny-WebCrawler.ipynb
Untitled.ipynb
img0.jpg
webscraping-requests-beautifulsoup.ipynb
webscraping_tripadvisor.ipynb


## Scraping function to download files of any type from a website

Below is a function that takes in a website and a specific file type to download X of them from the website.

In [76]:
# Extended scraping function of any file format
import os # To interact with operating system and format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" 
    in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or 
    a for file links)
    
    source_tag = the source tag for the file url 
    (usually src for images or href for files)
    
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    
    max = integer (max number of files to scrape, 
    if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' 
    # for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')
    print('Loading content from the url...')
    source = requests.get(url).content
    print('Creating content soup...')
    soup = bs.BeautifulSoup(source,'html.parser')
    
    i=0
    print('Finding tag:%s...'%html_tag)
    for n, link in enumerate(soup.find_all(html_tag)):
        file_url=link.get(source_tag)
        print ('\n',n+1,'. File url',file_url)
        
        
        if 'http' in file_url: # check that it is a valid link
            print('It is a valid url..')
            
            
            if file_type in file_url: #only check for specific 
                # file type
                
                print('%s FILE TYPE FOUND IN THE URL...'%file_type)
                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
             
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and 
                    # write to it
                    
                    shutil.copyfileobj(file_source.raw, file) 
                    # save the raw file object
                    
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('%s file type NOT found in url:'%file_type)
                print('EXCLUDED:',file_url) 
                # urls not downloaded from
                
        if i == max:
            print('Max reached')
            break
            

    print('Done!')

# Scrape funny cat pictures

In [77]:
py_file_scraper('https://funcatpictures.com/') 
# scrape cats

Loading content from the url...
Creating content soup...
Finding tag:img...

 1 . File url https://funcatpictures.com/wp-content/uploads/2018/03/fcp2018.png
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: https://funcatpictures.com/wp-content/uploads/2018/03/fcp2018.png

 2 . File url https://funcatpictures.com/wp-content/uploads/2020/02/funny-cats-at-home-vs-on-facebook.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: funny-cats-at-home-vs-on-facebook.jpg

 3 . File url https://funcatpictures.com/wp-content/uploads/2020/01/fun-cat-pictures-antidepressive-medicine-700x699.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: fun-cat-pictures-antidepressive-medicine-700x699.jpg

 4 . File url https://funcatpictures.com/wp-content/uploads/2019/09/snarkande-700x368.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: snarkande-700x368.jpg

 5 . File url https://funcatpictures.com/wp-content/uploads/2019/08/funny-cat-p

In [78]:
!ls ./files

fun-cat-monday-face-700x587.jpg
fun-cat-pictures-antidepressive-medicine-150x150.jpg
fun-cat-pictures-antidepressive-medicine-700x699.jpg
fun-cat-pictures-total-eclipse-150x150.jpg
fun-cat-pictures-total-eclipse-700x691.jpg
funny-cat-pictures-tired-cat-150x150.jpg
funny-cat-pictures-tired-cat-700x700.jpg
funny-cats-at-home-vs-on-facebook-150x150.jpg
funny-cats-at-home-vs-on-facebook-320x384.jpg
funny-cats-at-home-vs-on-facebook.jpg
snarkande-150x150.jpg
snarkande-700x368.jpg


# Scrape pdf's from Data-X site

In [79]:
py_file_scraper('https://data-x.blog/resources',
                html_tag='a',source_tag='href',file_type='.pdf', \
                max=5)

Loading content from the url...
Creating content soup...
Finding tag:a...

 1 . File url #content

 2 . File url https://data-x.blog/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/

 3 . File url /

 4 . File url https://data-x.blog/about-data-x/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/about-data-x/

 5 . File url https://data-x.blog/resources/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/resources/

 6 . File url https://data-x.blog/dx-online/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/dx-online/

 7 . File url https://data-x.blog/syllabus/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/syllabus/

 8 . File url http://data-x.blog/projects
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: http://data-x.blog/projects

 9 . File url https://data-x.blog/project-guideline/
It is a valid

# Scrape real data CSV files from websites

In [80]:
py_file_scraper('http://www-eio.upc.edu/~pau/cms/rdata/datasets.html',
                html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

Loading content from the url...
Creating content soup...
Finding tag:a...

 1 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/AirPassengers.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: AirPassengers.csv

 2 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html

 3 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BJsales.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BJsales.csv

 4 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html

 5 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BOD.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BOD.csv

 6 . File url http://www-e

# Extended tip: IP rotation

The website might get suspicious if a lot of requests are coming from the same IP address. If you use a shared proxy, VPN or TOR that can help you get around that problem

For example:

```pyton
proxies = {'http' : 'http://10.10.0.0:0000',  
          'https': 'http://120.10.0.0:0000'}
response = requests.get('https://whateverwebsite.com', proxies=proxies, timeout=5)

```

Also note the `timeout` argument, this specifies that the request should not be carried out indefinitely (prevents the webserver from detecting scraping activity).
 

By using a shared proxy, the website will see the IP address of the proxy server and not yours. A VPN connects you to another network and the IP address of the VPN provider will be sent to the website.

---
<a id='secBK'></a>
# Breakout problem


In this Breakout Problem you should extract live weather data in Berkeley from:

[http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971](http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971)

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.)
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`




# Appendix

<a id='sec6'></a>
# Scrape Bloomberg sitemap (XML) for current political news

In [81]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

<html><body><p># Bot rules:
# 1. A bot may not injure a human being or, through inaction, allow a human being to come to harm.
# 2. A bot must obey orders given it by human beings except where such orders would conflict with the First Law.
# 3. A bot must protect its own existence as long as such protection does not conflict with the First or Second Law.
# If you can read this then you should apply here https://www.bloomberg.com/careers/
User-agent: *
Disallow: /polska
Disallow: /account/*

User-agent: Mediapartners-Google
Disallow: /about/careers
Disallow: /about/careers/
Disallow: /offlinemessage/
Disallow: /apps/fbk
Disallow: /bb/newsarchive/
Disallow: /apps/news

User-agent: Spinn3r
Disallow: /podcasts/
Disallow: /feed/podcast/
Disallow: /bb/avfile/

User-agent: Googlebot-News
Disallow: /sponsor/
Disallow: /news/sponsors/*

Sitemap: https://www.bloomberg.com/sitemap.xml
Sitemap: https://www.bloomberg.com/feeds/bbiz/sitemap_index.xml
Sitemap: https://www.bloomberg.com/feeds/bpol/sit

In [82]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [83]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns:="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
 <url>
  <loc>
   https://www.bloomberg.com/news/articles/2020-02-21/trump-would-out-debate-bloomberg-sanders-says-campaign-update
  </loc>
  <news:news>
   <news:publication>
    <news:name>
     Bloomberg
    </news:name>
    <news:language>
     en
    </news:language>
   </news:publication>
   <news:publication_date>
    2020-02-24T20:07:55.426Z
   </news:publication_date>
   <news:title>
    Klobuchar Says a Woman President Would Inspire: Campaign Update
   </news:title>
   <news:keywords>
    Education, Refugee Crisis, Billionaires, Social Media, Women, Equality, ESG Concerns, Inclusion, Megacity, ESG, Gender Equality, Vladimir Putin, Hyun Jong Kim, Donald John Trump, Marco Antonio Rubio, Hillary Rodham Clinton, Kamala D Harris, Kirsten Gillibrand, Michael R Bl

In [84]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

Klobuchar Says a Woman President Would Inspire: Campaign Update
2020-02-24T20:07:55.426Z


Cuomo Takes Steps to Prevent Immigration Use of N.Y.’s DMV Data
2020-02-24T19:56:24.041Z


Buttigieg Calls for SALT Cap Removal Ahead of California Primary
2020-02-24T19:53:12.801Z


Harvey Weinstein Is Convicted of Rape in Case That Sparked #MeToo
2020-02-24T19:41:01.844Z


Sanders, Bloomberg Escalate Tensions Ahead of Democratic Debate
2020-02-24T19:30:10.797Z


Supreme Court Seems Ready to Back Atlantic Coast Pipeline Permit
2020-02-24T19:16:53.558Z


Johnson Faces Complaints of Bad Faith Ahead of Trade Talks
2020-02-24T19:01:25.994Z


Apple Rebuffed by Supreme Court in $1 Billion VirnetX Dispute
2020-02-24T18:47:25.920Z


U.S. Considers Expelling Chinese Journalists After Americans Barred
2020-02-24T18:41:44.249Z


Trump’s Gifts From Other Nations Include a Vuitton Golf Bag and Portrait of Trump
2020-02-24T17:35:25.741Z


Virus Outbreak Drives Italians to Panic-Buying of Masks and Food
2020-0

<a id='sec7'></a>
# Web crawl

Web crawling is almost like webscraping, but instead you crawl a specific website (and often its subsites) and extract meta information. It can be seen as simple, recursive scraping. This can be used for web indexing (in order to build a web search engine).

## Web crawl Twitter account
**Authors:** Kunal Desai & Alexander Fred Ojala

In [85]:
import bs4
from bs4 import BeautifulSoup
import requests

In [86]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [87]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [88]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [89]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

https://twitter.com/signup  ----   34
https://pbs.twimg.com/profile_images/544542867595599872/eHuovPqP_400x400.jpeg  ----   11
https://pbs.twimg.com/profile_images/544542867595599872/eHuovPqP.jpeg  ----   11
https://t.co/BJvDIMaMlV  ----   11
https://video.golfdigest.com/watch/every-hole-at-cypress-point-golf-club-in-pebble-beach-ca  ----   1
https://www.golfdigest.com/go/failsafe  ----   23
https://reader.golfdigest.com/  ----   22
https://w1.buysub.com/servlet/CSGateway?cds_mag_code=GLF  ----   22
https://www.golfdigest.com/go/giftfailsafe  ----   22
https://www.golfdigest.com/go/internationalgiftfailsafe  ----   22
https://www.golfdigest.com/story/visitor-agreement  ----   22
https://www.golfdigest.com/story/privacy-and-cookies-notice  ----   22
https://t.co/4ttZ8SLLYP  ----   2
https://www.golfdigest.com/schools  ----   1
https://www.golfdigest.com/story/golf-digest-live?incid=topnav  ----   1
https://www.golfdigest.com/story/purchase-pros-on-demand?incid=navigation  ----   1
https

<a id='sec8'></a>
# SEO: Visualize sitemap and categories in a website

**Source:** https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

In [90]:
# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object


In [91]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

['https://www.bloomberg.com/feeds/bpol/sitemap_recent.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_news.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_video_recent.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_2.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2020_1.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_12.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_11.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_10.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_9.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_8.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_7.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_6.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_5.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_4.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_3.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_2019_2.xml', 'https://www.bloomberg.com/feeds/bpol/sitemap_20

In [92]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

Found 36,254 URLs in the sitemap


In [93]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [94]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


Loaded 36,254 URLs
Categorizing up to a depth of 3
Printed 2,842 rows of data to sitemap_layers.csv


In [95]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)




Loaded 2,842 rows of categorized data from sitemap_layers.csv
Building 3 layer deep sitemap graph
Exported graph to sitemap_graph_3_layer.pdf       
