# Web Scraping

Stories from **Google News** by Extracting all **Tags** from the **HTML** of **Google News**. 

**Google News** used **Tags** to Create **Links** to various Websites.

pip install **beautifulsoup4**

In [1]:
import urllib.request
from bs4 import BeautifulSoup

**\_\_init\_\_** method uses a website to extract as a **Parameter**

**urlopen** function sends a request to a website and returns a Response Object in which **HTML** Code is stored.

**read** returns the **HTML** of Response Object.

**BeautifulSoup** parses the **HTML**.

In [2]:
class Scraper:
    def __init__(self, site):
        self.site = site
    
    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        string = BeautifulSoup(html, parser)
        
        for tag in string.find_all("a"):
            url = tag.get("href")
            if url is None:
                continue
            if "articles" in url:
                print("\n" + url)

In [3]:
news = "https://news.google.com/"
# Scraper(news).scrape()[0]

### Web Scraping 

- Some website offer **Data Sets** are Downloadable in **CSV** Format

- Accessible via an Application Programming Interface **API**

- But many Websites with confidential Data don't offer these convenient options.

### How does Web Scraping Work?

- We write code that sends a **Request** to the **Server** that's hosting the page.

- Code downloads that page's source code.

- But instead of Displaying the Page visually, it **Filters** through the page looking for HTML elements.

- Consider **Caching** the content you Scrape so that it's only Downloaded once.

### Components of Web Page

When we visit a **Web Page**, our **Web Browser** makes a request to a **Web Server**.

The Request is called a **GET** request

1. HTML : Main **Content** of the page.
2. CSS : Add **Styling** to make page look good.
3. JavaScript : Add **Interactivity** to Web Pages.


In [4]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

In [5]:
print(page.status_code)

200


In [6]:
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


### Parsing a page with BeautifulSoup

In [7]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [8]:
# Format 
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [9]:
print(list(soup.children))

['html', '\n', <html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>]


In [10]:
print([type(item) for item in list(soup.children)])

[<class 'bs4.element.Doctype'>, <class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>]


In [11]:
html = list(soup.children)[2]
print(html)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [12]:
body = list(html.children)[3]
print(body)

<body>
<p>Here is some simple content for this page.</p>
</body>


In [13]:
p = list(body.children)[1]
print(p.get_text())

Here is some simple content for this page.


In [14]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [15]:
print(soup.find_all('p', class_="outer-text"))

[<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>, <p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>]


In [16]:
print(soup.find_all(class_="outer-text"))

[<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>, <p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>]


In [17]:
print(soup.find_all(id = "first"))

[<p class="inner-text first-item" id="first">
                First paragraph.
            </p>]


### Using CSS Selectors

1. body p a : find all a tags inside p tag inside body tag.
2. p.outer-text : find all p tags with a **class** of outer-text.
3. p\#first : find all p tags with an **id** of first

In [18]:
print(soup.select("div p"))

[<p class="inner-text first-item" id="first">
                First paragraph.
            </p>, <p class="inner-text">
                Second paragraph.
            </p>]


In [19]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: A 50 percent chance of showers.  Mostly cloudy early, then becoming mostly clear, with a low around 48. West wind around 8 mph.  New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/nshra50.png" title="Overnight: A 50 percent chance of showers.  Mostly cloudy early, then becoming mostly clear, with a low around 48. West wind around 8 mph.  New precipitation amounts of less than a tenth of an inch possible. "/>
 </p>
 <p class="short-desc">
  Chance
  <br/>
  Showers
 </p>
 <p class="temp temp-low">
  Low: 48 °F
 </p>
</div>


In [20]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Overnight
ChanceShowers
Low: 48 °F


In [21]:
img = tonight.find("img")
desc = img['title']
print(desc)

Overnight: A 50 percent chance of showers.  Mostly cloudy early, then becoming mostly clear, with a low around 48. West wind around 8 mph.  New precipitation amounts of less than a tenth of an inch possible. 
