With reference to ths tutorial: https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/
Tried on 9th Feb 2022

Trying the <code>requests</code> library

In [3]:
import requests 
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [4]:
page.status_code # indicates whether the page was downloaded properly

200

A <code>status_code</code> of <code>200</code> means that the page was downloaded successfully.

</code

In [5]:
# html content of the page
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

## Parsing a Page with BeautifulSoup

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


Selecting all the elements at the top level of the page using the `children` property of `soup`

In [8]:
list(soup.children) # `children` returns a list generator, so we need to call list function on it


['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [9]:
# Seeing what the type of each element in the list is:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [10]:
# Selecting the HTML tage and its children by taking the third item in the list:
html = list(soup.children)[2]

In [11]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [12]:
body = list(html.children)[3]

In [13]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [14]:
# Isolating o tag
p = list(body.children)[1]

In [15]:
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag at once

In [16]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [17]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

## Searching for tags by class and id 

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we’re scraping, we can also use them to specify the elements we want to scrape.

To illustrate this principle, we’ll work with the following page:

In [18]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [19]:
# searching for any p tage that has the class outer-text:
soup.find_all('p', class_= 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [20]:
# Looking for an tag that has the class outer-text
soup.find_all(class_ = 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [21]:
# Searching for elements by id
soup.find_all(id = 'first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

## Using CSS Selectors

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

<item> p a — finds all a tags inside of a p tag.
<item> body p a — finds all a tags inside of a p tag inside of a body tag.
<item> html body — finds all body tags inside of an html tag.
<item> p.outer-text — finds all p tags with a class of outer-text.
<item> p#first — finds all p tags with an id of first.
<item> body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:



In [22]:
soup.select('div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

## Downloading Weather Data


Downloading Weather Data from: https://forecast.weather.gov/MapClick.php?lat=33.6873&lon=-117.8259#.YgQWVC-B0UE


Steps:

1.  Download the web page containing the forecast.
2. Create a `BeautifulSoup` class to parse the page.
3. Find the `div` with id `seven-day-forecast`, and assign to `seven_day`
4. Inside `seven_day`, find each individual forecast item.
5. Extract and print the first forecast item.

In [23]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.YgV5vi-B0UE")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id ='seven-day-forecast')
forecast_items = seven_day.find_all(class_= 'tombstone-container')
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  This
  <br/>
  Afternoon
 </p>
 <p>
  <img alt="This Afternoon: Sunny, with a high near 76. North northeast wind 6 to 8 mph. " class="forecast-icon" src="newimages/medium/skc.png" title="This Afternoon: Sunny, with a high near 76. North northeast wind 6 to 8 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 76 °F
 </p>
</div>


## Extracting Information from the Page

In [24]:
period = tonight.find(class_ = 'period-name').get_text()
short_desc = tonight.find(class_ = 'short-desc').get_text()
temp = tonight.find(class_= 'temp').get_text()
print("Period:", period)
print("Short Description:", short_desc)
print("Temperature:", temp) 

Period: ThisAfternoon
Short Description: Sunny
Temperature: High: 76 °F


In [25]:
img = tonight.find("img")
desc = img['title']
print("Description:", desc)

Description: This Afternoon: Sunny, with a high near 76. North northeast wind 6 to 8 mph. 


## Extracting all the information from the page

Now, we will:
- Select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.
- Use a list comprehension to call the `get_text` method on each `BeautifulSoup` object.

In [27]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday']

In [29]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d['title'] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Sunny', 'Clear', 'Haze thenSunny', 'Mostly Clearthen Haze', 'Haze thenSunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny']
['High: 76 °F', 'Low: 48 °F', 'High: 72 °F', 'Low: 47 °F', 'High: 71 °F', 'Low: 50 °F', 'High: 70 °F', 'Low: 50 °F', 'High: 62 °F']
['This Afternoon: Sunny, with a high near 76. North northeast wind 6 to 8 mph. ', 'Tonight: Clear, with a low around 48. Northwest wind 3 to 6 mph. ', 'Friday: Widespread haze before 9am. Sunny, with a high near 72. North wind 6 to 9 mph. ', 'Friday Night: Widespread haze after 4am. Mostly clear, with a low around 47. West wind 5 to 8 mph becoming north after midnight. ', 'Saturday: Widespread haze before 7am. Sunny, with a high near 71. North northeast wind around 7 mph. ', 'Saturday Night: Mostly clear, with a low around 50.', 'Sunday: Mostly sunny, with a high near 70.', 'Sunday Night: Partly cloudy, with a low around 50.', 'Monday: Mostly sunny, with a high near 62.']
