#### Learnt from Keith Galli lectures
https://www.youtube.com/watch?v=GjKQ6V_ViQE

In this notebook we walk through web scraping in Python using the beautiful soup library. We start with a brief introduction to HTML & CSS and discuss what web scraping is. Next we start getting into the basics of the beautiful soup library. This includes how to load a webpage, the basic commands you need to know such as find & find_all, grabbing strings from an HTML elements, etc. 

In [3]:
!pip install beautifulsoup4

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


In [5]:
!pip install requests

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


In [6]:
import requests
from bs4 import BeautifulSoup as bs

In [9]:
url = 'https://keithgalli.github.io/web-scraping/example.html'
r = requests.get(url)

#creating soup object
soup = bs(r.content)
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



## Getting header 

In [10]:
soup.head

<head>
<title>HTML Example</title>
</head>

In [11]:
soup.find('head')

<head>
<title>HTML Example</title>
</head>

## Getting Body tag

In [13]:
soup.body

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>

In [14]:
soup.find('body')

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>

## find & find_all method

In [16]:
soup.find('p')   # Give only the first appearence of the specified tag

<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>

Give only the first appearence of the tag

In [15]:
soup.find_all('p')    # returns a list with all the appearences of the specified tag

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [17]:
soup.find('p', attrs={'id':'paragraph-id'})   #By using attr variable, a specific class or id can be found of the tag

<p id="paragraph-id"><b>Some bold text</b></p>

In [20]:
soup.find_all("p", attrs={'id':'paragraph-id'})

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [43]:
## All the h2 tags
soup.find_all('h2')

[<h2>A Header</h2>, <h2>Another header</h2>]

## Take out the text only

In [22]:
soup.find('p', attrs={'id':'paragraph-id'}).string    # by using .string method

'Some bold text'

In [24]:
soup.find('p', attrs={'id':'paragraph-id'}).get_text()  # by using get_text() method

'Some bold text'

In [31]:
# using get_text() for all the appearences
soup.find_all('p')

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [32]:
[txt.string for txt in soup.find_all('p')]

[None, 'Some italicized text', 'Some bold text']

In [35]:
[txt.get_text() for txt in soup.find_all('p')]

['Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html',
 'Some italicized text',
 'Some bold text']

In [37]:
# href

soup.find_all('a')

[<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>]

In [42]:
# Getting only the link without the tag
soup.find_all('a')[0]['href']

'https://keithgalli.github.io/web-scraping/webpage.html'

## code navigation (parents and siblings)

In [46]:
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



In [53]:
# Path Syntax
soup.body.div.h1.string

'HTML Webpage'

In [60]:
# parent, siblings, child
soup.body.find('div').find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [68]:
[child for child in soup.body.find('div').children]

['\n',
 <h1>HTML Webpage</h1>,
 '\n',
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 '\n']

In [70]:
[child.strip for child in soup.body.find('div').children]

[<function NavigableString.strip(chars=None, /)>,
 None,
 <function NavigableString.strip(chars=None, /)>,
 None,
 <function NavigableString.strip(chars=None, /)>]

## CSS Selectors

using .select() method we can pass css selectors

In [72]:
soup.select('Title')

[<title>HTML Example</title>]

In [73]:
soup.select('p')

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [75]:
#Find tags beneath other tags:
soup.select('p a')

[<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>]

In [78]:
soup.select('body a')

[<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>]

In [77]:
# Find tags directly beneath other tags:
soup.select('body > a')

[]

Returns an empty list As there is no a-tag in body, it is inside p-tag

In [79]:
soup.select('p > a')

[<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>]

### select specific id

In [83]:
soup.select('body > p#paragraph-id')

[<p id="paragraph-id"><b>Some bold text</b></p>]