<h1 style="color:Orange;font-size:170%;"> Intro To Web Scraping With Python </h1>

Install Beautifullsoup and import used libraries

In [1]:
!pip install bs4



In [2]:
from bs4 import BeautifulSoup

Let we define a very simple webpage in HTML, with:
 * Head
 * Body tags
    * section 1
        * h3 title
        * image
        * paragraph
    * section 2
        * table

In [3]:
html_doc = """
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <meta http-equiv="X-UA-Compatible" content="ie=edge" />
        <title>My Webpage</title>
    </head>
    <body>
        <div id="section-1">
            <h3 data-hello="hi">Hello</h3>
            <img src="https://source.unsplash.com/200x200/?nature,water" />
            <p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
        </div>
        <div id="section-2">
            <ul class="items">
                <li class="item"><a href="#">Item 1</a></li>
                <li class="item"><a href="#">Item 2</a></li>
                <li class="item"><a href="#">Item 3</a></li>
                <li class="item"><a href="#">Item 4</a></li>
                <li class="item"><a href="#">Item 5</a></li>
            </ul>
        </div>
    </body>
</html>
"""

Let initialize and create a variable bs. For this case we will use the variable HTML created before. If we want to scrap a Webpage we can use request to get the html of the page and use that one as first parmeter istead.

In [4]:
#soup = BeautifulSoup(What to scrape,type of file (HTML5 html.parser))
soup = BeautifulSoup(html_doc, 'html.parser')

### Direct Approach

In [5]:
# direct print
print(soup.body)

<body>
<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>
<div id="section-2">
<ul class="items">
<li class="item"><a href="#">Item 1</a></li>
<li class="item"><a href="#">Item 2</a></li>
<li class="item"><a href="#">Item 3</a></li>
<li class="item"><a href="#">Item 4</a></li>
<li class="item"><a href="#">Item 5</a></li>
</ul>
</div>
</body>


In [6]:
print(soup.head)

<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="ie=edge" http-equiv="X-UA-Compatible"/>
<title>My Webpage</title>
</head>


In [7]:
print(soup.head.title)

<title>My Webpage</title>


### Find Approach
Usually we dont use this kind of search but we use find istead

In [8]:
soup.find('div')

<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>

This method return only the first argument find on the html code. For return all the argiment we can use find_all() or findAll() method which return a list object

In [9]:
soup.findAll('div')
# to get the second element of the list we can access as follow:
soup.findAll('div')[1]

<div id="section-2">
<ul class="items">
<li class="item"><a href="#">Item 1</a></li>
<li class="item"><a href="#">Item 2</a></li>
<li class="item"><a href="#">Item 3</a></li>
<li class="item"><a href="#">Item 4</a></li>
<li class="item"><a href="#">Item 5</a></li>
</ul>
</div>

In [10]:
# FInd specific sections
soup.find(id='section-1')

<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>

In [11]:
# Find specific class
soup.find(class_='items')

<ul class="items">
<li class="item"><a href="#">Item 1</a></li>
<li class="item"><a href="#">Item 2</a></li>
<li class="item"><a href="#">Item 3</a></li>
<li class="item"><a href="#">Item 4</a></li>
<li class="item"><a href="#">Item 5</a></li>
</ul>

We get an error because class is a reserved word. we can solve this problem adding an underscore _

In [12]:
soup.find(class_='items')

<ul class="items">
<li class="item"><a href="#">Item 1</a></li>
<li class="item"><a href="#">Item 2</a></li>
<li class="item"><a href="#">Item 3</a></li>
<li class="item"><a href="#">Item 4</a></li>
<li class="item"><a href="#">Item 5</a></li>
</ul>

In [13]:
# serching by attributes:
soup.find(attrs={"data-hello":"hi"})

<h3 data-hello="hi">Hello</h3>

### Select Approach
Allows to select by css selections

In [14]:
soup.select('#section-1')

[<div id="section-1">
 <h3 data-hello="hi">Hello</h3>
 <img src="https://source.unsplash.com/200x200/?nature,water"/>
 <p>
                 Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                 culpa cumque velit aperiam officia molestias maiores qui
                 officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                 officiis quod, eum natus voluptatem nihil fugit necessitatibus
                 dolorum quae accusamus aliquid enim fuga dicta beatae!
             </p>
 </div>]

In [15]:
soup.select('#section-1')[0]

<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>

In [16]:
soup.select('.item')[0]

<li class="item"><a href="#">Item 1</a></li>

### Get text

In [17]:
soup.find(class_='item').get_text()

'Item 1'

In [18]:
for item in soup.select('.item'):
    print(item.get_text())

Item 1
Item 2
Item 3
Item 4
Item 5


### Navigation
with contents method we can get the content of what we are looking:

In [19]:
soup.body.contents

['\n',
 <div id="section-1">
 <h3 data-hello="hi">Hello</h3>
 <img src="https://source.unsplash.com/200x200/?nature,water"/>
 <p>
                 Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                 culpa cumque velit aperiam officia molestias maiores qui
                 officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                 officiis quod, eum natus voluptatem nihil fugit necessitatibus
                 dolorum quae accusamus aliquid enim fuga dicta beatae!
             </p>
 </div>,
 '\n',
 <div id="section-2">
 <ul class="items">
 <li class="item"><a href="#">Item 1</a></li>
 <li class="item"><a href="#">Item 2</a></li>
 <li class="item"><a href="#">Item 3</a></li>
 <li class="item"><a href="#">Item 4</a></li>
 <li class="item"><a href="#">Item 5</a></li>
 </ul>
 </div>,
 '\n']

the method content search by breakline (\n). so if we want the first content we access it as follows:

In [20]:
soup.body.contents[1]

<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>

If we want to get ***Hello*** we can access as follows:

In [21]:
soup.body.contents[1].contents[1]

<h3 data-hello="hi">Hello</h3>

Be aware that contents detect also the breaklines (\n)

In [22]:
soup.body.contents[1].contents[2]

'\n'

In [23]:
soup.body.contents[1].contents[3]

<img src="https://source.unsplash.com/200x200/?nature,water"/>

If we want to find the next sibling we can do it as follow:
![title](img/html_tree.png)

In [24]:
soup.body.contents[1].contents[1].find_next_sibling()

<img src="https://source.unsplash.com/200x200/?nature,water"/>

In [25]:
soup.find(id='section-2').find_previous_sibling()

<div id="section-1">
<h3 data-hello="hi">Hello</h3>
<img src="https://source.unsplash.com/200x200/?nature,water"/>
<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>
</div>

In [26]:
soup.find(class_='item').find_parent()

<ul class="items">
<li class="item"><a href="#">Item 1</a></li>
<li class="item"><a href="#">Item 2</a></li>
<li class="item"><a href="#">Item 3</a></li>
<li class="item"><a href="#">Item 4</a></li>
<li class="item"><a href="#">Item 5</a></li>
</ul>

In [27]:
soup.find('h3').find_next_sibling('p')

<p>
                Lorem ipsum dolor sit amet consectetur adipisicing elit. Iusto
                culpa cumque velit aperiam officia molestias maiores qui
                officiis incidunt. Omnis vitae eveniet reprehenderit excepturi
                officiis quod, eum natus voluptatem nihil fugit necessitatibus
                dolorum quae accusamus aliquid enim fuga dicta beatae!
            </p>

## Small script for save the scraped webpage to a CSV file

In [None]:
import requests
from bs4 import BeautifulSoup
from csv import writer

url = 'some URL'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
posts = soup.find_all(class_='post-preview')

with open('posts.csv', 'w') as csv_file:
    csv_writer = writer(csv_file)
    headers = ['Title', 'Link', 'Date']
    csv_writer.writerow(headers)

    for post in posts:
        title = post.find(class_='post-title').get_text().replace('\n', '')
        link = post.find('a')['href']
        date = post.select('.post-date')[0].get_text()
        csv_writer.writerow([title, link, date])