# NEWS SCRAPER TUTORIAL

___

In this tutorial we will see how to parse the content from a website in order to pulling out some information. The website that we will use is http://www.uraniabasket.it/news; it is the website of the Urania Basket, a basketball team from Milan.

The information that we want to pull out from the news page are:
1. Title
2. date of publication
3. Short summary
4. URL of the thumbnail (if any)
5. URL of the complete news

Required libraries:
1. beautifulsoup4
4. requests
5. regex
6. pandas

__beautifulsoup4__ is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

__requests__ allows us to send the HTTP requests to web-server using python. (HTTP messages consist of requests from client to server and responses from server to client.)

In [1]:
!pip install beautifulsoup4
!pip install requests
!pip install pandas



Some notes on jhtml:
1. `<tag>...</tag>` represents a tag. Tags can be the specification of a bold heading an italic text and so on.
2. when we have key='a' it means that we have an anchor tag which contains a link. The link specified by 'href'.

In [34]:
import bs4
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import sys

In [36]:
print('Python version: {}'.format(sys.version))
print('Beautiful Soup version : {}'.format(bs4.__version__))
print('Requests version: {}'.format(requests.__version__))
print('Pandas version : {}'.format(pd.__version__))

Python version: 3.7.7 (default, Mar 23 2020, 17:31:31) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Beautiful Soup version : 4.9.1
Requests version: 2.24.0
Pandas version : 1.0.5


We can pass to BeautifulSoup an html file or the link to the web page. 

In [3]:
# Get the source code. The get method will return a response object.
# With the '.text' we obtain the source code.
url = 'http://www.uraniabasket.it/news'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [4]:
# We can format the source code with the prettify method.
print(soup.prettify())

<!-- w -->
<!DOCTYPE html>
<html>
 <!--
    _______________  ____    _________
    __  ___/__  /_ \/ /_ |  / /___  _/
    _____ \__  / __  /__ | / / __  /
    ____/ /_  /___  / __ |/ / __/ /
    /____/ /_____/_/  _____/  /___/
-->
 <head>
  <title>
  </title>
  <meta content="" name="description"/>
  <link href="" rel="icon" type="image/png"/>
  <meta content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
  <!-- Bootstrap -->
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet"/>
  <link href="https://cdnjs.cloudflare.com/ajax/libs/bxslider/4.2.5/jquery.bxslider.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/jquery.mcustomscrollbar/3.1.5/jquery.mCustomScrollbar.min.css" media="screen" rel="stylesheet"/>
  <link href="//maxcdn.bootstrapcdn.com/font-aw

Let's start by grabbing the first article. Let's visit the web page http://www.uraniabasket.it/news, go with the cursor over the title of the first news, right click with the mouse and select inspect element. Now we look for the container of the first news. In this case we can see that `<li class="row">` contains what we want.
    
<img src="fig/fig1.png" width="600">

We can see also that what we want is inside the `<ul class="news">`. We can use a filter to create a soup that contains just the news. This is important since in our case we have `<li class="row">` also in the header of the website.

In [5]:
all_news_filter = {'class': 'news'}
all_news = soup.find('ul', all_news_filter)
news = all_news.find('li')
print(news.prettify())

<li class="row">
 <div class="col-sm-4 col-md-4 nopadding news-thumb">
  <a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">
   <img class="img-responsive" src="https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png" style="background-color:#eee;"/>
  </a>
 </div>
 <div class="col-sm-8 col-md-8 news-inner">
  <h3 style="margin-top:5px">
   <a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">
    Buon Compleanno Wayne
   </a>
  </h3>
  <p class="news-info">
   Pubblicata in
   <a class="category-link" href="http://www.uraniabasket.it/news/categories/346969318921/serie-a2">
    Serie A2
   </a>
   il 2020-10-26 10:51:00
  </p>
  <h4>
  </h4>
  <p>
   Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...
  </p>
 </div>
</li>



Perfect, we get all the html for the first article. We can see from the output of the previous cell that all the information we need, except for the image, are inside the `<div class="col-sm-8 col-md-8 news-inner">`. So, we can filter for this specific `div`-`class` combination.

In [6]:
info_filter = {"class": "col-sm-8 col-md-8 news-inner"}
info_news = news.find('div', info_filter)
print(info_news)

<div class="col-sm-8 col-md-8 news-inner">
<h3 style="margin-top:5px"><a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">Buon Compleanno Wayne</a></h3>
<p class="news-info">Pubblicata in <a class="category-link" href="http://www.uraniabasket.it/news/categories/346969318921/serie-a2">Serie A2</a> il 2020-10-26 10:51:00</p>
<h4></h4>
<p>Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...</p>
</div>


Now we can grab the title and the link:

In [7]:
title = info_news.find('h3').a.text
link = info_news.find('h3').a['href']
print(title)
print(link)

Buon Compleanno Wayne
http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne


Perfect! For the date and the summary, we can see that they are inside two different paragraph `<p>`. We can get both using the method find_all and then access each of them with the respective position. The date will be in position 0 and the summary in the position 1.

In [8]:
paragraph = info_news.find_all('p')
date_ = paragraph[0].text
summary = paragraph[1].text
print(date_)
print(summary)

Pubblicata in Serie A2 il 2020-10-26 10:51:00
Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...


Ok, we are almost done but we can see that the date variable date contains also unrelevant info. We can use a simple regex to grab just the date.

In [9]:
date = re.findall('\d{4}-\d{1,2}-\d{1,2}', date_)[0]

The pattern we used is very simple; it search for a digit __\d__ repeated exactly 4 times __{4}__ (the year) follewed by an hypen, followed by a digit repeated from 1 to 2 times __{1,2}__ (month) followed again by an hypen and a digit repeated 1 or 2 times.

Now we have exactly what wee need except for the url of the thumbnail in case there is. Let's print the news html to see where it is positioned.

In [10]:
news

<li class="row">
<div class="col-sm-4 col-md-4 nopadding news-thumb">
<a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne"><img class="img-responsive" src="https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png" style="background-color:#eee;"/></a>
</div>
<div class="col-sm-8 col-md-8 news-inner">
<h3 style="margin-top:5px"><a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">Buon Compleanno Wayne</a></h3>
<p class="news-info">Pubblicata in <a class="category-link" href="http://www.uraniabasket.it/news/categories/346969318921/serie-a2">Serie A2</a> il 2020-10-26 10:51:00</p>
<h4></h4>
<p>Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...</p>
</div>
</li>

We can access to it with news.a.img and asking for the url as in a dictionary. Remember that we are looking for a thumbnail IF ANY, so we must be sure that in case there is no any image for the current news the code does not break. We can bypass this problem with __try__:

In [11]:
try:
    url_thumbnail = news.a.img['src']
    print(url_thumbnail)
except:
    'Thumbnail not present'

https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png


Now we have all that we need! Let's print all the info that we want to store:

In [12]:
print(title)
print(link)
print(date)
print(summary)
print(url_thumbnail)

Buon Compleanno Wayne
http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne
2020-10-26
Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...
https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png


Great! Now we see how to scrap the information for a single news but how we can get the same info for all the news? Instead of using the method __find__ we need __find_all__.

In [13]:
all_news_filter = {'class': 'news'}
all_news = soup.find('ul', all_news_filter)
news = all_news.find_all('li')
print(news)

[<li class="row">
<div class="col-sm-4 col-md-4 nopadding news-thumb">
<a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne"><img class="img-responsive" src="https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png" style="background-color:#eee;"/></a>
</div>
<div class="col-sm-8 col-md-8 news-inner">
<h3 style="margin-top:5px"><a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">Buon Compleanno Wayne</a></h3>
<p class="news-info">Pubblicata in <a class="category-link" href="http://www.uraniabasket.it/news/categories/346969318921/serie-a2">Serie A2</a> il 2020-10-26 10:51:00</p>
<h4></h4>
<p>Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...</p>
</div>
</li>, <li class="row">
<div class="col-sm-4 col-md-4 nopadding news-thumb">
<a href="http://www.uraniabasket.it/news/441556831750/vittoria-per-la-storia-77-84-a-piacenza-e-final8-per-urania"><i

Now the variable news contains all the news of the current page and we can acces them with the corresponding position inside the list.

The number of news in the current page is:

In [14]:
print(len(news))

10


We can see that the news that we scrap before is in the first position.

In [15]:
print(news[0])

<li class="row">
<div class="col-sm-4 col-md-4 nopadding news-thumb">
<a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne"><img class="img-responsive" src="https://slyvi-tstorage.fra1.digitaloceanspaces.com/m0_tml1961376069377_236153230549_1603705960744424.png" style="background-color:#eee;"/></a>
</div>
<div class="col-sm-8 col-md-8 news-inner">
<h3 style="margin-top:5px"><a href="http://www.uraniabasket.it/news/166678924806/buon-compleanno-wayne">Buon Compleanno Wayne</a></h3>
<p class="news-info">Pubblicata in <a class="category-link" href="http://www.uraniabasket.it/news/categories/346969318921/serie-a2">Serie A2</a> il 2020-10-26 10:51:00</p>
<h4></h4>
<p>Urania Basket Milano augura un felice e sereno Compleanno a Wayne Langston!

Happy Birthday Lefty...</p>
</div>
</li>


To scrap for the news in all the pages we can simply modify the url of the page to change iteratively the page number. Before doing so we need to create a dataframe to store the information for all the news.

In [16]:
columns_name = ['Title', 'Link', 'Date', 'Summary', 'Thumbnail']
news_df = pd.DataFrame(columns=columns_name)

Now we can loop over all the news inside all the pages. In doing so, we will use a while loop until the loop flag becomes `False`. By inspecting the website, we can see that all the pages have the same number of news except from the last one. So, we can change the flag loop to False when the actual news length is different from the one of the previous page.

In [17]:
# Specify the URL with page number as a variable
url = 'http://www.uraniabasket.it/news?page={number}'

number = 1  # initial page number

loop = True  # initialization for starting loop

# While loop to go through all the news pages
while loop:
    # Send a get request to the website and get the text from the response
    response = requests.get(url.format(number=number)).text

    # Parse HTML document
    soup = BeautifulSoup(response, "html.parser")

    # Search for all the first tag <ul> with attribute 'news'
    all_news_filter = {'class': 'news'}
    all_news = soup.find('ul', all_news_filter)

    # Search for all the news to get the total number of the current page
    news = all_news.find_all('li')
    news_number = len(news)  # number of news in current page

    # Loop over all the news inside the current page:
    for n in range(news_number):
        curr_news = news[n]

        # Get information from tag <div> with attribute 'col-sm-8 col-md-8 news-inner'
        info_filter = {"class": "col-sm-8 col-md-8 news-inner"}
        info_news = curr_news.find('div', info_filter)

        # Grab title
        title = info_news.find('h3').a.text

        # Grab link
        link = info_news.find('h3').a['href']

        # Find all paragraphs
        paragraph = info_news.find_all('p')

        # Grab date. We use regex to clean for irrelevant text
        date_ = paragraph[0].text
        date = re.findall('\d{4}-\d{1,2}-\d{1,2}', date_)[0]

        # Grab summary
        summary = paragraph[1].text

        # Grab thumbnail if any
        try:
            url_thumbnail = curr_news.a.img['src']
        except:
            url_thumbnail = ''
            'Thumbnail not present'

        # Update DataFrame with current news
        news_df = news_df.append(pd.DataFrame([[title, link, date, summary, url_thumbnail]],
                                              columns=columns_name), ignore_index=True)
    # Check if current page is last page
    if number > 1:
        if news_number != old_news_number:
            loop = False

    number += 1  # new page
    old_news_number = news_number

In [18]:
news_df

Unnamed: 0,Title,Link,Date,Summary,Thumbnail
0,Buon Compleanno Wayne,http://www.uraniabasket.it/news/166678924806/b...,2020-10-26,Urania Basket Milano augura un felice e sereno...,https://slyvi-tstorage.fra1.digitaloceanspaces...
1,"Vittoria per la storia, 77-84 a Piacenza e Fin...",http://www.uraniabasket.it/news/441556831750/v...,2020-10-25,Altra prova autoritaria di Urania che passa a ...,https://slyvi-tstorage.fra1.digitaloceanspaces...
2,Preview Piacenza - Milano,http://www.uraniabasket.it/news/29239971334/pr...,2020-10-24,Ultima giornata del girone di qualificazione d...,https://slyvi-tstorage.fra1.digitaloceanspaces...
3,Programma allenamenti 26 ottobre - 1 novembre,http://www.uraniabasket.it/news/304117878278/p...,2020-10-23,Dopo il nuovo DPCM del 25/10 rimaniamo in atte...,https://slyvi-tstorage.fra1.digitaloceanspaces...
4,La FIP di Roma posticipa l'inizio dei Campiona...,http://www.uraniabasket.it/news/243988336134/l...,2020-10-23,Si riceve dalla FIP di Roma:\n\nRinvio inizio ...,https://slyvi-tstorage.fra1.digitaloceanspaces...
...,...,...,...,...,...
346,Under 14 Elite: ABA Legnano - Urania Milano 74-68,http://www.uraniabasket.it/news/160773344774/u...,2019-10-27,Campionato Under 14 Elite - 4° giornata di and...,https://slyvi-tstorage.fra1.digitaloceanspaces...
347,Under 16 Eccellenza: Urania Milano - Bernaregg...,http://www.uraniabasket.it/news/435651251718/u...,2019-10-27,Campionato Under 16 Eccellenza - 3° giornata d...,https://slyvi-tstorage.fra1.digitaloceanspaces...
348,Under 13 Regionale: Trezzano Basket - Urania E...,http://www.uraniabasket.it/news/23334391302/un...,2019-10-27,Campionato Under 13 Regionale - 3° giornata gi...,https://slyvi-tstorage.fra1.digitaloceanspaces...
349,Under 13 Top: Urania Milano - EA7 Olimpia 38-90,http://www.uraniabasket.it/news/298212298246/u...,2019-10-27,Campionato Under 13 Top - 3° giornata di andat...,https://slyvi-tstorage.fra1.digitaloceanspaces...


We did it! As we can see the last news in the DataFrame is _Allianz Cloud ancora tabù per i Wildcats, Ferrara beffa l’Urania 71-72_ which is exactly the last news of the website at the page 36.

N.B. News are continually updated so at the time you are reading this notebook the last news or the total number of pages can be different.