Skip to content
Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
756 lines (624 sloc) 22.1 KB

Web scrapping the footbal calendar to a custom Excel file



This blog is a spin-off my Master's Degree Thesis at Universidad Internacional de La Rioja



In this post I will make use of the Python library, Beautiful Soap 4, to scrap content on the web to build an excel with all the footbal matches of the Spanish Football League.


Because some of the function names have been implemented in spanish, I introduced comments for the translation.

This demo is quite simple, and is ment to just go over the functionalities of Beatiful Soap.
In my GitHub repository there are scripts for going a step further and create a complete statistics database:

  • Save the teams and players for every season in the history
  • Save the matches, results and statistics of every match in the history

Plan of attack

  1 - Source: Analyze the platform from where we will extract the data
  2 - Exploration: Analyze how the information is structured on that platforms - the html body
  3 - Extraction: Retrieve the data using Beatiful Soap 4
  4 - Export: save the resulting pandas Dataframe object into an Excel file



1 - Source


The first thing to look at, is the source where we will get the data from. In this case, we will make use of one of the most popular spanish football newspapers, Marca.


In this picture, we see how we can make use of the inspector to understand how the content of that particular website is organized. This is a mandatory step, since we are going to make use of that structure to navigate over it and extract the information in the order we want.

Another possible way to do what we are going to achieve is by using graphical interfaces, but I find them even less intuitive and less flexible than using beautifulsoap and define you behavior on your scripts. However, here is a screenshot of ParseHub, that I believe helps to understand what we want to achieve later on.



Intuitively, we can see how we are mapping the different html pieces in a hierarchical order. The headers of each column whithin the big table will represent the round on the calendar and every row within that column contains the information of the whole match. Obviously, it is required also a high text processing to define how exactly you want your final output to be.

2 - Explore data


We are now ready to explore the website according to the observed structure. The very first thing, like in any other data science project, is to import the necessary libraries. We will import all of them now and then explain when each one is used an how.

By now, the important thing to know is that we need to import bs4, urllib and utils.py.

  • bs4 is the Beautiful Soap 4 module itself
  • urrlib will help us to access website and parse its html content
  • utils.py contains functions for the mentions text processing
  • patterns is called from the utils.py. It contains information about the names of the teams and stopwords we want to get rid of. These are, for instance: FC, SD, CA, RC, Club...
import os
root = './'
os.chdir(root)

if os.path.exists(root):
    path_to_data = os.path.join(root, 'Datos/Scrapped')
    path_to_save = os.path.join(root, 'Datos/Created')


import pandas as pd
from datetime import datetime

from bs4 import BeautifulSoup as BS
from urllib.request import urlopen as uOpen

# IMPORT HELPER FUNCTIONS
from utils import limpiar_nombre, buscar_equivalencia

Define how we will store the data

We create an empty pandas dataframe to define how we want to store the information

# MATCH MODELS
partidos_df = pd.DataFrame(columns=['Round', 'Match #', 'Home team', 'Away team'])

3 - Retrieve the html content from the source

m_url = 'http://www.marca.com/futbol/primera-division/calendario.html'
page_soup = BS(uOpen(m_url).read(), 'html.parser')
page_soup = BS(uOpen(m_url).read(), 'html.parser')

We are telling BS to read the content open by urlope from urllib.request. This brings all the html content of the page.

Then, we are interested in retrieve each of the rounds. If we take a look at the html again to see what are we looking for:

We see how each of the rounds is defined inside a div element which classis: jornada calendarioInternacional

Therefore, we use the BS method find_all():

rounds = page_soup.find_all('div',{'class': 'jornada calendarioInternacional'})
print(len(rounds))
38

We check that there are indeed 38 rounds. And if we take a look at the content of any of them, for instance the first one, we could see the retreived html and check Girona - Valladolid is the first of the matches of the first round

rounds[0]
<div class="jornada calendarioInternacional">
<div class="cal-agendas calendario">
<div class="jornada datos-jornada">
<a class="ir-arriba" href="#top">Ir arriba</a>
<table cellpadding="0" cellspacing="0" class="jor agendas" id="jornada1" summary="Todos los resultados de la jornada">
<caption>Jornada 1</caption>
<thead>
<tr>
<th scope="col">Equipo local</th>
<th scope="col">Resultado</th>
<th scope="col">Equipo visitante</th>
</tr>
</thead>
<tbody>
<tr>
<td class="local">
<figure>
<img alt="Girona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2893.png">
</img></figure>
<span class="equipo_t2893">Girona</span>
</td>
<td class="resultado"><span class="resultado-partido">0-0</span></td>
<td class="visitante">
<span class="equipo_t192">Valladolid</span>
<figure>
<img alt="Valladolid" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/192.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Betis" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/185.png"/>
</figure>
<span class="equipo_t185">Betis</span>
</td>
<td class="resultado"><span class="resultado-partido">0-3</span></td>
<td class="visitante">
<span class="equipo_t855">Levante</span>
<figure>
<img alt="Levante" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/855.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Celta" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/176.png"/>
</figure>
<span class="equipo_t176">Celta</span>
</td>
<td class="resultado"><span class="resultado-partido">1-1</span></td>
<td class="visitante">
<span class="equipo_t177">Espanyol</span>
<figure>
<img alt="Espanyol" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/177.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Villarreal" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/449.png"/>
</figure>
<span class="equipo_t449">Villarreal</span>
</td>
<td class="resultado"><span class="resultado-partido">1-2</span></td>
<td class="visitante">
<span class="equipo_t188">R. Sociedad</span>
<figure>
<img alt="R. Sociedad" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/188.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Barcelona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/178.png"/>
</figure>
<span class="equipo_t178">Barcelona</span>
</td>
<td class="resultado"><span class="resultado-partido">3-0</span></td>
<td class="visitante">
<span class="equipo_t173">Alavés</span>
<figure>
<img alt="Alavés" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/173.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Eibar" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/953.png"/>
</figure>
<span class="equipo_t953">Eibar</span>
</td>
<td class="resultado"><span class="resultado-partido">1-2</span></td>
<td class="visitante">
<span class="equipo_t2894">Huesca</span>
<figure>
<img alt="Huesca" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2894.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Rayo" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/184.png"/>
</figure>
<span class="equipo_t184">Rayo</span>
</td>
<td class="resultado"><span class="resultado-partido">1-4</span></td>
<td class="visitante">
<span class="equipo_t179">Sevilla</span>
<figure>
<img alt="Sevilla" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/179.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Real Madrid" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/186.png"/>
</figure>
<span class="equipo_t186">Real Madrid</span>
</td>
<td class="resultado"><span class="resultado-partido">2-0</span></td>
<td class="visitante">
<span class="equipo_t1450">Getafe</span>
<figure>
<img alt="Getafe" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/1450.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Valencia" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/191.png"/>
</figure>
<span class="equipo_t191">Valencia</span>
</td>
<td class="resultado"><span class="resultado-partido">1-1</span></td>
<td class="visitante">
<span class="equipo_t175">Atlético</span>
<figure>
<img alt="Atlético" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/175.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Athletic" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/174.png"/>
</figure>
<span class="equipo_t174">Athletic</span>
</td>
<td class="resultado"><span class="resultado-partido">2-1</span></td>
<td class="visitante">
<span class="equipo_t957">Leganés</span>
<figure>
<img alt="Leganés" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/957.png"/>
</figure>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>

We see how the round number is stored within the <caption> element.
We can then store this first piece of information as:

rounds[0].caption.text
'Jornada 1'

And we can also check that the information for the particular round is contain in the round -> table -> tbody element:

rounds[0].table.tbody
<tbody>
<tr>
<td class="local">
<figure>
<img alt="Girona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2893.png">
</img></figure>
<span class="equipo_t2893">Girona</span>
</td>
<td class="resultado"><span class="resultado-partido">0-0</span></td>
<td class="visitante">
<span class="equipo_t192">Valladolid</span>
<figure>
<img alt="Valladolid" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/192.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Betis" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/185.png"/>
</figure>
<span class="equipo_t185">Betis</span>
</td>
<td class="resultado"><span class="resultado-partido">0-3</span></td>
<td class="visitante">
<span class="equipo_t855">Levante</span>
<figure>
<img alt="Levante" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/855.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Celta" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/176.png"/>
</figure>
<span class="equipo_t176">Celta</span>
</td>
<td class="resultado"><span class="resultado-partido">1-1</span></td>
<td class="visitante">
<span class="equipo_t177">Espanyol</span>
<figure>
<img alt="Espanyol" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/177.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Villarreal" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/449.png"/>
</figure>
<span class="equipo_t449">Villarreal</span>
</td>
<td class="resultado"><span class="resultado-partido">1-2</span></td>
<td class="visitante">
<span class="equipo_t188">R. Sociedad</span>
<figure>
<img alt="R. Sociedad" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/188.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Barcelona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/178.png"/>
</figure>
<span class="equipo_t178">Barcelona</span>
</td>
<td class="resultado"><span class="resultado-partido">3-0</span></td>
<td class="visitante">
<span class="equipo_t173">Alavés</span>
<figure>
<img alt="Alavés" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/173.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Eibar" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/953.png"/>
</figure>
<span class="equipo_t953">Eibar</span>
</td>
<td class="resultado"><span class="resultado-partido">1-2</span></td>
<td class="visitante">
<span class="equipo_t2894">Huesca</span>
<figure>
<img alt="Huesca" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2894.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Rayo" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/184.png"/>
</figure>
<span class="equipo_t184">Rayo</span>
</td>
<td class="resultado"><span class="resultado-partido">1-4</span></td>
<td class="visitante">
<span class="equipo_t179">Sevilla</span>
<figure>
<img alt="Sevilla" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/179.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Real Madrid" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/186.png"/>
</figure>
<span class="equipo_t186">Real Madrid</span>
</td>
<td class="resultado"><span class="resultado-partido">2-0</span></td>
<td class="visitante">
<span class="equipo_t1450">Getafe</span>
<figure>
<img alt="Getafe" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/1450.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Valencia" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/191.png"/>
</figure>
<span class="equipo_t191">Valencia</span>
</td>
<td class="resultado"><span class="resultado-partido">1-1</span></td>
<td class="visitante">
<span class="equipo_t175">Atlético</span>
<figure>
<img alt="Atlético" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/175.png"/>
</figure>
</td>
</tr>
<tr>
<td class="local">
<figure>
<img alt="Athletic" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/174.png"/>
</figure>
<span class="equipo_t174">Athletic</span>
</td>
<td class="resultado"><span class="resultado-partido">2-1</span></td>
<td class="visitante">
<span class="equipo_t957">Leganés</span>
<figure>
<img alt="Leganés" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/957.png"/>
</figure>
</td>
</tr>
</tbody>

Where the matches will be inside each td element of the body of the table

rounds[0].table.tbody.td
<td class="local">
<figure>
<img alt="Girona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2893.png">
</img></figure>
<span class="equipo_t2893">Girona</span>
</td>

Then, we can catch all the matches within a round by:

r = rounds[0]
matches = r.findAll('tr')
matches[1]
<tr>
<td class="local">
<figure>
<img alt="Girona" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/2893.png">
</img></figure>
<span class="equipo_t2893">Girona</span>
</td>
<td class="resultado"><span class="resultado-partido">0-0</span></td>
<td class="visitante">
<span class="equipo_t192">Valladolid</span>
<figure>
<img alt="Valladolid" src="https://e00-marca.uecdn.es/assets/sports/logos/football/png/72x72/192.png"/>
</figure>
</td>
</tr>

To capture now the names of the local team and the away team, as we did for the round number:

match = matches[1]
loc  = match.find('td', {'class': 'local'}).span.text
away = match.find('td', {'class': 'visitante'}).span.text
print(loc)
print(away)
Girona
Valladolid

If we define this in a loop of matches inside a loop of rounds, we can retreive the entire site!

def crear_calendario(temp, path):
    
    global partidos_df # Use the defined empty dataframe
    page_soup = BS(uOpen(path).read(), 'html.parser')
    rounds = page_soup.find_all('div',{'class': 'jornada calendarioInternacional'})
    
    for r in rounds:
          
        rnd = r.caption.text        # Get the name of the round i.e. Jornada 1
        matches = r.findAll('tr')   # Find all the matches in that round
        
        for j, match in enumerate(matches[1:]):
            
            loc  = match.find('td', {'class': 'local'}).span.text
            away = match.find('td', {'class': 'visitante'}).span.text
            loc, away = limpiar_nombre(loc), limpiar_nombre(away)        
            loc, away = buscar_equivalencia(loc), buscar_equivalencia(away)
            
            res = pd.DataFrame([[rnd, j+1, loc, away]], columns=list(partidos_df))
            partidos_df = partidos_df.append(res)
        
    partidos_df = partidos_df.reset_index()
    partidos_df.drop('index', axis=1, inplace=True)
    

crear_calendario('2018-2019', m_url)
partidos_df.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Round Match # Home team Away team
0 Jornada 1 1 Girona Valladolid
1 Jornada 1 2 Betis Levante
2 Jornada 1 3 Celta Espanyol
3 Jornada 1 4 Villarreal Real Sociedad
4 Jornada 1 5 Barcelona Alavés
partidos_df.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Round Match # Home team Away team
375 Jornada 38 6 Huesca Leganés
376 Jornada 38 7 Levante Atlético
377 Jornada 38 8 Real Madrid Betis
378 Jornada 38 9 Valladolid Valencia
379 Jornada 38 10 Sevilla Athletic

4 - Export to Excel

# Export to Excel
# ---------------
partidos_writer = pd.ExcelWriter(path_to_save + '/matches_df.xlsx', engine='xlsxwriter')    
partidos_df.to_excel(partidos_writer, sheet_name='Matches_2018_2019')
partidos_writer.save()
You can’t perform that action at this time.