# Web scraping - Men VI Nations Rugby results 2000 - today

This notebook uses web scraping method to get results from all Men VI Nations Rugby fixtures from creation and can be run to update the database after each year. Website scarped is Wikipedia, particulary the result pages for each edition from 2000 (creation of the modern VI Nation with introduction of Italy national team) to today.

The dataset is available on Kaggle [Rugby - 6 Nations Results 2000-2024](https://www.kaggle.com/datasets/simfour/rugby-6-nations-results-2000-2024)

## Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Web scraping

In [3]:
# Update this cell to run the code each year and add last results
begin_year = 2000
end_year = 2025

In [4]:
# Creating blank dataframe
columns = ['Year','Date','HomeTeam','HomeBonus','AwayTeam','AwayBonus','HomeScore','AwayScore','Stadium']
df = pd.DataFrame(columns=columns)

# Itering over all result pages from Wikipedia
for year in range(begin_year,end_year+1):

    url = f'https://en.wikipedia.org/wiki/{year}_Six_Nations_Championship'
    response = requests.get(url)
    html_content = response.text
    soup = BeautifulSoup(html_content,'html.parser')
    div_event = soup.find_all(class_="vevent summary")

    for k,div in enumerate(div_event):
        tables = div.find_all('table')
        new_match = [year]
        for i, table in enumerate(tables):
            # Date
            if i == 0:
                for texts in table:
                    date = pd.to_datetime(str(texts).split('>')[3].split('<')[0])
                    new_match.append(date)

            # Teams, bonus points and scores (nb : from 2017 and onwards editions)
            if i == 1:
                # Teams and bonus points
                teams = table.find_all(class_='vcard')
                for j,team in enumerate(teams):
                    if j == 0:
                        new_match.append(team.find('a').text.strip())
                        home_team = team.text.strip()
                        if home_team.find('(') != -1:
                            new_match.append(home_team.split('(')[1][0])
                        else:
                            new_match.append(0)
                    else:
                        new_match.append(team.find('a').text.strip())
                        away_team = team.text.strip()
                        if away_team.find('(') != -1:
                            new_match.append(away_team.split('(')[1][0])
                        else:
                            new_match.append(0)

                # Scores
                score = table.find(attrs={'style':'width:22%'}).text.split('–')
                if len(score) > 1:
                    new_match.append(score[0])
                    new_match.append(score[1])
                else:
                    new_match.append('P') # Match postponed
                    new_match.append('P')

            # Stadium
            if i == 2:
                stadium = table.find(class_='location')
                links = stadium.find('a')
                #print(f"Stadium : {links.text}")
                new_match.append(links.text)

        match_df = pd.DataFrame(data=[new_match],columns=columns)
        df = pd.concat([df,match_df],ignore_index=True)

  df = pd.concat([df,match_df],ignore_index=True)


In [5]:
# Displaying result
df

Unnamed: 0,Year,Date,HomeTeam,HomeBonus,AwayTeam,AwayBonus,HomeScore,AwayScore,Stadium
0,2000,2000-02-05,Italy,0,Scotland,0,34,20,Stadio Flaminio
1,2000,2000-02-05,England,0,Ireland,0,50,18,Twickenham Stadium
2,2000,2000-02-05,Wales,0,France,0,3,36,Millennium Stadium
3,2000,2000-02-19,Wales,0,Italy,0,47,16,Millennium Stadium
4,2000,2000-02-19,France,0,England,0,9,15,Stade de France
...,...,...,...,...,...,...,...,...,...
387,2025,2025-03-08,Scotland,1,Wales,2,35,29,Murrayfield Stadium
388,2025,2025-03-09,England,1,Italy,0,47,24,Twickenham Stadium
389,2025,2025-03-15,Italy,1,Ireland,1,17,22,Stadio Olimpico
390,2025,2025-03-15,Wales,0,England,1,14,68,Millennium Stadium


## Cleaning additonal games

**Two games have been postponed and marked as "P" in scores: dropping those two rows**

In [6]:
df[df['HomeScore']=='P']

Unnamed: 0,Year,Date,HomeTeam,HomeBonus,AwayTeam,AwayBonus,HomeScore,AwayScore,Stadium
184,2012,2012-02-11,France,0,Ireland,0,P,P,Stade de France
324,2021,2021-02-28,France,0,Scotland,0,P,P,Stade de France


In [7]:
drop_index = df[df['HomeScore']=='P'].index
df = df.drop(drop_index,axis=0)
df = df.reset_index(drop=True)

## Some info about dataset

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Year       390 non-null    object        
 1   Date       390 non-null    datetime64[ns]
 2   HomeTeam   390 non-null    object        
 3   HomeBonus  390 non-null    object        
 4   AwayTeam   390 non-null    object        
 5   AwayBonus  390 non-null    object        
 6   HomeScore  390 non-null    object        
 7   AwayScore  390 non-null    object        
 8   Stadium    390 non-null    object        
dtypes: datetime64[ns](1), object(8)
memory usage: 27.6+ KB


Numeric columns are strings: changing their types to integers

In [11]:
num_cols = ['Year', 'HomeBonus', 'AwayBonus', 'HomeScore', 'AwayScore']

for col in num_cols:
    df[col] = df[col].astype('int32')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Year       390 non-null    int32         
 1   Date       390 non-null    datetime64[ns]
 2   HomeTeam   390 non-null    object        
 3   HomeBonus  390 non-null    int32         
 4   AwayTeam   390 non-null    object        
 5   AwayBonus  390 non-null    int32         
 6   HomeScore  390 non-null    int32         
 7   AwayScore  390 non-null    int32         
 8   Stadium    390 non-null    object        
dtypes: datetime64[ns](1), int32(5), object(3)
memory usage: 19.9+ KB


## Exporting dataset to csv file

In [13]:
df.to_csv('rugby_six_nations.csv', index=False)

## Displyaing all Crunches (France vs. England games) from the beginning

What is your best memory from this list?

In [14]:
df[((df['HomeTeam']=='England') | (df['HomeTeam']=='France')) & ((df['AwayTeam']=='England') | (df['AwayTeam']=='France'))]

Unnamed: 0,Year,Date,HomeTeam,HomeBonus,AwayTeam,AwayBonus,HomeScore,AwayScore,Stadium
4,2000,2000-02-19,France,0,England,0,9,15,Stade de France
25,2001,2001-04-07,England,0,France,0,48,19,Twickenham Stadium
36,2002,2002-03-02,France,0,England,0,20,15,Stade de France
46,2003,2003-02-15,England,0,France,0,25,17,Twickenham Stadium
74,2004,2004-03-27,France,0,England,0,24,21,Stade de France
80,2005,2005-02-13,England,0,France,0,17,18,Twickenham Stadium
101,2006,2006-03-12,France,0,England,0,31,6,Stade de France
116,2007,2007-03-11,England,0,France,0,26,18,Twickenham Stadium
128,2008,2008-02-23,France,0,England,0,13,24,Stade de France
146,2009,2009-03-15,England,0,France,0,34,10,Twickenham Stadium
