# Extracting data for Volleyball Nations League statistics analysis

Hello!

In this Jupyter Notebook, I will extract data about women players of the Volleyball Nations League from the [official website](https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/). This data will be used in future analyses, which will be available in the same repository on GitHub. Feel free to explore the other parts of the project, and if you have any questions or suggestions, please don't hesitate to [reach out](https://www.linkedin.com/in/jrocatelli/)!

In [1]:
# Importing libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

## 1. Statistics

I'll create 7 different dataframes: `best_scorers`, `best_attackers`, `best_blockers`, `best_servers`, `best_setters`, `best_diggers`, and `best_receivers`, each representing a specific action in volleyball as presented in these stats.

In [18]:
# Sending GET requests to the webpage.

url_requests = ["https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-scorers/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/statistics/women/best-attackers/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-blockers/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-servers/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-setters/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-diggers/",
                "https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/finals-statistics/women/best-receivers/"]

req_scorers = requests.get(url_requests[0])
req_attackers = requests.get(url_requests[1])
req_blockers = requests.get(url_requests[2])
req_servers = requests.get(url_requests[3])
req_setters = requests.get(url_requests[4])
req_diggers = requests.get(url_requests[5])
req_receivers = requests.get(url_requests[6])

reqs = [req_scorers, req_attackers, req_blockers, req_servers, req_setters, req_diggers, req_receivers]

# If the status = 200, it worked well

for element in reqs:
    print(element.status_code)

200
200
200
200
200
200
200


In [20]:
# HTML content with BeautifulSoup

Soup = []

for element in reqs:
    Soup.append(BeautifulSoup(element.text, 'lxml'))

In [21]:
# Finding the tables

table = []

for element in Soup:
    table.append(element.find('table'))

In [20]:
# Function to create the dataframes

def create_table(table):
    
    heading_table = []
    content = []
    
    for row in table.find_all('th'):
        heading_table.append(row.text)
    
    for row in table.find_all('tr'):
        if not row.find_all('th'):
            content.append([element.text for element in row.find_all('td')])
        
        
    return pd.DataFrame(content, columns=heading_table)

In [32]:
# Creating dataframes

best_scorers = create_table(table[0])
best_attackers = create_table(table[1])
best_blockers = create_table(table[2])
best_servers = create_table(table[3])
best_setters = create_table(table[4])
best_diggers = create_table(table[5])
best_receivers = create_table(table[6])

In [38]:
# One of the dataframes' head

best_scorers.head()

Unnamed: 0,Shirt NumberShirt,Player NamePlayer,TeamTeam,PointsPts,Attack PointsA Pts,Block PointsB Pts,Serve PointsS Pts
0,4,Vargas Melissa Teresa,TUR,65,55,5,5
1,9,Stysiak Magdalena,POL,56,52,1,3
2,12,Li Yingying,CHN,51,48,2,1
3,11,Lukasik Martyna,POL,38,35,1,2
4,11,Drews Andrea,USA,37,32,3,2


In [39]:
# Informations

best_scorers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Shirt NumberShirt   112 non-null    object
 1   Player NamePlayer   112 non-null    object
 2   TeamTeam            112 non-null    object
 3   PointsPts           112 non-null    object
 4   Attack PointsA Pts  112 non-null    object
 5   Block PointsB Pts   112 non-null    object
 6   Serve PointsS Pts   112 non-null    object
dtypes: object(7)
memory usage: 6.2+ KB


It's clear that every field has an "object" data type. We will convert this when using these dataframes in other platforms such as PowerBI and SQL.

In [45]:
# Exporting dataframes to csv files

best_scorers.to_csv('best_scorers.csv', index=False)
best_attackers.to_csv('best_attackers.csv', index=False)
best_blockers.to_csv('best_blockers.csv', index=False)
best_servers.to_csv('best_servers.csv', index=False)
best_setters.to_csv('best_setters.csv', index=False)
best_diggers.to_csv('best_diggers.csv', index=False)
best_receivers.to_csv('best_receivers.csv', index=False)

The dataframes have been exported successfully.

## 2. Players

The player lists are spread across different links, with 16 teams competing in the tournament, resulting in 16 links for data extraction. To optimize the process, I'll employ a similar approach, utilizing dictionaries and for loops to minimize code repetition and enhance efficiency.

In [4]:
# Links of players for each team

players = {'Brazil': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5834/players/',
           'Bulgaria': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5835/players/',
           'Canada': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5836/players/',
           'China': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5837/players/',
           'Croatia': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5838/players/',
           'Dominican Republic': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5839/players/',
           'Germany': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5840/players/',
           'Italy': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5841/players/',
           'Japan': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5842/players/',
           'Korea': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5843/players/',
           'Netherlands': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5844/players/',
           'Poland': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5845/players/',
           'Serbia': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5846/players/',
           'Thailand': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5847/players/',
           'Türkiye': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5848/players/',
           'USA': 'https://en.volleyballworld.com/volleyball/competitions/volleyball-nations-league/2023/teams/women/5849/players/',
          }

In [14]:
# Requests

players_requests = {}

for key in players.keys():
    players_requests[key] = requests.get(players[key])

In [13]:
# Checking if the request was successful (status code 200).

players_requests

{'Brazil': <Response [200]>,
 'Bulgaria': <Response [200]>,
 'Canada': <Response [200]>,
 'China': <Response [200]>,
 'Croatia': <Response [200]>,
 'Dominican Republic': <Response [200]>,
 'Germany': <Response [200]>,
 'Italy': <Response [200]>,
 'Japan': <Response [200]>,
 'Korea': <Response [200]>,
 'Netherlands': <Response [200]>,
 'Poland': <Response [200]>,
 'Serbia': <Response [200]>,
 'Thailand': <Response [200]>,
 'Türkiye': <Response [200]>,
 'USA': <Response [200]>}

In [18]:
# BeautifulSoup and HTML texts

soup_players = {}

for key in players_requests.keys():
    soup_players[key] = BeautifulSoup(players_requests[key].text, 'lxml')
    
table_players = {}

for key in soup_players.keys():
    table_players[key] = soup_players[key].find('table')

In [21]:
# Creating a dataframe for each team

dataframes_players = {}

for key in table.keys():
    dataframes_players[key] = create_table(table[key])

In [29]:
# Indicating the abbreviation for each team in a column.

dataframes_players['Brazil']['Team'] = 'BRA'
dataframes_players['Bulgaria']['Team'] = 'BUL'
dataframes_players['Canada']['Team'] = 'CAN'
dataframes_players['China']['Team'] = 'CHN'
dataframes_players['Croatia']['Team'] = 'CRO'
dataframes_players['Dominican Republic']['Team'] = 'DOM'
dataframes_players['Germany']['Team'] = 'GER'
dataframes_players['Italy']['Team'] = 'ITA'
dataframes_players['Japan']['Team'] = 'JPN'
dataframes_players['Korea']['Team'] = 'KOR'
dataframes_players['Netherlands']['Team'] = 'NED'
dataframes_players['Poland']['Team'] = 'POL'
dataframes_players['Serbia']['Team'] = 'SRB'
dataframes_players['Thailand']['Team'] = 'THA'
dataframes_players['Türkiye']['Team'] = 'TUR'
dataframes_players['USA']['Team'] = 'USA'

In [38]:
all_players = pd.concat(objs = dataframes_players, axis = 0, ignore_index = True)

all_players

Unnamed: 0,No.,Player Name,Position,Team
0,2,Duarte Alecrim Diana,MB,BRA
1,3,Alexandre Costa Nunes Nyeme Victoria,L,BRA
2,5,Zalewski Daroit Priscila,OH,BRA
3,6,Daher de Menezes Thaisa,MB,BRA
4,7,Montibeller Rosamaria,OH,BRA
...,...,...,...,...
234,24,Ogbogu Chiaka,MB,USA
235,26,O'Neal Asjia,MB,USA
236,27,Skinner Avery,OH,USA
237,29,Lanier Khalia,OH,USA


In [39]:
# Exporting dataframe to csv file

all_players.to_csv('players.csv', index=False)

The dataframes have been exported successfully.