# Euro Fantasy Player Data
Aim: Create a player data in csv format from the HTML of [Euro 2024 Fantasy](https://gaming.uefa.com/en/eurofantasy/create-team). The relevant HTML parts were extracted into a separate file under the `data/raw` directory.

In [3]:
from bs4 import BeautifulSoup
import re
import pandas as pd
from pathlib import Path

## Parse HTML

references:
- https://stackoverflow.com/questions/42038130/beautifulsoup-nested-class-selector
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

In [4]:
# read the extracted html
with open("../data/raw/euro_fantasy_players_table.html", "r", encoding="utf-8") as f:
    raw_html = f.read()

In [5]:
# parse the html using BeautifulSoup
soup = BeautifulSoup(raw_html, "html.parser")

In [6]:
# parse header
header_lft = [x.text for x in soup.select(".si-plist__row--title .si-plist__lft .si-plist__col")]
header_rgt = [x.text for x in soup.select(".si-plist__row--title .si-plist__rgt .si-plist__col span")]
header_row = header_lft + header_rgt + ["img_url"]


# NOTE: below does not work
# temp = soup.find(class_="si-plist__row--title").find(class_=re.compile("si-plist__(lft|rgt)")).find_all(class_="si-plist__col")
# [x.get_text().strip("\n") for x in temp]

In [7]:
print(header_row)

['Players', 'Price', 'Total pts', 'Selected', 'MD pts', 'Pts per €', 'Pts per MD', 'PotM pts', 'Goals', 'Assists', 'Balls recovered', 'Clean sheets', 'Red cards', 'Yellow cards', 'Mins played', 'Trans in', 'Trans out', 'img_url']


In [8]:
def parse_raw_body_row(str):
    # [x.get_text().strip("\n") for x in str.select(".si-plist__col")] # simple
    res = []

    for col in str.select(".si-plist__col"):
        curr_val = col.get_text().strip("\n")
        if curr_val != "":
            res.append(curr_val)
        
        # try to get the thumbnail image src
        if "si-list-img" in col.attrs.get("class"):
            img_path = col.select("img")[0].attrs.get("src")
    
    res.append(img_path) # image path as last col
    return res


raw_body_rows = soup.select(".si-plist__body .si-plist__row") # parse body get all rows in body
body_rows = [parse_raw_body_row(x) for x in raw_body_rows] # get text value for each column

In [9]:
print(body_rows[0])

['K. Mbappé\nAUT v FRA', '€11m\nFWD', '0', '76 %', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '9715', '2632', 'https://img.uefa.com/imgml/TP/players/3/2024/324x324/250076574.jpg?v=0.05']


the first 2 columns in the body rows contain other values, Next Match and Position respectively. We will handle this using `pandas`.

# Create dataframe

In [10]:
df = pd.DataFrame(body_rows, columns=header_row)
df.head()

Unnamed: 0,Players,Price,Total pts,Selected,MD pts,Pts per €,Pts per MD,PotM pts,Goals,Assists,Balls recovered,Clean sheets,Red cards,Yellow cards,Mins played,Trans in,Trans out,img_url
0,K. Mbappé\nAUT v FRA,€11m\nFWD,0,76 %,0,0,0,0,0,0,0,0,0,0,0,9715,2632,https://img.uefa.com/imgml/TP/players/3/2024/3...
1,H. Kane\nSRB v ENG,€11m\nFWD,0,49 %,0,0,0,0,0,0,0,0,0,0,0,10142,4501,https://img.uefa.com/imgml/TP/players/3/2024/3...
2,C. Ronaldo\nPOR v CZE,€10m\nFWD,0,15 %,0,0,0,0,0,0,0,0,0,0,0,3962,5013,https://img.uefa.com/imgml/TP/players/3/2024/3...
3,J. Bellingham\nSRB v ENG,€9.5m\nMID,0,50 %,0,0,0,0,0,0,0,0,0,0,0,7235,6766,https://gaming.uefa.com/en/eurofantasy/static-...
4,K. De Bruyne\nBEL v SVK,€9.5m\nMID,0,21 %,0,0,0,0,0,0,0,0,0,0,0,5474,6343,https://gaming.uefa.com/en/eurofantasy/static-...


In [11]:
df[['Players', 'Next Match']] = df['Players'].str.split('\n', expand=True)
df[['Price', 'Position']] = df['Price'].str.split('\n', expand=True)

df['Price'] = df['Price'].apply(lambda x:x[1:-1]) # remove currency symbol so that we can convert to numeric type
df['Selected'] = df['Selected'].apply(lambda x:x.replace("%", "").strip()) # remove % symbol so that we can convert to numeric type

In [12]:
df.head()

Unnamed: 0,Players,Price,Total pts,Selected,MD pts,Pts per €,Pts per MD,PotM pts,Goals,Assists,Balls recovered,Clean sheets,Red cards,Yellow cards,Mins played,Trans in,Trans out,img_url,Next Match,Position
0,K. Mbappé,11.0,0,76,0,0,0,0,0,0,0,0,0,0,0,9715,2632,https://img.uefa.com/imgml/TP/players/3/2024/3...,AUT v FRA,FWD
1,H. Kane,11.0,0,49,0,0,0,0,0,0,0,0,0,0,0,10142,4501,https://img.uefa.com/imgml/TP/players/3/2024/3...,SRB v ENG,FWD
2,C. Ronaldo,10.0,0,15,0,0,0,0,0,0,0,0,0,0,0,3962,5013,https://img.uefa.com/imgml/TP/players/3/2024/3...,POR v CZE,FWD
3,J. Bellingham,9.5,0,50,0,0,0,0,0,0,0,0,0,0,0,7235,6766,https://gaming.uefa.com/en/eurofantasy/static-...,SRB v ENG,MID
4,K. De Bruyne,9.5,0,21,0,0,0,0,0,0,0,0,0,0,0,5474,6343,https://gaming.uefa.com/en/eurofantasy/static-...,BEL v SVK,MID


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 933 entries, 0 to 932
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Players          933 non-null    object
 1   Price            933 non-null    object
 2   Total pts        933 non-null    object
 3   Selected         933 non-null    object
 4   MD pts           933 non-null    object
 5   Pts per €        933 non-null    object
 6   Pts per MD       933 non-null    object
 7   PotM pts         933 non-null    object
 8   Goals            933 non-null    object
 9   Assists          933 non-null    object
 10  Balls recovered  933 non-null    object
 11  Clean sheets     933 non-null    object
 12  Red cards        933 non-null    object
 13  Yellow cards     933 non-null    object
 14  Mins played      933 non-null    object
 15  Trans in         933 non-null    object
 16  Trans out        933 non-null    object
 17  img_url          933 non-null    ob

In [14]:
# convert to numeric data type
numeric_colnames = ['Price', 'Total pts', 'Selected', 'MD pts', 'Pts per €', 'Pts per MD', 'PotM pts', 'Goals', 'Assists', 'Balls recovered', 'Clean sheets', 'Red cards', 'Yellow cards', 'Mins played', 'Trans in', 'Trans out']
df[numeric_colnames] = df[numeric_colnames].apply(pd.to_numeric, axis=1)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 933 entries, 0 to 932
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Players          933 non-null    object 
 1   Price            933 non-null    float64
 2   Total pts        933 non-null    float64
 3   Selected         933 non-null    float64
 4   MD pts           933 non-null    float64
 5   Pts per €        933 non-null    float64
 6   Pts per MD       933 non-null    float64
 7   PotM pts         933 non-null    float64
 8   Goals            933 non-null    float64
 9   Assists          933 non-null    float64
 10  Balls recovered  933 non-null    float64
 11  Clean sheets     933 non-null    float64
 12  Red cards        933 non-null    float64
 13  Yellow cards     933 non-null    float64
 14  Mins played      933 non-null    float64
 15  Trans in         933 non-null    float64
 16  Trans out        933 non-null    float64
 17  img_url         

In [16]:
output_path = Path("../data/clean")

In [17]:
# Path(output_path).is_dir()output
output_path.parts

('..', 'data', 'clean')

In [18]:
output_path.absolute()

WindowsPath('d:/Users/Timothy/Projects/football-analysis/notebooks/../data/clean')

In [19]:
# create output path if it does not exist
main_dir = Path.cwd().resolve().parents[0]
out_path = main_dir / "data" / "clean"
out_path.mkdir(parents=True, exist_ok=True)

# save output
target_path = out_path / "euro_fantasy_players.csv"
df.to_csv(target_path, index=False)