# Data Acquisition of Best CL Scorer Data from Transfermarkt

This notebook demonstrates how the UEFA Champions Leagues best goal scorer data is scraped from [Transfermarkt](https://www.transfermarkt.com)

- [Best Goal Scorers of UEFA Champions League](https://www.transfermarkt.com/uefa-champions-league/ewigetorschuetzenliste/pokalwettbewerb/CL/land_id/0/saisonIdVon/1955/saisonIdBis/2024)

In [2]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

In [3]:
# Header used to perform http request data from web server.
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64} AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

This is the structure of the top scorer table:

```
{
    name,
    seasons,
    appearances,
    minutes_played,
    goals
}
```

In the first step we need to create a function that reads all top scorers from the UEFA Champions League's top scorer table of transfermarkt.com, and a function that parses the data from a single row so that we can store the information in a dataframe.

In [54]:
def getRowData(row):
    """
    This function parses the data from a single row of the uefa cl top scorer table.

    Args:
        row: Table row
    
    Returns:
        player: Information about the player in the uefa cl top scorer list.
    """
    player = {}

    #get player name
    init_tag = row.find_next("a")
    if (not init_tag):
        return None

    # get name tag
    name_tag = init_tag.find_next("td").find_next("a")
    if (name_tag):
        player["name"] = name_tag.get("title", "No title available")
    
    
    club_tag = name_tag.find_next("td").find_next("td").find_next("td") # only used to find appearances tag
    if not club_tag:
        return None
    age_tag = club_tag.find_next("td") # only used to find appearances tag
    if not age_tag:
        return None
    
    # get amount of played seasons
    seasons_tag = age_tag.find_next("td")
    if (seasons_tag):
        player["seasons"] = int(seasons_tag.text.strip())

    
    # get appearances
    appearance_tag = seasons_tag.find_next("td")
    if (appearance_tag):
        player["appearances"] = int(appearance_tag.text.strip())

    # get goal amount
    goals_tag = appearance_tag.find_next("td")
    if (goals_tag):
        player["goals"] = int(goals_tag.text.strip())

    return player

In [52]:
def getTopPlayers():
    """
    This function reads the information of the uefa cl top scorer table and returns each row.
    
    Returns:
        title_data: List containing the all time top scorers of the UEFA Champions League.
    """
    page = "https://www.transfermarkt.com/uefa-champions-league/ewigetorschuetzenliste/pokalwettbewerb/CL/land_id/0/saisonIdVon/1955/saisonIdBis/2024"
    pageTree = requests.get(page, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    top_players = []
    table = pageSoup.find("table", class_="items")
    if not table:
        return None
    tbody = table.find_next("tbody")
    if not tbody:
        return None

    rows = tbody.contents

    for row in rows[1::2]:  # No iterations because the slice is empty
        player = getRowData(row)
        top_players.append(player)
    
    return top_players


In [56]:
data = getTopPlayers()
data

[{'name': 'Cristiano Ronaldo',
  'seasons': 19,
  'appearances': 183,
  'goals': 140},
 {'name': 'Lionel Messi', 'seasons': 19, 'appearances': 163, 'goals': 129},
 {'name': 'Robert Lewandowski',
  'seasons': 14,
  'appearances': 126,
  'goals': 101},
 {'name': 'Karim Benzema', 'seasons': 19, 'appearances': 152, 'goals': 90},
 {'name': 'Raúl', 'seasons': 15, 'appearances': 142, 'goals': 71},
 {'name': 'Ruud van Nistelrooy',
  'seasons': 11,
  'appearances': 73,
  'goals': 56},
 {'name': 'Thomas Müller', 'seasons': 17, 'appearances': 156, 'goals': 55},
 {'name': 'Kylian Mbappé', 'seasons': 9, 'appearances': 79, 'goals': 50},
 {'name': 'Thierry Henry', 'seasons': 13, 'appearances': 112, 'goals': 50},
 {'name': 'Alfredo di Stéfano', 'seasons': 9, 'appearances': 58, 'goals': 49},
 {'name': 'Zlatan Ibrahimović',
  'seasons': 16,
  'appearances': 124,
  'goals': 48},
 {'name': 'Andriy Shevchenko', 'seasons': 12, 'appearances': 100, 'goals': 48},
 {'name': 'Mohamed Salah', 'seasons': 10, 'appe

Now we can create a dataframe for the retrieved data, which we can then store as a .csv file.

In [188]:
def createDataframe(list):
    """
    Creates a dataframe for the all time uefa cl top scorers list.

    Args:
        list: top scorer list
    
    Returns:
        df: converted dataframe.
    """
    rows = []
    for player in list:
        row = {
            'name': player['name'],
            'seasons': player['seasons'],
            'appearances': player['appearances'],
            'goals': player['goals']
        }
        rows.append(row)

    df = pd.DataFrame(rows)
    print("Dataframes successfully created.")
    return df


In [191]:
# Create dataframes to be saved
df = createDataframe(data)

Dataframes successfully created.


In [192]:
# store data
folder_name = "data"
try:
    os.makedirs(folder_name, exist_ok=False)
    print("Folder created for storing goal data")
except Exception:
    print("Folder already exists")


df.to_csv('./data/' + "laliga_top_scorer.csv", index=False, encoding="utf-8")

Folder already exists
