# 📊 Data Scraping for "The Top 10 Individual Performances in FIFA World Cup History" 🏆

In this notebook, I'll cover the process of scraping data from the web to gather the necessary information for our analysis on the top individual performances in FIFA World Cup history. The data obtained through web scraping will serve as the foundation for our subsequent analysis and ranking of player performances.

**Objective:**

The main objective of this notebook is to collect comprehensive data on player performances from various sources, enabling us to conduct a thorough analysis and identify the top-performing players in FIFA World Cup history.

**Approach:**

We will utilize web scraping techniques to extract relevant data from websites such as Sofascore, a popular platform for sports statistics. By leveraging tools like ScraperFC, we will automate the process of collecting data on player statistics in an edition of FWC.

**Created by: Jose Ruben Garcia Garcia**

*Date: April 2024*


In [1]:
import ScraperFC as sfc  # Importing the ScraperFC library for web scraping
import pandas as pd  # Importing pandas library for data manipulation and analysis

In [2]:
sofascore = sfc.Sofascore()  # Creating an instance of the Sofascore class from the ScraperFC library

In [3]:
def scrape_league_stats(
    self, year, league, accumulation='total', 
    selected_positions=['Goalkeepers', 'Defenders', 'Midfielders', 'Forwards']
):
    """ 
    Scrapes player statistics for a given league and season from the Sofascore website.

    Args:
        year (str): The year of the season.
        league (str): The name of the league.
        accumulation (str, optional): The accumulation filter. Can be "per90", "perMatch", or "total". Defaults to 'total'.
        selected_positions (list, optional): The selected positions to filter. Defaults to ['Goalkeepers', 'Defenders', 'Midfielders', 'Forwards'].

    Returns:
        DataFrame: DataFrame with each row corresponding to a player and the columns are the fields defined on get_league_stats_fields()
    """
    # Retrieving source competition information
    source_comp_info = get_source_comp_info(year, league, "Sofascore")
    
    # Getting positions
    positions = self.get_positions(selected_positions)
    
    # Retrieving league and season IDs
    league_id = source_comp_info['Sofascore'][league]['id']
    season_id = source_comp_info['Sofascore'][league]['seasons'][year]

    offset = 0
    df = pd.DataFrame()
    
    # Looping through pages to scrape data
    for i in range(0, 100):
        request_url = f'https://api.sofascore.com/api/v1' +\
            f'/unique-tournament/{league_id}/season/{season_id}/statistics'+\
            f'?limit=100&order=-rating&offset={offset}'+\
            f'&accumulation={accumulation}' +\
            f'&fields={self.concatenated_fields}'+\
            f'&filters=position.in.{positions}'
        
        response = requests.get(request_url, headers=self.requests_headers)
        new_df = pd.DataFrame(response.json()['results'])
        new_df['player'] = new_df.player.apply(pd.Series)['name']
        new_df['team'] = new_df.team.apply(pd.Series)['name']
        df = pd.concat([df, new_df])
        
        if response.json().get('page') == response.json().get('pages'):
            print('End of the pages')
            break
        
        offset += 100
    
    return df

In [4]:
#This is an example for the 1958 WC (No data gathered)
def get_top_players(df, country, year):
    """
    Filters the data by the specified country and selects the top 3 players.

    Args:
        df (DataFrame): DataFrame containing player statistics.
        country (str): The country name to filter.
        year (str): The year of the World Cup.

    Returns:
        DataFrame: DataFrame containing the top 3 players from the specified country.
    """
    # Filter the data by the specified country
    country_df = df[df['team'] == country]
    # Add a 'year' column with the specified year
    country_df['year'] = year
    # Get the top 3 players from that country
    top_players = country_df.head(3)
    return top_players

# Initialize an empty DataFrame to store all the results
all_results_df_1958 = pd.DataFrame()

# Get the World Cup data
df = sofascore.scrape_league_stats('1958', 'World Cup', accumulation='total', selected_positions=['Midfielders', 'Forwards'])

# Get the top 3 players from Brazil in the year 1958
top_players_df = get_top_players(df, 'Brazil', '1958')

# Check if there are already results in the final DataFrame
if not all_results_df_1958.empty:
    # If there are existing data, append the new results below the existing ones
    all_results_df_1958 = pd.concat([all_results_df_1958, top_players_df], ignore_index=True)
else:
    # If the DataFrame is empty, simply assign the new results
    all_results_df_1958 = top_players_df

# Display the final DataFrame
all_results_df_1958


AttributeError: 'DataFrame' object has no attribute 'player'

In [None]:
who

all_results_df_1958	 all_results_df_1962	 all_results_df_1966	 all_results_df_1970	 all_results_df_1974	 all_results_df_1978	 all_results_df_1982	 all_results_df_1986	 all_results_df_1990	 
all_results_df_1994	 all_results_df_1998	 all_results_df_2002	 all_results_df_2006	 all_results_df_2010	 all_results_df_2014	 all_results_df_2018	 all_results_df_2022	 df	 
get_top_players	 pd	 sfc	 sofascore	 top_players_df	 


In [None]:
# List to store the DataFrames
dfs = []

# Get all variables from the global environment
variables = globals()

# Iterate over the variables and get those that match the pattern "all_results_df_"
for var_name, var_value in variables.items():
    if isinstance(var_value, pd.DataFrame) and var_name.startswith('all_results_df_'):
        dfs.append(var_value)

# Concatenate all DataFrames into one
merged_df = pd.concat(dfs, ignore_index=True)

# Save the merged DataFrame as a CSV file
merged_df.to_csv('merged_results.csv', index=False)

print("Merged DataFrames saved as merged_results.csv")

DataFrames fusionados y guardados como merged_results.csv


  merged_df = pd.concat(dfs, ignore_index=True)


In [None]:
df = pd.read_csv('merged_results.csv')
df.tail()

Unnamed: 0,goals,yellowCards,redCards,groundDuelsWon,groundDuelsWonPercentage,aerialDuelsWon,aerialDuelsWonPercentage,successfulDribbles,successfulDribblesPercentage,tackles,...,goalConversionPercentage,hitWoodwork,offsides,expectedGoals,errorLeadToGoal,errorLeadToShot,passToAssist,player,team,year
40,3,1,0,41,57.75,0,0.0,19,67.86,12,...,10.0,1,0,,0,0,0,Rivellino,Brazil,1970
41,7,0,0,78,60.94,1,50.0,47,67.14,5,...,58.33,0,3,,0,0,0,Jairzinho,Brazil,1970
42,4,0,0,11,32.35,8,61.54,5,50.0,3,...,23.53,0,4,,0,0,0,Geoff Hurst,England,1966
43,3,0,0,36,52.17,3,60.0,12,48.0,18,...,10.0,1,2,,0,0,0,Bobby Charlton,England,1966
44,1,1,0,30,53.57,8,88.89,5,45.45,23,...,3.03,0,2,,0,0,0,Martin Peters,England,1966


# Note on Manual DataFrame Concatenation

In this section, I'd like to acknowledge that there was a possibility of automating the process to merge all the DataFrames into one. This could have been achieved by creating a function to dynamically gather all the DataFrames that match a specific pattern and concatenate them into a single DataFrame. However, considering the time it would take to develop such a function and the fact that it may require significant modifications to the existing functionality of the ScraperFC library, I made the decision to proceed with manual concatenation instead.

While automation could have potentially saved time and effort, I chose to prioritize simplicity and avoid potential complications that may arise from modifying the library's codebase. By manually merging the DataFrames, I ensured a straightforward and reliable approach to consolidating the results of my analysis.

I would like to express my gratitude once again to the creators of the ScraperFC library for providing a powerful tool for web scraping sports data. Their dedication and effort have greatly facilitated the process of gathering data for this project, and I am thankful for their contribution to the field of data science.

---

*Again, a special thanks to the creators of ScraperFC (https://github.com/oseymour/ScraperFC) for their valuable contribution to this project.*
