## Premier League Data Preparation

#### The data has already been scraped from the premier league official website, but their are some issues that prevent us from storing this data properly for future warehousing analysis, and predictive modelling.

#### We need to do the following to make sure the data is ready for use:

- Convert all information stored as dictionaries into standard csv rows and columns
- merge batch data containing the same information


In [1]:
import pandas as pd
import ast

### Preparing player statistics data

#### We have two player stats files, lets start by merging them together


In [2]:
# Getting the player stats files

player_stats_1 = pd.read_csv("../../web_scraping/new/datasets/player_stats.csv")
player_stats_2 = pd.read_csv("../../web_scraping/new/datasets/player_stats_2.csv")

In [3]:
player_stats = pd.concat([player_stats_1, player_stats_2])
player_stats

Unnamed: 0,player_name,preferred_foot,date_of_birth,appearances_sub,goals,assists,xa,xg,touches_in_opposition_box,crosses_completed,...,duels_won,aerial_duels_won,total_tackles,interceptions,blocks,red_cards,yellow_cards,fouls,offsides,all_stats_on_page
0,Max Aarons,Right,04/01/2000,3 (2),0,0,0,0.0,0,0,...,0,0,2,1,1,0,0,0,0,"{'Nationality': 'England', 'Preferred Foot': '..."
1,George Abbott,Right,17/08/2005,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,"{'Nationality': 'England', 'Preferred Foot': '..."
2,Zach Abbott,Right,13/05/2006,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,"{'Nationality': 'England', 'Preferred Foot': '..."
3,Josh Acheampong,Right,05/05/2006,4 (2),0,0,3,0.02,0,0,...,1,0,2,1,0,0,1,0,0,"{'Nationality': 'England', 'Preferred Foot': '..."
4,Ché Adams,13/07/1996,,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,"{'Nationality': 'England', 'Date of Birth': '1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,Edson Álvarez,Both,24/10/1997,28 (8),1,0,7,0.89,13,6 (67%),...,112,30,50,18,8,1,7,42,4,"{'Nationality': 'Mexico', 'Preferred Foot': 'B..."
725,Julián Álvarez,Right,31/01/2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"{'Nationality': 'Argentina', 'Preferred Foot':..."
726,Odsonne Édouard,Right,16/01/1998,6 (5),0,0,17,0.02,0,0,...,5,2,2,0,0,0,0,6,2,"{'Nationality': 'France', 'Preferred Foot': 'R..."
727,Martin Ødegaard,Left,17/12/1998,30 (4),3,8,6.66,4.81,18,15,...,75,8,19,6,0,0,4,12,1,"{'Nationality': 'Norway', 'Preferred Foot': 'L..."


#### Lets look at the description of the combine player stats


In [4]:
player_stats.describe()

Unnamed: 0,goals,assists,duels_won,aerial_duels_won,total_tackles,interceptions,blocks,red_cards,yellow_cards,fouls,offsides
count,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0
mean,1.973118,0.729391,31.21147,7.827061,11.558244,5.222222,2.422043,0.043907,1.296595,7.234767,1.11828
std,6.444603,1.95654,53.190074,17.9371,20.827354,10.199397,5.830353,0.21771,2.349091,12.763193,3.086843
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,45.0,6.25,15.0,6.0,2.0,0.0,2.0,10.25,1.0
max,66.0,18.0,243.0,148.0,133.0,66.0,57.0,2.0,12.0,73.0,28.0


#### The player stats information clearly has a flaw, which is why we created the all stats column in the first place. Lets use that to populate the columns instead of what we already have


#### We have to be strategic about this, so let us start by creating a list that stores all the possible keys in the all stat columns.

- This is done to avoid entering data in the wrong row.


In [5]:
# First, we have to convert string reperesentations to dictionaries, so we can work with it
player_stats["all_stats_on_page"] = player_stats["all_stats_on_page"].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)

In [6]:
# Now lets check the to see if it is now a dictionary
print(type(player_stats["all_stats_on_page"].iloc[0]))

<class 'dict'>


In [7]:
# Now we store all possible stats in a list
possible_stats_list = []
for stats in player_stats["all_stats_on_page"]:
    for k, v in stats.items():
        if k in possible_stats_list:
            continue
        else:
            possible_stats_list.append(k)
possible_stats_list

['Nationality',
 'Preferred Foot',
 'Date of Birth',
 'Appearances (Sub)',
 'XA',
 'Passes (Completed %)',
 'Long Passes (Completed %)',
 'Minutes Played',
 'Duels Won',
 'Total Tackles',
 'Interceptions',
 'Blocks',
 'Red Cards',
 'Yellow Cards',
 'XG',
 'Touches in the Opposition Box',
 'Aerial Duels Won',
 'Assists',
 'Shots On Target Inside the Box',
 'Crosses (Completed %)',
 'Dribbles (Completed %)',
 'Fouls',
 'Goals',
 'Hit Woodwork',
 'Offsides',
 'Shots On Target Outside the Box',
 'Corners Taken',
 'Appearances',
 'Free Kicks Scored (Scored)',
 'Passes',
 'Own Goals',
 'Penalties Taken',
 'Goals Conceded',
 'Clean Sheets',
 'Saves Made',
 'Penalties Faced',
 'Penalties Taken (Scored)',
 'Penalties Saved (%)']

#### Now lets remove all the other stats that we dont need


In [8]:
player_stats = player_stats[["player_name", "all_stats_on_page"]]
player_stats

Unnamed: 0,player_name,all_stats_on_page
0,Max Aarons,"{'Nationality': 'England', 'Preferred Foot': '..."
1,George Abbott,"{'Nationality': 'England', 'Preferred Foot': '..."
2,Zach Abbott,"{'Nationality': 'England', 'Preferred Foot': '..."
3,Josh Acheampong,"{'Nationality': 'England', 'Preferred Foot': '..."
4,Ché Adams,"{'Nationality': 'England', 'Date of Birth': '1..."
...,...,...
724,Edson Álvarez,"{'Nationality': 'Mexico', 'Preferred Foot': 'B..."
725,Julián Álvarez,"{'Nationality': 'Argentina', 'Preferred Foot':..."
726,Odsonne Édouard,"{'Nationality': 'France', 'Preferred Foot': 'R..."
727,Martin Ødegaard,"{'Nationality': 'Norway', 'Preferred Foot': 'L..."


#### Now lets define the logic to get the player name and all the player stats in row and column form

- Create a dictionary that holds the new player stat information
- We start iterating over each row in the player stats data frame
- In each row, we store the player names and all their stats in a dictionary (if the stat isnt there store as "n/a")
- For rows that have more than one stat, we split them with the string split method


In [9]:
new_player_stat_info_list = []

In [10]:
for index, row in player_stats.iterrows():
    new_player_stat_dict = dict()
    
    row_player_name = row["player_name"]
    row_player_stats = row["all_stats_on_page"]
    
    new_player_stat_dict["player_name"] = row_player_name
    
    
    for stat in possible_stats_list:
        if stat in row_player_stats.keys():
            # Separating appearances from substitute appearances
            if stat == "Appearances (Sub)":
                appearances_, sub_appearances = row_player_stats[stat].split()
                sub_appearances = sub_appearances.strip("()")

                new_player_stat_dict["appearances_"] = int(appearances_)
                new_player_stat_dict["sub_appearances"] = int(sub_appearances)

            # Separating the passes completed
            elif stat== "Passes (Completed %)":
                pass_attempts, pass_accuracy = row_player_stats[stat].split()

                pass_accuracy = pass_accuracy.strip("()")
                pass_accuracy = pass_accuracy.strip("%")

                new_player_stat_dict["pass_attempts"] = int(pass_attempts.replace(",", ""))
                new_player_stat_dict["pass_accuracy"] = int(pass_accuracy)

            # Separating long passes from long pass accuracy
            elif stat== "Long Passes (Completed %)":
                long_pass_attempts, long_pass_accuracy = row_player_stats[stat].split()

                long_pass_accuracy = long_pass_accuracy.strip("()")
                long_pass_accuracy = long_pass_accuracy.strip("%")

                new_player_stat_dict["long_pass_attempts"] = int(long_pass_attempts.replace(",", ""))
                new_player_stat_dict["long_pass_accuracy"] = int(long_pass_accuracy)

            # separating crosses from cross accuracy
            elif stat == "Crosses (Completed %)":
                cross_attempts, cross_accuracy = row_player_stats[stat].split()

                cross_accuracy = cross_accuracy.strip("()")
                cross_accuracy = cross_accuracy.strip("%")

                new_player_stat_dict["cross_attempts"] = int(cross_attempts.replace(",", ""))
                new_player_stat_dict["cross_accuracy"] = int(cross_accuracy)
            
            # Separating dribbles attemts from dribbles completed
            elif stat == "Dribbles (Completed %)":
                dribble_attempts, dribble_accuracy = row_player_stats[stat].split()

                dribble_accuracy = dribble_accuracy.strip("()")
                dribble_accuracy = dribble_accuracy.strip("%")

                new_player_stat_dict["dribble_attempts"] = int(dribble_attempts)
                new_player_stat_dict["dribble_accuracy"] = int(dribble_accuracy)

            # Separating freekicks taken from freekicks scored
            elif stat == "Free Kicks Scored (Scored)":
                free_kick_attempts, free_kicks_scored = row_player_stats[stat].split()

                free_kicks_scored = free_kicks_scored.strip("()")

                new_player_stat_dict["free_kick_attempts"] = int(free_kick_attempts)
                new_player_stat_dict["free_kicks_scored"] = int(free_kicks_scored)

            # Separating penalties taken from penalties scored   
            elif stat == "Penalties Taken (Scored)":
                penalty_attempts, penalties_scored = row_player_stats[stat].split()

                penalties_scored = penalties_scored.strip("()")

                new_player_stat_dict["penalty_attempts"] = int(penalty_attempts)
                new_player_stat_dict["penalties_scored"] = int(penalties_scored)

            elif stat == "XG":
                new_player_stat_dict[stat] = float(row_player_stats[stat])
            elif stat == "XA":
                new_player_stat_dict[stat] = float(row_player_stats[stat])
            
            elif stat == "Nationality" or stat == "Preferred Foot" or stat == "Date of Birth":
                new_player_stat_dict[stat] = row_player_stats[stat]
            
            elif stat == "Penalties Saved (%)":
                penalties_saved, penalty_save_precentage = row_player_stats[stat].split()

                penalty_save_precentage = penalty_save_precentage.strip("()")
                penalty_save_precentage = penalty_save_precentage.strip("%")

                new_player_stat_dict["penalties_saved"] = int(penalties_saved)
                new_player_stat_dict["penalty_save_precentage"] = int(penalty_save_precentage)

            
            else:
                new_player_stat_dict[stat] = int(row_player_stats[stat].replace(",", ""))
        else:
            if stat == "Appearances (Sub)":
                new_player_stat_dict["appearances_"] = 0
                new_player_stat_dict["sub_appearances"] = 0

            elif stat == "Passes (Completed %)":
                new_player_stat_dict["pass_attempts"] = 0
                new_player_stat_dict["pass_accuracy"] = 0

            elif  stat== "Long Passes (Completed %)":
                new_player_stat_dict["long_pass_attempts"] = 0
                new_player_stat_dict["long_pass_accuracy"] = 0

            elif stat == "Crosses (Completed %)":
                new_player_stat_dict["cross_attempts"] = 0
                new_player_stat_dict["cross_accuracy"] = 0

            elif stat == "Dribbles (Completed %)":
                new_player_stat_dict["dribble_attempts"] = 0
                new_player_stat_dict["dribble_accuracy"] = 0

            elif stat == "Free Kicks Scored (Scored)":
                new_player_stat_dict["free_kick_attempts"] = 0
                new_player_stat_dict["free_kicks_scored"] = 0

            elif stat == "Penalties Taken (Scored)":
                new_player_stat_dict["penalty_attempts"] = 0
                new_player_stat_dict["penalties_scored"] = 0
            
            elif stat == "XG":
                new_player_stat_dict[stat] = float(0)
            elif stat == "XA":
                new_player_stat_dict[stat] = float(0)
            elif stat == "Nationality" or stat == "Preferred Foot" or stat == "Date of Birth":
                new_player_stat_dict[stat] = "n/a"
            elif stat == "Penalties Saved (%)":
                new_player_stat_dict["penalties_saved"] = 0
                new_player_stat_dict["penalty_save_precentage"] = 0
            else:
                new_player_stat_dict[stat] = 0
    
    new_player_stat_info_list.append(new_player_stat_dict)
    
new_player_stat_info_list
    
#print(new_player_stat_info_list)

[{'player_name': 'Max Aarons',
  'Nationality': 'England',
  'Preferred Foot': 'Right',
  'Date of Birth': '04/01/2000',
  'appearances_': 3,
  'sub_appearances': 2,
  'XA': 0.02,
  'pass_attempts': 51,
  'pass_accuracy': 80,
  'long_pass_attempts': 5,
  'long_pass_accuracy': 60,
  'Minutes Played': 85,
  'Duels Won': 4,
  'Total Tackles': 2,
  'Interceptions': 1,
  'Blocks': 1,
  'Red Cards': 0,
  'Yellow Cards': 0,
  'XG': 0.0,
  'Touches in the Opposition Box': 0,
  'Aerial Duels Won': 0,
  'Assists': 0,
  'Shots On Target Inside the Box': 0,
  'cross_attempts': 0,
  'cross_accuracy': 0,
  'dribble_attempts': 0,
  'dribble_accuracy': 0,
  'Fouls': 0,
  'Goals': 0,
  'Hit Woodwork': 0,
  'Offsides': 0,
  'Shots On Target Outside the Box': 0,
  'Corners Taken': 0,
  'Appearances': 0,
  'free_kick_attempts': 0,
  'free_kicks_scored': 0,
  'Passes': 0,
  'Own Goals': 0,
  'Penalties Taken': 0,
  'Goals Conceded': 0,
  'Clean Sheets': 0,
  'Saves Made': 0,
  'Penalties Faced': 0,
  'pena

In [11]:
new_player_stats = pd.DataFrame(new_player_stat_info_list)
new_player_stats.head(5)

Unnamed: 0,player_name,Nationality,Preferred Foot,Date of Birth,appearances_,sub_appearances,XA,pass_attempts,pass_accuracy,long_pass_attempts,...,Own Goals,Penalties Taken,Goals Conceded,Clean Sheets,Saves Made,Penalties Faced,penalty_attempts,penalties_scored,penalties_saved,penalty_save_precentage
0,Max Aarons,England,Right,04/01/2000,3,2,0.02,51,80,5,...,0,0,0,0,0,0,0,0,0,0
1,George Abbott,England,Right,17/08/2005,0,0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Zach Abbott,England,Right,13/05/2006,0,0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Josh Acheampong,England,Right,05/05/2006,4,2,0.02,123,84,14,...,0,0,0,0,0,0,0,0,0,0
4,Ché Adams,England,,13/07/1996,0,0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Now that we have succesfully changed the look of the data, lets see what it looks like


In [12]:
new_player_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1116 entries, 0 to 1115
Data columns (total 47 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   player_name                      1116 non-null   object 
 1   Nationality                      1116 non-null   object 
 2   Preferred Foot                   1116 non-null   object 
 3   Date of Birth                    1116 non-null   object 
 4   appearances_                     1116 non-null   int64  
 5   sub_appearances                  1116 non-null   int64  
 6   XA                               1116 non-null   float64
 7   pass_attempts                    1116 non-null   int64  
 8   pass_accuracy                    1116 non-null   int64  
 9   long_pass_attempts               1116 non-null   int64  
 10  long_pass_accuracy               1116 non-null   int64  
 11  Minutes Played                   1116 non-null   int64  
 12  Duels Won           

In [13]:
new_player_stats.describe()

Unnamed: 0,appearances_,sub_appearances,XA,pass_attempts,pass_accuracy,long_pass_attempts,long_pass_accuracy,Minutes Played,Duels Won,Total Tackles,...,Own Goals,Penalties Taken,Goals Conceded,Clean Sheets,Saves Made,Penalties Faced,penalty_attempts,penalties_scored,penalties_saved,penalty_save_precentage
count,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,...,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0,1116.0
mean,8.25448,2.792115,0.613154,296.3181,38.168459,31.282258,20.120072,593.068996,32.183692,11.602151,...,0.027778,0.006272,0.999104,0.15681,2.095878,0.074373,0.066308,0.060036,0.012545,0.450717
std,12.328745,4.781482,1.338665,493.536369,40.909309,80.926973,25.526771,954.107362,53.348094,20.853135,...,0.194404,0.089623,6.229756,1.125154,13.368081,0.556113,0.517924,0.483798,0.133341,5.292375
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,16.0,4.0,0.5425,426.25,82.0,25.0,44.0,961.75,48.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,38.0,28.0,9.26,2922.0,95.0,886.0,88.0,3420.0,243.0,133.0,...,3.0,2.0,66.0,13.0,153.0,9.0,9.0,9.0,2.0,100.0


#### Even though there are still other things to clean like data types and handling empty values, the purpose of this notebook was to make the data more usable, and that has been done, so we move to the next data that we have


### Looking at the player information data


In [14]:
player_info = pd.read_csv("../../web_scraping/new/datasets/premier_player_info.csv")
player_info.head(10)

Unnamed: 0,player_image_url,player_name,player_country,player_club,player_position,player_stats_url
0,https://resources.premierleague.com/premierlea...,Max Aarons,England,Bournemouth,Defender,https://www.premierleague.com/en/players/23298...
1,https://resources.premierleague.com/premierlea...,George Abbott,England,Tottenham Hotspur,Midfielder,https://www.premierleague.com/en/players/51932...
2,https://resources.premierleague.com/premierlea...,Zach Abbott,England,Nottingham Forest,Defender,https://www.premierleague.com/en/players/54906...
3,https://resources.premierleague.com/premierlea...,Josh Acheampong,England,Chelsea,Defender,https://www.premierleague.com/en/players/57701...
4,https://resources.premierleague.com/premierlea...,Ché Adams,Scotland,Southampton,Forward,https://www.premierleague.com/en/players/20043...
5,https://resources.premierleague.com/premierlea...,Tyler Adams,United States,Bournemouth,Midfielder,https://www.premierleague.com/en/players/20078...
6,https://resources.premierleague.com/premierlea...,Tosin Adarabioyo,England,Chelsea,Defender,https://www.premierleague.com/en/players/10964...
7,https://resources.premierleague.com/premierlea...,Tayo Adaramola,Ireland,Crystal Palace,Defender,https://www.premierleague.com/en/players/50146...
8,https://resources.premierleague.com/premierlea...,Valintino Adedokun,Ireland,Brentford,Defender,https://www.premierleague.com/en/players/51643...
9,https://resources.premierleague.com/premierlea...,Simon Adingra,Cote d’Ivoire,Brighton and Hove Albion,Forward,https://www.premierleague.com/en/players/53581...


In [15]:
player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1116 entries, 0 to 1115
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   player_image_url  1116 non-null   object
 1   player_name       1116 non-null   object
 2   player_country    1116 non-null   object
 3   player_club       1116 non-null   object
 4   player_position   1116 non-null   object
 5   player_stats_url  1116 non-null   object
dtypes: object(6)
memory usage: 52.4+ KB


In [16]:
player_info.describe()

Unnamed: 0,player_image_url,player_name,player_country,player_club,player_position,player_stats_url
count,1116,1116,1116,1116,1116,1116
unique,786,1116,78,20,4,1116
top,https://resources.premierleague.com/premierlea...,Max Aarons,England,Chelsea,Defender,https://www.premierleague.com/en/players/23298...
freq,331,1,468,75,364,1


#### The player info data looks fine as far as beign at least usable is concerned, so we move onto the next one


## Looking at the gameweek data for the different seasons


In [17]:
club_gameweek_table_example = pd.read_csv("../../web_scraping/new/datasets/league_table/home_and_away/gameweek_2024/2024_gameweek_38.csv")
club_gameweek_table_example.head(20)

Unnamed: 0,position,badge_url,name,games_played,games_won,games_drawn,games_lost,goals_for,goals_against,goal_difference,points
0,1,https://resources.premierleague.com/premierlea...,Liverpool,38,25,9,4,86,41,45,84
1,2,https://resources.premierleague.com/premierlea...,Arsenal,38,20,14,4,69,34,35,74
2,3,https://resources.premierleague.com/premierlea...,Manchester City,38,21,8,9,72,44,28,71
3,4,https://resources.premierleague.com/premierlea...,Chelsea,38,20,9,9,64,43,21,69
4,5,https://resources.premierleague.com/premierlea...,Newcastle United,38,20,6,12,68,47,21,66
5,6,https://resources.premierleague.com/premierlea...,Aston Villa,38,19,9,10,58,51,7,66
6,7,https://resources.premierleague.com/premierlea...,Nottingham Forest,38,19,8,11,58,46,12,65
7,8,https://resources.premierleague.com/premierlea...,Brighton and Hove Albion,38,16,13,9,66,59,7,61
8,9,https://resources.premierleague.com/premierlea...,Bournemouth,38,15,11,12,58,46,12,56
9,10,https://resources.premierleague.com/premierlea...,Brentford,38,16,8,14,66,57,9,56


In [18]:
club_gameweek_table_example.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   position         20 non-null     int64 
 1   badge_url        20 non-null     object
 2   name             20 non-null     object
 3   games_played     20 non-null     int64 
 4   games_won        20 non-null     int64 
 5   games_drawn      20 non-null     int64 
 6   games_lost       20 non-null     int64 
 7   goals_for        20 non-null     int64 
 8   goals_against    20 non-null     int64 
 9   goal_difference  20 non-null     int64 
 10  points           20 non-null     int64 
dtypes: int64(9), object(2)
memory usage: 1.8+ KB


In [19]:
club_gameweek_table_example.describe()

Unnamed: 0,position,games_played,games_won,games_drawn,games_lost,goals_for,goals_against,goal_difference,points
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,10.5,38.0,14.35,9.3,14.35,55.75,55.75,0.0,52.35
std,5.91608,0.0,6.002412,2.867238,6.960603,14.707231,14.421749,27.041878,18.576372
min,1.0,38.0,2.0,5.0,4.0,26.0,34.0,-60.0,12.0
25%,5.75,38.0,11.0,7.75,9.75,45.5,45.5,-11.25,42.0
50%,10.5,38.0,15.0,9.0,12.0,58.0,52.5,3.5,55.0
75%,15.25,38.0,19.25,10.25,18.5,66.0,62.75,14.25,66.0
max,20.0,38.0,25.0,15.0,30.0,86.0,86.0,45.0,84.0


#### The table looks fine, and all the data types look right. Lets move on to the next one


## Looking at club statistics


In [20]:
club_stats_example = pd.read_csv("../../web_scraping/new/datasets/club_stats/2024_club_stats.csv")
club_stats_example.head(5)

Unnamed: 0,name,url,season,stats
0,Arsenal,https://www.premierleague.com/en/clubs/3/arsen...,2024/2025,"[{'Games Played': '38', 'Goals': '69', 'Goals ..."
1,Aston Villa,https://www.premierleague.com/en/clubs/7/aston...,2024/2025,"[{'Games Played': '38', 'Goals': '58', 'Goals ..."
2,Bournemouth,https://www.premierleague.com/en/clubs/91/bour...,2024/2025,"[{'Games Played': '38', 'Goals': '58', 'Goals ..."
3,Brentford,https://www.premierleague.com/en/clubs/94/bren...,2024/2025,"[{'Games Played': '38', 'Goals': '66', 'Goals ..."
4,Brighton and Hove Albion,https://www.premierleague.com/en/clubs/36/brig...,2024/2025,"[{'Games Played': '38', 'Goals': '66', 'Goals ..."


#### Without even describing the data, we can see that there is data in a dictionary column called stats, so let us deal with that.


##### We have to be strategic about this because we have just seen an example, and we would have to carry out the steps involved accross all the stats for all available seasons

- Using a loop to solve this, we iterate through all the files
- create a dataframe for them
- apply the transformations
- and write the clean versions to a specified location


In [21]:
def transform_club_stats_for_file(club_stats_example):
    # first we convert the stats column to a dictionary

    # First, we have to convert string reperesentations to dictionaries, so we can work with it

    club_stats_example["stats"] = club_stats_example["stats"].apply(
        lambda x: ast.literal_eval(x) if isinstance(x, str) else x
    )

    club_stats_example["stats"] = club_stats_example["stats"].apply(
        lambda x: x[0] if isinstance(x, str) else x[0]
    )
    # Now lets check the to see if it is now a dictionary
    print(type(club_stats_example["stats"].iloc[0]))

    # NOw lets create a list to hold all the possible stats we can have
    all_possible_club_stats_list = []
    for stat_row in club_stats_example["stats"]:
        for k,v in stat_row.items():
            if k in all_possible_club_stats_list:
                continue
            else:
                all_possible_club_stats_list.append(k)

    print(all_possible_club_stats_list)


    # Now lets create a dictionary to hold the new club statistics data
    new_club_stats_dict_list = []

    # Now lets iterate over the rows of the original club statistics dataframe
    for index, row in club_stats_example.iterrows():

        # Create a dictionary to hold the information for this row
        current_row_dict = dict()

        # now lets get the club name, club url, and season for this row
        current_row_dict["club_name"] = row["name"]
        current_row_dict["club_url"] = row["url"]
        current_row_dict["season"] = row["season"]
        row_stats = row["stats"]

        # Now we go into the club stats, and we do it neatly using the possible stat list we created
        for stat in all_possible_club_stats_list:
            print(stat)
            if stat in row_stats.keys():
                if stat == "Penalties Taken (Scored)":
                    if row_stats[stat] == "0":
                        current_row_dict["penalties"] = 0
                        current_row_dict["penalties_scored"] = 0
                    else:
                        penalties, penalties_scored = row_stats[stat].split()
                        penalties_scored = penalties_scored.strip("()")

                        current_row_dict["penalties"] = int(penalties)
                        current_row_dict["penalties_scored"] = int(penalties_scored)
                
                elif stat == "Crosses (Completed %)":
                    if row_stats[stat] == "0":
                        current_row_dict["crosses"] = 0
                        current_row_dict["cross_accuracy"] = 0
                    else:
                        crosses, cross_accuracy = row_stats[stat].split()
                        cross_accuracy = cross_accuracy.strip("()")
                        cross_accuracy = cross_accuracy.strip("%")

                        current_row_dict["crosses"] = int(crosses)
                        current_row_dict["cross_accuracy"] = int(cross_accuracy)

                elif stat == "Long Passes (Completed %)":
                    if row_stats[stat] == "0":
                        current_row_dict["long_passes"] = 0
                        current_row_dict["long_pass_accuracy"] = 0
                    else:
                        long_passes, long_pass_accuracy = row_stats[stat].split()
                        long_pass_accuracy = long_pass_accuracy.strip("()")
                        long_pass_accuracy = long_pass_accuracy.strip("%")

                        current_row_dict["long_passes"] = int(long_passes.replace(",", ""))
                        current_row_dict["long_pass_accuracy"] = int(long_pass_accuracy)


                elif stat == "Dribbles (Completed %)":
                    if row_stats[stat] == "0":
                        current_row_dict["dribble_attempts"] = 0
                        current_row_dict["dribble_accuracy"] = 0
                    else:
                        dribble_attempts, dribble_accuracy = row_stats[stat].split()
                        dribble_accuracy = dribble_accuracy.strip("()")
                        dribble_accuracy = dribble_accuracy.strip("%")

                        current_row_dict["dribble_attempts"] = int(dribble_attempts)
                        current_row_dict["dribble_accuracy"] = int(dribble_accuracy)
                        
                elif stat == "Free Kicks Scored (Scored)":
                    if row_stats[stat] == "0":
                        current_row_dict["free_kicks_taken"] = 0
                        current_row_dict["free_kicks_scored"] = 0
                    else:
                        free_kicks_taken, free_kicks_scored = row_stats[stat].split()
                        free_kicks_scored = free_kicks_scored.strip("()")

                        current_row_dict["free_kicks_taken"] = int(free_kicks_taken)
                        current_row_dict["free_kicks_scored"] = int(free_kicks_scored)

                elif stat == "Penalties Saved (%)":
                    if row_stats[stat] == "0":
                        current_row_dict["penalties_saved"] = 0
                        current_row_dict["penalty_save_precentage"] = 0
                    else:
                        penalties_saved, penalty_save_precentage = row_stats[stat].split()
                        penalty_save_precentage = penalty_save_precentage.strip("()")
                        penalty_save_precentage = penalty_save_precentage.strip("%")

                        current_row_dict["penalties_saved"] = int(penalties_saved)
                        current_row_dict["penalty_save_precentage"] = int(penalty_save_precentage)
                elif stat == "XG":
                    current_row_dict[stat] = float(row_stats[stat].replace(",", ""))
                else:
                    current_row_dict[stat] = int(row_stats[stat].replace(",", ""))
            else:
                if stat == "Penalties Taken (Scored)":
                    current_row_dict["penalties"] = 0
                    current_row_dict["penalties_scored"] = 0

                elif stat == "Crosses (Completed %)":
                    current_row_dict["crosses"] = 0
                    current_row_dict["cross_accuracy"] = 0
                
                elif stat == "Long Passes (Completed %)":
                    current_row_dict["long_passes"] = 0
                    current_row_dict["long_pass_accuracy"] = 0

                elif stat == "Dribbles (Completed %)":
                    current_row_dict["dribble_attempts"] = 0
                    current_row_dict["dribble_accuracy"] = 0

                elif stat == "Free Kicks Scored (Scored)":
                    current_row_dict["free_kicks_taken"] = 0
                    current_row_dict["free_kicks_scored"] = 0
                
                elif stat == "Penalties Saved (%)":
                    current_row_dict["penalties_saved"] = 0
                    current_row_dict["penalty_save_precentage"] = 0

                elif stat == "XG":
                    current_row_dict[stat] = 0
                else:
                    current_row_dict[stat] = 0
        
        new_club_stats_dict_list.append(current_row_dict)

    return new_club_stats_dict_list

#### Now that we have written the function to transform the club statistics, we have to write the logic to dynamicaly load transform and store the transformed data

- First we have to iterate over all the file names (this is why we there has to be consistency in naming files)
- Then we use the transform function to transform the data in the file
- then we convert our new transformed data from a dictionary to to a pandas data frame
- finally we write the transformed data to the target direcotry


In [22]:
# Dynamically iterating through each file name

for season in range(2016, 2025):
    # Defining the path for each iteration
    club_stats_example = pd.read_csv(f"../../web_scraping/new/datasets/club_stats/{season}_club_stats.csv")

    # using the transform function to transform the data 
    new_club_stats_list = transform_club_stats_for_file(club_stats_example)

    #Creating a dataframe
    new_club_stats = pd.DataFrame(new_club_stats_list)

    # storing the transformed dataset
    new_club_stats.to_csv(f"datasets/club_stats/{season}_season_club_stats.csv", index=False)

<class 'dict'>
['Games Played', 'Goals', 'Goals Conceded', 'Shots', 'Shots On Target', 'Penalties Taken (Scored)', 'Free Kicks Scored (Scored)', 'Hit Woodwork', 'Crosses (Completed %)', 'Interceptions', 'Blocks', 'Clearances', 'Passes', 'Long Passes (Completed %)', 'Corners Taken', 'Dribbles (Completed %)', 'Duels Won', 'Aerial Duels Won', 'Red Cards', 'Yellow Cards', 'Fouls', 'Offsides', 'Own Goals', 'Free Kicks Scored', 'Penalties Saved (%)']
Games Played
Goals
Goals Conceded
Shots
Shots On Target
Penalties Taken (Scored)
Free Kicks Scored (Scored)
Hit Woodwork
Crosses (Completed %)
Interceptions
Blocks
Clearances
Passes
Long Passes (Completed %)
Corners Taken
Dribbles (Completed %)
Duels Won
Aerial Duels Won
Red Cards
Yellow Cards
Fouls
Offsides
Own Goals
Free Kicks Scored
Penalties Saved (%)
Games Played
Goals
Goals Conceded
Shots
Shots On Target
Penalties Taken (Scored)
Free Kicks Scored (Scored)
Hit Woodwork
Crosses (Completed %)
Interceptions
Blocks
Clearances
Passes
Long Passes

#### let us also write the other file we transformed to this new directory


In [23]:
new_player_stats.to_csv("datasets/player_stats_2024_2025_season.csv", index=False)

# The End
