## Exploratory Data Analysis (EDA)

## Introduction

In this lecture, we will explore the importance of Exploratory Data Analysis (EDA) in football analysis and provide an overview of the key concepts and techniques covered.

EDA is a crucial step in the data analysis process, allowing us to gain insights, understand patterns, and make informed decisions based on the available data. EDA helps us uncover valuable information about player performance, team strategies, and game dynamics.

Throughout this lecture, we will cover the following key concepts and techniques:
 - Descriptive Statistics
 - Distributions & Dispersion Metrics
 - Relational Analysis
 - Correlation & Covariance
 - Data Distribution and Patterns


In [None]:
pip install statsbombpy

In [None]:
pip install mplsoccer

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsbombpy import sb
from mplsoccer import Pitch, VerticalPitch
from math import pi
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 150)
pd.options.mode.copy_on_write = True
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load competition data
competitions = sb.competitions()
print("Available competitions:")

display(competitions.head())

In [None]:
# Get information about a specific competition
competition_id = 43  # 43 represents the FIFA World Cup 2018
competition_info = competitions.loc[competition_id]
print(f"\nInformation about competition {competition_info['competition_name']}:")
print(f"Country: {competition_info['country_name']}")
print(f"Gender: {competition_info['competition_gender']}")
print(f"Season: {competition_info['season_name']}")

In [None]:
# Load match data for the selected competition and season
season_id = 3  # 3 represents the 2018 season
matches = sb.matches(competition_id, season_id)
print(f"\nNumber of matches in the {competitions.loc[competition_id, 'competition_name']} {season_id}: {len(matches)}")


In [None]:
# Get information about a specific match
match_id = matches.loc[0, 'match_id']  # Select the first match
match_info = matches.loc[matches['match_id'] == match_id].iloc[0]
print(f"\nInformation about match {match_id}:")
print(f"Home team: {match_info['home_team']}")
print(f"Away team: {match_info['away_team']}")
print(f"Match date: {match_info['match_date']}")
print(f"Stadium: {match_info['stadium']}")

In [None]:
# Load event data for the selected match
events = sb.events(match_id)
print(f"\nNumber of events in match {match_id}: {len(events)}")

In [None]:
# Assuming `events` contains the loaded event data from a specific match
# Filter to include only certain types of events or actions for detailed analysis
shots = events[events['type'] == 'Shot']
passes = events[events['type'] == 'Pass']



In [None]:
# Descriptive statistics for shots
print("Shots Descriptive Statistics:")
display(shots.describe())

# Descriptive statistics for passes
print("\nPasses Descriptive Statistics:")
display(passes.describe())


In [None]:
# Shot distances
shots['distance'] = shots.apply(lambda x: np.linalg.norm([x['location'][0]-120, x['location'][1]-40]), axis=1)
plt.figure(figsize=(10, 6))
sns.histplot(shots['distance'], kde=True)
plt.title('Distribution of Shot Distances')
plt.xlabel('Distance to Goal')
plt.ylabel('Frequency')
plt.show()

# Pass lengths
passes['length'] = passes['pass_length']
plt.figure(figsize=(10, 6))
sns.histplot(passes['length'], kde=True, bins=30)
plt.title('Distribution of Pass Lengths')
plt.xlabel('Pass Length')
plt.ylabel('Frequency')
plt.show()


In [None]:

# Compare the distributions of shot distances between successful and unsuccessful shots
successful_shots = shots[shots['shot_outcome'] == 'Goal']
unsuccessful_shots = shots[shots['shot_outcome'] != 'Goal']

plt.figure(figsize=(10, 6))
plt.hist([successful_shots['distance'], unsuccessful_shots['distance']], bins=20, label=['Successful', 'Unsuccessful'])
plt.xlabel('Shot Distance')
plt.ylabel('Frequency')
plt.title('Distribution of Shot Distances by Outcome')
plt.legend()
plt.show()

In [None]:
# Analyzing shot outcomes
print("\nAnalyzing shot outcomes")
shot_outcome_counts = shots['shot_outcome'].value_counts()

plt.figure(figsize=(10, 6))
plt.bar(shot_outcome_counts.index, shot_outcome_counts.values)
plt.xlabel('Shot Outcome')
plt.ylabel('Count')
plt.title('Shot Outcome Distribution')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Analyzing pass outcomes
print("\nAnalyzing pass outcomes")
pass_outcome_counts = passes['pass_outcome'].value_counts()

plt.figure(figsize=(10, 6))
plt.bar(pass_outcome_counts.index, pass_outcome_counts.values)
plt.xlabel('Pass Outcome')
plt.ylabel('Count')
plt.title('Pass Outcome Distribution')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create an instance of the pitch
pitch = Pitch(pitch_type='statsbomb',
              pitch_color='grass', line_color='white')

# Plotting heatmaps for shots and passes
fig, axs = pitch.draw(nrows=1, ncols=2, figsize=(16, 7))

shot_locations = shots["location"].tolist()
recovery_passes = passes[passes["pass_type"] == "Recovery"]["location"].tolist()
x,y = zip(*shot_locations)
i,j = zip(*recovery_passes)


# Shots heatmap
sns.kdeplot(x=x, y=y, shade=False, levels=50, color='red', thresh=0.01, ax=axs[0])
axs[0].set_title('Shot Heatmap')

# Passes heatmap
sns.kdeplot(x=i, y=j, shade=False, levels=50, color='blue', thresh=0.01, ax=axs[1])
axs[1].set_title('Pass Recovery Heatmap')

plt.show()


## Correlation

In [None]:
# Investigate the correlation between shot distance and xG
plt.figure(figsize=(10, 6))
plt.scatter(shots['distance'], shots['shot_statsbomb_xg'])
plt.xlabel('Shot Distance')
plt.ylabel('xG')
plt.title('Correlation between Shot Distance and xG')
plt.show()

corr_distance_xg = shots['distance'].corr(shots['shot_statsbomb_xg'])
print(f"Correlation between Shot Distance and xG: {corr_distance_xg:.2f}")

In [None]:
# Create binary indicators for specific pass types and success
passes['is_cross'] = passes['pass_cross'].notnull().astype(int)
passes['is_switch'] = passes['pass_switch'].notnull().astype(int)
passes['pass_success'] = pd.isna(passes['pass_outcome']).astype(int)

# Prepare a DataFrame for correlation analysis including the relevant columns
correlation_df = passes[['pass_length', 'pass_angle', 'is_cross', 'is_switch', 'pass_success']]

# Calculating the correlation matrix
correlation_matrix = correlation_df.corr()

# Visualizing the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap of Pass Metrics')
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Convert 'shot_outcome' into a binary indicator (1 for Goal, 0 for others)
shots['is_goal'] = (shots['shot_outcome'] == 'Goal').astype(int)

# Convert 'shot_technique' and 'shot_body_part' into binary indicators for simplicity
shots['technique_normal'] = (shots['shot_technique'] == 'Normal').astype(int)
shots['body_part_right_foot'] = (shots['shot_body_part'] == 'Right Foot').astype(int)

# Select columns to include in the correlation analysis
columns_to_include = ['distance', 'shot_statsbomb_xg', 'is_goal', 'technique_normal', 'body_part_right_foot']

# Drop any rows with NaN values in these columns to ensure clean correlation analysis
clean_shots = shots[columns_to_include].dropna()


# Calculate the correlation matrix
correlation_matrix = clean_shots.corr()

# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap of Shot Metrics')
plt.show()


In [None]:
# Analyzing player performance
print("\nAnalyzing player performance")
player_shots = shots.groupby('player').size().reset_index(name='total_shots')
player_goals = shots[shots['shot_outcome'] == 'Goal'].groupby('player').size().reset_index(name='total_goals')
player_performance = pd.merge(player_shots, player_goals, on='player')
player_performance['goal_conversion_rate'] = player_performance['total_goals'] / player_performance['total_shots']

print("Player performance:")
display(player_performance.sort_values('total_shots', ascending=False).head(10))

# Plotting top players by shots and goals
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.bar(player_performance['player'], player_performance['total_shots'])
plt.xlabel('Player')
plt.ylabel('Total Shots')
plt.title('Top Players by Total Shots')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
plt.bar(player_performance['player'], player_performance['total_goals'])
plt.xlabel('Player')
plt.ylabel('Total Goals')
plt.title('Top Players by Total Goals')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Radar Plots

In [None]:

def plot_radar_chart(data, attributes, title="Player Attributes"):
    N = len(attributes)
    values = data
    values += values[:1]  # repeat the first value to close the circle
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]

    ax = plt.subplot(111, polar=True)
    plt.xticks(angles[:-1], attributes, color='grey', size=8)
    ax.plot(angles, values, linewidth=1, linestyle='solid', label="Attributes")
    ax.fill(angles, values, 'b', alpha=0.1)
    plt.title(title)
    plt.show()


player_attributes = [60, 40, 70, 80, 50]  # example metrics like speed, accuracy, stamina, etc.
attributes = ['Speed', 'Accuracy', 'Stamina', 'Agility', 'Strength']
plot_radar_chart(player_attributes, attributes)


In [None]:
# Filter events for Harry Maguire
player_events = events[events['player'] == 'Harry Maguire']

# Calculate Pass Accuracy
total_passes = player_events[player_events['type'] == 'Pass']
completed_passes = total_passes[total_passes['pass_outcome'].isna()]
pass_accuracy = len(completed_passes) / len(total_passes) if len(total_passes) > 0 else 0

# Shots
shots_taken = player_events[player_events['type'] == 'Shot']
shots_on_target = shots_taken[shots_taken['shot_outcome'].isin(['Goal', 'Saved', 'Off T'])]
shot_accuracy = len(shots_on_target) / len(shots_taken) if len(shots_taken) > 0 else 0

# Print results
print(f"Pass Accuracy: {pass_accuracy*100:.2f}%")
print(f"Shots Taken: {len(shots_taken)}")
print(f"Shot Accuracy: {shot_accuracy*100:.2f}%")


In [None]:
# Define attributes and data for radar chart
attributes = ['Pass Accuracy', 'Shots Taken', 'Shot Accuracy']
data = [pass_accuracy*100, len(shots_taken), shot_accuracy*100]

# Plot the radar chart
plot_radar_chart(data, attributes)


## xG Flow charts

In [None]:
events = sb.events (match_id = 18236)
df = events[events.type == 'Shot']
df = df[['period', 'minute', 'shot_statsbomb_xg', 'team', 'player', 'shot_outcome' ]]
df.rename(columns = {'shot_statsbomb_xg':'xG', 'shot_outcome':'result'}, inplace = True)
df.sort_values(by='team', inplace=True)
df.head(10)

In [None]:
home_team = df["team"].iloc[0]
away_team = df["team"].iloc[-1]
print('Home Team : ' + home_team)
print('Away Team : ' + away_team)

In [None]:
df_home = df.loc[df["team"] == home_team].sort_values(by="minute")
df_away = df.loc[df["team"] == away_team].sort_values(by="minute")
df_home["h_cum"] = df_home["xG"].cumsum()
df_away["h_cum"] = df_away["xG"].cumsum()

In [None]:
# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the xG flow for each team
ax.plot(df_home['minute'], df_home['h_cum'], marker='o', linestyle='-', linewidth=2, markersize=8, label=home_team, color='blue')
ax.plot(df_away['minute'], df_away['h_cum'], marker='o', linestyle='-', linewidth=2, markersize=8, label=away_team, color='red')

# Set the x-axis and y-axis labels and title
ax.set_xlabel('Minute')
ax.set_ylabel('Cumulative xG')
ax.set_title(f'xG Flow Chart - {home_team} vs {away_team}')

# Set the y-axis range
ax.set_ylim(0, max(df_home['h_cum'].max(), df_away['h_cum'].max()) + 0.1)

# Add a grid
ax.grid(True, linestyle='--', alpha=0.7)

# Add goal annotations
for idx, row in df_home[df_home['result'] == 'Goal'].iterrows():
    ax.annotate(f"{row['player']} ({row['minute']}')", xy=(row['minute'], row['h_cum']), xytext=(0, 10), textcoords='offset points', ha='center', va='bottom')

for idx, row in df_away[df_away['result'] == 'Goal'].iterrows():
    ax.annotate(f"{row['player']} ({row['minute']}')", xy=(row['minute'], row['h_cum']), xytext=(0, -10), textcoords='offset points', ha='center', va='top')

# Add a legend in the upper right corner
ax.legend(loc='upper right')

# Display the total xG for each team in the upper left corner
home_total_xg = round(df_home['xG'].sum(), 2)
away_total_xg = round(df_away['xG'].sum(), 2)
ax.text(0.05, 0.95, f"{home_team} Total xG: {home_total_xg}", transform=ax.transAxes, fontsize=12, verticalalignment='top')
ax.text(0.05, 0.90, f"{away_team} Total xG: {away_total_xg}", transform=ax.transAxes, fontsize=12, verticalalignment='top')

# Adjust the plot layout and display
plt.tight_layout()
plt.show()

# Homework Assignment: xG Flow Chart Function

Instructions:

1. Take the code provided for creating the xG Flow Chart and convert it into a reusable function called `plot_xg_flow_chart`.
   The function should accept the following parameters:
   - `df_home`: DataFrame containing the home team's shot data
   - `df_away`: DataFrame containing the away team's shot data
   - `home_team`: Name of the home team (string)
   - `away_team`: Name of the away team (string)

2. The function should create and display the xG Flow Chart based on the provided data and team names.

3. Test your function by applying it to different matches. You can use the `events` function from the StatsBomb library
   to retrieve shot data for various matches and pass the relevant data to your `plot_xg_flow_chart` function.

4. Experiment with different customization options to further enhance the appearance and readability of the chart.
   Consider adding team logos, adjusting colors based on team preferences, or highlighting specific events or periods of interest.

5. Document your function by adding a docstring that explains its purpose, parameters, and any additional information
   that would be helpful for users.


Note: Make sure to handle any necessary data preprocessing steps within your function to ensure it works smoothly with the
provided data. Also, function parameters were a suggestion - feel free to improvize!

Happy coding, and let me know if you have any questions!
# Your code here