### Statistical analysis of the 2024 Copa America soccer tournament

## Brief overview

I will write this after project is complete...

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
match_data = pd.read_csv("international-copa-america-matches-2024-to-2024-stats.csv")
team_data = pd.read_csv("international-copa-america-teams-2024-to-2024-stats.csv")

## Step 1: Data Cleaning and Formatting

In order to create meaningful visualizations and models for our data, we need to clean and format the raw csv data to suit our future data analysis. First, I filtered the data include only attributes that are complete, consistent, and relevant to the group stage matches. Then, I formatted the data to improve its readability and usability.  

For this study I will use to two tables. One table to hold two rows for every match played where each row represents the single match statistics for either the home or away team. The second table is the aggregate mean of the first table where each row represents a single team's average statistics among their three group stage matches.

In [18]:
# Select only group stage matches and rename expected goals columns for consistency
group_stage_matches = match_data.dropna(subset=['Game Week']).rename(columns={'team_a_xg':'home_team_xg', 'team_b_xg':'away_team_xg'})

# Filter only the relevant attributes for the home and away teams
home_relevant_columns = ['home_team_name', 'away_team_name', 'Game Week', 'home_team_goal_count', 'away_team_goal_count', 'home_team_corner_count', 'home_team_yellow_cards', 'home_team_red_cards', 'home_team_shots', 'home_team_shots_on_target', 'home_team_fouls', 'home_team_possession', 'home_team_xg']
home_team_group_stage_data = group_stage_matches[home_relevant_columns]
away_relevant_columns = [col.replace('home', 'away') if col.find('home') != -1 else col.replace('away', 'home') for col in home_relevant_columns]
away_team_group_stage_data = group_stage_matches[away_relevant_columns]

# Combine the home team dataframe and away team dataframe to represent each team's stats in every group stage match
new_column_names = ['Team', 'Opponent', 'Game Week', 'Goals', 'Goals_Against', 'Corners', 'Yellow_Cards', 'Red_Cards', 'Shots', 'Shots_On_Target', 'Fouls', 'Possession', 'Expected_Goals']
home_team_group_stage_data.columns = new_column_names
away_team_group_stage_data.columns = new_column_names
team_group_stage_match_data = pd.concat([home_team_group_stage_data, away_team_group_stage_data], ignore_index=True)
team_group_stage_match_data = team_group_stage_match_data.sort_values(by=['Team'], ignore_index=True)

# Group each team's stats to calculate their averages for each relevant attribute
team_group_stage_data = team_group_stage_match_data.drop(columns=['Opponent', 'Game Week']).groupby('Team', as_index=False).mean()
team_group_stage_data


Unnamed: 0,Team,Goals,Goals_Against,Corners,Yellow_Cards,Red_Cards,Shots,Shots_On_Target,Fouls,Possession,Expected_Goals
0,Argentina,1.666667,0.0,8.0,1.0,0.0,15.0,9.0,12.0,66.333333,1.906667
1,Bolivia,0.333333,3.333333,2.0,2.333333,0.0,5.666667,3.666667,16.0,43.666667,0.8
2,Brazil,1.666667,0.666667,7.0,2.333333,0.0,12.333333,5.0,13.333333,60.0,1.54
3,Canada,0.333333,0.666667,4.333333,2.333333,0.0,7.0,3.333333,13.0,48.333333,1.026667
4,Chile,0.0,0.333333,2.333333,2.333333,0.333333,6.333333,3.666667,13.0,49.0,0.956667
5,Colombia,2.0,0.666667,4.0,1.666667,0.0,11.666667,5.666667,16.0,60.333333,1.446667
6,Costa Rica,0.666667,1.333333,1.333333,2.0,0.0,2.666667,1.0,11.666667,31.666667,0.366667
7,Ecuador,1.333333,1.0,4.0,1.333333,0.333333,9.333333,4.333333,13.666667,39.333333,1.146667
8,Jamaica,0.333333,2.333333,6.666667,1.666667,0.0,8.0,4.0,9.666667,47.0,1.033333
9,Mexico,0.333333,0.333333,6.333333,2.333333,0.0,13.0,6.666667,12.0,61.0,1.71
