# 1. Data Loading and Preliminary Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = '../data/game_events.csv'
game_events_df = pd.read_csv(file_path)

# Display the first few rows of the dataset
game_events_df.head()


## Initial Observations
Based on the initial few rows of the `game_events.csv` dataset, here are some observations:

### Structure and Key Columns:
- `game_event_id`: A unique identifier for each game event.
- `date`: The date when the event occurred.
- `game_id`: Identifier for the game to which the event belongs.
- `minute`: The minute in the game when the event happened.
- `type`: The type of event (e.g., Cards, Goals, Substitutions).
- `club_id`: Identifier for the club involved in the event.
- `player_id`: Identifier for the player involved in the event.
- `description`: A textual description of the event.
- `player_in_id`: Identifier for a player involved in a substitution.
- `player_assist_id`: Identifier for the player who assisted in the event (if applicable).

### Data Quality and Peculiarities:
- The dataset appears to be well-structured with clear column names.
- There are missing values in columns like `player_in_id`, `player_assist_id`, and potentially others.
- The `description` column contains textual data that may need further parsing or analysis.

### Data Types:
- Most of the fields appear to be numeric (IDs, minute), except for `date`, `type`, and `description`, which are textual.


# Data Cleaning and Preprocessing

## Handling Missing Data
Assess the extent of missing data in each column.
Determine appropriate strategies for each column with missing data (e.g., dropping, imputation, or using placeholders).

## Data Type Conversions
Convert the date column to a DateTime format for easier manipulation.
Assess if any other data type conversions are required.

## New Features
Consider deriving new features from existing data if it adds value to our analysis (e.g., extracting year or month from the date).

In [None]:
# Assessing the extent of missing data in each column
missing_data = game_events_df.isnull().sum()

# Display the count of missing values for each column
missing_data

The `game_events_df` dataset has missing values in the following columns:

- `description`: 336,325 missing values.
- `player_in_id`: 245,309 missing values.
- `player_assist_id`: 635,701 missing values.

Proposed Actions

**description Column:**
Since it's a textual column, we could replace missing values with a placeholder like 'Not described' if the description is crucial for the analysis. If not, we might consider leaving it as is.

**player_in_id and player_assist_id Columns:**
These columns likely relate to specific types of events (e.g., substitutions, assists) and might be expected to have missing values where they are not applicable. We can replace missing values with a placeholder like -1 to indicate 'Not Applicable' or leave them as NaN if that's more suitable for analysis.

In [None]:
# Checking the data types of each column
data_types = game_events_df.dtypes

data_types

Current data types are as follows:

- `game_event_id`, `date`, `type`, and `description` are of type object (usually indicating strings).
- `game_id`, `minute`, `club_id`, and `player_id` are integers (`int64`).
- `player_in_id` and `player_assist_id` are floating point numbers (`float64`).

Proposed Actions:

- Convert `date` Column: The `date` column should be converted to a `DateTime` format for easier date-related manipulations and analysis.
- Other Columns: The current types for other columns seem appropriate. However, we can consider converting `player_in_id` and `player_assist_id` to integers if we decide to replace NaNs with a placeholder like -1.


For the missing data handling:

We will replace missing values in the `description` column with 'Not reported'.
For `player_in_id` and `player_assist_id`, we will first check if these columns contain any floating point numbers. If they only contain integers, we'll convert them to integer type and replace missing values with -1. If they contain floating point numbers, we'll retain their float type and handle missing values accordingly.


Let's start by checking for floating point numbers in `player_in_id` and `player_assist_id`. After that, we'll proceed with the data cleaning steps.

In [None]:
# Check if player_in_id and player_assist_id contain any floating point numbers
float_values_in_id = game_events_df['player_in_id'].apply(lambda x: x % 1).sum() > 0
float_values_assist_id = game_events_df['player_assist_id'].apply(lambda x: x % 1).sum() > 0

float_values_in_id, float_values_assist_id

The columns `player_in_id` and `player_assist_id` do not contain any floating point numbers; all values are whole numbers. Therefore, we can safely convert these columns to integers and handle missing values with -1.

We will now proceed with the following data cleaning steps:

1. Replace missing values in the `description` column with 'Not reported'.
2. Convert `player_in_id` and `player_assist_id` to integer type and replace missing values with -1.
3. Convert the `date` column to a `DateTime` format.

In [None]:
# Data Cleaning Steps

# 1. Replace missing values in the description column
game_events_df['description'].fillna('Not reported', inplace=True)

# 2. Convert player_in_id and player_assist_id to integers and replace missing values with -1
game_events_df['player_in_id'] = game_events_df['player_in_id'].fillna(-1).astype(int)
game_events_df['player_assist_id'] = game_events_df['player_assist_id'].fillna(-1).astype(int)

# 3. Convert the date column to DateTime format
game_events_df['date'] = pd.to_datetime(game_events_df['date'])

# Display the first few rows of the updated dataset to confirm changes
game_events_df.head()

The data cleaning steps have been successfully executed:

- Missing values in the `description` column are replaced with 'Not reported'.
- The `player_in_id` and `player_assist_id` columns are converted to integers, and missing values are replaced with -1.
- The `date` column is converted to `DateTime` format.

The `description` appears to be quite awkwardly formatted. Let's try to clean it up a bit.


- Trimming Leading and Trailing Spaces: Remove any spaces at the beginning and end of the strings.
- Removing Unnecessary Punctuation: Clean up commas and other punctuation marks that are not needed.
- Standardizing Text: Ensure consistency in the use of spaces after commas and other punctuation if needed.

In [None]:
import re

# Function to clean text in the description column
def clean_text(text):
    # Remove leading and trailing spaces
    text = text.strip()
    # Remove multiple spaces
    text = re.sub(' +', ' ', text)
    # Remove leading commas and spaces
    text = re.sub(r'^,+', '', text)
    text = text.strip()
    # Remove trailing commas and spaces
    text = re.sub(r',+$', '', text)
    return text.strip()

# Apply the cleaning function to the description column
game_events_df['description'] = game_events_df['description'].apply(clean_text)

# Display the first few rows of the updated dataset to confirm changes
game_events_df.head()


# 3. Exploratory Analysis and Visualization

## Descriptive Statistics
We'll focus on the following aspects:

Summary Statistics for numeric columns like `minute`, `club_id`, `player_id`, `player_in_id`, and `player_assist_id`.
Distribution of Event Types to see the frequency of different types of events.
Date Range Analysis to understand the time span covered by the dataset.

Let's begin with the summary statistics for the numeric columns.

In [None]:
# Summary statistics for numeric columns
numeric_summary = game_events_df[['minute', 'club_id', 'player_id', 'player_in_id', 'player_assist_id']].describe()

# Frequency distribution of event types
event_type_distribution = game_events_df['type'].value_counts()

# Date range analysis
date_range = game_events_df['date'].agg(['min', 'max'])

numeric_summary, event_type_distribution, date_range

### Numeric Columns Summary:

- Minute:
    - Range: -1 to 120 minutes (likely includes extra time and possibly pre-match events).
    - Mean: Approximately 63.5 minutes.
    - Standard Deviation: About 21.9 minutes, indicating variability in event timing.

- Club ID:
    - Wide range of values, potentially indicating many clubs or a varied numbering system.

- Player ID:
    - Similar to Club ID, a wide range of values.

- Player In ID and Player Assist ID:
    - Both have a large number of -1 values, as expected (to indicate 'Not Applicable').

Event Type Distribution:

- Substitutions: 421,908 events.
- Goals: 180,901 events.
- Cards: 62,473 events.
- Shootout: 1,276 events.

This distribution provides an insight into the frequency of different types of events within the dataset.

Date Range:

Covers events from July 3, 2012, to November 30, 2023.

This timeframe allows us to understand the temporal scope of the data.

## Visualization Plan

To gain a clearer understanding of the dataset's characteristics and trends, we will create the following visualizations:

1. Event Type Distribution: A bar chart to visualize the frequency of different event types.

2. Events Over Time: A time series plot to show how events have been distributed over the years.

3. Distribution of Events by Minute: A histogram to see when most events occur during a game.

These visualizations will provide valuable insights into the dataset and help in analyzing the data effectively.


In [None]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Visualization 1: Event Type Distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=game_events_df, x='type', order=game_events_df['type'].value_counts().index)
plt.title('Distribution of Event Types')
plt.xlabel('Event Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

# Visualization 2: Events Over Time
plt.figure(figsize=(12, 6))
game_events_df['year'] = game_events_df['date'].dt.year
sns.countplot(data=game_events_df, x='year')
plt.title('Number of Events Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Events')
plt.xticks(rotation=45)
plt.show()

# Visualization 3: Distribution of Events by Minute
plt.figure(figsize=(12, 6))
sns.histplot(game_events_df, x='minute', bins=60, kde=True)
plt.title('Distribution of Events by Minute')
plt.xlabel('Minute')
plt.ylabel('Frequency')
plt.xlim(-1, 120)  # Limiting to the duration of a standard match including extra time and events after the final whistle
plt.show()


1. Event Type Distribution

- The bar chart reveals that Substitutions are the most frequent event type, followed by Goals, Cards, and Shootouts.

- This distribution offers insights into the commonality of different events within the dataset.

2. Events Over Time

- The time series plot shows the number of events recorded each year.

- There appears to be variation in event frequency over the years, which could be influenced by various factors like the number of games played, data collection methods, or external events.

3. Distribution of Events by Minute

- The histogram displays the frequency of events throughout the 120-minute span of a game (including extra time).
- Most events seem to cluster around the latter half of the match, with a noticeable increase towards the end. This could be due to the nature of game dynamics, where more substitutions and strategic plays happen as the game progresses.

These visualizations provide a deeper understanding of the dataset's characteristics. They reveal patterns in event types, temporal trends, and the distribution of events within the timeframe of a game.


## In-depth Analysis - Aspect 1: Club-Specific Event Patterns

In [None]:
# Top clubs in terms of event frequency
top_clubs = game_events_df['club_id'].value_counts().head(10)

# Create a DataFrame for visualization
top_clubs_df = game_events_df[game_events_df['club_id'].isin(top_clubs.index)]
top_clubs_events = top_clubs_df.groupby(['club_id', 'type']).size().unstack().fillna(0)

# Visualization: Types of Events per Top Club
plt.figure(figsize=(14, 8))
sns.heatmap(top_clubs_events, annot=True, fmt=".0f", cmap="YlGnBu", linewidths=.5)
plt.title('Types of Events per Top Club')
plt.xlabel('Event Type')
plt.ylabel('Club ID')
plt.show()

## Visualization: Types of Events per Top Club

The heatmap displays the frequency of different event types for the top clubs in terms of overall event count. Each row represents a club (identified by Club ID), and each column represents an event type. The intensity of the color and the numbers indicate the frequency of each event type for the respective clubs.

**Insights:**

- There is noticeable variability in the distribution of event types across different clubs.
- Some clubs show a higher frequency of certain event types, like Goals or Cards, which might reflect their playing style or other factors.


This analysis provides a glimpse into how different clubs' activities vary in terms of event types. It can be insightful for understanding club-specific dynamics or for more focused analyses on specific clubs or event types.


## In-depth Analysis - Aspect 2: Relationship Between Event Types and Game Progression

The aim here is to understand how different types of events are distributed throughout the duration of a game. Specifically, we want to see if certain events (like Goals, Cards) are more likely to occur at specific times during a match.

We will analyze the distribution of different event types over the minute range of the game. This can reveal patterns like whether more goals are scored towards the end of the match or if cards are more frequently given in certain periods.

In [None]:
# Distribution of different event types over the minute range of the game
plt.figure(figsize=(14, 8))

# Filtering out only relevant event types for clarity
relevant_types = ['Goals', 'Cards']
filtered_df = game_events_df[game_events_df['type'].isin(relevant_types)]

# Creating the plot
sns.histplot(data=filtered_df, x='minute', hue='type', element='step', bins=24, kde=False)
plt.title('Distribution of Goals and Cards Over Match Minutes')
plt.xlabel('Minute of the Game')
plt.ylabel('Frequency')
plt.xlim(0, 120)  # Limiting to the duration of a standard match including extra time
plt.show()

The histogram shows the frequency of Goals and Cards throughout the duration of a match (up to 120 minutes, including extra time). Each bar represents a time segment, with different colors indicating the type of event (Goals or Cards).

**Insights**:

- Goals: The distribution of goals appears to have peaks towards the middle and end of the match. This might suggest increased scoring opportunities during these periods.

- Cards: The issuance of cards seems to be more evenly distributed throughout the match, with a slight increase towards the end.

This analysis provides insights into how the dynamics of a game might influence the occurrence of certain types of events. Understanding these patterns can be crucial for game strategies, player management, and predicting game outcomes.


## In-depth Analysis - Aspect 3: Player-Specific Analysis

In this part, we aim to explore the involvement of players in different events. Specifically, we'll look at:

- Top Players in Terms of Event Participation: Identifying players who are most frequently involved in events.
- Event Type Breakdown for Top Players: Analyzing the types of events these top players are mostly involved in.

This analysis can reveal which players are most active or influential in games, based on their event participation. Let's proceed with identifying the top players in terms of event involvement.

In [None]:
# Top players in terms of event participation
top_players = game_events_df['player_id'].value_counts().head(10)

# Create a DataFrame for visualization
top_players_df = game_events_df[game_events_df['player_id'].isin(top_players.index)]
top_players_events = top_players_df.groupby(['player_id', 'type']).size().unstack().fillna(0)

# Visualization: Event Type Breakdown for Top Players
plt.figure(figsize=(14, 8))
sns.heatmap(top_players_events, annot=True, fmt=".0f", cmap="YlOrRd", linewidths=.5)
plt.title('Event Type Breakdown for Top Players')
plt.xlabel('Event Type')
plt.ylabel('Player ID')
plt.show()

- The heatmap represents the frequency of different event types for the top players, identified by their Player ID
- The color intensity and annotations show the count of each event type for these players.

**Insights:**

- This visualization highlights which players are most frequently involved in certain types of events, like Goals, Cards, or Substitutions.
- It provides a clear picture of player-specific activity within games, revealing patterns about their involvement, such as whether they are more likely to score, assist, or receive cards.
- Understanding player-specific patterns can be essential for team management, player performance analysis, and scouting. It offers valuable insights into the roles and impact of players in games.


## Delving Deeper 

Now, the IDs might not mean all that much to us. Let's try to find out the names of these players.

In the interest of time, we'll only be doing the cross-referencing in this notebook. In a production environment, all of these visualizations will be accessible in a readable format through a separate Jupyter Notebook dashboard.

In [None]:
# Try to load a cleaned version of the players dataset first. If it doesn't exist, load the original dataset
try:
    file_path = '../data/cleaned/players_cleaned.csv'
    players_df = pd.read_csv(file_path)
except FileNotFoundError:
    file_path = '../data/players.csv.csv'
    players_df = pd.read_csv(file_path)

# Make a copy of our previous DataFrame, which we will save as a CSV file later on
game_events_df_copy = game_events_df.copy()

players_df.head()

The `players_cleaned.csv` dataset provides detailed information about players, including:

- `player_id`: Unique identifier for each player, which we can use to link to our `game_events_df` dataset.
- `first_name` and `last_name`: Names of the players.
- `name`: Full name of the player.
- Other attributes such as `last_season`, `current_club_id`, `country_of_birth`, `city_of_birth`, `height_in_cm`, `market_value_in_eur`, etc.

For our current analysis, the most relevant columns are `player_id` and `name`. We can use these to map player IDs in our `game_events_df` to their names.

Next steps:

1. Merge `game_events_df` with `players_df` based on `player_id`.
2. Update the analysis to include player names instead of IDs.

Let's proceed with merging the datasets and then re-run the player-specific event analysis with player names.

In [None]:
# Merging the game_events_df with players_df to include player names
game_events_with_player_names_df = pd.merge(game_events_df, players_df[['player_id', 'name']], 
                                            on='player_id', 
                                            how='left')

# Top players in terms of event participation with names
top_players_with_names = game_events_with_player_names_df['name'].value_counts().head(10)

# Create a DataFrame for visualization including player names
top_players_names_df = game_events_with_player_names_df[game_events_with_player_names_df['name'].isin(top_players_with_names.index)]
top_players_names_events = top_players_names_df.groupby(['name', 'type']).size().unstack().fillna(0)

# Visualization: Event Type Breakdown for Top Players (with names)
plt.figure(figsize=(14, 8))
sns.heatmap(top_players_names_events, annot=True, fmt=".0f", cmap="Blues", linewidths=.5)
plt.title('Event Type Breakdown for Top Players (With Names)')
plt.xlabel('Event Type')
plt.ylabel('Player Name')
plt.show()

## Visualization: Event Type Breakdown for Top Players (With Names)

The heatmap now includes player names, providing a more intuitive view of the event participation of individual players. Different event types (like Goals, Cards, Substitutions) are shown for each player.

**Insights:**

- This visualization makes it easier to identify which players are most active in specific types of events.
- It helps in understanding the roles and contributions of these top players in games, such as who scores the most goals or who is more likely to be involved in substitutions.
- This analysis, enhanced with player names, offers a clearer and more meaningful understanding of player activities and their impact on games.



Now that that's clearer, let's wrap up this analysis by summarizing the key findings and takeaways.

# 4. Insights and Conclusions

## Key Findings

- Event Type Distribution: Substitutions are the most common event, followed by goals and cards.
- Temporal Trends: The number of events varies across years, and most goals and cards occur in the latter half of matches.
- Club-Specific Patterns: Different clubs show varying frequencies of event types, which might reflect distinct playing styles or strategies.
- Player Involvement: Certain players are more frequently involved in specific types of events, which can indicate their roles and impact in games.

## Limitations

- Data Completeness: The dataset may not cover all relevant aspects of game events (e.g., detailed player actions, precise event timings).
- External Factors: Factors not captured in the dataset (such as game conditions, player injuries, or tactical decisions) can significantly influence event occurrences.

## Recommendations/Conclusions

- Strategic Decisions: Clubs and coaches can use these insights for strategic planning and player management.
- Further Research: Additional data (like player performance metrics or team strategies) could enrich the analysis for more comprehensive insights.

# Saving the cleaned data

In [None]:
# Save the cleaned DataFrame to a new CSV file
cleaned_data_path = '../data/cleaned/game_events_cleaned.csv'
game_events_df_copy.to_csv(cleaned_data_path, index=False)