# 1. Introduction

## Objective
The objective of this analysis is to delve into the `game_lineups.csv` dataset to uncover insights and patterns related to game lineups in a sports context. We aim to answer questions related to player participation, team compositions, and any notable trends or patterns that emerge from the data.

## Dataset Overview
The dataset contains 119,133 entries and 9 columns, covering aspects like game lineups, player IDs, player names, team captains, and positions. Given its comprehensive nature, it provides a rich source for analyzing game strategies, player utilization, and team dynamics.

# 2. Data Loading and Preliminary Analysis
Let's start by loading the dataset and performing some preliminary analysis.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
game_lineups_df = pd.read_csv('../data/game_lineups.csv')

# Displaying the first few rows of the dataset
game_lineups_df.head()

Here are some initial observations based on the first few rows:
- Columns and Data Types:  The dataset contains columns for `game_lineups_id`, `game_id`, `club_id`, `type`, `number`, `player_id`, `player_name`, `team_captain`, and `position`. The data types will need to be verified to ensure they are appropriate for analysis.

- Data Structure: Each row seems to represent a player's participation in a specific game, along with their position and role (e.g., team captain).

- Unique Identifiers: Columns like `game_lineups_id`, `game_id`, `player_id` serve as unique identifiers for lineups, games, and players respectively.

- Categorical and Numerical Data: Columns such as `type`, `position`, and `team_captain` are categorical, whereas `number`, `game_id`, `club_id`, and `player_id` are `numerical`.

- Possible Areas of Focus: We can explore various aspects like the frequency of player participation, distribution of positions, team compositions, and the prevalence of team captains.


Next, we're going to conduct a more detailed assessment to check data types and look for any peculiarities or inconsistencies in the dataset.

# 3. Data Cleaning and Preprocessing



## Data Types and Missing Values
- Data Types: The dataset primarily consists of integer and object (string) types. `game_id`, `club_id`, `player_id`, and `team_captain` are integers, while `game_lineups_id`, `type`, `number`, `player_name`, and `position` are objects. The types seem appropriate for their respective columns.

- Missing Values: There are no missing values in any of the columns, as indicated by the non-null counts matching the total entry count.

## Data Type Conversions
- Player Number: The number column is an object type, likely due to some non-numeric entries. It would be useful to investigate this further.

- Categorical Columns: Columns like `type`, `position`, and `team_captain` could be treated as categorical for more efficient storage and analysis.

# Data Type Conversion and Inspection of 'number' Column
Let's inspect the number column to understand why it's classified as an object and then convert appropriate columns to categorical types. ​

In [None]:
# Checking unique values in the 'number' column to understand why it's an object type
unique_numbers = game_lineups_df['number'].unique()

# Converting 'type', 'position', and 'team_captain' to categorical types
game_lineups_df['type'] = game_lineups_df['type'].astype('category')
game_lineups_df['position'] = game_lineups_df['position'].astype('category')
game_lineups_df['team_captain'] = game_lineups_df['team_captain'].astype('category')

# Verifying the changes
updated_data_types = game_lineups_df.dtypes

unique_numbers, updated_data_types


- 'number' Column: The `number` column contains player numbers as strings, including numerical values ('1', '2', '3', etc.) and a special character ('-'). The presence of non-numeric characters explains why this column is an object type. For most analyses, this column can be left as is, unless numeric operations are required.

- Data Type Conversion: The `type`, `position`, and `team_captain` columns have been successfully converted to categorical types, which is more efficient for analysis and storage.

# 4. Exploratory Data Analysis
Now that we've cleaned and preprocessed the data, we can move on to exploratory data analysis. We'll start with descriptive statistics and then move on to visualizations and more focused analyses.

## Descriptive Statistics


In [None]:
# Converting 'type', 'position', and 'team_captain' to categorical types again
game_lineups_df['type'] = game_lineups_df['type'].astype('category')
game_lineups_df['position'] = game_lineups_df['position'].astype('category')
game_lineups_df['team_captain'] = game_lineups_df['team_captain'].astype('category')

# Descriptive statistics for categorical columns: 'type', 'position', 'team_captain'
categorical_stats = game_lineups_df[['type', 'position', 'team_captain']].describe()

# Descriptive statistics for numerical columns: 'game_id', 'club_id', 'player_id'
numerical_stats = game_lineups_df[['game_id', 'club_id', 'player_id']].describe()

categorical_stats, numerical_stats


### Categorical Columns: 
- Type: There are two unique types, with 'starting_lineup' being the most frequent. This suggests a focus on players who start the game.

- Position: 17 unique positions are represented, with 'Centre-Back' being the most common. This indicates a variety of player roles and positions in the dataset.

- Team Captain: The majority of entries are non-captains (0), highlighting the relatively few players who serve as team captains.


### Numerical Columns:
- Game ID: The game IDs span a wide range, suggesting a comprehensive coverage of games.

- Club ID: There is a wide range of club IDs, indicating the inclusion of many different clubs.

- Player ID: Player IDs also cover a broad spectrum, reflecting a diverse set of players in the dataset.

## Visualization
To gain deeper insights, we'll create visualizations for:

1. Distribution of player positions.
2. Frequency of players being a team captain.
3. Distribution of game and club IDs (to understand the diversity of games and clubs represented).

In [None]:
# Setting the aesthetic style of the plots
sns.set(style="whitegrid")

# Visualization 1: Distribution of Player Positions
plt.figure(figsize=(12, 8))
sns.countplot(y='position', data=game_lineups_df, order=game_lineups_df['position'].value_counts().index)
plt.title('Distribution of Player Positions')
plt.xlabel('Count')
plt.ylabel('Position')
plt.show()

The visualization provides a clear view of the distribution of player positions in the dataset. From this, we can observe:

- Certain positions, like 'Centre-Back', appear to be more common than others, indicating their frequent presence in game lineups.
- The variety of positions underscores the diversity of player roles within the dataset.

Let's look at the frequency of players being team captains: 
his will help us understand the proportion of players who are designated as team captains in the games. We'll create a bar chart to visualize this frequency. ​​

In [None]:
# Visualization 2: Frequency of Players Being Team Captains
plt.figure(figsize=(8, 6))
sns.countplot(x='team_captain', data=game_lineups_df)
plt.title('Frequency of Players Being Team Captains')
plt.xlabel('Team Captain')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'])  # Making the x-axis labels more descriptive
plt.show()


The bar chart clearly illustrates the frequency of players being team captains:

- A vast majority of players are not team captains, as indicated by the high count for 'No'.

- Only a small fraction of players have the role of a team captain in the games, as seen in the 'Yes' category.

This reflects the unique responsibility and rarity of the team captain role in the lineup.

For the distribution of game and club IDs: We'll use histograms to understand the diversity of games and clubs represented in the dataset. This will help us see if there's a wide range of games and clubs or if the dataset is concentrated around certain ones. Let's start with the distribution of game IDs.

In [None]:
# Visualization 3: Distribution of Game and Club IDs

# Distribution of Game IDs
plt.figure(figsize=(12, 6))
sns.histplot(game_lineups_df['game_id'], bins=30, kde=False)
plt.title('Distribution of Game IDs')
plt.xlabel('Game ID')
plt.ylabel('Frequency')
plt.show()

# Distribution of Club IDs
plt.figure(figsize=(12, 6))
sns.histplot(game_lineups_df['club_id'], bins=30, kde=False)
plt.title('Distribution of Club IDs')
plt.xlabel('Club ID')
plt.ylabel('Frequency')
plt.show()


The histograms provide insights into the distribution of Game and Club IDs in the dataset:

1. Distribution of Game IDs:
The distribution appears relatively uniform, indicating a broad coverage of different games in the dataset.
There are no significant spikes, which suggests that the dataset doesn't disproportionately represent certain games.

2. Distribution of Club IDs:
Similar to the Game IDs, the distribution of Club IDs is also relatively uniform.
This uniformity suggests a diverse representation of clubs, with no single club dominating the dataset.

These visualizations help us understand the breadth and diversity of the games and clubs covered in the dataset, which is crucial for ensuring that our analysis isn't biased towards specific games or clubs.

## In-Depth Analysis
Based on the initial exploratory data analysis, there are several interesting avenues we can explore in more depth. We'll focus on the following:

1. Position and Team Captaincy: Analyzing whether certain positions are more likely to be team captains. This can reveal if leadership roles tend to be associated with specific positions on the field.

2. Player Participation Across Different Clubs: Investigating how players are distributed across clubs, to see if certain clubs have more diverse lineups or rely more heavily on a core group of players.

3. Player Participation in Games: Understanding the frequency of player appearances in games. This can show us if there are players who are consistently part of game lineups, indicating their pivotal role in their teams.

Let's start with the first analysis: Position and Team Captaincy. We'll examine if certain positions are more likely to have players serving as team captains.

### Analysis 1: Position and Team Captaincy

In [None]:
# Analysis 1: Position and Team Captaincy

# Calculate captaincy rates
captain_by_position = game_lineups_df.groupby(['position', 'team_captain']).size().unstack(fill_value=0)
captain_by_position['Captaincy Rate'] = captain_by_position[1] / (captain_by_position[0] + captain_by_position[1])

# Sort by Captaincy Rate
captain_by_position_sorted = captain_by_position.sort_values('Captaincy Rate', ascending=False)

# Create a bar chart
plt.figure(figsize=(12, 8))
sns.barplot(x=captain_by_position_sorted['Captaincy Rate'], y=captain_by_position_sorted.index)
plt.title('Rate of Captaincy Across Different Positions')
plt.xlabel('Captaincy Rate')
plt.ylabel('Position')


The analysis of team captaincy across different positions reveals some interesting insights:

- Sweeper: Although not a common position in modern football, it has the highest captaincy rate. This could be due to the strategic nature of the role.

- Centre-Back: This position shows a significantly high rate of captaincy. It's often considered a leadership position on the field, which is reflected in the data.

- Defensive and Central Midfield: These positions also have a relatively high rate of captaincy, suggesting midfielders often take on leadership roles.

- Goalkeeper: Despite being a specialized position, goalkeepers also frequently serve as team captains.

This analysis suggests that leadership roles (captaincy) tend to be associated with positions that require a good overview of the field (like Centre-Backs and Midfielders) and strategic thinking (like Sweepers).

### Analysis 2: Player Participation Across Different Clubs
We will investigate how players are distributed across different clubs, which can indicate if certain clubs rely on a more diverse set of players or have a consistent core team. 

In [None]:
# Analysis 2: Player Participation Across Different Clubs

# Count unique players per club and sort
players_per_club = game_lineups_df.groupby('club_id')['player_id'].nunique().sort_values()

# Plot for top 10 clubs
plt.figure(figsize=(12, 6))
players_per_club.tail(10).plot(kind='bar')
plt.title('Top 10 Clubs with Most Diverse Lineups')
plt.xlabel('Club ID')
plt.ylabel('Number of Unique Players')

# Plot for bottom 10 clubs
plt.figure(figsize=(12, 6))
players_per_club.head(10).plot(kind='bar')
plt.title('Bottom 10 Clubs with Least Diverse Lineups')
plt.xlabel('Club ID')
plt.ylabel('Number of Unique Players')

The analysis of player participation across different clubs provides insights into the diversity of lineups used by various clubs:

- Clubs with the Most Diverse Lineups: The top 10 clubs (identified by their club_id) have a notably high number of unique players used in their lineups. For example, the club with ID 660 has used 52 different players, indicating a high level of player rotation or a large squad.

- Clubs with the Least Diverse Lineups: On the other end, the bottom 10 clubs have used significantly fewer players. For instance, the club with ID 109967 has only used 1 unique player, and several clubs have used only 2 or 3. This could indicate smaller teams, less rotation, or limited data available for these clubs.

This analysis suggests a wide variance in how clubs manage their player rosters, with some preferring a large pool of players and others relying on a more consistent core team.

### Analysis 3: Player Participation in Games
This will help us understand which players are consistently part of game lineups, possibly indicating their importance to their teams. We'll identify players with the highest number of game appearances.

In [None]:
# Analysis 3: Player Participation in Games

# Count games per player and sort
games_per_player = game_lineups_df.groupby('player_id')['game_id'].nunique().sort_values(ascending=False)

# Plot for top 10 players
plt.figure(figsize=(12, 6))
games_per_player.head(10).plot(kind='bar')
plt.title('Top 10 Players with Most Game Appearances')
plt.xlabel('Player ID')
plt.ylabel('Number of Games')


The analysis of player participation in games highlights players with the highest number of game appearances:

- The top players, identified by their `player_id`, have appeared in a significant number of games. For instance, several players (IDs 660438, 792331, 104918, etc.) have participated in 29 games each.

- This high frequency of game appearances suggests that these players are key figures in their teams, consistently selected for game lineups.

These players could be pivotal to their team's performance, either due to their skill, experience, or role within the team structure.

### Summary of in-depth analysis
1. Position and Team Captaincy: Leadership roles are more associated with strategic and overview positions like Centre-Backs and Midfielders.

2. Player Participation Across Different Clubs: There's a wide range in how clubs manage their player rosters, from clubs using a large pool of players to those with a consistent core team.

3. Player Participation in Games: Certain players show a high frequency of game appearances, indicating their importance to their teams.

# 5. Insights and Conclusions

## Key Insights
Based on our analysis, we can draw the following conclusions:

### Payer Positions and Captaincy
- Leadership roles such as team captaincy are more often associated with positions that have a strategic overview of the game, like Centre-Backs and Midfielders.

- Certain positions, like the Sweeper, though not as common, show a higher rate of captaincy.

### Diversity in Player Participation Across Clubs
- There is a significant variance in how clubs manage their rosters. Some clubs utilize a large and diverse pool of players, while others rely on a more consistent core team.

- Clubs with a higher number of unique players indicate more rotation or larger squads.

### Consistency in Player Participation in Games
- Certain players show high frequencies of game appearances, suggesting their critical roles within their teams.

- This can indicate players' reliability, importance, or possibly their fitness and injury-free status.

## Limitations
The dataset provides a comprehensive coverage of game lineups, but there are some limitations to our analysis:
- Identifier Variables: The dataset primarily consists of identifiers (game_id, player_id, club_id), limiting the depth of quantitative analysis.

- Lack of Contextual Data: Additional contextual data, such as player performance metrics, game outcomes, or player demographics, could have provided richer insights. 
This data is available accross multiple datasets, but in the interest of time, was not included here.It is however tackled, in part and individually, in the other notebooks.

- Categorical Nature of Data: The predominance of categorical data limited the scope for correlation analysis or other statistical tests typically applied to numerical data.

**The dataset as a whole seems to be in good shape, with no missing values or inconsistencies, so it does not present major limitations for future data visualizations and analyses requested by the client.**

<h2>Recommendations</h2>

- Further Data Collection: To deepen the analysis, collecting more detailed data on player performance, match outcomes, and demographic information could be beneficial.

- Longitudinal Analysis: Tracking these metrics over time could reveal trends and changes in team strategies, player utilization, and performance.

## Conclusions
This analysis provides valuable insights into the dynamics of game lineups, player roles, and team strategies. It highlights the importance of strategic positions in leadership roles and underscores the diversity in team management across different clubs. These findings can be particularly useful for sports analysts, team managers, and enthusiasts looking to understand the nuances of team compositions and player utilization.



# Saving the cleaned dataset

In [None]:
# Save the cleaned DataFrame to a new CSV file
cleaned_data_path = '../data/cleaned/game_lineups_cleaned.csv'
game_lineups_df.to_csv(cleaned_data_path, index=False)