# Premiere League Data Analysis

## Introduction

#### This project aims to analyze  Premier League data to understand team and player performance through various statistics. The dataset includes player names, teams, nationalities, ages, playing minutes, goals, assists, shots, cards, dribbles, and many other indicators that help assess individual and collective performance.

Through this analysis, insights will be extracted to answer questions such as:
- What are the most influential factors in goal scoring?
- How do passes and dribbles impact player performance?
- Which teams excel in defense or attack based on available data?
- Can player performance in upcoming matches be predicted using historical data?
Using data analysis tools and statistical visualizations, accurate and useful results will be provided that can be used by football fans, sports analysts, and even coaches to develop playing strategies.
If you want to add more details or adjust the style, let me know and I'll help!


In [None]:
# Import necessary libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Load the dataset and take an initial look
df = pd.read_csv('player_stats.csv') 

# Display the first few rows to understand the structure
df.head()


In [None]:
# Check the shape of the dataset (rows, columns)
df.shape

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# View basic info about columns and data types
df.info()

In [None]:
# View basic statistics for numerical columns
df.describe()

In [None]:
# Create a backup copy of the original dataset to preserve the raw data
df_original = df.copy()

## ---- Data Cleaning ----

In [None]:
# Check for missing values in each column
df.isna().sum()

In [None]:
# Clean 'Pass Completion %' column:
# - Replace commas with dots
# - Convert to float
# - Fill missing values with column mean
df['Pass Completion %'] = df['Pass Completion %'].str.replace(',', '.')
df['Pass Completion %'] = df['Pass Completion %'].astype(float)
df['Pass Completion %'].fillna(df['Pass Completion %'].mean(), inplace=True)

In [None]:
# Rename '#' column to 'NUM' for clarity
df.rename(columns={'#': 'NUM'}, inplace=True)

In [None]:
# Split the 'Age' column on '-' and keep only the first part (the actual age)
# Removing extra text after '-' (e.g., '29-343' -> '29')
df['Age'] = df['Age'].str.split('-').str[0]

# Convert the cleaned 'Age' column to integer type
df['Age'] = df['Age'].astype(int)

In [None]:
# Extract the primary position from the list by keeping only the first value
df['Position'] = df['Position'].str.strip().str.split(',').str[0]

In [None]:
# convert Date column to datetime and  change format to be month day year 
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')



#  Exploratory Data Analysis (EDA)



##### In this section, we explore the dataset to uncover patterns, trends, and relationships.
#### This includes analyzing distributions, top performers, correlations, and team-level statistics.
##### Visualizations will be used to support understanding of player and team performance.


In [None]:
# Age distribution 

plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=15, kde=True, color='skyblue')
plt.title('Player Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


In [None]:
# Top 20 Midfielders with the Most Assists
# Filter midfielders only
midfield_roles = ['DM', 'CM', 'LM', 'RM']
midfielders = df[df['Position'].isin(midfield_roles)]

# Sort by Assists
top_mf_assists = midfielders.sort_values(by='Assists', ascending=False).head(20)
# Plot
plt.figure(figsize=(10, 6))
sns.barplot(data=top_mf_assists, x='Assists', y='Player', palette='Purples_d', ci=None)
plt.title('Top 10 Midfielders by Assists')
plt.xlabel('Assists')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


In [None]:
# Top 20 Defenders by Tackles

# Define defender positions
defensive_roles = ['RB', 'LB', 'CB', 'WB']

# Filter for defenders
defenders = df[df['Position'].isin(defensive_roles)]

# Sort by Tackles
top_defenders = defenders.sort_values(by='Tackles', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 8))
sns.barplot(
    data=top_defenders,
    x='Tackles',
    y='Player',
    hue='Position',
    palette='Greens',
    errorbar=None
)
plt.title('Top 20 Defenders by Tackles')
plt.xlabel('Tackles')
plt.ylabel('Player')
plt.legend(title='Position')
plt.tight_layout()
plt.show()

In [None]:
# Filter only Forwards

forwards = df[df['Position'] == 'FW']

#Sort by Goals
top_scorers = forwards.sort_values(by='Goals', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(
    data=top_scorers,
    x='Goals',
    y='Player',
    palette='Reds',
    errorbar=None
)
plt.title('Top 10 Forwards by Goals Scored')
plt.xlabel('Goals')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


In [None]:
# The League's Top Scorer
# Sort the entire DataFrame by 'Goals' in descending order to get the top scorers
top_10_scorers = df.sort_values(by='Goals', ascending=False).head(10)

# Create a horizontal bar plot to visualize the top 10 players by goals scored
plt.figure(figsize=(10, 6))  # Set the figure size for better readability

# Draw the barplot with player names on the y-axis and their goal count on the x-axis
sns.barplot(
    data=top_10_scorers,
    x='Goals',
    y='Player',
    palette='Oranges',     # Set a warm orange color palette
    errorbar=None          # Disable confidence intervals to keep it clean
)

# Add a descriptive title and axis labels
plt.title('Top 10 Goal Scorers (All Positions)')
plt.xlabel('Goals')
plt.ylabel('Player')

# Adjust layout to prevent overlapping or cutting off labels
plt.tight_layout()

# Show the final plot
plt.show()

In [None]:
# Strongest Attacking Teams
# The team with the highest number of goals scored is considered to have the strongest attacking line.

top_attacking_teams = df.groupby('Team')['Goals'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_attacking_teams.values, y=top_attacking_teams.index, palette='Reds')
plt.title('Top 10 Teams with Strongest Attacking Line (Total Goals Scored)')
plt.xlabel('Total Goals Scored')
plt.ylabel('Team')
plt.tight_layout()
plt.show()


In [None]:
# Get correlation matrix
corr = df.corr(numeric_only=True)

# Create mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Plot heatmap with mask
plt.figure(figsize=(14, 10))
sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('📊 Correlation Heatmap (Cleaned)')
plt.tight_layout()
plt.show()


In [None]:
# Get full correlation
corr_matrix = df.corr(numeric_only=True)

# Unstack and filter top absolute correlations
top_corrs = corr_matrix.unstack().reset_index()
top_corrs.columns = ['Feature1', 'Feature2', 'Correlation']

# Remove self-correlations and duplicates
top_corrs = top_corrs[top_corrs['Feature1'] != top_corrs['Feature2']]
top_corrs['AbsCorr'] = top_corrs['Correlation'].abs()
top_corrs = top_corrs.sort_values('AbsCorr', ascending=False).drop_duplicates(subset=['Correlation'])

# Show top 10
top_corrs.head(30)


In [None]:
#  Tactical Analysis: Evaluating Passing Accuracy and Ball Progression by Position

import matplotlib.pyplot as plt
import seaborn as sns

#  Scatter plot: Relationship between pass accuracy and progressive passes
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data= df, 
    x='Pass Completion %', 
    y='Progressive Passes', 
    hue='Position', 
    palette='coolwarm', 
    alpha=0.7
)
plt.title('Impact of Pass Accuracy on Progressive Passing by Position')
plt.xlabel('Pass Completion Percentage')
plt.ylabel('Number of Progressive Passes')
plt.legend(title='Position')
plt.show()

In [None]:
# Scatter plot: Influence of successful dribbles on progressive carries
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df, 
    x='Successful Dribbles', 
    y='Progressive Carries', 
    hue='Position', 
    palette='viridis', 
    alpha=0.7
)
plt.title('Effect of Dribbling Success on Progressive Ball Carrying')
plt.xlabel('Number of Successful Dribbles')
plt.ylabel('Number of Progressive Carries')
plt.legend(title='Position')
plt.show()

## Summary of Key Insights & Recommendations

This analysis of Premier League player data provided several valuable insights into individual and team performances across various positions:

- **Top Performers by Role:**  
  - Forwards like [insert top scorer name if desired] stood out as the league's most prolific goal scorers.
  - Midfielders such as [insert top assister name if desired] led in assists, indicating strong creative influence.
  - Defenders with the highest number of tackles demonstrated solid defensive contribution, especially in the full-back and center-back roles.

- **Team Performance Patterns:**  
  - Chelsea emerged as the team with the strongest attacking line based on total goals scored.
  - Defensive metrics suggested key areas where teams vary widely in strength, which may be useful for tactical planning and scouting.

- **Correlation Insights:**  
  - Strong positive correlations were found between:
    - Passes Attempted and Passes Completed
    - Touches and Passes Attempted
    - xG and Non-Penalty xG
    - Progressive actions and passing accuracy  
  These suggest that ball progression and possession are tightly linked to passing efficiency and dribbling success.

- **Tactical Observations:**  
  - Players with higher pass completion percentages tended to contribute more progressive passes, especially in midfield and defensive roles.
  - Successful dribbles were also strongly associated with progressive carries, particularly among wide players and attacking midfielders.

### Recommendations for Further Analysis:
- Incorporate expected goals (xG) and expected assists (xAG) into player performance models.
- Extend analysis to a team-level aggregation for defensive and attacking metrics.
- Use clustering or PCA for dimensionality reduction and player role segmentation.
- Combine with external match results or standings to align performance with team success.

This notebook provides a solid foundation for further tactical and predictive modeling and is suitable for portfolio demonstration, scouting support, or strategic analysis use cases.
