# Unsupervised Learning: Clustering National Teams

In this notebook, we apply **KMeans Clustering** to group national teams based on their performance-related features.
This method helps us explore structural patterns among teams without using the target variable (`stage_score`).

We aim to see whether certain clusters align with team success or confederation membership.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

sns.set(style='whitegrid')


## Load Dataset 

In [None]:
# Simulated data based on your project structure
df = pd.DataFrame({
    'fifa_rank': np.random.randint(1, 200, 100),
    'win_rate': np.random.rand(100),
    'goal_difference_per_game': np.random.normal(0, 2, 100),
    'confederation': np.random.choice(['UEFA', 'CONMEBOL', 'CAF', 'AFC', 'CONCACAF'], 100),
    'stage_score': np.random.randint(0, 11, 100)
})
df.head()

## Preprocessing for Clustering

In [None]:
features = ['fifa_rank', 'win_rate', 'goal_difference_per_game']
X = df[features]

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Finding the Optimal Number of Clusters (Elbow Method)

In [None]:
inertias = []
K_range = range(1, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(6, 4))
plt.plot(K_range, inertias, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.tight_layout()
plt.show()


## Apply KMeans Clustering (k=3 as example)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df['cluster'] = clusters
df.head()

## Visualize Clusters in 2D using PCA

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]

plt.figure(figsize=(7, 5))
sns.scatterplot(data=df, x='PCA1', y='PCA2', hue='cluster', palette='Set2')
plt.title('Clusters of Teams (PCA Projection)')
plt.grid(True)
plt.tight_layout()
plt.show()


## Analyze Cluster Characteristics

In [None]:
# Check mean values of each cluster
cluster_summary = df.groupby('cluster')[['fifa_rank', 'win_rate', 'goal_difference_per_game', 'stage_score']].mean()
cluster_summary


##  PCA Explanation and Variance Analysis

Principal Component Analysis (PCA) is used to reduce the dimensionality of our dataset while preserving as much variance as possible.
We use PCA here to visualize team clusters in 2D.

**PC1 and PC2** are the new axes that explain the largest variance in the dataset.

Let’s see how much variance is explained by these two components.

In [None]:
# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(f"PC1 explains {explained_variance[0]:.2%} of variance")
print(f"PC2 explains {explained_variance[1]:.2%} of variance")
print(f"Together, they explain {(explained_variance[0] + explained_variance[1]):.2%} of total variance.")

###  Interpretation
- If PC1 and PC2 explain a large portion of the variance (e.g., 60–80%), our 2D visualization is reliable.
- We can now interpret clusters more confidently and relate them back to `stage_score` or `confederation`.
- These clusters give us insights into how similar teams are grouped based on performance metrics.