# Introduction
In this project, we classify football players into different performance categories based on their overall ratings. We use the Random Forest and K-Nearest Neighbors (KNN) classifiers to predict whether a central defender (CB) is "World Class," "Good," or "Mediocre" based on various attributes. The dataset contains multiple attributes for each player, allowing us to analyze their performance comprehensively.

## Data Loading and Initial Exploration
We start by loading the dataset and inspecting its structure to understand the available features and their types.

In [None]:
import pandas as pd

# Load the dataset
football_players = pd.read_csv('football_players.csv', encoding='ISO-8859-1')

# Inspect the dataset to understand its structure
print(football_players.head())
print(football_players.columns)

# Find the most common Overall score
most_common_overall = football_players['Overall'].mode()[0]
print(f"The most common Overall score for players in the database is: {most_common_overall}")


## Classifying Central Defenders
We focus on central defenders (CB) and classify them into performance categories based on their overall scores. We define "World Class" players as those with an overall score of 80 or above, "Good" players with scores between 70 and 79, and "Mediocre" players with scores below 70.

In [None]:
# Create a subset of central defenders
cb_players = football_players[football_players['Preferred Positions'].str.contains('CB', na=False)]

# Classify players based on overall score
def classify_player(overall):
    if overall >= 80:
        return 'World Class'
    elif 70 <= overall < 80:
        return 'Good'
    else:
        return 'Mediocre'

cb_players['Class'] = cb_players['Overall'].apply(classify_player)


## Random Forest Classifier
We use the Random Forest classifier to determine the most important features for predicting the performance class of central defenders.


In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare features and target
features = cb_players.select_dtypes(include=[np.number]).columns.drop('Overall')
X = cb_players[features]
y = cb_players['Class']

# Encode the target variable
le = LabelEncoder()
y = le.fit_transform(y)

# Create and fit the random forest model
rf_model = RandomForestClassifier(n_estimators=500, random_state=1971)
rf_model.fit(X, y)

# Get feature importances
importances = rf_model.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)

# Print the top 5 most important features
print("Top 5 most important features:")
print(feature_importances.head())


## K-Nearest Neighbors Classifier
Next, we use the K-Nearest Neighbors classifier to predict player classes and determine the best value for k.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=911)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to calculate F1 scores for each class
def calculate_f1_scores(y_true, y_pred):
    return {
        'World Class': f1_score(y_true, y_pred, labels=[2], average=None)[0],
        'Good': f1_score(y_true, y_pred, labels=[1], average=None)[0]
    }

# Train and evaluate KNN models for k from 1 to 5
results = {k: {} for k in range(1, 6)}
for k in range(1, 6):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    results[k] = calculate_f1_scores(y_test, y_pred)

# Find the best k for each class
best_k_world_class = max(results, key=lambda k: results[k]['World Class'])
best_k_good = max(results, key=lambda k: results[k]['Good'])

print(f"World Class: k={best_k_world_class}, Good: k={best_k_good}")


## Conclusion
This project demonstrates the use of machine learning techniques to classify football players based on their performance. By identifying key features and optimal parameters for classification models, we can better understand what attributes contribute to a player's success as a central defender. Future work could involve exploring additional features, tuning hyperparameters further, and testing other classification algorithms to improve the model's accuracy and reliability.