## Projet Machine Learning - Determine Football Player Poisition

Authors:
- Grégoire ALPEROVITCH
- Nicolas FLANDIN
- Maxime Chamont

# Introduction

All 3 of us are passionate about football, so we decided to combine our shared passion with our IT skills in this project. In fact, from a computer point of view, football players can be seen as objects with precise statistics that reflect their ability. That's how we came up with the hypothesis that these statistics could be used as factors to determine their role on the pitch. For example, a player with good shooting ability would seem to be an attacker, just as a player who could easily intercept balls would be a defender.

Now that we have these observations, we can start to imagine a machine learning model that can guess a player's position as a defender, striker or midfielder. So here we have our features, which we'll abbreviate as ATT, DEF, MID. To carry out our project we have found on this <a href="https://www.kaggle.com/datasets/nyagami/ea-sports-fc-25-database-ratings-and-stats?select=male_players.csv">link</a>, a database containing all the professional football players with their statistics and their position. 

To help us develop this model, we will use the pandas, numpy, sklearn and matplotlib packages.

In [None]:
# The package pandas will help us to manipulate the datas into dataframe, a python object for datas
import pandas as pd

file_path = "./male_players.csv"
data = pd.read_csv(file_path)

data.head()

However, as we can see from the male_player.csv file, this database is very dense and complex, meaning that we will have to process our data before we can start tuning the model. The first step in our data processing is to remove the ‘GK’ label. We do this because players with the ‘GK’’ label have special and unique features. So our model won't be able to guess them.

In [None]:
# Data cleansing: remove spaces around positions
data['Position'] = data['Position'].str.strip()

# Exclude goalkeepers (GK) from data 
data = data[data['Position'] != 'GK']

As explained earlier, each player is assigned to a position that corresponds to his position on the pitch (our label). However, we're going to simplify these positions, especially as many of these positions are virtually identical. This will create redundancy in the results, which will distort the model and reduce accuracy. To do this, we will group the identical positions into a single category. Then, Specific positions (“ST”, “CM”...) are grouped into general labels (“ATT”, “MID”, “DEF”).

In [None]:
# Position dictionary with main categories
positions = {
    "ATT": ["LW", "RW", "ST"],
    "MID": ["CM", "CDM", "CAM", "LM", "RM"],
    "DEF": ["LB", "RB", "CB"]
}

# Position mapping function
def map_position(position):
    for category, values in positions.items():
        if position in values:
            return category
    return position  # If the position is not in the dictionary, we keep it unchanged

# Applying the `map_position` function to the DataFrame's 'Position' column
data['Position'] = data['Position'].apply(map_position)

There are many features in this model, but many of them are redundant. For example, a player's top speed and his normal speed are two features with a very strong correlation. So we're going to remove one of them. Redundancy has the effect of disrupting the model. At the same time, we'll remove rows with incorrect or empty values

In [None]:
# Define features with more detailed stats than the base ones
features = [
    'PAC', 'SHO', 'PAS', 'DRI', 'DEF', 'PHY',               # Base stats on cards
    'Finishing', 'Heading Accuracy', 'Positioning',        # Attacking statistics 
    'Short Passing', 'Long Passing', 'Vision',             # Midfield statistics
    'Ball Control', 'Standing Tackle', 'Sliding Tackle',   # Defensive statistics
    'Interceptions', 'Acceleration', 'Sprint Speed',       # Additional recommended stats
    'Agility', 'Balance', 'Stamina', 'Strength'            # Physical and agility stats 
]

# Drop rows with missing values for any of the selected features
data = data.dropna(subset=features)

# Define features and label
X = data[features] 
y = data['Position']

Here, The data is standardized to ensure that all features are on a comparable scale, and is divided into training and test sets.

In [None]:
# Normalize the features 
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this section, we define and prepare the machine learning models that will be used to predict player positions based on their statistics. The models chosen cover a variety of popular approaches to supervised learning(KNN, Random Forest, SVM, Softmax)

In [None]:
# Train multiple models and store them for future use
models = {
    'KNN': KNeighborsClassifier(n_neighbors=51),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='linear', probability=True),
    'Logistic Regression (Softmax)': LogisticRegression(max_iter=200, multi_class='multinomial', solver='lbfgs')
}

Each model is trained and evaluated on the test set, displaying accuracy, a classification ratio and a confusion matrix.

In [None]:
# Function to train and evaluate each model
def evaluate_models(models, X_train, X_test, y_train, y_test):
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy of {model_name}: {accuracy * 100:.2f}%")
        print("Classification report:\n", classification_report(y_test, y_pred))
        print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
        print("\n" + "-"*50 + "\n")

We create a function to test different numbers of neighbors for KNN and to display their impact on accuracy.


In [None]:
# Function to determine the optimal K for KNN
def find_best_k(X_train, y_train, X_test, y_test):
    Ks = 100
    mean_acc = np.zeros((Ks-1))

    for n in range(1, Ks):
        neigh = KNeighborsClassifier(n_neighbors=n).fit(X_train, y_train)
        yhat = neigh.predict(X_test)
        mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    # Plot accuracy vs K
    plt.plot(range(1, Ks), mean_acc, 'g')
    plt.ylabel('Accuracy')
    plt.xlabel('Number of Neighbors (K)')
    plt.title('Accuracy vs. Number of Neighbors (K)')
    plt.show()

    # Displays the best precision obtained and the corresponding k value.
    print("The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

The aim is to predict the position of a specific player based on his statistics and the selected model.

In [None]:
# Function to predict player's position based on their name and chosen model
def predict_player_position(player_name, model, data, features, scaler):
    # Case-insensitive search for player's name
    player_data = data[data['Name'].str.contains(player_name, case=False, na=False)]
    
    if player_data.empty:
        print("Player not found!")
        return
    
    # Extract the player's features
    player_features = player_data[features].values
    
    # Normalize the player's features using the same scaler as the training data
    player_features_scaled = scaler.transform(player_features)
    
    # Predict the player's position using the selected model
    predicted_position = model.predict(player_features_scaled)
    
    # Extract first prediction if the result is an array
    predicted_position = predicted_position[0] if len(predicted_position) > 0 else predicted_position
    
    # Simplify predicted positions to broader categories
    if predicted_position in ["ST", "LW", "RW"]:
        predicted_position = "ATT"
    elif predicted_position in ["CM", "CDM", "CAM", "LM", "RM"]:
        predicted_position = "MID"
    elif predicted_position in ["LB", "RB", "CB"]:
        predicted_position = "DEF"

    print(f"The predicted position for {player_name} is: {predicted_position}")