# Football Player Stats



1. `Player`: Name of the player.
2. `Nation`: Nationality.
3. `Pos`: Position.
4. `Squad`: Team.
5. `Comp`: Competition/League.
6. `Age`: Age of the player.
7. `Born`: Year of birth.
8. `MP`: Matches played.
9. `Starts`: Number of starts.
10. `Min`: Minutes played.



In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import time
import seaborn as sns
from sklearn.model_selection import KFold

In [2]:
# Reading the football player stats dataset with the correct delimiter
football_df = pd.read_csv("2021-2022 Football Player Stats.csv", encoding='ISO-8859-1', delimiter=';')
football_df.head()


Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,Off,Crs,TklW,PKwon,PKcon,OG,Recov,AerWon,AerLost,AerWon%
0,1,Max Aarons,ENG,DF,Norwich City,Premier League,22.0,2000,34,32,...,0.03,1.41,1.16,0.0,0.06,0.03,5.53,0.47,1.59,22.7
1,2,Yunis Abdelhamid,MAR,DF,Reims,Ligue 1,34.0,1987,34,34,...,0.0,0.06,1.39,0.0,0.03,0.0,6.77,2.02,1.36,59.8
2,3,Salis Abdul Samed,GHA,MF,Clermont Foot,Ligue 1,22.0,2000,31,29,...,0.0,0.36,1.24,0.0,0.0,0.0,8.76,0.88,0.88,50.0
3,4,Laurent Abergel,FRA,MF,Lorient,Ligue 1,29.0,1993,34,34,...,0.03,0.79,2.23,0.0,0.0,0.0,8.87,0.43,0.43,50.0
4,5,Charles Abi,FRA,FW,Saint-Étienne,Ligue 1,22.0,2000,1,1,...,0.0,2.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0,100.0




1. **Target Variable**: A suitable target variable could be the player's position (`Pos`), as it is categorical and might be interesting to predict based on other performance metrics.

2. **Missing Values**:
   - I will check for missing values and handle them appropriately, either by filling with a placeholder value or median/mean, depending on the nature of the data.

3. **Feature Transformation and Selection**:
   - Convert certain numerical columns into categorical ones if needed (e.g., age groups from `Age`).
   - Label encoding for categorical variables like `Player`, `Nation`, `Squad`, and `Comp`.
   - Select a subset of relevant features to avoid overfitting and reduce computational complexity.

4. **Handling Numerical Data**:
   - Many columns are already numerical (like `MP`, `Goals`, `Shots`), which can be directly used for analysis.


1. **Selected Features**: A subset of relevant features has been chosen, including metrics like 'Age', 'Goals', 'Shots', 'Assists', 'Touches', and many others related to on-field performance.

2. **Target Variable**: The target variable is the player's position (`Pos`), which has been encoded into numerical format (`Pos_encoded`) using label encoding.

3. **Missing Values**: There is only one missing value in the 'Age' column. Given the minimal impact, this can be ignored for the initial analysis, or the row with the missing value can be dropped.

Now, we can apply the LazyFCA method to this dataset. The LazyFCA function requires a specification of the coding type ('i' for interval and 'c' for categorical) for each feature. Given that most features are numeric, I will treat them as interval types ('i').



In [3]:
# Checking for missing values in the football dataset
football_missing_values = football_df.isna().sum()

# Selecting a subset of columns for analysis
# Excluding some columns that are not directly related to on-field performance
football_selected_cols = ['Age', 'MP', 'Starts', 'Min', 'Goals', 'Shots', 'SoT', 'Assists', 'PasTotCmp', 'PasTotAtt', 'PasTotCmp%', 'PasTotDist', 'PasTotPrgDist', 'PasShoCmp', 'PasShoAtt', 'PasMedCmp', 'PasMedAtt', 'PasLonCmp', 'PasLonAtt', 'Tkl', 'TklWon', 'Press', 'PresSucc', 'Int', 'Clr', 'Touches', 'DriSucc', 'DriAtt', 'Carries', 'CarTotDist', 'CarPrgDist', 'RecTarg', 'Rec', 'AerWon', 'AerLost']

# Encoding the target variable 'Pos' (Player Position)
label_encoder_pos = LabelEncoder()
football_df['Pos_encoded'] = label_encoder_pos.fit_transform(football_df['Pos'])

# Final dataset for LazyFCA analysis
football_fca_df = football_df[football_selected_cols + ['Pos_encoded']]
football_fca_df.head(), football_missing_values[football_selected_cols]


(    Age  MP  Starts   Min  Goals  Shots   SoT  Assists  PasTotCmp  PasTotAtt  \
 0  22.0  34      32  2881   0.00   0.41  0.06     0.06       34.0       45.0   
 1  34.0  34      34  2983   0.06   0.54  0.18     0.00       38.7       47.0   
 2  22.0  31      29  2462   0.04   0.66  0.18     0.00       55.9       61.0   
 3  29.0  34      34  2956   0.00   0.91  0.21     0.06       40.7       49.8   
 4  22.0   1       1    45   0.00   0.00  0.00     0.00        4.0       12.0   
 
    ...  DriSucc  DriAtt  Carries  CarTotDist  CarPrgDist  RecTarg   Rec  \
 0  ...     1.03    2.44     33.9       199.4       121.7     36.0  32.4   
 1  ...     0.48    0.66     35.7       204.7       115.5     37.5  36.3   
 2  ...     0.99    1.53     53.5       246.5       106.3     58.6  54.2   
 3  ...     1.28    1.98     45.7       171.9        86.4     46.3  43.0   
 4  ...     0.00    0.00     18.0       118.0        18.0     24.0  16.0   
 
    AerWon  AerLost  Pos_encoded  
 0    0.47     1.59

In [4]:
def LazyFCA(X, y, cod, cv=5, min_supp=1, ranged=False, gap=None):
  """Performs a lazy classification and computes CV score

  Parameters
  ----------
  X : List
      Data features
  y : List
      Target feature
  cod : List
      Type of coding of features
      -c categorical
      -i interval intersection
  cv : int
      Number of folds in k-fold CV
  min_supp ; int
      Minimal support of hypothesis
  ranged : bool
      If classes are ordered
  gap : int
      Maximum length of interval of classification

  Returns
  -------
  prediction: List
      Class predictions for objects
  acc: float
      Accuracy on CV
  """
  y = np.array(y)
  kf = KFold(n_splits=cv, random_state=None, shuffle=False)
  prediction = [None] * len(y)
  if ranged:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if (max(pred_int)-min(pred_int)+1) > gap:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and min(prediction[p]) <= y[p] <= max(prediction[p]):
          right += 1
      acc.append(right/len(test_index))

  else:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if pred_int[-1] != pred_int[0]:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and y[p] == prediction[p][0]:
          right += 1
      acc.append(right/len(test_index))

  unclass = 0
  for p in range(len(y)):
    if prediction[p] == None:
      unclass += 1
  unclass /= len(y)


  return prediction, acc, np.mean(acc), unclass

In [5]:
# Coding type for each feature ('i' for interval, 'c' for categorical)
football_feature_coding = ['i'] * len(football_selected_cols)  # Assuming all features as interval type for simplicity

# Applying LazyFCA
# Splitting features (X) and target (y)
X_football = football_fca_df[football_selected_cols]
y_football = football_fca_df['Pos_encoded']

# Applying the LazyFCA function to a subset of the football dataset
# We'll start with a small subset due to computational intensity
toy_x_football = X_football.iloc[:50]
toy_y_football = y_football.iloc[:50]

# Applying the LazyFCA function to the toy dataset
pred_football, acc_football, mean_acc_football, uc_football = LazyFCA(toy_x_football, toy_y_football, football_feature_coding, cv=5)
pred_football, acc_football, mean_acc_football, uc_football


([[10],
  [10],
  [10],
  [10],
  [10],
  [10],
  [10],
  [10],
  [10],
  [10],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0],
  [0]],
 [0.0, 0.1, 0.3, 0.7, 0.3],
 0.28,
 0.0)

Based on the output from your LazyFCA analysis on the football player stats dataset, let's interpret the results:

1. **Predictions (`pred_football`)**: The predictions consist of two groups, '10' and '0', which correspond to encoded positions of the players. The algorithm predicts the same position (either '10' or '0') for each group of 10 players. This suggests that the algorithm identifies certain patterns in the data that associate with these specific positions.

2. **Accuracy (`acc_football`)**: The accuracy scores across the 5-fold cross-validation vary, ranging from 0.0 (0%) to 0.7 (70%). The variation in accuracy indicates that while the algorithm can make some correct predictions, its performance is not consistent across different subsets of the data.

3. **Mean Accuracy (`mean_acc_football`)**: The mean accuracy across all folds is 0.28 (28%), which is below a satisfactory level for predictive modeling. This suggests that the selected features and the LazyFCA method might not be fully capturing the complexities of player positions based on the given data.

4. **Unclassified (`uc_football`)**: The value is 0.0, indicating that all instances were classified and none were left unclassified.

