 # Player Similarity Score 

## Getting Started

Import necessarry libraries and dependencies

In [None]:
import pandas as pd  # Library that allows data manipulation and analysis
import numpy as np  # Library for high-level mathematical functions and support for multi-dimensional arrays
import matplotlib.pyplot as plt  # Library for plotting
import seaborn as sns  # Additional Library for plotting

# To import the necessarry libraries and dependendencies required for the machine learning models and their respective training and evaluation.

from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    ExtraTreesClassifier,
)
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from xgboost import XGBClassifier

# To import metrics to measure performance
from sklearn.metrics import (
    accuracy_score,
    mean_absolute_error,
    mean_squared_error,
    r2_score,
)
from sklearn.metrics.pairwise import cosine_similarity

# Pandas display options
pd.set_option("display.max_rows", None)  # To better display the rows when there are too many
pd.set_option("display.max_columns", None)  # To better display the columns when there are too many

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Reading the file

In [None]:
df = pd.read_csv("C:/Users/aleja/OneDrive/Desktop/Data Analytics Msc/Thesis/male_players.csv") #Loading the CSV file 

## Exploratory data analysis

### Checking general information about the dataset

In [None]:
df.info(verbose=True,show_counts=True) # For a summarry of the columns available, the amount of nulls per column and the data type 

- This initial glimpse of the code allows us to make quick inferences about the dataset being handled. We are dealing with 111 initial columns in our dataset which already shows us that we are probably going to need to perform data cleaning to get rid of unnecesarry columns.

- The Non-Null Count also signals that we must perform data cleaning as the count value are not the same for all columns.

- Lastly, the Data type of each column tells us how we are going to have to handle each column at the moment of making operations and when considering the steps to handle the different types of variables.  

In [None]:
#We run the following method to get a initial look at the data.
df.head()

This overview of the dataset allows us to understand our next steps for data preprocessing: 
1. Player positions contain multiple positions separated by a comma. We will have to reduce it to only one so that comparisons can be made properly by avoiding having a mix of multiple positions being recognized as their own category. (This step was completed in Excel using the LEFT and FIND functions, for the sake of including various data analysis tools; hence the player_positions_2 colum).
1. Some of the scores assigned to each player for each position in the field (ls,st,rs,lw, etc.) include an operation, i.e. 90+3, which was included to include potential based on certain conditions. This causes the data type to be an object when it would be needed for these colums to be an integer.  

### Data Preprocessing

#### General Data Cleaning

We define a function that will allow us to perform the operations found in some of player position rating columns.

In [None]:
#To calculate the effective overall rating for the player position rankings
# Define a function to handle both addition and subtraction
def calculate_operation(x):
    x = str(x)  # Convert to string to avoid the error
    if '+' in x:
        return sum(map(int, x.split('+')))
    elif '-' in x:
        parts = x.split('-')
        return int(parts[0]) - int(parts[1])
    else:
        return int(x)

# Apply the function to multiple columns
columns_to_process = ['ls', 'st', 'rs','lw','lf','cf','rf','rw','lam','cam','ram','lm','lcm','cm','rcm','rm','lwb','ldm','cdm','rdm','rwb','lb','lcb','cb','rcb','rb','gk']          

# Apply the function to the specified columns and update the DataFrame
for col in columns_to_process:
    if col in df.columns:  # Check if the column exists in the DataFrame
        df[col] = df[col].apply(calculate_operation)

In [None]:
#To drop the rows of players missing their market value 
df = df.dropna(subset=['value_eur'])

#To drop the rows of players missing information
df = df.dropna(subset=['club_position'])

We invert the values of ordinal variables like league level so that the logic for all cardinal variable stays consistent; higher is better.

In [None]:
df['league_level'] = 6 - df['league_level']

#### Handling Duplicates

By checking the dataset we see that there is a 'player_id' column that holds a unique identifier for each player. We use the following lines of code to check on duplicates. 

In [None]:
duplicates = df[df.duplicated(subset=['player_id' ], keep=False)] #To check if there are any duplicates for player id
duplicates = duplicates.sort_values(by=['player_id']) #To sort and show the duplicated together in case there are any
duplicates


#### Handling Missing Values

It is crucial to handle missing values as they will skew our results and ML algorithms dont function properly with them. We first analyze the missing values in each column

In [None]:
null_counts = df.isnull().sum() #To count the sum of the null values in our dataset
null_counts

As we can see, there are columns that have a high percentage of missing values, upwards of 90% in some cases. We decided the best course of action is to get rid of those columns all together and also drop some columns that dont provide valuable information at a first glance. We include some of the financial variables as we know that these dont have an influence over player similarities.

In [None]:
#We drop columns that have high null value count and that dont provide much information
df = df.drop(columns=['nation_jersey_number','player_traits' , 'goalkeeping_speed',
                       'club_loaned_from', 'nation_team_id', 'nation_position',
                       'player_tags', 'fifa_update', 'nationality_id', 'real_face',
                       'league_id','club_team_id','club_joined_date','update_as_of',
                       'dob', 'player_positions', 'body_type','value_eur',
                       'wage_eur','release_clause_eur'
                       ]) 


Now that we have solved the null values that could be disposed off, we must analyze what procedure to take to solve the null values in the columns that will prove useful to us. We will use histograms to check on the distribution for each of the variables and analyze the best method to impute said columns.

In [None]:
h = df.hist(figsize = (25,25))

Plotting the histograms gave us interesting insights of our data. 
- The distributions for 'value_eur' and 'wage_eur' present a high degree of skewness to the left, which might affect our data. We will need to apply some transformation to correct this.
- The columns 'pace','shooting','passing','dribbling','mentality_composure' all present a normal distribution, which means that null values can be imputed using the mean.
- The columns 'defending','physic','release_clause_eur' present skewness in their distributions, meaning that null values can be imputed using the median.
- Colums with categorical variables that present null values like 'league_level' can be imputed using the mode. 

In [None]:
#To impute the columns that have a normal distribution with the mean
df['pace'].fillna(df['pace'].mean(), inplace=True)
df['shooting'].fillna(df['shooting'].mean(), inplace=True)
df['passing'].fillna(df['passing'].mean(), inplace=True)
df['dribbling'].fillna(df['dribbling'].mean(), inplace=True)
df['mentality_composure'].fillna(df['mentality_composure'].mean(), inplace=True)

#To impute the colums that present skewness with the median
df['defending'].fillna(df['defending'].median(), inplace=True)
df['physic'].fillna(df['physic'].median(), inplace=True)


#To impute the values of a ordinal variable with the mode
df['league_level'] = df['league_level'].fillna(df['league_level'].mode()[0])


Displaying the sum of the null counts to verify that are not any left. ('player_face_url' still presents null values but that is a column that will not be used in the analysis and is only needed for the construction of the Tableau Dashboard)

In [None]:
null_counts = df.isnull().sum() #To count the sum of the null values in our dataset
null_counts

#### Handling Outliers

By taking a look a the histograms for each variable we could realize that we would need to apply data transformations techniques that would help us reduce the imbalance in our dataset. The histograms already give us a hint at potential outliers, however we use box plots to confirm take a much more precise look.

In [None]:
b = df.plot(kind='box', figsize=(25, 25), subplots=True, layout=(10, 9))

There are multiple conclusions that we can draw from examining the box plots
- Some of the variables present a lot of outliers but in some of them it actually makes sense, i.e. height can present outliers on both ends of the box as goalkeepers are usually tall and some regions of the world have shorter players; not necessarily being determinant of player value. This teaches us that we must focus our outlier removal efforts in variables where outliers dont make sense or could potentially affect our results. 
- Other variables like the goalkeeping score present many outliers as it is logical that field players have low scores in this regard.  

#### Feature Scaling and Encoding

It is advisable to put most of the numeric variables on similar scales so that they are more easily comparable. To achieve this we utilize a Robust Scaler. 

In [None]:
#We apply Robust Scaler to all of the numerical variables in our dataset

columns_to_scale = ['overall','potential','age',
                    'height_cm','weight_kg','pace',
                    'shooting','passing','dribbling','defending','physic',
                    'attacking_crossing','attacking_finishing','attacking_heading_accuracy',
                    'attacking_short_passing','attacking_volleys','skill_dribbling','skill_curve',
                    'skill_fk_accuracy','skill_long_passing','skill_ball_control','movement_acceleration',
                    'movement_sprint_speed','movement_agility','movement_reactions','movement_balance',
                    'power_shot_power','power_jumping','power_stamina','power_strength','power_long_shots',
                    'mentality_aggression','mentality_interceptions','mentality_positioning',
                    'mentality_vision','mentality_penalties','mentality_composure','defending_marking_awareness',
                    'defending_standing_tackle','defending_sliding_tackle','goalkeeping_diving',
                    'goalkeeping_handling','goalkeeping_kicking','goalkeeping_positioning',
                    'goalkeeping_reflexes','ls', 'st', 'rs','lw','lf','cf','rf','rw','lam',
                    'cam','ram','lm','lcm','cm','rcm','rm','lwb','ldm','cdm','rdm','rwb','lb',
                    'lcb','cb','rcb','rb','gk']

scaler = RobustScaler()
scaled_data = scaler.fit_transform(df[columns_to_scale])
df[columns_to_scale] = pd.DataFrame(scaled_data, index=df.index, columns=columns_to_scale)


Categorical variables have to be encoded for a better performance with our ML models

In [None]:
#We apply a Hot Encoder to the categorical variables that are not ordinal
columns_to_encode = ['player_positions_2','preferred_foot','work_rate','body_type_2']
encoder = OneHotEncoder(sparse_output=False, drop=None) 
encoded_data = encoder.fit_transform(df[columns_to_encode])
encoded_columns = encoder.get_feature_names_out(columns_to_encode)
encoded_df = pd.DataFrame(encoded_data, columns=encoded_columns, index=df.index)
df = pd.concat([df.drop(columns=columns_to_encode), encoded_df], axis=1)

#### Dimensionality Reduction

In [None]:
# We define a list of numerical features so that ML algorithms have no problems processing categorical variables.
features = ['overall', 'potential','age', 'height_cm', 'weight_kg', 'club_jersey_number',
    'pace', 'shooting', 'passing', 'dribbling', 'defending', 
    'physic', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 
    'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 
    'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 
    'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 
    'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 
    'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 
    'mentality_penalties', 'mentality_composure', 'defending_marking_awareness', 
    'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 
    'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 
    'goalkeeping_reflexes', 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 
    'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 
    'cb', 'rcb', 'rb', 'gk','league_level','international_reputation'
]

We can try different approaches, the first one will be using RFE with Logistic Regression

In [None]:
# Perform RFE with Logistic Regression
X = df[features]
y = df['club_position']
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=5)
rfe.fit(X, y)

# Output selected features
selected_features = [features[i] for i in range(len(features)) if rfe.support_[i]]
print("Top features selected:", selected_features)

Second approach will be to use an Extra Trees Classifier to graph the best performing features

In [None]:
X = df[features]
y = df['club_position']

# Train an Extra Trees Classifier
et_model = ExtraTreesClassifier(random_state=42)
et_model.fit(X, y)

# Get feature importances
feature_importances = et_model.feature_importances_

# Create a DataFrame to store feature importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})

# Sort the features by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Display the sorted feature importances
print(importance_df)

# Optionally, plot the feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 15))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.gca().invert_yaxis()  # Invert y-axis to display the highest importance on top
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.title('Feature Importances - Extra Trees Classifier')
plt.show()

We define our final variables for our model based on the results of the the models but also in conjunction with the recommendations from professional football coaches that formed part of a complementary research

In [None]:
final_features = [
    'overall', 'potential','club_jersey_number','pace','shooting','passing','dribbling',
    'defending','physic','attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 
    'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 
    'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 
    'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 
    'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 
    'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 
    'mentality_penalties', 'mentality_composure', 'defending_marking_awareness', 
    'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 
    'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 
    'goalkeeping_reflexes', 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 
    'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 
    'cb', 'rcb', 'rb', 'gk','preferred_foot_Right', 'preferred_foot_Left',
]


## ML Algorithms

We begin by defining that our dataset will be limited to the year 2024 as we want to compare players to each other and not compare them to other versions of themselves from the previous years

In [None]:
df = df[df['fifa_version'] == 24]

We divide the train and test sets.

In [None]:
# Extract numeric features and player IDs
player_features = df[final_features]
player_ids = df['player_id']
train_data = df[df['fifa_version'] == 23][final_features]  # Data from 2023
test_data = df[df['fifa_version'] == 24][final_features]   # Data from 2024
train_player_ids = df[df['fifa_version'] == 23]['player_id']
test_player_ids = df[df['fifa_version'] == 24]['player_id']


### K Nearest Neighbors

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

# Create a sparse matrix from df values
matrix = csr_matrix(df[final_features].values)

# Initialize Nearest Neighbors with cosine similarity and brute force algorithm
knn_search = NearestNeighbors(metric='cosine', algorithm='brute')
knn_search.fit(matrix)

# Create a dictionary to store recommendations for each player
rec_dict = {}

for player_idx, player_row in enumerate(matrix):
    # Get 4 neighbors to include the player itself
    distances, indices = knn_search.kneighbors(player_row.reshape(1, -1), n_neighbors=4)
    
    player_id = df.iloc[player_idx]['player_id']  # Get the player_id for the current player
    recommendations = []  # Store recommendations for the current player

    for elem in range(1, len(distances.flatten())):  # Skip the player itself
        rec_player_idx = indices.flatten()[elem]
        rec_player_id = df.iloc[rec_player_idx]['player_id']  # Get the recommended player's player_id
        rec_player_name = df.iloc[rec_player_idx]['long_name']  # Get the recommended player's name
        similarity_score = (1 - distances.flatten()[elem]) #* penalty_factor  # Apply penalty
        recommendations.append((rec_player_id, rec_player_name, similarity_score))
    
    rec_dict[player_id] = recommendations

# Create lists for new columns
rec_players_1 = []  # First recommended player_id
rec_players_2 = []  # Second recommended player_id
rec_players_3 = []  # Third recommended player_id
similarity_scores_1 = []  # Similarity score for first recommended player
similarity_scores_2 = []  # Similarity score for second recommended player
similarity_scores_3 = []  # Similarity score for third recommended player
rec_player_names_1 = []  # First recommended player name
rec_player_names_2 = []  # Second recommended player name
rec_player_names_3 = []  # Third recommended player name

# Populate new columns with recommended players, their names, and similarity scores
for player_id in df['player_id']:
    recommendations = rec_dict.get(player_id, [])
    recommendations.sort(key=lambda x: x[2], reverse=True)  # Sort by similarity score

    if recommendations:
        rec_players_1.append(recommendations[0][0])  # Get the first recommended player_id
        rec_player_names_1.append(recommendations[0][1])  # Get the first recommended player's name
        similarity_scores_1.append(recommendations[0][2])  # Similarity score for the first player

        if len(recommendations) > 1:
            rec_players_2.append(recommendations[1][0])  # Second recommended player_id
            rec_player_names_2.append(recommendations[1][1])  # Second recommended player's name
            similarity_scores_2.append(recommendations[1][2])  # Similarity score for the second player

        if len(recommendations) > 2:
            rec_players_3.append(recommendations[2][0])  # Third recommended player_id
            rec_player_names_3.append(recommendations[2][1])  # Third recommended player's name
            similarity_scores_3.append(recommendations[2][2])  # Similarity score for the third player
        else:
            rec_players_3.append(None)
            rec_player_names_3.append(None)
            similarity_scores_3.append(None)
    else:
        rec_players_1.append(None)
        rec_player_names_1.append(None)
        similarity_scores_1.append(None)
        rec_players_2.append(None)
        rec_player_names_2.append(None)
        similarity_scores_2.append(None)
        rec_players_3.append(None)
        rec_player_names_3.append(None)
        similarity_scores_3.append(None)


### SVD

In [None]:
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import cosine

# Apply SVD to reduce dimensions while maintaining structure
svd = TruncatedSVD(n_components=20, random_state=42)  # Reduce to 20 latent features
reduced_matrix = svd.fit_transform(df[final_features].values)

# Create a dictionary to store recommendations for each player
rec_dict = {}

for player_idx, player_vec in enumerate(reduced_matrix):
    # Get current player's ID and name
    current_player_id = df.iloc[player_idx]['player_id']
    similarities = []
    
    for other_idx, other_vec in enumerate(reduced_matrix):
        if player_idx != other_idx:
            # Calculate similarity score with penalty
            similarity_score = (1 - cosine(player_vec, other_vec))
            
            # Get recommended player's details
            rec_player_id = df.iloc[other_idx]['player_id']
            rec_player_name = df.iloc[other_idx]['long_name']
            
            similarities.append((rec_player_id, rec_player_name, similarity_score))
    
    # Sort by similarity score descending and keep top 3
    similarities.sort(key=lambda x: x[2], reverse=True)
    rec_dict[current_player_id] = similarities[:3]

# Create lists for new columns (same structure as KNN output)
rec_players_1 = []
rec_players_2 = []
rec_players_3 = []
similarity_scores_1 = []
similarity_scores_2 = []
similarity_scores_3 = []
rec_player_names_1 = []
rec_player_names_2 = []
rec_player_names_3 = []

# Populate new columns with recommendations
for player_id in df['player_id']:
    recommendations = rec_dict.get(player_id, [])
    
    # Initialize all values as None first
    r1, r2, r3 = (None, None, None), (None, None, None), (None, None, None)
    
    if len(recommendations) >= 1:
        r1 = (recommendations[0][0], recommendations[0][1], recommendations[0][2])
    if len(recommendations) >= 2:
        r2 = (recommendations[1][0], recommendations[1][1], recommendations[1][2])
    if len(recommendations) >= 3:
        r3 = (recommendations[2][0], recommendations[2][1], recommendations[2][2])
    
    # Append to lists
    rec_players_1.append(r1[0])
    rec_player_names_1.append(r1[1])
    similarity_scores_1.append(r1[2])
    
    rec_players_2.append(r2[0])
    rec_player_names_2.append(r2[1])
    similarity_scores_2.append(r2[2])
    
    rec_players_3.append(r3[0])
    rec_player_names_3.append(r3[1])
    similarity_scores_3.append(r3[2])

### Measuring Model performance

Measuring the performance of the models in this instance is more complicated than what we did in the market value predictor. There are various methods to measure the similarity of the predicted recommendations. Since we decided to measure the cosine similarity within the models we now have to compute the average of all those cosine similarities. 

In [None]:
def calculate_average_similarity(df):
    similarity_columns = ['SimilarityScore_1', 'SimilarityScore_2', 'SimilarityScore_3']
    
    # Convert similarity scores to numeric values (handling None values as 0)
    df['AvgSimilarity'] = df[similarity_columns].apply(lambda row: np.nanmean([val for val in row if val is not None]), axis=1)
    
    return df[['player_id', 'AvgSimilarity']]

# Apply function to compute average similarity
df_similarity = calculate_average_similarity(df)

# Compute the overall average similarity across all players
overall_avg_similarity = df_similarity['AvgSimilarity'].mean()

print(f"Overall Average Similarity: {overall_avg_similarity:.4f}")

## Output Generation

Now that we know the results for which model adapts better to our data we can proceed to adding the recommended players with their similarity scores and their IDs to the dataframe so that this can be used in Tableau. To make the process simpler, we reload the original CSV file and append the new columns to it. 

In [None]:
df2 = pd.read_csv("C:/Users/aleja/OneDrive/Desktop/Data Analytics Msc/Thesis/male_players.csv") #Loading the CSV file 
df2 = df2[df2['fifa_version'] == 24]

We repeat the same data preprocessing steps we did before as the dataframe needs to be in the same shape and size for there to be no problems with the indexes

In [None]:
#To calculate the effective overall rating for the player position rankings
# Define a function to handle both addition and subtraction
def calculate_operation(x):
    x = str(x)  # Convert to string to avoid the error
    if '+' in x:
        return sum(map(int, x.split('+')))
    elif '-' in x:
        parts = x.split('-')
        return int(parts[0]) - int(parts[1])
    else:
        return int(x)

# Apply the function to multiple columns
columns_to_process = ['ls', 'st', 'rs','lw','lf','cf','rf','rw','lam','cam','ram','lm','lcm','cm','rcm','rm','lwb','ldm','cdm','rdm','rwb','lb','lcb','cb','rcb','rb','gk']          

# Apply the function to the specified columns and update the DataFrame
for col in columns_to_process:
    if col in df2.columns:  # Check if the column exists in the DataFrame
        df2[col] = df2[col].apply(calculate_operation)

In [None]:
#To drop the rows of players missing their market value 
df2 = df2.dropna(subset=['value_eur'])

#To drop the rows of players missing information
df2 = df2.dropna(subset=['club_position'])

In [None]:
#To impute the columns that have a normal distribution with the mean
df2['pace'].fillna(df2['pace'].mean(), inplace=True)
df2['shooting'].fillna(df2['shooting'].mean(), inplace=True)
df2['passing'].fillna(df2['passing'].mean(), inplace=True)
df2['dribbling'].fillna(df2['dribbling'].mean(), inplace=True)
df2['mentality_composure'].fillna(df2['mentality_composure'].mean(), inplace=True)

#To impute the colums that present skewness with the median
df2['defending'].fillna(df2['defending'].median(), inplace=True)
df2['physic'].fillna(df2['physic'].median(), inplace=True)
df2['release_clause_eur'].fillna(df2['release_clause_eur'].median(), inplace=True)

#To impute the values of a ordinal variable with the mode
df2['league_level'] = df2['league_level'].fillna(df2['league_level'].mode()[0])

We now append the results of the model executed last to the dataframe

In [None]:

# Add new columns to df
df2['RecPlayer_1'] = rec_players_1
df2['RecPlayerName_1'] = rec_player_names_1
df2['SimilarityScore_1'] = similarity_scores_1

df2['RecPlayer_2'] = rec_players_2
df2['RecPlayerName_2'] = rec_player_names_2
df2['SimilarityScore_2'] = similarity_scores_2

df2['RecPlayer_3'] = rec_players_3
df2['RecPlayerName_3'] = rec_player_names_3
df2['SimilarityScore_3'] = similarity_scores_3


In [None]:
# Export to CSV
df2.to_csv('player_recommendations_with_ids.csv', index=False)