# FM23 Similarity Score Testing
## **Goal**
> This analysis uses cosine similarity scoring based on Football Manager 23 attributes to 'predict' who a young player may emulate at full potential. Football Manager data seems to pass the smell test fot the most part in terms of it's accruacy relative to what's readily and publicly available. ([See this in action on my blog!](https://futek.io/))


## Data
>  I previously scraped player data from Football Manager 23-- **this includes over ~100 attributes for ~450 thousand players from around world:**
>>  Key attributes include:
>>>  Current & Potential ability

>>> Height, Preferred Foot, Position



>>> Attacking Attributes: 
>>>>>`['flair','off_the_ball', 'vision', 'crossing', 'dribbling', 'finishing', 'free_kicks', 'frist_touch', 'long_shots', 'passing']`

>>>  Defensive Attributes: 

>>>>>`['anticipation','positioning', 'acceleration', 'jumping_reach', 'pace', 'stamina', 'strength', 'heading', 'marking', 'tackling']`

## Methodology for Player Potential
> Leverage cosine similarities of the aforementioned attributes in addition to weighting methods for other attributes such a height, preffered foot, potential ability to 'predict' who a young player may emulate at full 
potential.

### Example

1) Take a given young player:

> Player A: 
>> Potential: `180`

>> Position: `AML`

>> Height: `5'10"`

>> Preferred Foot: `Right`

>> Sorted Attributes: `{"pace": 12, "dribbling": 11, "acceleration": "10", "passing": 10}`



2) Scan the database for players whof fit the following criteria in relation to Player A:
> Current Ability == += X (10,20, etc.) from Player A's Potential ability

> Same/similar Position


3) For all potential matches in terms of position and current ability, calculate a similarity score:
> Stack rank their attributes in relation to Player A's to calculate their cosine similarity:
>> Ex:

>>> Player A: `{"pace": 12, "dribbling": 11, "acceleration": "10", "passing": 10 etc...}`

>>> Player B: `{"acceleration": 12, "pace": 11, "dribbling": "10", "flair": 10 etc...}` 

>> Apply weights to other factors: `weights = {'c_weight': 1, 'c_height': 7, 'c_pref_foot': 7, 'c_ability': 35, 'c_comb': 50}`
>>> `c_ability`: `35%`
>>>> Similarity in Player A's Potential Ability to Player B's Current Ability

>>> `c_comb`: `35%`
>>>> Cosine similarity based on the aforementioned factors

>>> `c_height`: `7%`
>>>> Height similarity

>>> `c_pref_foot`: `7%`
>>>> Preferred foot similarity

>>> `c_weight`: `1%`
>>>> Weight similarity

4) Calculate the 'potential fit' by league
> Take all players 




4) Combine to calculate an overall similarity score for all potential matches:
> Calculate the Top 20 leagues based on Current Ability for players between the ages of 25-29 (in their "prime").
>> Calculate the percentile for which Player A's potential ability falls in per league

5) Output for a given player:
<img src="https://i0.wp.com/futek.io/wp-content/uploads/2022/12/Nicolas-Jackson__cover-1.png?w=964&ssl=1" style="height: 1000px;"/>
<img src="https://i0.wp.com/futek.io/wp-content/uploads/2022/12/Nicolas-Jason__similarity.png?w=1494&ssl=1" style="height: 1000px;"/>
<img src="https://i0.wp.com/futek.io/wp-content/uploads/2022/12/Nicolas-Jackson__strweak.png?w=1502&ssl=1" style="height: 1000px;"/>

#### Load Imports

In [1]:
import os

from fuzzywuzzy import fuzz

from scipy import spatial
from scipy.stats import percentileofscore

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

from datetime import datetime as dt

from bs4 import BeautifulSoup
import requests

import pyautogui as pygui

import time
from datetime import datetime as dt

from ipypb import ipb

tqdm_notebook = ipb
# or if you run it in interactive shell
tqdm = ipb

## 1) Read Data

In [2]:
players_df = pd.read_csv('players_data_scrubbed.csv')
mapped_cols = pd.read_csv('../FMScraper/FUTEK Unmapped Cols - unmapped_cols.csv')

In [3]:
print('\n', "Num. Players:", len(players_df), "Num. Attributes:", len(players_df.columns), '\n')
players_df.head()


 Num. Players: 458142 Num. Attributes: 101 



Unnamed: 0,anticipation,positioning,acceleration,jumping_reach,pace,stamina,strength,heading,marking,tackling,...,versatility,agility,balance,natural_fitness,corners,long_throws,penalty_taking,technique,height_string,preferred_foot_rating
0,11.0,8.0,14.0,4.0,14.0,6.0,7.0,7.0,5.0,10.0,...,17.0,14.0,8.0,13.0,5.0,5.0,5.0,10.0,"5'8""",2
1,10.0,9.0,11.0,7.0,12.0,11.0,8.0,9.0,7.0,9.0,...,15.0,14.0,10.0,14.0,9.0,9.0,1.0,15.0,"5'8""",-2
2,10.0,10.0,5.0,8.0,5.0,8.0,8.0,7.0,10.0,11.0,...,9.0,8.0,7.0,12.0,10.0,7.0,9.0,11.0,"5'11""",2
3,11.0,8.0,9.0,10.0,6.0,10.0,12.0,9.0,3.0,2.0,...,8.0,10.0,6.0,10.0,4.0,2.0,3.0,9.0,"6'2""",3
4,7.0,13.0,13.0,14.0,13.0,6.0,11.0,11.0,6.0,12.0,...,9.0,13.0,6.0,14.0,2.0,1.0,1.0,3.0,"6'3""",3


## 2) Select Candidates

In [4]:
# Cols
general_info_cols = ['uid','player_name','age','nationality','position','club_name','division','current_ability','potential_ability','world_reputation']

# Filters
p_age_max = 20
p_potential_min = 130
max_distance = 20
p_nation = 'NGA'
include_gk = False

p_age_filter = (players_df.age <= p_age_max)
p_potential_filter = (players_df.potential_ability >= p_potential_min)
p_nation_filter = (players_df.nationality == p_nation)
p_gk_filter = (players_df.position.str.contains('GK') == include_gk)

players_df[general_info_cols][p_age_filter&p_potential_filter&p_nation_filter&p_gk_filter].sort_values('potential_ability', ascending= False).head(20)

Unnamed: 0,uid,player_name,age,nationality,position,club_name,division,current_ability,potential_ability,world_reputation
302404,2000136000.0,Victor Eletu,16.0,NGA,"DM, M (C)",Milan,Italian Serie A,80.0,168.0,761.0
211818,13224510.0,Olakunle Olusegun,19.0,NGA,"AM (RL), ST (C)",Krasnodar,Russian Premier League,123.0,155.0,2775.0
232878,27162490.0,Raphael Onyedika,20.0,NGA,"DM, M (C)",Club Brugge,Jupiler Pro League,126.0,150.0,2634.0
279681,2000144000.0,Nduka Junior,20.0,NGA,D (C),Remo Stars,Nigerian Premier League,111.0,147.0,2750.0
218943,2000150000.0,Ebenezer Akinsanmiro,17.0,NGA,"AM (C), ST (C)",Remo Stars,Nigerian Premier League,87.0,141.0,1700.0
252529,2000156000.0,Adams Olamilekan,17.0,NGA,AM (RLC),Remo Stars,Nigerian Premier League,88.0,137.0,2200.0
62044,13223860.0,Kunle Adeleke,19.0,NGA,D (C),Enyimba Aba,Nigerian Premier League,99.0,135.0,3300.0
161651,2000221000.0,Victor Orekoya,19.0,NGA,M (C),Merani,Georgian Regional League Group A,46.0,135.0,51.0
202446,2000149000.0,Jonathan Okoronkwo,18.0,NGA,"AM (RL), ST (C)",Krasnodar,Russian Premier League,92.0,131.0,1188.0
211815,13224510.0,Akinkunmi Amoo,19.0,NGA,M/AM (R),FC København,3F Superliga,109.0,130.0,3250.0


### 2.1) Select Relevant Player IDs

In [5]:
players_to_analyze = [302404, 211818, 212476, 232878, 279681]
p_id = 269859
p_df = players_df.loc[p_id]

### 2.2) Set Filters and Generate Comp DF

In [6]:
# Filters
rel_pa = p_df['potential_ability'] - 5

rel_pa_filter = (rel_pa - players_df.current_ability <= max_distance)&(rel_pa - players_df.current_ability >= 0)


attr_cols = mapped_cols[mapped_cols['class'].isin(['gk_attr', 'mental_attr', 'personal_attr','phys_attr', 'techincal_attr'])].mapped.to_list()
comp_df = players_df[rel_pa_filter][['uid','player_name','age','club_name','potential_ability','current_ability','height','weight','position','preferred_foot_rating']+attr_cols]
print("Potential Matches:", len(comp_df))


Potential Matches: 8731


## 3) Calculate Comparisons

### 3.1) Height, Weight, Pref. Foot, Position

In [7]:
p_weight = p_df['weight']
p_height = p_df['height']
p_pref_foot = p_df['preferred_foot_rating']
p_position = p_df['position']
p_potential = p_df['potential_ability'] - 5

In [8]:
comp_weight__l = []
comp_height__l = []
comp_pref_foot__l = []
comp_ability__l = []

for ix, c_df in comp_df.iterrows():
    
    # Height, Weight, Pref. Foot, Position
    
    c_weight = c_df['weight']
    c_height = c_df['height']
    c_pref_foot = c_df['preferred_foot_rating']
    c_position = c_df['position']        
    c_ability = c_df['current_ability']        
    
    comp_weight__l.append(1-((30 if abs(p_weight - c_weight) > 30 else abs(p_weight - c_weight))/30))
    comp_height__l.append(1-((12 if abs(p_height - c_height) > 12 else abs(p_height - c_height))/12))
    comp_pref_foot__l.append((6-abs(p_pref_foot - c_pref_foot))/6)
    comp_ability__l.append(1 - (abs(p_potential - c_ability) / 20))
    
comp_df['c_weight'] = comp_weight__l        
comp_df['c_height'] =  comp_height__l       
comp_df['c_pref_foot'] = comp_pref_foot__l      
comp_df['c_ability'] = comp_ability__l       


# Position similarity filter
comp_df_2 = comp_df[comp_df.position == p_position]


# Custom position filters
filter1 = (comp_df.position.str.contains("AM (C)", regex=False))
filter2 = (comp_df.position.str.contains("AM (RL)", regex=False))
filter3 = (comp_df.position.str.contains("AM (LC)", regex=False))
filter4 = (comp_df.position.str.contains("AM (RLC)", regex=False))
filter5 = (comp_df.position.str.contains("AM (L)", regex=False))
filter6 = (comp_df.position.str.contains("AM (R)", regex=False))
filter7 = (comp_df.position.str.contains("ST (C)", regex=False))
filter8 = (comp_df.position.str.contains("M (C)", regex=False))
filter9 = (comp_df.position == 'M (C)')
filter10 = (comp_df.position == 'ST (C)')

filter_str = 'position.str.contains("AM (C)", regex=False) or position.str.contains("AM (RL)", regex=False)'
df_filter = comp_df.query(filter_str)

comp_df_2 = comp_df[filter1|filter2]
len(comp_df_2), len(df_filter)

(1509, 1509)

### 3.2) Attribute Comparisons

In [9]:
attack_cols = mapped_cols[mapped_cols.comp_class == 'attack_comp'].mapped.to_list()
defense_cols = mapped_cols[mapped_cols.comp_class == 'defense_comp'].mapped.to_list()

In [10]:
# Defense Attributes
p_df_defense = pd.DataFrame(p_df[defense_cols].sort_values(ascending=False)).reset_index()
p_df_defense['rank'] = p_df_defense.index
p_df_defense.columns = ['attr','val','rank']

# Attack Attributes
p_df_attack = pd.DataFrame(p_df[attack_cols].sort_values(ascending=False)).reset_index()
p_df_attack['rank'] = p_df_attack.index
p_df_attack.columns = ['attr','val','rank']

# Combined Attributes
p_df_combined = pd.DataFrame(p_df[attack_cols+defense_cols].sort_values(ascending=False)).reset_index()
p_df_combined['rank'] = p_df_combined.index
p_df_combined.columns = ['attr','val','rank']


In [11]:
comp_defense__l = []
comp_attack__l = []
comp_combined__l = []

for ix, c_df in comp_df_2.iterrows():
    # Defense Attributes
    c_df_defense = pd.DataFrame(c_df[defense_cols].sort_values(ascending=False)).reset_index()
    c_df_defense['rank'] = c_df_defense.index
    c_df_defense.columns = ['attr','c_val','c_rank']

    # Attack Attributes
    c_df_attack = pd.DataFrame(c_df[attack_cols].sort_values(ascending=False)).reset_index()
    c_df_attack['rank'] = c_df_attack.index
    c_df_attack.columns = ['attr','c_val','c_rank']
    
    # Combined Attributes
    c_df_combined = pd.DataFrame(c_df[attack_cols+defense_cols].sort_values(ascending=False)).reset_index()
    c_df_combined['rank'] = c_df_combined.index
    c_df_combined.columns = ['attr','c_val','c_rank']
    
    # Comp dfs
    def_comp_df = pd.merge(p_df_defense, c_df_defense, how='left', on='attr')[['attr','val','c_val','rank','c_rank']]
    att_comp_df = pd.merge(p_df_attack, c_df_attack, how='left', on='attr')[['attr','val','c_val','rank','c_rank']]
    com_comp_df = pd.merge(p_df_combined, c_df_combined, how='left', on='attr')[['attr','val','c_val','rank','c_rank']]
    
    # Comp funcs
    minimum_sim = 0.42105263157894735
    com_minimum_sim = 0.46153846153846156
    
    X_att = att_comp_df['rank'].to_list()
    Y_att = att_comp_df.c_rank.to_list()
    
    X_def = def_comp_df['rank'].to_list()
    Y_def = def_comp_df.c_rank.to_list()
    
    X_com = com_comp_df['rank'].to_list()
    Y_com = com_comp_df.c_rank.to_list()


    cos_sim_def = 1 - spatial.distance.cosine(X_def, Y_def)
    cos_sim_att = 1 - spatial.distance.cosine(X_att, Y_att)
    cos_sim_comb = 1 - spatial.distance.cosine(X_com, Y_com)
    
    final_cos_comp_attack = (cos_sim_att- minimum_sim)/(1-minimum_sim)
    final_cos_comp_defense = (cos_sim_def- minimum_sim)/(1-minimum_sim)
    final_cos_comp_comb = (cos_sim_def- com_minimum_sim)/(1-com_minimum_sim)
    
    comp_attack__l.append(final_cos_comp_attack)
    comp_defense__l.append(final_cos_comp_defense)    
    comp_combined__l.append(final_cos_comp_comb)
    
    #print(final_cos_comp_attack, final_cos_comp_defense)
    
    
comp_df_2['c_defense'] = comp_defense__l
comp_df_2['c_attack'] = comp_attack__l
comp_df_2['c_comb'] = comp_combined__l

### 3.3) Weight Tuning

In [12]:
weights = {
    'c_weight': 1,
    'c_height': 7,
    'c_pref_foot': 7,
    'c_ability': 35,
    'c_comb': 50
    
}

sum_weights_l = []

for k,v in weights.items():
    sum_weights_l.append(v)
    
sum(sum_weights_l)

100

In [13]:
eval_cols = []

for k,v in weights.items():
    
    eval_col = 'sim_'+k.split('c_')[1]
    
    comp_df_2[eval_col] = comp_df_2[k] * v
    
    eval_cols.append(eval_col)

comp_df_2['total_similarity'] = comp_df_2[eval_cols].sum(axis=1)

In [14]:
comp_df_2[['uid','player_name','age','club_name','current_ability',
                                       'height','weight','position','preferred_foot_rating']+eval_cols+['total_similarity']] \
        .sort_values('total_similarity', ascending=False).head(20)

Unnamed: 0,uid,player_name,age,club_name,current_ability,height,weight,position,preferred_foot_rating,sim_weight,sim_height,sim_pref_foot,sim_ability,sim_comb,total_similarity
193538,98041215.0,Rubén Vargas,23.0,FC Augsburg,134.0,70,149,M/AM (RL),2,0.433333,6.416667,7.0,35.0,38.922306,87.772306
223187,51041098.0,Paul Arriola,26.0,FC Dallas,132.0,66,147,M/AM (RL),2,0.5,5.25,7.0,31.5,41.854637,86.104637
323050,19326424.0,Yuri Alberto,20.0,COR,134.0,72,169,"AM (RL), ST (C)",2,0.0,5.25,7.0,35.0,38.596491,85.846491
223202,51052599.0,Víctor Guzmán,26.0,Pachuca,132.0,69,165,M/AM (C),2,0.0,7.0,7.0,31.5,39.899749,85.399749
291840,76043626.0,Duván Vergara,25.0,Monterrey,133.0,68,160,AM (RL),2,0.066667,6.416667,7.0,33.25,38.596491,85.329825
226132,78088553.0,Brian Ocampo,22.0,Cádiz,130.0,68,147,"AM (RL), ST (C)",2,0.5,6.416667,7.0,28.0,43.157895,85.074561
290932,91100663.0,Amin Younes,28.0,FC Utrecht,133.0,66,147,M/AM (RL),2,0.5,5.25,7.0,33.25,38.922306,84.922306
319821,22051266.0,Kiril Despodov,25.0,Ludogorets,133.0,70,165,"AM (RL), ST (C)",2,0.0,6.416667,7.0,33.25,37.619048,84.285714
252426,19158809.0,Carlos Jr.,26.0,Al-Shabab,132.0,68,149,"AM (RL), ST (C)",2,0.433333,6.416667,7.0,31.5,37.944862,83.294862
329899,67272136.0,Ferran Jutglà,22.0,Club Brugge,132.0,69,165,"AM (RL), ST (C)",2,0.0,7.0,7.0,31.5,37.619048,83.119048


## 4) Let's Automate This!

In [18]:
class PlayerGenie:
    def __init__(self, nation, max_age, min_potential, include_gk):
        
        self.players_df = pd.read_csv('players_data_scrubbed.csv')
        self.mapped_cols = pd.read_csv('../FMScraper/FUTEK Unmapped Cols - unmapped_cols.csv')
        self.mapped_pos_df = pd.read_excel('FUTEK Mapped Positional Columns.xlsx')
        self.mapped_pos_df.fillna('', inplace= True)
        
        self.nation = nation
        self.max_age = max_age
        self.min_potential = min_potential
        self.include_gk = include_gk        
        
        self.l__final_sim__str_weak_dfs = []
        self.l__final_sim__comp_players_dfs = []        
        self.l__final_sim__div_compatibility_dfs = []        
        self.l__final_sim__player_perf_stats_dfs = []        
        
        self.general_info_cols = ['uid','player_name','age','height_string','nationality','birth_city','birth_nation','position','club_name','division','current_ability','potential_ability','world_reputation']
        
        self.top_20_divisions_by_ability = pd.DataFrame(self.players_df[self.players_df.age.between(25,29,inclusive='both')].groupby('division', as_index= False).current_ability.median()).sort_values('current_ability', ascending= False).head(20).reset_index()
        self.top_20_divisions_l = self.top_20_divisions_by_ability.division.to_list()

        p_age_filter = (self.players_df.age <= self.max_age)
        p_potential_filter = (self.players_df.potential_ability >= self.min_potential)
        p_nation_filter = (self.players_df.nationality == self.nation)
        p_gk_filter = (self.players_df.position.str.contains('GK') == self.include_gk)

        self.candidates = self.players_df[self.general_info_cols][p_age_filter&p_potential_filter&p_nation_filter&p_gk_filter].sort_values('potential_ability', ascending= False).head(20)
        self.pindexes_to_analyse = self.candidates.index.to_list()
    
    def rubBelly(self):
        print('Players to compare:')
        for p_id in tqdm(self.candidates.index.to_list()):
            p_df = self.players_df.loc[p_id]
            print(p_df['player_name'])
            print('Age:', p_df['age'],'Club:', p_df['club_name'], 'Potential:', p_df['potential_ability'])
            # Filters
            rel_pa = p_df['potential_ability'] - 5

            rel_pa_filter = (rel_pa - self.players_df.current_ability <= max_distance)&(rel_pa - players_df.current_ability >= 0)


            attr_cols = mapped_cols[mapped_cols['class'].isin(['gk_attr', 'mental_attr', 'personal_attr','phys_attr', 'techincal_attr'])].mapped.to_list()
            comp_df = players_df[rel_pa_filter][['uid','player_name','age','club_name','potential_ability','current_ability','height','height_string',
                                                 'nationality','division','birth_city','birth_nation','weight','position','preferred_foot_rating']+attr_cols]
            
            p_weight = p_df['weight']
            p_height = p_df['height']
            p_pref_foot = p_df['preferred_foot_rating']
            p_position = p_df['position']
            p_potential = p_df['potential_ability'] - 5
            
            
            comp_weight__l = []
            comp_height__l = []
            comp_pref_foot__l = []
            comp_ability__l = []

            for ix, c_df in comp_df.iterrows():

                # Height, Weight, Pref. Foot, Position

                c_weight = c_df['weight']
                c_height = c_df['height']
                c_pref_foot = c_df['preferred_foot_rating']
                c_position = c_df['position']        
                c_ability = c_df['current_ability']        

                comp_weight__l.append(1-((30 if abs(p_weight - c_weight) > 30 else abs(p_weight - c_weight))/30))
                comp_height__l.append(1-((12 if abs(p_height - c_height) > 12 else abs(p_height - c_height))/12))
                comp_pref_foot__l.append((6-abs(p_pref_foot - c_pref_foot))/6)
                comp_ability__l.append(1 - (abs(p_potential - c_ability) / 20))

            comp_df['c_weight'] = comp_weight__l        
            comp_df['c_height'] =  comp_height__l       
            comp_df['c_pref_foot'] = comp_pref_foot__l      
            comp_df['c_ability'] = comp_ability__l       


            # Custom position filters
            print(f"Positions: [{p_df['position']}]")
            
            
            filter_str = f'position == "{p_position}"'            
            
            for i in p_df['position'].split(','):
                mapped_pos = self.mapped_pos_df[self.mapped_pos_df.pos == i.strip()].mapped.values[0]
                filter_str += f' or position.str.contains("{i.strip()}", regex= False)'
                
                for p in mapped_pos.split(','):
                    if p != '':
                        filter_str += f' or position.str.contains("{p.strip()}", regex= False)'
                    


            #print(filter_str)
                
            comp_df_2 = comp_df.query(filter_str)

            self.comp_df_2 = comp_df_2
            
            ## Attribute Comparisons
            
            attack_cols = self.mapped_cols[self.mapped_cols.comp_class == 'attack_comp'].mapped.to_list()
            defense_cols = self.mapped_cols[self.mapped_cols.comp_class == 'defense_comp'].mapped.to_list()
        

            # Combined Attributes
            p_df_combined = pd.DataFrame(p_df[attack_cols+defense_cols].sort_values(ascending=False)).reset_index()
            p_df_combined['rank'] = p_df_combined.index
            p_df_combined.columns = ['attr','val','rank']
            
            comp_combined__l = []
            #print(f"[{p_df['player_name']}] Comparing players:")
            for ix, c_df in tqdm(self.comp_df_2.iterrows(), total=len(self.comp_df_2)):

                # Combined Attributes
                c_df_combined = pd.DataFrame(c_df[attack_cols+defense_cols].sort_values(ascending=False)).reset_index()
                c_df_combined['rank'] = c_df_combined.index
                c_df_combined.columns = ['attr','c_val','c_rank']

                # Comp dfs
                com_comp_df = pd.merge(p_df_combined, c_df_combined, how='left', on='attr')[['attr','val','c_val','rank','c_rank']]

                # Comp funcs
                com_minimum_sim = 0.46153846153846156

                X_com = com_comp_df['rank'].to_list()
                Y_com = com_comp_df.c_rank.to_list()

                cos_sim_comb = 1 - spatial.distance.cosine(X_com, Y_com)

                final_cos_comp_comb = (cos_sim_comb- com_minimum_sim)/(1-com_minimum_sim)

                comp_combined__l.append(final_cos_comp_comb)

            comp_df_2['c_comb'] = comp_combined__l
            
            ## Weight Tuning
            
            weights = {
                'c_weight': 1,
                'c_height': 7,
                'c_pref_foot': 7,
                'c_ability': 35,
                'c_comb': 50

            }

            sum_weights_l = []

            for k,v in weights.items():
                sum_weights_l.append(v)

            sum(sum_weights_l)
            
            eval_cols = []

            for k,v in weights.items():

                eval_col = 'sim_'+k.split('c_')[1]

                comp_df_2[eval_col] = comp_df_2[k] * v

                eval_cols.append(eval_col)

            comp_df_2['total_similarity'] = comp_df_2[eval_cols].sum(axis=1)
            comp_df_2['rel_uid'] = p_df['uid']
            comp_df_2['rel_name'] = p_df['player_name']
            
            self.sim_players = comp_df_2[['rel_uid','rel_name','uid','player_name','position','club_name','division','nationality','height_string']+eval_cols+['total_similarity']] \
                                    .sort_values('total_similarity', ascending=False).head(20)
            
            
            ## Strenths & Weaknesses
            
            str_weak_df = pd.DataFrame(p_df[mapped_cols[mapped_cols['comp_class'].notnull()].mapped]).reset_index()
            str_weak_df['rel_uid'] = p_df['uid']
            str_weak_df['rel_name'] = p_df['player_name']
            str_weak_df.columns = ['attr','value','rel_uid','rel_name']
            str_weak_df = str_weak_df[['rel_uid','rel_name','attr','value']]
            str_weak_df.sort_values('value', ascending=False, inplace=True)
            self.str_weak_df = str_weak_df
            
            ## Append            
            self.l__final_sim__str_weak_dfs.append(str_weak_df)
            self.l__final_sim__comp_players_dfs.append(self.sim_players.head(20))
            
            
            ## Percentile match by division
            
            
            rel_uid_l = []
            rel_name_l = []
            div_l = []
            perc_l = []
            
            for div in self.top_20_divisions_l:
                percentile = percentileofscore(self.players_df[(self.players_df.age.between(25,29,inclusive='both'))&(self.players_df.division == div)].current_ability.to_list(), p_df['potential_ability'] - 10)                
                rel_uid_l.append(p_df['uid'])
                rel_name_l.append(p_df['player_name'])
                div_l.append(div)
                perc_l.append(round(percentile,2))
            
            
            div_comp_df = pd.DataFrame()
            div_comp_df['rel_uid'] = rel_uid_l
            div_comp_df['rel_name'] = rel_name_l
            div_comp_df['division'] = div_l
            div_comp_df['perc_fit'] = perc_l     
            #div_comp_df.sort_values('perc_fit', ascending= False, inplace= True)
            
            self.l__final_sim__div_compatibility_dfs.append(div_comp_df)
            
            
            ##. Player Stats
            
            try:
                search = f"{p_df['player_name']} {p_df['club_name']} fbref"
                url = 'https://www.google.com/search'

                headers = {
                    'Accept' : '*/*',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82',
                }
                parameters = {'q': search}

                content = requests.get(url, headers = headers, params = parameters).text
                soup = BeautifulSoup(content, 'html.parser')

                search = soup.find(id = 'search')
                first_link = search.find('a')

                

                content = requests.get(first_link['href'], headers = headers, params = parameters).text
                soup = BeautifulSoup(content, 'html.parser')

                soup_bottom = soup.find(attrs={'class':'stats_pullout'})

                first_col = [soup_bottom.find('span').text] + [i.text for i in soup_bottom.findAll('p') if 'strong' in str(i)]

                col_headers = [i.text for i in soup_bottom.findAll('span', attrs={'class':'poptip'})]
                col_headers.insert(1,'Min')
                col_values = [pd.to_numeric(i.text) for i in soup_bottom.findAll('p') if 'strong' not in str(i) and 'caption' not in str(i)]
                col_values = [i.tolist() for i in np.array_split(col_values, len(col_headers))]


                total_cols = [first_col]
                for ix, i in enumerate(col_headers):
                    total_cols.append([i]+col_values[ix])

                p_stats_df = pd.DataFrame(total_cols).transpose()
                new_header = p_stats_df.iloc[0] #grab the first row for the header
                p_stats_df = p_stats_df[1:] #take the data less the header row
                p_stats_df.columns = new_header #set the header row as the df header   

                p_stats_df['rel_uid'] = p_df['uid']
                p_stats_df['rel_name'] = p_df['player_name']
                p_stats_df['rel_age'] = p_df['age']
                p_stats_df['nationality'] = p_df['nationality']
                p_stats_df['rel_potential'] = p_df['potential_ability']                                
                p_stats_df['rel_club'] = p_df['club_name']                
                p_stats_df['rel_position'] = p_df['position']                                                

                p_stats_df = p_stats_df[['rel_uid','rel_name','rel_age','nationality','rel_potential','rel_club','rel_position']+p_stats_df.columns[:-7].to_list()]
                p_stats_df['as_of'] = str(dt.now())

                self.l__final_sim__player_perf_stats_dfs.append(p_stats_df)
                print('Found Stats:', True, 'Link:', first_link['href'])
                print()
                print()
            except Exception as e:
                print('Found Stats:', False, 'Error:', e)                
                print()
                print()
                pass
                
            
        self.final_sim_player_info_df = self.candidates.loc[self.pindexes_to_analyse]
        self.final_sim_comp_players_df = pd.concat(self.l__final_sim__comp_players_dfs)
        self.final_sim_str_weak_df = pd.concat(self.l__final_sim__str_weak_dfs)
        self.final_sim_div_comp_df = pd.concat(self.l__final_sim__div_compatibility_dfs)        
        
        if not os.path.exists(f"AnalysisExports/{p_df['nationality']}"):

            # if the demo_folder directory is not present 
            # then create it.
            os.makedirs(f"AnalysisExports/{p_df['nationality']}")
        print(f"Process Complete, data exported to: AnalysisExports/{p_df['nationality']}")
        self.final_sim_player_info_df.to_csv(f"AnalysisExports/{p_df['nationality']}/player_info_df.csv", index=False)
        self.final_sim_comp_players_df.to_csv(f"AnalysisExports/{p_df['nationality']}/comp_players_df.csv", index=False)
        self.final_sim_str_weak_df.to_csv(f"AnalysisExports/{p_df['nationality']}/str_weak_df.csv", index=False)
        self.final_sim_div_comp_df.to_csv(f"AnalysisExports/{p_df['nationality']}/div_comp_df.csv", index=False)
        try:
            self.final_sim_perf_stats_df = pd.concat(self.l__final_sim__player_perf_stats_dfs)        
            self.final_sim_perf_stats_df.fillna(0, inplace= True)
            self.final_sim_perf_stats_df.to_csv(f"AnalysisExports/{p_df['nationality']}/perf_stats.csv", index=False)
        except:
            pass


            
            
        

In [19]:
players_df.sort_values('nationality').nationality.unique()

array(['AFG', 'AIA', 'ALB', 'ALG', 'AND', 'ANG', 'ARG', 'ARM', 'ARU',
       'ASA', 'ATG', 'AUS', 'AUT', 'AZE', 'BAH', 'BAN', 'BDI', 'BEL',
       'BEN', 'BER', 'BFA', 'BHR', 'BHU', 'BIH', 'BLM', 'BLR', 'BLZ',
       'BOE', 'BOL', 'BOT', 'BRA', 'BRB', 'BRU', 'BUL', 'CAM', 'CAN',
       'CAY', 'CGO', 'CHA', 'CHI', 'CHN', 'CIV', 'CMR', 'COD', 'COK',
       'COL', 'COM', 'CPV', 'CRC', 'CRO', 'CTA', 'CUB', 'CUW', 'CYP',
       'CZE', 'DEN', 'DJI', 'DMA', 'DOM', 'ECU', 'EGY', 'ENG', 'EQG',
       'ERI', 'ESP', 'EST', 'ETH', 'FIJ', 'FIN', 'FRA', 'FRO', 'FSM',
       'GAB', 'GAM', 'GEO', 'GER', 'GHA', 'GIB', 'GLP', 'GNB', 'GRE',
       'GRN', 'GUA', 'GUF', 'GUI', 'GUM', 'GUY', 'HAI', 'HKG', 'HON',
       'HUN', 'IDN', 'IND', 'IRL', 'IRN', 'IRQ', 'ISL', 'ISR', 'ITA',
       'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ', 'KIR', 'KOR', 'KOS',
       'KSA', 'KUW', 'LAO', 'LBR', 'LBY', 'LCA', 'LES', 'LIB', 'LIE',
       'LTU', 'LUX', 'LVA', 'MAC', 'MAD', 'MAR', 'MAS', 'MAY', 'MDA',
       'MDV', 'MEX',

### 4.1) Create Genies

In [20]:
#nigeria_genie = PlayerGenie('NGA', max_age=21, min_potential=130, include_gk=False)
# portugal_genie = PlayerGenie('POR', max_age=21, min_potential=130, include_gk=False)
# cotedivoire_genie = PlayerGenie('CIV', max_age=21, min_potential=130, include_gk=False)
# algeria_genie = PlayerGenie('ALG', max_age=21, min_potential=130, include_gk=False)
# ecuador_genie = PlayerGenie('ECU', max_age=21, min_potential=130, include_gk=False)
# egypt_genie = PlayerGenie('EGY', max_age=21, min_potential=130, include_gk=False)
# sweden_genie = PlayerGenie('SWE', max_age=21, min_potential=130, include_gk=False)
# zimbabwe_genie = PlayerGenie('ZIM', max_age=21, min_potential=120, include_gk=False)
# denmark_genie = PlayerGenie('DEN', max_age=21, min_potential=130, include_gk=False)
# turkey_genie = PlayerGenie('TUR', max_age=21, min_potential=130, include_gk=False)
# morocco_genie = PlayerGenie('MAR', max_age=21, min_potential=120, include_gk=False)
#england_genie = PlayerGenie('ENG', max_age=17, min_potential=130, include_gk=False)
#spain_genie = PlayerGenie('ESP', max_age=17, min_potential=130, include_gk=False)
# serbia_genie = PlayerGenie('SRB', max_age=20, min_potential=130, include_gk=False)
# usa_genie = PlayerGenie('USA', max_age=19, min_potential=130, include_gk=False)
# chile_genie = PlayerGenie('CHI', max_age=20, min_potential=130, include_gk=False)
# japan_genie = PlayerGenie('JPN', max_age=21, min_potential=130, include_gk=False)
# czech_genie = PlayerGenie('CZE', max_age=21, min_potential=130, include_gk=False)
# saudi_genie = PlayerGenie('KSA', max_age=21, min_potential=130, include_gk=False)
# finland_genie = PlayerGenie('FIN', max_age=21, min_potential=130, include_gk=False)
# qatar_genie = PlayerGenie('QAT', max_age=21, min_potential=130, include_gk=False)
# israel_genie = PlayerGenie('ISR', max_age=21, min_potential=130, include_gk=False)
# azerbaijan_genie = PlayerGenie('AZE', max_age=21, min_potential=130, include_gk=False)
#kuwait_genie = PlayerGenie('KUW', max_age=21, min_potential=130, include_gk=False)
senegal_genie = PlayerGenie('SEN', max_age=21, min_potential=130, include_gk=False)

### 4.2) Rub Bellies 

In [21]:
# portugal_genie.rubBelly()
#nigeria_genie.rubBelly()
#spain_genie.rubBelly()
# cotedivoire_genie.rubBelly()
# algeria_genie.rubBelly()
# ecuador_genie.rubBelly()
# egypt_genie.rubBelly()
# sweden_genie.rubBelly()
# zimbabwe_genie.rubBelly()
# denmark_genie.rubBelly()
# turkey_genie.rubBelly()
#england_genie.rubBelly()
# morocco_genie.rubBelly()
# serbia_genie.rubBelly()
# usa_genie.rubBelly()
# chile_genie.rubBelly()
# japan_genie.rubBelly()
# czech_genie.rubBelly()
# saudi_genie.rubBelly()
#finland_genie.rubBelly()
#qatar_genie.rubBelly()
#israel_genie.rubBelly()
#azerbaijan_genie.rubBelly()
senegal_genie.rubBelly()

Players to compare:


Nicolas Jackson
Age: 20.0 Club: Villarreal Potential: 158.0
Positions: [AM (RLC), ST (C)]


Found Stats: True Link: https://fbref.com/en/players/9c36ed83/Nicolas-Jackson


Samba Diallo
Age: 19.0 Club: Dynamo Kyiv Potential: 154.0
Positions: [AM (RL)]


Found Stats: True Link: https://fbref.com/en/players/d4822868/Samba-Diallo


Demba Diop
Age: 18.0 Club: Zulte Waregem Potential: 150.0
Positions: [M (C)]


Found Stats: False Error: 'NoneType' object has no attribute 'find'


Baba
Age: 16.0 Club: R. Madrid Potential: 149.0
Positions: [M (R), AM (RL)]


Found Stats: False Error: 'NoneType' object has no attribute 'find'


Libasse Ngom
Age: 18.0 Club: Guédiawaye Potential: 149.0
Positions: [AM (C), ST (C)]


Found Stats: False Error: 'NoneType' object has no attribute 'find'


Dauda Diong
Age: 15.0 Club: Académie Darou Salam Potential: 148.0
Positions: [M/AM (RL)]


Found Stats: False Error: 'NoneType' object has no attribute 'find'


Iliman N'Diaye
Age: 21.0 Club: Sheff Utd Potential: 148.0
Positions: [AM (RLC), ST (C)]


Found Stats: True Link: https://fbref.com/en/players/5ed97752/Iliman-Ndiaye


Cheikh Diop
Age: 15.0 Club: Wallydane Thiès Potential: 147.0
Positions: [D (C)]


Found Stats: True Link: https://fbref.com/en/players/873ce3a2/Pape-Cheikh-Diop


Bamba Dieng
Age: 21.0 Club: OM Potential: 145.0
Positions: [ST (C)]


Found Stats: True Link: https://fbref.com/en/players/40774c6b/Bamba-Dieng


Demba Seck
Age: 20.0 Club: Torino Potential: 145.0
Positions: [ST (C)]


Found Stats: True Link: https://fbref.com/en/players/16463a79/Demba-Seck


Abdallah Sima
Age: 20.0 Club: Angers SCO Potential: 144.0
Positions: [AM (R), ST (C)]


Found Stats: True Link: https://fbref.com/en/players/fdbde523/Abdallah-Sima


Arouna Sanganté
Age: 19.0 Club: Havre AC Potential: 143.0
Positions: [D (C)]


Found Stats: True Link: https://fbref.com/en/players/17bb562c/Arouna-Sangante


Issa Soumaré
Age: 21.0 Club: QRM Potential: 143.0
Positions: [AM (RL), ST (C)]


Found Stats: True Link: https://fbref.com/en/players/1d8aa290/Issa-Soumare


Formose Mendy
Age: 20.0 Club: Amiens SC Potential: 143.0
Positions: [D (RC)]


Found Stats: True Link: https://fbref.com/en/players/c21983bf/Formose-Mendy


Dion Lopy
Age: 19.0 Club: Reims Potential: 140.0
Positions: [DM, M (C)]


Found Stats: True Link: https://fbref.com/en/players/64c14d49/Dion-Lopy


Amidou Diop
Age: 17.0 Club: Génération Foot Potential: 139.0
Positions: [D/WB (R)]


Found Stats: False Error: 'NoneType' object has no attribute 'find'


Aliou Baldé
Age: 19.0 Club: FC Dordrecht Potential: 139.0
Positions: [M/AM (RL)]


Found Stats: True Link: https://fbref.com/en/players/7362a6f2/Aliou-Badara-Balde


Lamine Camara
Age: 17.0 Club: Génération Foot Potential: 138.0
Positions: [M (C)]


Found Stats: True Link: https://fbref.com/en/players/01870104/Mohamed-Lamine-Bayo


Mamadou Sarr
Age: 17.0 Club: Stade de Mbour Potential: 138.0
Positions: [ST (C)]


Found Stats: True Link: https://fbref.com/en/players/b959a0d1/Mouhamadou-Sarr


Ibou Sané
Age: 16.0 Club: Génération Foot Potential: 137.0
Positions: [ST (C)]


Found Stats: False Error: 'href'


Process Complete, data exported to: AnalysisExports/SEN


### 4.3) Create Master Data files

In [3605]:
path ="/Users/rustambensalem/Desktop/Code/FUTEK/RPs/Wunderkids/Analysis/AnalysisExports"

comp_players_l = []
str_weak_l = []
player_info_l = []
div_comp_l = []
perf_stats_l = []

for root, dirs, files in os.walk(path):
    for file in files:        
        if 'csv' in str(file) and 'checkpoint' not in str(file):
            if file == 'comp_players_df.csv':
                df = pd.read_csv(os.path.join(root,file))
                comp_players_l.append(df)
            if file == 'str_weak_df.csv':
                df = pd.read_csv(os.path.join(root,file))   
                str_weak_l.append(df)
            if file == 'player_info_df.csv':
                df = pd.read_csv(os.path.join(root,file))
                player_info_l.append(df)
            if file == 'div_comp_df.csv':
                df = pd.read_csv(os.path.join(root,file))    
                div_comp_l.append(df)
            if file == 'perf_stats.csv':
                df = pd.read_csv(os.path.join(root,file))  
                perf_stats_l.append(df)          
                        
                
master_comp_players_df = pd.concat(comp_players_l)        
master_str_weak_df = pd.concat(str_weak_l)
master_player_info_df = pd.concat(player_info_l)
master_div_comp_df = pd.concat(div_comp_l)
master_perf_stats_df = pd.concat(perf_stats_l)

master_str_weak_df['attr'] = master_str_weak_df.attr.replace({
                                'agility':'Agility',
                                'dribbling':'Dribbling',
                                'acceleration':'Acceleration',
                                'natural_fitness':'Natural Fitness',
                                'flair':'Flair',
                                'pace':'Pace',
                                'technique':'Technique',
                                'versatility':'Versatility',
                                'frist_touch':'First Touch',
                                'off_the_ball':'Off the Ball',
                                'balance':'Balance',
                                'long_shots':'Long_Shots',
                                'finishing':'Finishing',
                                'anticipation':'Anticipation',
                                'crossing':'Crossing',
                                'stamina':'Stamina',
                                'passing':'Passing',
                                'vision':'Vision',
                                'free_kicks':'Free Kicks',
                                'strength':'Strength',
                                'heading':'Heading',
                                'jumping_reach':'Jumping Reach',
                                'positioning':'Positioning',
                                'tackling':'Tackling',
                                'marking':'Marking'
                                })

master_comp_players_df.to_csv('AnalysisExports/master_comp_players.csv', index= False)
master_str_weak_df.to_csv('AnalysisExports/master_str_weak.csv', index= False)
master_player_info_df.to_csv('AnalysisExports/master_player_info.csv', index= False)
master_div_comp_df.to_csv('AnalysisExports/master_div_comp.csv', index= False)
master_perf_stats_df.to_csv('AnalysisExports/master_perf_stats.csv', index= False)

### 4.4) Example Output

In [27]:
senegal_genie.final_sim_comp_players_df[senegal_genie.final_sim_comp_players_df.rel_name == 'Nicolas Jackson']

Unnamed: 0,rel_uid,rel_name,uid,player_name,position,club_name,division,nationality,height_string,sim_weight,sim_height,sim_pref_foot,sim_ability,sim_comb,total_similarity
205361,12087972.0,Nicolas Jackson,67184349.0,Iñaki Williams,"M (RL), AM (RLC), ST (C)",A. Bilbao,Spanish First Division,GHA,"6'1""",0.833333,7.0,5.833333,35.0,42.704082,91.370748
215128,12087972.0,Nicolas Jackson,59130638.0,Khvicha Kvaratskhelia,AM (RLC),Parthenope,Italian Serie A,GEO,"6'0""",0.433333,6.416667,4.666667,35.0,41.131195,87.647862
269709,12087972.0,Nicolas Jackson,90008752.0,Jamie Vardy,ST (C),Leicester,English Premier Division,ENG,"5'10""",0.566667,5.25,5.833333,35.0,38.516035,85.166035
189957,12087972.0,Nicolas Jackson,19306929.0,Rodrygo,"AM (RL), ST (C)",R. Madrid,Spanish First Division,BRA,"5'9""",0.0,4.666667,5.833333,33.25,40.126822,83.876822
245305,12087972.0,Nicolas Jackson,67117360.0,Álvaro Morata,ST (C),A. Madrid,Spanish First Division,ESP,"6'2""",0.533333,6.416667,5.833333,35.0,36.090379,83.873712
242739,12087972.0,Nicolas Jackson,19265858.0,Richarlison,"AM (RL), ST (C)",Tottenham,English Premier Division,BRA,"5'11""",0.5,5.833333,5.833333,26.25,45.167638,83.584305
194059,12087972.0,Nicolas Jackson,28051378.0,Wilfried Zaha,"M (L), AM (RL), ST (C)",Crystal Palace,English Premier Division,CIV,"6'0""",0.733333,6.416667,5.833333,29.75,40.524781,83.258115
229702,12087972.0,Nicolas Jackson,37050140.0,Steven Bergwijn,"M (RL), AM (RLC), ST (C)",Ajax,Eredivisie,NED,"5'10""",1.0,5.25,5.833333,31.5,39.444606,83.02794
202212,12087972.0,Nicolas Jackson,28100266.0,Marcus Rashford,"AM (L), ST (C)",Man UFC,English Premier Division,ENG,"6'1""",0.433333,7.0,5.833333,31.5,38.004373,82.77104
299442,12087972.0,Nicolas Jackson,29123128.0,Dominic Calvert-Lewin,ST (C),Everton,English Premier Division,ENG,"6'2""",0.5,6.416667,5.833333,33.25,36.715743,82.715743


In [28]:
senegal_genie.final_sim_div_comp_df[senegal_genie.final_sim_div_comp_df.rel_name == 'Nicolas Jackson']

Unnamed: 0,rel_uid,rel_name,division,perc_fit
0,12087972.0,Nicolas Jackson,English Premier Division,76.02
1,12087972.0,Nicolas Jackson,Spanish First Division,81.45
2,12087972.0,Nicolas Jackson,Italian Serie A,86.67
3,12087972.0,Nicolas Jackson,Bundesliga,91.5
4,12087972.0,Nicolas Jackson,Ligue 1 Uber Eats,96.34
5,12087972.0,Nicolas Jackson,Brazilian National First Division,100.0
6,12087972.0,Nicolas Jackson,Argentine Premier Division,100.0
7,12087972.0,Nicolas Jackson,Mexican First Division,100.0
8,12087972.0,Nicolas Jackson,Spanish Second Division,100.0
9,12087972.0,Nicolas Jackson,Portuguese Premier League,96.79


In [29]:
senegal_genie.final_sim_str_weak_df[senegal_genie.final_sim_str_weak_df.rel_name == 'Nicolas Jackson']

Unnamed: 0,rel_uid,rel_name,attr,value
2,12087972.0,Nicolas Jackson,acceleration,16.0
4,12087972.0,Nicolas Jackson,pace,16.0
24,12087972.0,Nicolas Jackson,technique,15.0
14,12087972.0,Nicolas Jackson,dribbling,15.0
23,12087972.0,Nicolas Jackson,natural_fitness,14.0
5,12087972.0,Nicolas Jackson,stamina,14.0
15,12087972.0,Nicolas Jackson,finishing,14.0
21,12087972.0,Nicolas Jackson,agility,13.0
7,12087972.0,Nicolas Jackson,heading,13.0
20,12087972.0,Nicolas Jackson,versatility,12.0


## BONUS: Exported Positions for Mapping

In [3212]:
d = [i.split(',') for i in list(players_df.position.unique())]
flat_list = [item.strip() for sublist in d for item in sublist]
df = pd.DataFrame(flat_list)
df.columns = ['pos']
df.drop_duplicates('pos',inplace= True)
df.to_csv('position_mapper.csv')