## Imports

Import the necessary library.

In [1]:
# ! pip install eli5

In [2]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
import bs4 as bs
import urllib.request
import warnings
import eli5


Matplotlib is building the font cache; this may take a moment.


## Read Data

Reading the data for all the players. The data is read in Dataframes. 

In [3]:
player15_df = pd.read_csv('./data/players_15.csv')
player16_df = pd.read_csv('./data/players_16.csv')
player17_df = pd.read_csv('./data/players_17.csv')
player18_df = pd.read_csv('./data/players_18.csv')
player19_df = pd.read_csv('./data/players_19.csv')
player20_df = pd.read_csv('./data/players_20.csv')

## Data Cleanup

The purpose of the cells below is to make sure that we drop the columns that have been identified as not to be used.

In addition to the columns indicated in the statement, the team also feels that the clubs staff capabilities do not depend on other attributes like:
1. Player Height
1. Player Weight
1. Player Nationality

So these columns have been also identified as to be dropped.

In [4]:
def clean_player_df(player_df):
    '''
    Function below takes the Dataframe and drops the columns which are specified in the list above.
    '''
    return player_df.drop(columns_to_drop, axis=1)

Cleaning up all the dataframes to remove the columns identified.

## Data Analysis

For the purpose of Data Analysis, identifying columns which are numerical would simplify the quantitative analysis of the stafs capabilities. In order to identify those columns and run some preliminary analysis like min values max values, identifying the columns which are numeric.

### Step 1: Identify Numerical Columns

#### Inference from the step:
The list above indicates that there are about 24 columns which are numerical and can be leveraged for the sake of quantitative analysis. 
However further investigation of these numerical columns indicate that some of these columns could be added to the cleanup of columns as they would not really reflect the staffs capabilities to promote talent.

#### Inference Action:
Add the columns to the list of columns to be cleaned up and remove the columns.

Columns identified:
1. value_eur
1. release_clause_eur
1. team_jersey_number
1. contract_valid_until
1. nation_jersey_number

In [5]:
columns_to_drop = ["sofifa_id", "player_url", "long_name", "wage_eur", "real_face", "height_cm", "weight_kg", "nationality", 
                    "value_eur", "release_clause_eur", "team_jersey_number", "contract_valid_until","nation_jersey_number"]

In [6]:
# Further cleanup
player15_cleaned_df = clean_player_df(player15_df)
player16_cleaned_df = clean_player_df(player16_df)
player17_cleaned_df = clean_player_df(player17_df)
player18_cleaned_df = clean_player_df(player18_df)
player19_cleaned_df = clean_player_df(player19_df)
player20_cleaned_df = clean_player_df(player20_df)

In [7]:
player15_cleaned_df.head()

Unnamed: 0,short_name,age,dob,club,overall,potential,player_positions,preferred_foot,international_reputation,weak_foot,...,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
0,L. Messi,27,1987-06-24,FC Barcelona,93,95,CF,Left,5,3,...,62+3,62+3,62+3,62+3,62+3,54+3,45+3,45+3,45+3,54+3
1,Cristiano Ronaldo,29,1985-02-05,Real Madrid,92,92,"LW, LM",Right,5,4,...,63+3,63+3,63+3,63+3,63+3,57+3,52+3,52+3,52+3,57+3
2,A. Robben,30,1984-01-23,FC Bayern München,90,90,"RM, LM, RW",Left,5,2,...,64+3,64+3,64+3,64+3,64+3,55+3,46+3,46+3,46+3,55+3
3,Z. Ibrahimović,32,1981-10-03,Paris Saint-Germain,90,90,ST,Right,5,4,...,61+3,65+3,65+3,65+3,61+3,56+3,55+3,55+3,55+3,56+3
4,M. Neuer,28,1986-03-27,FC Bayern München,90,90,GK,Right,5,4,...,,,,,,,,,,


### Step 2: Analyze the Numerical Columns

In order to check the quality of data available in those numerical columns, analyzing a couple of years to see the type and quality of data. 

#### Inferences:
1. The Describe on the multiple years does indicate that the mental_composure is not a column that we can rely on as the values are not captured in the earlier years and further investigation indicated that the data has a format which is a split value (e.g. and hence can be excluded 90+3)
1. International reputation is also a Categorical variable with values 1 to 5.
1. Weak Foot is a categorical value too with values 1 to 5.
1. Numerical columns 'gk_speed', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_diving' have very few values. On inspecting the players associated with those values, it was identified that these are goal keepers and attributes for goal keepers. As the question is about effectiveness of the entire staff, the approach was to drop these columns as well both for lack of data and to avoid focus on skill development for goal keepers.

#### Inference Action
1. Remove the mentality_composure value from the columns. 
2. From domain knowledge perspective, weak foot can be eliminated as other factors would reflect if the staff has improved the weak foot score of the individual.
3. International reputation is also being dropped because we are not measuring PR teams capabilities but the teams staff.
~~
4. Remove the gk_* values from the columns list.  
5. Also as the Goal Keepers are missing the information of other attributes, we might not be able to get metrics on the staffs work on the goal keepers. 
~~

In [8]:
columns_to_drop = ["player_url", "long_name", "wage_eur", "real_face", "height_cm", "weight_kg", "nationality", 
                    "value_eur", "release_clause_eur", "team_jersey_number", "contract_valid_until","nation_jersey_number",
                      "mentality_composure", "weak_foot", "international_reputation"]

In [9]:
# Reloading the data so that we can reclean
player15_df = pd.read_csv('./data/players_15.csv')
player16_df = pd.read_csv('./data/players_16.csv')
player17_df = pd.read_csv('./data/players_17.csv')
player18_df = pd.read_csv('./data/players_18.csv')
player19_df = pd.read_csv('./data/players_19.csv')
player20_df = pd.read_csv('./data/players_20.csv')

In [10]:
# Further cleanup
player15_cleaned_df = clean_player_df(player15_df)
player16_cleaned_df = clean_player_df(player16_df)
player17_cleaned_df = clean_player_df(player17_df)
player18_cleaned_df = clean_player_df(player18_df)
player19_cleaned_df = clean_player_df(player19_df)
player20_cleaned_df = clean_player_df(player20_df)

### Skills Columns

The skills columns in the dataframe indicate the different skills with a numerical value for the Skills. The team thinks that the skills are an important part of the development of every player. The player skills once improved will contribute to the overall improvement of the player and hence its overall rating. 

#### Observations:
1. Non Numeric columns: 
The columns for skills are multiple and are non numeric. Infact the columns have values with + and - signs indicating that the columns have data which indicates positive and negative traits.

#### Observation Action:

1. The decision is to exclude the columns for the first set of determination and keep only the numeric columns in the list. 
2. Build a data frame for every year with only these numeric values in addition to columns which identify the player like name, club.


In [11]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_column_list = list(player15_cleaned_df.select_dtypes(include=numerics).columns)
print (pd.DataFrame(numeric_column_list))

                 0
0        sofifa_id
1              age
2          overall
3        potential
4      skill_moves
5             pace
6         shooting
7          passing
8        dribbling
9        defending
10          physic
11       gk_diving
12     gk_handling
13      gk_kicking
14     gk_reflexes
15        gk_speed
16  gk_positioning


#### Numerical Only Dataset

Creating a new Dataset with only numerical columns:
0          age
1      overall
2    potential
3  skill_moves
4         pace
5     shooting
6      passing
7    dribbling
8    defending
9       physic
10 short_name
11 club


In [12]:
# player_trait_columns = ["overall", "pace" "shooting","passing","dribbling" "defending"]

player_numeric_identification_columns = ['short_name', "club"] +  numeric_column_list
print (player_numeric_identification_columns)


['short_name', 'club', 'sofifa_id', 'age', 'overall', 'potential', 'skill_moves', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning']


In [13]:
player15_numeric_cleaned_df =  player15_cleaned_df[player_numeric_identification_columns]
player16_numeric_cleaned_df =  player16_cleaned_df[player_numeric_identification_columns]
player17_numeric_cleaned_df =  player17_cleaned_df[player_numeric_identification_columns]
player18_numeric_cleaned_df =  player18_cleaned_df[player_numeric_identification_columns]
player19_numeric_cleaned_df =  player19_cleaned_df[player_numeric_identification_columns]
player20_numeric_cleaned_df =  player20_cleaned_df[player_numeric_identification_columns]
numeric_column_list.remove("sofifa_id")


### Year Over Year Comparison

#### Approach
Since the purpose of the exercise is to identify the effectiveness of the staff, the approach is to identify the score differences of the players year over year. 

We are going to join data year over year and see the score differences. The approach is to create a dataframe that would have the name of the player the club and the difference in the rating of the numeric column for the player.

#### Decisions:
1. Initially the team considered only players who stayed at the club to get a guage of the players skill. However given that not many players were staying at the club, the decision was to not to enforce the stay at club metric. The credit for the increase is being assigned to the club in the earlier year.
1. The score difference will be calculated for every numeric parameter including the values which are categorical.

In [14]:
# Define a year over year dataframe in which the values will be stored
year_over_year_df = pd.DataFrame()
all_years_df = pd.DataFrame()
# year_over_year_df = player15_numeric_cleaned_df["short_name"]
# display(year_over_year_df)

In [15]:
def create_year_over_year_df(year1_df, year2_df, year):
    year1_year2_joined_df = year1_df.merge(year2_df, on="sofifa_id", suffixes=('_1', '_2'))
    year_over_year_df["short_name"] = year1_year2_joined_df["short_name_1"]
    year_over_year_df["club"] = year1_year2_joined_df["club_1"]
    year_over_year_df["age"] = year1_year2_joined_df["age_1"]    
    year_over_year_df["year_over_year"] = year
    for column in numeric_column_list:
        year_over_year_df[f"diff_{column}"] = year1_year2_joined_df[f"{column}_2"] - year1_year2_joined_df[f"{column}_1"]
    return year_over_year_df
        
all_years_df = all_years_df.append(create_year_over_year_df(player15_numeric_cleaned_df, player16_numeric_cleaned_df, 2016))
all_years_df = all_years_df.append(create_year_over_year_df(player16_numeric_cleaned_df, player17_numeric_cleaned_df, 2017)) 
all_years_df = all_years_df.append(create_year_over_year_df(player17_numeric_cleaned_df, player18_numeric_cleaned_df, 2018))
all_years_df = all_years_df.append(create_year_over_year_df(player18_numeric_cleaned_df, player19_numeric_cleaned_df, 2019)) 
all_years_df = all_years_df.append(create_year_over_year_df(player19_numeric_cleaned_df, player20_numeric_cleaned_df, 2020)) 



In [16]:
all_years_df.shape

(53440, 20)

In [17]:
all_years_df.head()
numeric_column_list = list(player15_cleaned_df.select_dtypes(include=numerics).columns)
for column in numeric_column_list: 
    if (column != 'sofifa_id'):
        all_years_df[f"diff_{column}"].fillna(0) 

### Score Difference Matrix

The year over year dataframe has all the score changes for the Club for every individual player that has played for the club.

#### Approach 1
Find the number of players who have a positive change for every metric. This would mean that we create a histogram and find how many players have a positive change. We are going to only count players who have atleast one standard deviation change of positive change in rating for every metric. That will give us indication of how important is this metric in staffs contribution.
##### Decision"
Instead of only using the Players with positive score we decided to use both negative and positive scores so the clubs which are performing bad could be penalized.

#### Approach 2
For every club find the average change in ratings for each of the metrics and then order the clubs to identify which clubs have maximum change and order those clubs. This will ensure that we not only count players which are improving but also players which are losing points.
Plot the top 10 clubs that have shown the most change.

#### Approach 3
Define a Linear Regression model to determine what should be the change in the overall rating for a player at an age.

Find the change in the Players metrics. Based upon the metrics find the overall score change that is desired based upon the individual factors using a Linear Regression model. 



#### Approach 2

In [18]:
all_years_df.head()

Unnamed: 0,short_name,club,age,year_over_year,diff_age,diff_overall,diff_potential,diff_skill_moves,diff_pace,diff_shooting,diff_passing,diff_dribbling,diff_defending,diff_physic,diff_gk_diving,diff_gk_handling,diff_gk_kicking,diff_gk_reflexes,diff_gk_speed,diff_gk_positioning
0,L. Messi,FC Barcelona,27,2016,1,1,0,0,-1.0,-1.0,0.0,-1.0,-3.0,-1.0,,,,,,
1,Cristiano Ronaldo,Real Madrid,29,2016,1,1,1,0,-1.0,0.0,-1.0,0.0,1.0,-1.0,,,,,,
2,A. Robben,FC Bayern München,30,2016,1,0,0,0,-1.0,0.0,-1.0,0.0,0.0,0.0,,,,,,
3,Z. Ibrahimović,Paris Saint-Germain,32,2016,1,-1,-1,0,-3.0,-1.0,0.0,-1.0,-3.0,0.0,,,,,,
4,M. Neuer,FC Bayern München,28,2016,1,0,0,0,,,,,,,-2.0,2.0,-1.0,0.0,0.0,0.0


#### Observation:

From the graphs above we observe:
1. Players of higher ages tend to show less changes in all the metrics. Infact playersa above the age between 30 and 35 tend to show lesser changes in the skill area improvements. 
2. Younger players do tend to show higher changes in the skill profile values.
3. Certain higher values for differences might skew the data and might have to be removed.

#### Actions:
1. Age definitely plays a role in the calculation of the score changes and hence we are going to calculate the average on every age.
1. Average the score changes for a club at every age and identify the clubs which are performing better. 

In [19]:
all_years_df = all_years_df.fillna(0)

In [20]:
all_years_df.head()

Unnamed: 0,short_name,club,age,year_over_year,diff_age,diff_overall,diff_potential,diff_skill_moves,diff_pace,diff_shooting,diff_passing,diff_dribbling,diff_defending,diff_physic,diff_gk_diving,diff_gk_handling,diff_gk_kicking,diff_gk_reflexes,diff_gk_speed,diff_gk_positioning
0,L. Messi,FC Barcelona,27,2016,1,1,0,0,-1.0,-1.0,0.0,-1.0,-3.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cristiano Ronaldo,Real Madrid,29,2016,1,1,1,0,-1.0,0.0,-1.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A. Robben,FC Bayern München,30,2016,1,0,0,0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Z. Ibrahimović,Paris Saint-Germain,32,2016,1,-1,-1,0,-3.0,-1.0,0.0,-1.0,-3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M. Neuer,FC Bayern München,28,2016,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,2.0,-1.0,0.0,0.0,0.0


In [21]:
year_over_year_metrics_averaged = all_years_df
year_over_year_metrics_averaged = year_over_year_metrics_averaged.drop(["short_name"], axis=1)
year_over_year_metrics_averaged = year_over_year_metrics_averaged.groupby(["club", "year_over_year", "age"]).mean().reset_index()
year_over_year_metrics_averaged = year_over_year_metrics_averaged.dropna()

display (year_over_year_metrics_averaged.head())

Unnamed: 0,club,year_over_year,age,diff_age,diff_overall,diff_potential,diff_skill_moves,diff_pace,diff_shooting,diff_passing,diff_dribbling,diff_defending,diff_physic,diff_gk_diving,diff_gk_handling,diff_gk_kicking,diff_gk_reflexes,diff_gk_speed,diff_gk_positioning
0,SSV Jahn Regensburg,2019,20,1.0,1.0,-1.0,0.0,3.0,11.0,12.0,3.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
1,SSV Jahn Regensburg,2019,21,1.0,4.0,3.0,0.0,-1.0,2.0,4.0,3.0,2.5,3.0,0.0,0.0,0.0,0.0,0.0,0.0
2,SSV Jahn Regensburg,2019,22,1.0,2.5,1.5,0.0,-2.0,2.0,3.5,4.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,SSV Jahn Regensburg,2019,23,1.0,2.0,1.0,-1.0,1.0,2.0,0.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,SSV Jahn Regensburg,2019,24,1.0,2.25,0.75,0.25,3.25,5.25,3.75,2.0,5.75,2.5,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# Finding Clubs which have done better in all age group
sortable_columns = []
for column in all_years_df.columns:
    if ("diff_" in column and column != "diff_age"):
        sortable_columns.append(column)
best_clubs_any_year = year_over_year_metrics_averaged.sort_values(by=sortable_columns, ascending = False).head(10)
display (best_clubs_any_year)

Unnamed: 0,club,year_over_year,age,diff_age,diff_overall,diff_potential,diff_skill_moves,diff_pace,diff_shooting,diff_passing,diff_dribbling,diff_defending,diff_physic,diff_gk_diving,diff_gk_handling,diff_gk_kicking,diff_gk_reflexes,diff_gk_speed,diff_gk_positioning
25469,Sivasspor,2016,24,1.0,29.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,31.0,27.0,17.0,33.0,27.0,23.0
22979,Rosario Central,2017,19,1.0,21.0,22.0,0.0,12.0,13.0,25.0,19.0,18.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0
22374,Real Sociedad,2017,18,1.0,21.0,19.0,0.0,8.0,26.0,15.0,24.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
14787,Jagiellonia Białystok,2016,16,1.0,21.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,24.0,25.0,22.0,1.0,25.0
3820,Bologna,2017,17,1.0,17.0,12.0,0.0,15.0,2.0,15.0,20.0,11.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0
4011,Borussia Mönchengladbach,2017,19,1.0,17.0,11.5,0.0,6.5,10.0,12.0,17.5,11.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
19605,Once Caldas,2017,18,1.0,17.0,10.0,0.0,1.0,16.0,7.0,10.0,3.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0
21077,Portland Timbers,2016,22,1.0,16.0,16.0,0.0,1.0,1.0,3.0,18.0,22.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0
23284,SC Braga,2016,18,1.0,16.0,9.0,0.0,2.0,8.0,13.0,19.0,17.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
13423,Hamburger SV,2016,20,1.0,15.0,16.0,0.0,12.0,4.0,9.0,9.0,17.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
x_train = year_over_year_metrics_averaged.drop(["club","year_over_year","age","diff_age","diff_overall"], axis=1)
y_train = year_over_year_metrics_averaged["diff_overall"]

linreg = LinearRegression()
linreg.fit(x_train, y_train)

y_train_pred = linreg.predict(x_train)
y_train_pred

predicted_score_improvement = year_over_year_metrics_averaged[["club","diff_overall"]]
predicted_score_improvement["predicted"] = y_train_pred

predicted_score_improvement["difference"] = predicted_score_improvement["diff_overall"] - predicted_score_improvement["predicted"] 
predicted_score_improvement.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_score_improvement["predicted"] = y_train_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predicted_score_improvement["difference"] = predicted_score_improvement["diff_overall"] - predicted_score_improvement["predicted"]


Unnamed: 0,diff_overall,predicted,difference
count,30230.0,30230.0,30230.0
mean,0.987038,0.987038,9.166777000000001e-17
std,2.299698,2.08824,0.9632589
min,-10.0,-10.583788,-8.833086
25%,-0.2,-0.222364,-0.5300505
50%,1.0,0.688935,-0.09280438
75%,2.0,1.885995,0.4903193
max,29.0,28.426504,9.878519


In [None]:
fail here 

#### Age Based Score Calculation

We are now considering the values for the score changes for every age. This means grouping based upon the age. This is to make sure that we are calculating what the expected scores are going to be for every player. 


In [None]:
# Find the mean score change at each age.
player_score_changes_by_age = all_years_df.drop(["club","year_over_year", "diff_age"], axis=1)
player_score_changes_by_age = player_score_changes_by_age.groupby(["age"]).mean().reset_index()
player_score_changes_by_age.head(10)

##### Age Based Score Calculation for all Players

Instead of grouping by only the age, we are also going to count the players which belong to a group. This will also help us identify the players in the specified train set that belong to teams belonging to a club.

In [None]:
player_score_changes_by_club_age = all_years_df.drop(["year_over_year", "diff_age"], axis=1)
player_score_changes_by_club_age = player_score_changes_by_club_age.groupby(["club","age"]).mean().reset_index()
player_score_changes_by_club_age.head(10)

#### Club Score and Expected Score

This is a dataframe to calculate what was the expected score of a player at that age vs what was the actual different that the club data reflects.

In [None]:
# A placeholder Dataframe to store the values associated with the clubs scores and the average expected
club_score_comparison = player_score_changes_by_club_age.merge(player_score_changes_by_age, on="age", suffixes=('_1', '_2'))
club_score_comparison.head()

In [None]:
# Now calculate the difference between the individual scores to come up with a Overall Staff performance score
score_comparison_club_assign_scored = club_score_comparison[["club", "age"]]
for column in club_score_comparison.columns:
    if ("diff_" in column and "_1" in column): 
        diff_column_name = column.replace("diff_", "")
        diff_column_name = diff_column_name.replace("_1", "")
        score_comparison_club_assign_scored[diff_column_name] = club_score_comparison[f"diff_{diff_column_name}_2"] - club_score_comparison[f"diff_{diff_column_name}_1"]
score_comparison_club_assign_scored.head()

#### Interpretation:

The graph above depicts the number of clubs (y axis) which have scores 10, 20, 30 ...100. 
We observe that many clubs have scores of 100 for ages 19 to 31 which is definitely the expected output. 

These are also prime years of the players.

### Model 

#### Test and Train Set

The problem states that we are using the data from the Division 1 European League. In order to make that happen for the data that we have cleaned up, we are going to separate the data for the clubs into Train set belonging to the players from the league.

#### Steps:
1. Investigation of the dataset has indicated that the League information is not in the data file and needs to be fetched from the web using the scraping approach.
2. Once the scraping pulls data the data is going to be appended to the club score changes dataframe. 

#### Model Options
1. Linear Model using fixed values based upon the score changes.
2. Linear Model but instead of using fixed values use the coefficients of the linear equation that generates the overall score and then assign the score based upon the coefficient values.


#### Approach:
1. Read the Club Data and the Leagues data files
2. Match the club and the leagues in which the clubs play
3. Create a test data set and a train dataset by excluding the the clubs that are associated with the Leagues marked as Test data.


In [None]:
leagues_df = pd.read_csv('./data/teams_and_leagues.csv')
display(leagues_df)

### Web Scraping

The leagues and teams dataset does not have any correlation with the clubs that play in a league and hence we need to scrape that data from the sofifa site.
The url field in the dataset can be used to make a web request and then download the data.

In [None]:
import requests
def parseTeamNameFromUrl(team_url_id):

    source = requests.get(f"https://sofifa.com/team/{team_url_id}")
    soup = bs.BeautifulSoup(source.content)
    team_info_divs = soup.findAll("div", {"class": "info"})
    team_name = 'not found'
    for div in team_info_divs:
        team_name = div.find("h1").text 
    print (team_name)
    return team_name

In [None]:
parseTeamNameFromUrl(9)

#### Web Scraping End

Once the web scraping is done we do not intend to save the file and no need to keep running the scraping every execution. This is the lambda based approach to make a call to every row and pull the data. this is not done everytime and I have pulled that data down to my machine.

In [None]:
# leagues_df["club"] = leagues_df.apply(lambda x: parseTeamNameFromUrl(x['url']),axis=1)


In [None]:
# display(leagues_df)
# leagues_df.to_csv('./data/teams_leagues_clubs.csv', index=False)

In [None]:
leagues_with_club_df = pd.read_csv('./data/teams_leagues_clubs.csv')
display(leagues_with_club_df)

In [None]:
score_comparison_club_assign_scored_with_leagues = score_comparison_club_assign_scored.merge(leagues_with_club_df, on="club", suffixes=('_1', '_2'))
display(score_comparison_club_assign_scored_with_leagues)

#### Test and Train Set

In [None]:
club_test_set = score_comparison_club_assign_scored_with_leagues[score_comparison_club_assign_scored_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]
club_test_set.shape

In [None]:
club_train_set = score_comparison_club_assign_scored_with_leagues[~score_comparison_club_assign_scored_with_leagues['league_name'].isin(["Division 1 European League "])]
club_train_set.shape

## Model Design

One of the challenges of calculating the Scores for the Best Staff is that the scores for the players have to be compared to generate the scores for the Support staff. 

To calculate the scores for the Support Staff of a club, the approach is to calculate the difference in the score for every year and attribute that improvement to the club that the player was the year earlier. In our EDA we identified that the player score improvements depend upon the players age. In order to incorporate the age factor, the approach essentially only includes using the entire dataset to calculate the score differences for every player year over year. Once the difference is calculated we determine the mean score changes for every age (for all players of an age what was the change that happened for the attributes).

Using the score changes calculated earlier, we now group all the players in a club at every age and calculate the mean score changes for that club. The score changes for all the players of the club are then compared to the mean for all players. An indication if the club did better is if the mean score change for the club for players of an age is more than the average score change for the players of the age, the club staff is definitely helping the players of that age. 



### Model 1 - Simple Scoring Approach
#### Scoring Approach:

The teams that are doing better are the ones which are doing better than the average assigned to each team. 
So in the score comparison dataframe, the difference in the scores for all the scores is calculated and then overall rating associated. 
Here is the weight associated with each of the parameters:
A positive value gets the following values and a negative value is deducted by the same amount. 
1. Overall - 20
2. potential - 10
3. skill_moves - 10
4. pace - 10
5. shooting - 10
6. passing - 10
7. dribbling - 10
8. defending - 10
9. physic - 10


In [None]:
# This is not relevant any more.
score_comparison_club_assign_scored["score"] = 0
for column in score_comparison_club_assign_scored.columns:
    if (column != "club" and column != "age"):
        if ("overall" in column):
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] > 0, 'score'] += 20
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] == 0, 'score'] += 0            
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] < 0, 'score'] -= 20            
        else:
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] > 0, 'score'] += 10
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] == 0, 'score'] += 0            
            score_comparison_club_assign_scored.loc[score_comparison_club_assign_scored[column] < 0, 'score'] -= 10            
score_comparison_club_assign_scored.head()
score_comparison_club_assign_scored.shape

In [None]:
score_comparison_club_assign_scored.hist(by="age", column="score", figsize=(16,16))
plt.suptitle('Histogram depicting clubs buckets for Scores')

In [None]:
score_comparison_club_assign_scored_with_leagues["score"] = 0
for column in score_comparison_club_assign_scored_with_leagues.columns:
    if (column != "club" and column != "age" and column != "league_name" ):
        if ("overall" in column):
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] > 0, 'score'] += 20
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] == 0, 'score'] += 0            
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] < 0, 'score'] -= 20            
        else:
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] > 0, 'score'] += 10
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] == 0, 'score'] += 0            
            score_comparison_club_assign_scored_with_leagues.loc[score_comparison_club_assign_scored_with_leagues[column] < 0, 'score'] -= 10            
score_comparison_club_assign_scored_with_leagues.head()
score_comparison_club_assign_scored_with_leagues.shape

In [None]:
club_test_set_with_basic_score = score_comparison_club_assign_scored_with_leagues[score_comparison_club_assign_scored_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]
club_test_set_with_basic_score.shape

In [None]:
club_train_set_with_basic_score = score_comparison_club_assign_scored_with_leagues[~score_comparison_club_assign_scored_with_leagues['league_name'].isin(["Division 1 European League "])]
club_train_set_with_basic_score.shape

In [None]:
club_train_set_with_basic_score.columns

In [None]:
# club_train_set_with_basic_score = club_train_set_with_basic_score.dropna()
x_train_with_basic_score = club_train_set_with_basic_score.drop(["score", "club", "league_name", "url"], axis=1)
cols_to_norm = ['score']
club_train_set_with_basic_score[cols_to_norm] = club_train_set_with_basic_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_train_set_with_basic_score[cols_to_norm] = club_train_set_with_basic_score[cols_to_norm] * 100
y_train_with_rf_col_score = club_train_set_with_basic_score["score"]
y_train_with_basic_score = club_train_set_with_basic_score["score"]
print(x_train_with_basic_score.shape)
print(y_train_with_basic_score.shape)

In [None]:
linreg = LinearRegression()
linreg.fit(x_train_with_basic_score, y_train_with_basic_score)

In [None]:
y_train_pred_with_basic_score = linreg.predict(x_train_with_basic_score)

mse_train = mean_squared_error(y_train_with_basic_score, y_train_pred_with_basic_score)
print (mse_train)

In [None]:
# club_test_set_with_basic_score = club_test_set_with_basic_score.dropna()
x_test_with_basic_score = club_test_set_with_basic_score.drop(["score", "club", "league_name", "url"], axis=1)

cols_to_norm = ['score']
club_test_set_with_basic_score[cols_to_norm] = club_test_set_with_basic_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_test_set_with_basic_score[cols_to_norm] = club_test_set_with_basic_score[cols_to_norm] * 100
y_test_with_basic_score = club_test_set_with_basic_score["score"]

In [None]:
y_test_pred_with_basic_score = linreg.predict(x_test_with_basic_score)

mse_test = mean_squared_error(y_test_with_basic_score, y_test_pred_with_basic_score)
print (f"Train MSE {mse_train:.4f}")
print (f"Test MSE {mse_test:.4f}")

r2 = r2_score(y_test_with_basic_score, y_test_pred_with_basic_score)
print (r2)

#### End of Basic Model
***

#### Limitations:

Limitations of the Basic Model is that the staff score calculation is based upon the determination made without any data. Simply considering that overall factor is an important indicator and so we assign a +20 for a staff who has a positive though might sound like a good idea the limitations are clear that it would indicate that we weigh somebody who increased the score by 0.001 and somebody who improved the score by 3 as the same and give them both 20 points. However this might be incorrect as the support staff who improved the score by 3 is definitely helping the players of that age group develop significantly more.

The basic model is therefore comes up with a very high MSE values.

### Scoring Approach

The player overall score is an indicator of the players overall improvement of his abilities aggregated over. The approach for a Data Driven scoring model is to consider using the individual parameters contributing to the overall score values. 

Approach here is to run a learning model to determine what is the impact of the individual parameters on the overall score. The idea is to then determine how important are these parameters on the overall score. Once the impact on the overall score is identified, the next step is to then evaluate the staff on the basis of the importance for each of those factor improvements for the groups. 

An e.g. of this approach would be if dribbling and skill moves are more important then we might be able to give the support staff more credit for improving the dribbling and skill moves values over other values and then generate a score.

This approach also requires eliminating the overall score from the dataset for the score calculation. The approach is to use only the individual parameters for the determination of the effectiveness of the support staff in improving it. 

We will continue with the approach that if the support staff has improved the parameter score they would get a positive score but if they failed to improve the score then they get dinged.

### Model 2

In this Model we are going to use the generate coefficients for individual components that contribute to the  overall rating value. We will set the overall rating as the value that is the output and then use the weights to generate a score. 

In [None]:
club_train_set = score_comparison_club_assign_scored_with_leagues[~score_comparison_club_assign_scored_with_leagues['league_name'].isin(["Division 1 European League "])]
club_train_set.shape

In [None]:
club_train_set_for_coeffs = club_train_set.fillna(0)
x_train_set_for_coeffs = club_train_set_for_coeffs.drop(["overall", "club", "league_name", "url"], axis=1)
y_train_set_for_coeffs = club_train_set_for_coeffs["overall"]
print(x_train_set_for_coeffs.shape)
print(y_train_set_for_coeffs.shape)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

maxdeg = 1
x_poly = PolynomialFeatures(maxdeg).fit_transform(x_train_set_for_coeffs)
alpha_list = np.linspace(10,120,100)
# Create two lists for training and validation error
training_error, validation_error = [],[]
for i in alpha_list:

    ridge_reg = Ridge(alpha=i,normalize=True)

    #Fit on the entire data because we just want to see the trend of the coefficients

    ridge_reg.fit(x_poly, y_train_set_for_coeffs)
    
    # Perform cross validation on the modified data with neg_mean_squared_error as the scoring parameter and cv=5
    # Remember to get the train_score
    ridge_cv = cross_validate(ridge_reg, x_poly, y_train_set_for_coeffs, cv=5,scoring='neg_mean_squared_error',return_train_score=True)

    # Compute the training and validation errors got after cross validation
    mse_train = np.mean(np.abs(ridge_cv["train_score"]))
    mse_val = np.mean(np.abs(ridge_cv["test_score"]))

    # Append the MSEs to their respective lists 
    training_error.append(mse_train)
    validation_error.append(mse_val)
    
# Get the best mse from the validation_error list
best_mse  =  min(validation_error)

# Get the best alpha value based on the best mse
best_parameter = alpha_list[validation_error.index(best_mse)]
print (best_parameter)
ridge_reg = Ridge(alpha=best_parameter,normalize=True)

#Fit on the entire data because we just want to see the trend of the coefficients

ridge_reg.fit(x_poly, y_train_set_for_coeffs)

print (ridge_reg.coef_)

We decided to abandon the approach here because we realized that the coefficients for the linear model though gives us the details, the coefficients are very pretty similar and are not able to provide the details.

### Model 3

The Random Forest approach uses Decision trees with different parameters in every iteration and helps identify importance of the parameters over multiple iterations. The approach here was to use the Random Forest to identify the parameters that are useful to determine the parameters that contribute the most to the determination of the important parameters. We used the RandomForestRegressor as overall is not a categorical but a continous value. 

The Random Forest Regressor was run for over 250 iterations and max features of 4 to ensure that the features are dropped periodically. Then using the predictors determined or picked the important_features list is prepared. 

##### Normalization of Parameters:

We observe that the important columns are being identified below as ['potential', 'dribbling', 'passing', 'shooting', 'defending', 'physic'] but the number of times those columns are identified as the primary column are different. In the next step we normalize them on a score of 0 to 1 to identify how important they are and then use the important columns to create a score for the club. 

##### Normalization of Generated Scores

The problem states that the score for the Staff should not be more than 100 so we are normalizing the scores to a max value of 100 and between 0 to 100. 

In [None]:
from sklearn.ensemble import RandomForestRegressor
club_train_set_for_RF = club_train_set
x_train_set_for_RF = club_train_set_for_RF.drop(["overall", "club", "league_name", "url", "score"], axis=1)
y_train_set_for_RF = club_train_set_for_RF["overall"]
print(x_train_set_for_RF.shape)
print(y_train_set_for_RF.shape)

max_depth = 15
random_state = 144
random_forest = RandomForestRegressor(max_depth=max_depth, random_state=random_state, n_estimators=250, max_features = 8)

# Fit the model on the training set
random_forest.fit(x_train_set_for_RF, y_train_set_for_RF)


In [None]:
top_predictors_rf = {}
for rfestimator_tree in random_forest.estimators_:
    # print (rfestimator_tree.tree_.feature)
    try:
        top_predictors_rf[rfestimator_tree.tree_.feature[0]] += 1
    except:
        top_predictors_rf[rfestimator_tree.tree_.feature[0]] = 1

print(x_train_set_for_RF.columns)
print(top_predictors_rf)

In [None]:
important_columns = []
for k in top_predictors_rf.keys():
    important_columns.append(x_train_set_for_RF.columns[k])
print (important_columns)

In [None]:
important_columns_imp = [] 
for i in top_predictors_rf.values():
    important_columns_imp.append(i)
print (important_columns_imp)

#### Normalization:

We observe that the important columns are being identified below as ['potential', 'dribbling', 'passing', 'shooting', 'defending', 'physic'] but the number of times those columns are identified as the primary column are different. In the next step we normalize them on a score of 0 to 1 to identify how important they are and then use the important columns to create a score for the club. 

In [None]:
important_columns_norm = [float(i)/sum(important_columns_imp) for i in important_columns_imp] 
important_columns_norm

In [None]:
club_test_set_with_rf_col_score = score_comparison_club_assign_scored_with_leagues[score_comparison_club_assign_scored_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]
club_test_set_with_rf_col_score.shape

In [None]:
club_train_set_with_rf_col_score = score_comparison_club_assign_scored_with_leagues[~score_comparison_club_assign_scored_with_leagues['league_name'].isin(["Division 1 European League "])]
club_train_set_with_rf_col_score.shape

##### Scoring Approach

Instead of scoring each parameter which is positive as a fixed value, we are going to only score the important columns and also we are going to give the score a value which is product of the score change and importance normalized. 
Also we are now going to create a Train Set and Test set with the scores for the values.

In [None]:
club_train_set_with_rf_col_score["score"] = 0
scale = len(important_columns)
i = 0
for column in important_columns:
        club_train_set_with_rf_col_score.loc[club_train_set_with_rf_col_score[column] > 0, 'score'] += club_train_set_with_rf_col_score[column] * important_columns_norm[i]
        club_train_set_with_rf_col_score.loc[club_train_set_with_rf_col_score[column] == 0, 'score'] += 0            
        club_train_set_with_rf_col_score.loc[club_train_set_with_rf_col_score[column] < 0, 'score'] -= club_train_set_with_rf_col_score[column] * important_columns_norm[i]            
        i += 1
club_train_set_with_rf_col_score.head()
club_train_set_with_rf_col_score.shape

In [None]:
club_test_set_with_rf_col_score["score"] = 0
scale = len(important_columns)
for column in important_columns:
        club_test_set_with_rf_col_score.loc[club_test_set_with_rf_col_score[column] > 0, 'score'] += club_test_set_with_rf_col_score[column] * scale
        club_test_set_with_rf_col_score.loc[club_test_set_with_rf_col_score[column] == 0, 'score'] += 0            
        club_test_set_with_rf_col_score.loc[club_test_set_with_rf_col_score[column] < 0, 'score'] -= club_test_set_with_rf_col_score[column] * scale 
        scale -= 1
club_test_set_with_rf_col_score.head()
club_test_set_with_rf_col_score.shape

##### Normalization of Generated Scores
The problem states that the score for the Staff should not be more than 100 so we are normalizing the scores to a max value of 100 and between 0 to 100. The scores are normalized for both test and train sets.


In [None]:
# club_train_set_with_rf_col_score = club_train_set_with_rf_col_score.fillna()
x_train_with_rf_col_score = club_train_set_with_rf_col_score.drop(["score", "club", "league_name", "url"], axis=1)
cols_to_norm = ['score']
club_train_set_with_rf_col_score[cols_to_norm] = club_train_set_with_rf_col_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_train_set_with_rf_col_score[cols_to_norm] = club_train_set_with_rf_col_score[cols_to_norm] * 100
y_train_with_rf_col_score = club_train_set_with_rf_col_score["score"]
print(x_train_with_rf_col_score.shape)
print(y_train_with_rf_col_score.shape)

In [None]:
club_train_set_with_rf_col_score.columns

In [None]:
display(max(y_train_with_rf_col_score))

##### Linear Regression Model

Using the Linear Regression Model we are going to check the effectiveness of the scoring mechanism that was generated. 

The approach is to use the train set and fit the model on the training set and then compare if the test set gets values similar to our calculated values.

In [None]:
linreg = LinearRegression()
# Fitting the model on the train data and the train output.
linreg.fit(x_train_with_rf_col_score, y_train_with_rf_col_score)

# Checking the MSE values on the train data
y_train_pred_with_rf_col_score = linreg.predict(x_train_with_rf_col_score)

mse_train = mean_squared_error(y_train_with_rf_col_score, y_train_pred_with_rf_col_score)

# Now preparing the test values including calculating the actual score and normalizing
club_test_set_with_rf_col_score = club_test_set_with_rf_col_score.dropna()
x_test_with_rf_col_score = club_test_set_with_rf_col_score.drop(["score", "club", "league_name", "url"], axis=1)
cols_to_norm = ['score']
club_test_set_with_rf_col_score[cols_to_norm] = club_test_set_with_rf_col_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_test_set_with_rf_col_score[cols_to_norm] = club_test_set_with_rf_col_score[cols_to_norm] * 100
y_test_with_rf_col_score = club_test_set_with_rf_col_score["score"]

# predicting the score
y_test_pred_with_rf_col_score = linreg.predict(x_test_with_rf_col_score)

# Determine the error.
mse_test = mean_squared_error(y_test_with_rf_col_score, y_test_pred_with_rf_col_score)
print (f"Train MSE for the Scores calculated via Random Forest Significant Params: {mse_train:.4f}")
print (f"Test MSE for the Scores calculated via Random Forest Significant Params: {mse_test:.4f}")

r2 = r2_score(y_test_with_rf_col_score, y_test_pred_with_rf_col_score)
print(r2)

### Model Results - Ridge with Permutation Importance Factors
In this approach, we decided to use the eli5 library to identify the important parameters identified by the Ridge Model using the Permutation Importance and use the weights to determine the importance. 

In [None]:
x_poly.shape
#x_train_set_for_coeffs

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

seed = 42

# perm = PermutationImportance(random_forest,random_state=seed,n_iter=10).fit(x_train_set_for_RF, y_train_set_for_RF)
# eli5.show_weights(perm,feature_names=x_train_set_for_RF.columns.tolist())

ridge_reg = Ridge(alpha=10,normalize=True)

ridge_reg.fit(x_train_set_for_RF, y_train_set_for_RF)

print (ridge_reg.coef_)

perm = PermutationImportance(ridge_reg,random_state=seed,n_iter=10).fit(x_train_set_for_RF, y_train_set_for_RF)
eli5.show_weights(perm, feature_names=x_train_set_for_RF.columns.tolist())

# print (perm.feature_importances)

In [None]:
# print (perm.feature_importances_)

In [None]:
# x_train_set_for_RF.columns

In [None]:
perm_feature_importances_norm = [float(i)/sum(perm.feature_importances_) for i in perm.feature_importances_] 

In [None]:
club_test_set_with_eli5_col_score = score_comparison_club_assign_scored_with_leagues[score_comparison_club_assign_scored_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]
club_test_set_with_eli5_col_score.shape

In [None]:
club_train_set_with_eli5_col_score = score_comparison_club_assign_scored_with_leagues[~score_comparison_club_assign_scored_with_leagues['league_name'].isin(["Division 1 European League "])]
club_train_set_with_eli5_col_score.shape

In [None]:
club_train_set_with_eli5_col_score["score"] = 0
i = 0 
for column in x_train_set_for_RF.columns:
        club_train_set_with_eli5_col_score.loc[club_train_set_with_eli5_col_score[column] > 0, 'score'] += club_train_set_with_eli5_col_score[column] * perm_feature_importances_norm[i]
        club_train_set_with_eli5_col_score.loc[club_train_set_with_eli5_col_score[column] == 0, 'score'] += 0             
        club_train_set_with_eli5_col_score.loc[club_train_set_with_eli5_col_score[column] < 0, 'score'] -= club_train_set_with_eli5_col_score[column] * perm_feature_importances_norm[i]  
        i += 1
club_train_set_with_eli5_col_score.head()
club_train_set_with_eli5_col_score.shape

In [None]:
club_test_set_with_eli5_col_score["score"] = 0
i = 0 
for column in x_train_set_for_RF.columns:
        club_test_set_with_eli5_col_score.loc[club_test_set_with_eli5_col_score[column] > 0, 'score'] += club_test_set_with_eli5_col_score[column] * perm_feature_importances_norm[i]
        club_test_set_with_eli5_col_score.loc[club_test_set_with_eli5_col_score[column] == 0, 'score'] += 0            
        club_test_set_with_eli5_col_score.loc[club_test_set_with_eli5_col_score[column] < 0, 'score'] -= club_test_set_with_eli5_col_score[column] * perm_feature_importances_norm[i]            
        i += 1
club_test_set_with_eli5_col_score.head()
club_test_set_with_eli5_col_score.shape

In [None]:
# club_train_set_with_eli5_col_score = club_train_set_with_eli5_col_score.fillna(0)
x_train_with_eli5_col_score = club_train_set_with_eli5_col_score.drop(["score", "club", "league_name", "url"], axis=1)
cols_to_norm = ['score']
club_train_set_with_eli5_col_score[cols_to_norm] = club_train_set_with_eli5_col_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_train_set_with_eli5_col_score[cols_to_norm] = club_train_set_with_eli5_col_score[cols_to_norm] * 100
y_train_with_eli5_col_score = club_train_set_with_eli5_col_score["score"]
print(x_train_with_eli5_col_score.shape)
print(y_train_with_eli5_col_score.shape)

In [None]:
y_train_with_eli5_col_score.describe()

In [None]:
# max(y_train_with_eli5_col_score)

In [None]:
linreg = LinearRegression()
linreg.fit(x_train_with_eli5_col_score, y_train_with_eli5_col_score)

y_train_pred_with_eli5_col_score = linreg.predict(x_train_with_eli5_col_score)

mse_train = mean_squared_error(y_train_with_eli5_col_score, y_train_pred_with_eli5_col_score)
# print (mse_train)

club_test_set_with_eli5_col_score = club_test_set_with_eli5_col_score.dropna()
# Normalize the Score.
x_test_with_eli5_col_score = club_test_set_with_eli5_col_score.drop(["score", "club", "league_name", "url"], axis=1)
cols_to_norm = ['score']
club_test_set_with_eli5_col_score[cols_to_norm] = club_test_set_with_eli5_col_score[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
club_test_set_with_eli5_col_score[cols_to_norm] = club_test_set_with_eli5_col_score[cols_to_norm] * 100
y_test_with_eli5_col_score = club_test_set_with_eli5_col_score["score"]

y_test_pred_with_eli5_col_score = linreg.predict(x_test_with_eli5_col_score)

mse_test = mean_squared_error(y_test_with_eli5_col_score, y_test_pred_with_eli5_col_score)
print (f"Train MSE for the Scores calculated via ELI5 Permutation Combination: {mse_train:.4f}")
print (f"Test MSE for the Scores calculated via ELI5 Permutation Combination: {mse_test:.4f}")

r2 = r2_score(y_test_with_eli5_col_score, y_test_pred_with_eli5_col_score)
print(r2)

In [None]:
display(club_test_set_with_eli5_col_score)

#### Model Results

We will now find the top 10 clubs performing at each age group to determine the best performance for every age.

We are going to use the scores to determine the clubs which have done.
The graphs generated are:
1. Scores vs Age which identifies what is the club distribution and the scores for clubs for Age.
2. Histogram of Clubs and in what score buckets do clubs tend to lie
3. A Table of Top 10 Clubs for each player age available in the Dataset
4. Clubs which have done better over all ages.

In [None]:
def displayModelTestScoreAgeScatter(club_test_set_model):
    plt.scatter(club_test_set_model["age"], club_test_set_model["score"])
    plt.xlabel("Age")
    plt.ylabel("Score")
    plt.title("Club Scores at different Ages")
    plt.show()

def displayModelTestScoreClubHistogram(club_test_set_model):
    plt.hist(club_test_set_model[club_test_set_model["score"] > 20]["score"], bins=20)
    plt.xlabel("Score")
    plt.ylabel("Number of Clubs")
    plt.title("Histogram showing number of clubs with Score greater than 20")
    plt.show()
    
def displayModelTestScoreClubHistogramByLeague(club_test_set_model):
    league_names = club_test_set_model.league_name.unique()
    league_scores = []
    for league_name in league_names:
        league_scores.append(club_test_set_model[club_test_set_model["league_name"] == league_name]["score"].values)

    plt.figure(figsize=(20,10))
    plt.hist(league_scores, bins = 10, histtype='bar', label=league_names)
    plt.xlabel("Score")
    plt.ylabel("Number of Clubs")

    plt.legend()
    plt.show()

def displayTop10ClubsForEachAge(club_test_set_model):

    fig, axs = plt.subplots(5,3,figsize=(32,16))
    collabel=("Club", "Score")
    i = 0
    j = 0
    for i in [0,1,2,3,4]: 
        for j in [0,1,2]:
            axs[i][j].axis('tight')
            axs[i][j].axis('off')
            age = i*3+j+16
            club_test_data_for_age = club_test_set_model[club_test_set_model["age"] == age].sort_values(by="score", ascending = False).head(10)
            outof = len(club_test_set_model[club_test_set_model["age"] == age])
            result = club_test_data_for_age[["club","score"]]
            result['score'] = result['score'].map('{:.4f}'.format)
            axs[i][j].table(cellText=result.values, colLabels=collabel, loc='center',colWidths=[0.3 for x in collabel])
            axs[i][j].set_title(f"Clubs identified as top {len(club_test_data_for_age)} out of {outof} for Age - {age} ")

    plt.show()
    
def displayBestClubsOverAllAges(club_test_set_model):
    # Finding the Clubs that performed best overall at all ages
    club_test_set_model = club_test_set_model.drop_duplicates()
    all_ages_club_df = pd.DataFrame()
    for age in range(16, 41):
        club_test_data_for_age = club_test_set_model[club_test_set_model["age"] == age].sort_values(by="score", ascending = False).head(10)
        all_ages_club_df=all_ages_club_df.append(club_test_set_model, ignore_index=True)

    all_ages_club_mean_scores_df = all_ages_club_df.groupby(by=["club"]).mean()[["score"]]
    result_df = all_ages_club_mean_scores_df.sort_values(by="score", ascending=False)
    display(result_df.head(10))
    

#### Random Forest Model

In [None]:
displayModelTestScoreAgeScatter(club_test_set_with_rf_col_score)
displayModelTestScoreClubHistogram(club_test_set_with_rf_col_score)
displayModelTestScoreClubHistogramByLeague(club_test_set_with_rf_col_score)
displayTop10ClubsForEachAge(club_test_set_with_rf_col_score)

#### Parameters Selected by ELI5 Permutation


In [None]:
displayModelTestScoreAgeScatter(club_test_set_with_eli5_col_score)
displayModelTestScoreClubHistogram(club_test_set_with_eli5_col_score)
displayModelTestScoreClubHistogramByLeague(club_test_set_with_eli5_col_score)
displayTop10ClubsForEachAge(club_test_set_with_eli5_col_score)


In [None]:
displayBestClubsOverAllAges(club_test_set_with_rf_col_score)
displayBestClubsOverAllAges(club_test_set_with_eli5_col_score)