## Imports

Import the necessary library.

In [None]:
# ! pip install eli5

In [None]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import bs4 as bs
import urllib.request
import warnings
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

## Read Data

Reading the data for all the players. The data is read in Dataframes. 

In [None]:
player15_df = pd.read_csv('./data/players_15.csv')
player16_df = pd.read_csv('./data/players_16.csv')
player17_df = pd.read_csv('./data/players_17.csv')
player18_df = pd.read_csv('./data/players_18.csv')
player19_df = pd.read_csv('./data/players_19.csv')
player20_df = pd.read_csv('./data/players_20.csv')

## Data Cleanup

The purpose of the cells below is to make sure that we drop the columns that have been identified as not to be used.

In addition to the columns indicated in the statement, the team also feels that the clubs staff capabilities do not depend on other attributes like:
1. Player Height
1. Player Weight
1. Player Nationality

So these columns have been also identified as to be dropped.

In [None]:
def clean_player_df(player_df):
    '''
    Function below takes the Dataframe and drops the columns which are specified in the list above.
    '''
    return player_df.drop(columns_to_drop, axis=1)

Cleaning up all the dataframes to remove the columns identified.

## Data Analysis

For the purpose of Data Analysis, identifying columns which are numerical would simplify the quantitative analysis of the stafs capabilities. In order to identify those columns and run some preliminary analysis like min values max values, identifying the columns which are numeric.

### Step 1: Identify Numerical Columns

#### Inference from the step:
The list above indicates that there are about 24 columns which are numerical and can be leveraged for the sake of quantitative analysis. 
However further investigation of these numerical columns indicate that some of these columns could be added to the cleanup of columns as they would not really reflect the staffs capabilities to promote talent.

#### Inference Action:
Add the columns to the list of columns to be cleaned up and remove the columns.

Columns identified:
1. value_eur
1. release_clause_eur
1. team_jersey_number
1. contract_valid_until
1. nation_jersey_number

In [None]:
columns_to_drop = ["sofifa_id", "player_url", "long_name", "wage_eur", "real_face", "height_cm", "weight_kg", "nationality", 
                    "value_eur", "release_clause_eur", "team_jersey_number", "contract_valid_until","nation_jersey_number"]

In [None]:
# Further cleanup
player15_cleaned_df = clean_player_df(player15_df)
player16_cleaned_df = clean_player_df(player16_df)
player17_cleaned_df = clean_player_df(player17_df)
player18_cleaned_df = clean_player_df(player18_df)
player19_cleaned_df = clean_player_df(player19_df)
player20_cleaned_df = clean_player_df(player20_df)

In [None]:
player15_cleaned_df.head()

### Step 2: Analyze the Numerical Columns

In order to check the quality of data available in those numerical columns, analyzing a couple of years to see the type and quality of data. 

#### Inferences:
1. The Describe on the multiple years does indicate that the mental_composure is not a column that we can rely on as the values are not captured in the earlier years and further investigation indicated that the data has a format which is a split value (e.g. and hence can be excluded 90+3)
1. International reputation is also a Categorical variable with values 1 to 5.
1. Weak Foot is a categorical value too with values 1 to 5.
1. Numerical columns 'gk_speed', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_diving' have very few values. On inspecting the players associated with those values, it was identified that these are goal keepers and attributes for goal keepers. As the question is about effectiveness of the entire staff, the approach was to drop these columns as well both for lack of data and to avoid focus on skill development for goal keepers.

#### Inference Action
1. Remove the mentality_composure value from the columns. 
2. From domain knowledge perspective, weak foot can be eliminated as other factors would reflect if the staff has improved the weak foot score of the individual.
3. International reputation is also being dropped because we are not measuring PR teams capabilities but the teams staff.
~~
4. Remove the gk_* values from the columns list.  
5. Also as the Goal Keepers are missing the information of other attributes, we might not be able to get metrics on the staffs work on the goal keepers. 
~~

In [None]:
columns_to_drop = ["player_url", "long_name", "wage_eur", "real_face", "height_cm", "weight_kg", "nationality", 
                    "value_eur", "release_clause_eur", "team_jersey_number", "contract_valid_until","nation_jersey_number",
                      "mentality_composure", "weak_foot", "international_reputation"]

In [None]:
# Reloading the data so that we can reclean
player15_df = pd.read_csv('./data/players_15.csv')
player16_df = pd.read_csv('./data/players_16.csv')
player17_df = pd.read_csv('./data/players_17.csv')
player18_df = pd.read_csv('./data/players_18.csv')
player19_df = pd.read_csv('./data/players_19.csv')
player20_df = pd.read_csv('./data/players_20.csv')

In [None]:
# Further cleanup
player15_cleaned_df = clean_player_df(player15_df)
player16_cleaned_df = clean_player_df(player16_df)
player17_cleaned_df = clean_player_df(player17_df)
player18_cleaned_df = clean_player_df(player18_df)
player19_cleaned_df = clean_player_df(player19_df)
player20_cleaned_df = clean_player_df(player20_df)

### Skills Columns

The skills columns in the dataframe indicate the different skills with a numerical value for the Skills. The team thinks that the skills are an important part of the development of every player. The player skills once improved will contribute to the overall improvement of the player and hence its overall rating. 

#### Observations:
1. Non Numeric columns: 
The columns for skills are multiple and are non numeric. Infact the columns have values with + and - signs indicating that the columns have data which indicates positive and negative traits.

#### Observation Action:

1. The decision is to exclude the columns for the first set of determination and keep only the numeric columns in the list. 
2. Build a data frame for every year with only these numeric values in addition to columns which identify the player like name, club.


In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_column_list = list(player15_cleaned_df.select_dtypes(include=numerics).columns)
print (pd.DataFrame(numeric_column_list))

#### Numerical Only Dataset

Creating a new Dataset with only numerical columns:
0          age
1      overall
2    potential
3  skill_moves
4         pace
5     shooting
6      passing
7    dribbling
8    defending
9       physic
10 short_name
11 club


In [None]:
# player_trait_columns = ["overall", "pace" "shooting","passing","dribbling" "defending"]

player_numeric_identification_columns = ['short_name', "club"] +  numeric_column_list
print (player_numeric_identification_columns)


In [None]:
player15_numeric_cleaned_df =  player15_cleaned_df[player_numeric_identification_columns]
player16_numeric_cleaned_df =  player16_cleaned_df[player_numeric_identification_columns]
player17_numeric_cleaned_df =  player17_cleaned_df[player_numeric_identification_columns]
player18_numeric_cleaned_df =  player18_cleaned_df[player_numeric_identification_columns]
player19_numeric_cleaned_df =  player19_cleaned_df[player_numeric_identification_columns]
player20_numeric_cleaned_df =  player20_cleaned_df[player_numeric_identification_columns]
numeric_column_list.remove("sofifa_id")


### Year Over Year Comparison

#### Approach
Since the purpose of the exercise is to identify the effectiveness of the staff, the approach is to identify the score differences of the players year over year. 

We are going to join data year over year and see the score differences. The approach is to create a dataframe that would have the name of the player the club and the difference in the rating of the numeric column for the player.

#### Decisions:
1. Initially the team considered only players who stayed at the club to get a guage of the players skill. However given that not many players were staying at the club, the decision was to not to enforce the stay at club metric. The credit for the increase is being assigned to the club in the earlier year.
1. The score difference will be calculated for every numeric parameter including the values which are categorical.

In [None]:
# Define a year over year dataframe in which the values will be stored
year_over_year_df = pd.DataFrame()
all_years_df = pd.DataFrame()
# year_over_year_df = player15_numeric_cleaned_df["short_name"]
# display(year_over_year_df)

In [None]:
def create_year_over_year_df(year1_df, year2_df, year):
    year1_year2_joined_df = year1_df.merge(year2_df, on="sofifa_id", suffixes=('_1', '_2'))
    year_over_year_df["short_name"] = year1_year2_joined_df["short_name_1"]
    year_over_year_df["club"] = year1_year2_joined_df["club_1"]
    year_over_year_df["age"] = year1_year2_joined_df["age_1"]    
    year_over_year_df["year_over_year"] = year
    for column in numeric_column_list:
        year_over_year_df[f"diff_{column}"] = year1_year2_joined_df[f"{column}_2"] - year1_year2_joined_df[f"{column}_1"]
        year_over_year_df[column] = year1_year2_joined_df[[f"{column}_1"]]         
    return year_over_year_df
        
all_years_df = all_years_df.append(create_year_over_year_df(player15_numeric_cleaned_df, player16_numeric_cleaned_df, 2016))
all_years_df = all_years_df.append(create_year_over_year_df(player16_numeric_cleaned_df, player17_numeric_cleaned_df, 2017)) 
all_years_df = all_years_df.append(create_year_over_year_df(player17_numeric_cleaned_df, player18_numeric_cleaned_df, 2018))
all_years_df = all_years_df.append(create_year_over_year_df(player18_numeric_cleaned_df, player19_numeric_cleaned_df, 2019)) 
all_years_df = all_years_df.append(create_year_over_year_df(player19_numeric_cleaned_df, player20_numeric_cleaned_df, 2020)) 



In [None]:
all_years_df.shape

In [None]:
all_years_df.head()
numeric_column_list = list(player15_cleaned_df.select_dtypes(include=numerics).columns)
for column in numeric_column_list: 
    if (column != 'sofifa_id'):
        all_years_df[f"diff_{column}"].fillna(0) 

### Score Difference Matrix

The year over year dataframe has all the score changes for the Club for every individual player that has played for the club.

#### Approach 1
Find the number of players who have a positive change for every metric. This would mean that we create a histogram and find how many players have a positive change. We are going to only count players who have atleast one standard deviation change of positive change in rating for every metric. That will give us indication of how important is this metric in staffs contribution.
##### Decision"
Instead of only using the Players with positive score we decided to use both negative and positive scores so the clubs which are performing bad could be penalized.

#### Approach 2
For every club find the average change in ratings for each of the metrics and then order the clubs to identify which clubs have maximum change and order those clubs. This will ensure that we not only count players which are improving but also players which are losing points.
Plot the top 10 clubs that have shown the most change.

#### Approach 3
Define a Linear Regression model to determine what should be the change in the overall rating for a player at an age.

Find the change in the Players metrics. Based upon the metrics find the overall score change that is desired based upon the individual factors using a Linear Regression model. 



#### Approach 2

In [None]:
all_years_df.head()

#### Observation:

From the graphs above we observe:
1. Players of higher ages tend to show less changes in all the metrics. Infact playersa above the age between 30 and 35 tend to show lesser changes in the skill area improvements. 
2. Younger players do tend to show higher changes in the skill profile values.
3. Certain higher values for differences might skew the data and might have to be removed.

#### Actions:
1. Age definitely plays a role in the calculation of the score changes and hence we are going to calculate the average on every age.
1. Average the score changes for a club at every age and identify the clubs which are performing better. 

In [None]:
all_years_df = all_years_df.fillna(0)

In [None]:
all_years_df.head()

In [None]:
year_over_year_metrics_averaged = all_years_df
year_over_year_metrics_averaged = year_over_year_metrics_averaged.drop(["short_name"], axis=1)
year_over_year_metrics_averaged = year_over_year_metrics_averaged.groupby(["club", "year_over_year"]).sum().reset_index()
year_over_year_metrics_averaged = year_over_year_metrics_averaged.dropna()

display (year_over_year_metrics_averaged.head())

In [None]:
# Finding Clubs which have done better in all age group
sortable_columns = []
for column in all_years_df.columns:
    if ("diff_" in column and column != "diff_age"):
        sortable_columns.append(column)
best_clubs_any_year = year_over_year_metrics_averaged.sort_values(by=sortable_columns, ascending = False).head(10)
display (best_clubs_any_year)

## Test and Train Set

The problem states that we are using the data from the Division 1 European League. In order to make that happen for the data that we have cleaned up, we are going to separate the data for the clubs into Train set belonging to the players from the league.

#### Steps:
1. Investigation of the dataset has indicated that the League information is not in the data file and needs to be fetched from the web using the scraping approach.
2. Once the scraping pulls data the data is going to be appended to the club score changes dataframe. 

#### Model Options
1. Linear Model using fixed values based upon the score changes.
2. Linear Model but instead of using fixed values use the coefficients of the linear equation that generates the overall score and then assign the score based upon the coefficient values.


#### Approach:
1. Read the Club Data and the Leagues data files
2. Match the club and the leagues in which the clubs play
3. Create a test data set and a train dataset by excluding the the clubs that are associated with the Leagues marked as Test data.


### Web Scraping

The leagues and teams dataset does not have any correlation with the clubs that play in a league and hence we need to scrape that data from the sofifa site.
The url field in the dataset can be used to make a web request and then download the data.

In [None]:
import requests
def parseTeamNameFromUrl(team_url_id):

    source = requests.get(f"https://sofifa.com/team/{team_url_id}")
    soup = bs.BeautifulSoup(source.content)
    team_info_divs = soup.findAll("div", {"class": "info"})
    team_name = 'not found'
    for div in team_info_divs:
        team_name = div.find("h1").text 
    print (team_name)
    return team_name

In [None]:
parseTeamNameFromUrl(10030)

#### Web Scraping End

Once the web scraping is done we do not intend to save the file and no need to keep running the scraping every execution. This is the lambda based approach to make a call to every row and pull the data. this is not done everytime and I have pulled that data down to my machine.

In [None]:
leagues_with_club_df = pd.read_csv('./data/teams_leagues_clubs.csv')
display(leagues_with_club_df)

In [None]:
# leagues_df["club"] = leagues_df.apply(lambda x: parseTeamNameFromUrl(x['url']),axis=1)


In [None]:
# leagues_df.to_csv('./data/teams_leagues_clubs.csv', index=False)
# display(leagues_df)


In [None]:
teams_and_leagues = pd.read_csv('./data/teams_and_leagues.csv')
display(teams_and_leagues)

Add the league information into the dataframe to be able to sort the data.

In [None]:
year_over_year_metrics_averaged_with_leagues = year_over_year_metrics_averaged.merge(leagues_with_club_df, on="club", suffixes=('_1', '_2'))
display(year_over_year_metrics_averaged_with_leagues[["club", "year_over_year", "overall", "diff_overall", "skill_moves", "diff_skill_moves"]])

In [None]:
# Simple Scoring
year_over_year_metrics_averaged_with_leagues_aggregated_over_all_years = year_over_year_metrics_averaged_with_leagues.groupby(["club"]).sum().reset_index()
year_over_year_metrics_averaged_with_leagues_aggregated_over_all_years

## Helper Code

In [None]:
# Since we are splitting based upon data points and not a percentage, this method would get the x and y train sets
def train_test_split_custom():
    club_test_set = year_over_year_metrics_averaged_with_leagues[year_over_year_metrics_averaged_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]

    club_train_set = year_over_year_metrics_averaged_with_leagues[~year_over_year_metrics_averaged_with_leagues['league_name'].isin(["English Premier League ", "German 1. Bundesliga ", "French Ligue 1 ", "Spain Primera Division ", "Italian Serie A "])]

#     x_train = club_train_set.drop(["club","year_over_year","age","diff_age","diff_overall","league_name"], axis=1)
#     y_train = club_train_set["diff_overall"]

#     x_test = club_test_set.drop(["club","year_over_year","age","diff_age","diff_overall","league_name"], axis=1)
#     y_test = club_test_set["diff_overall"]
    
    x_train = club_train_set.drop(["club","diff_age","diff_overall","league_name", "url"], axis=1)
    y_train = club_train_set["diff_overall"]

    x_test = club_test_set.drop(["club","diff_age","diff_overall","league_name", "url"], axis=1)
    y_test = club_test_set["diff_overall"]    
    
    return club_train_set, club_test_set, x_train, y_train, x_test, y_test

In [None]:
def generate_scored_df( train_df, test_df):
    
    club_test_set_results = test_df.copy()
    # Difference between the model times the diff overall
    # club_test_set_results["score"] = club_test_set_results["diff_overall"] *  (club_test_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    # Product of predicted versus overall
    # club_test_set_results["score"] = club_test_set_results["diff_overall"] * club_test_set_results["predicted"] 
    # Difference between predicted versus overall
    # club_test_set_results["score"] = (club_test_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    # Difference between the model and sum of diff overall
    club_test_set_results["score"] = club_test_set_results["diff_overall"] + (club_test_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    club_test_set_results["score_raw"] = club_test_set_results["score"]
    club_test_set_results = club_test_set_results.drop(["year_over_year", "diff_age"], axis=1)    
    club_test_set_results = club_test_set_results.groupby(["club"]).mean().reset_index()
#     cols_to_norm = ['score']
#     club_test_set_results[cols_to_norm] = club_test_set_results[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
#     club_test_set_results[cols_to_norm] = club_test_set_results[cols_to_norm] * 100
    
    club_test_set_results[['score']] = MinMaxScaler().fit_transform(club_test_set_results[['score']])
    club_test_set_results['score'] = club_test_set_results['score'] * 100
    
    club_train_set_results = train_df.copy()
    # Difference between the model times the diff overall
    # club_train_set_results["score"] = club_train_set_results["diff_overall"] *  (club_train_set_results["diff_overall"] - club_train_set_results["predicted"])
    # Difference between the model times the diff overall
    club_train_set_results["score"] = club_train_set_results["diff_overall"] +  (club_train_set_results["diff_overall"] - club_train_set_results["predicted"])    
    # Product of predicted versus overall
    # club_train_set_results["score"] = club_train_set_results["diff_overall"] * club_train_set_results["predicted"] 
    # Difference between predicted versus overall
    # club_train_set_results["score"] = (club_train_set_results["diff_overall"] - club_train_set_results["predicted"])    
    club_train_set_results["score_raw"] = club_train_set_results["score"]

    club_train_set_results = club_train_set_results.drop(["year_over_year", "diff_age"], axis=1)
    club_train_set_results = club_train_set_results.groupby(["club"]).mean().reset_index()
#     cols_to_norm = ['score']
#     club_train_set_results[cols_to_norm] = club_train_set_results[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
#     club_train_set_results[cols_to_norm] = club_train_set_results[cols_to_norm] * 100
    club_train_set_results[['score']] = MinMaxScaler().fit_transform(club_train_set_results[['score']])
    club_train_set_results['score'] = club_train_set_results['score'] * 100

    
    return club_train_set_results, club_test_set_results

### Scoring Approach

The function here enumerates the scoring approaches and provides a method to select the approach. Default method is to use the difference overall + (difference overall - predicted difference overall)

In [None]:
from enum import Enum
class ScoreApproach(Enum):
    DIFF_OVERALL_TIMES_DIFFERENCE = 1
    DIFF_OVERALL_TIMES_PREDICTED = 2
    DIFF_OVERALL_PLUS_DIFFERENCE = 3
    OVERALL_PLUS_DIFF_OVERALL_PLUS_DIFFERENCE = 4
    DIFF_OVERALL_PREDICTED = 5

In [None]:
def generate_scored_with_approach_selection_df( train_df, test_df, score_approach=ScoreApproach.DIFF_OVERALL_PLUS_DIFFERENCE):
    
    club_test_set_results = test_df.copy()
    if (score_approach == ScoreApproach.DIFF_OVERALL_TIMES_DIFFERENCE):
        # Difference between the model times the diff overall
        club_test_set_results["score"] = club_test_set_results["diff_overall"] *  (club_test_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    elif (score_approach == ScoreApproach.DIFF_OVERALL_TIMES_PREDICTED):
        # Product of predicted versus overall
        club_test_set_results["score"] = club_test_set_results["diff_overall"] * club_test_set_results["predicted"] 
    elif (score_approach == ScoreApproach.DIFF_OVERALL_PLUS_DIFFERENCE):
        # Difference between the model and sum of diff overall
        club_test_set_results["score"] = club_test_set_results["diff_overall"] +  (club_test_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    elif (score_approach == ScoreApproach.OVERALL_PLUS_DIFF_OVERALL_PLUS_DIFFERENCE):
        # Difference between the model and sum of diff overall
        club_test_set_results["score"] = club_test_set_results["overall"] + club_test_set_results["diff_overall"] +  (club_test_set_results["diff_overall"] - club_test_set_results["predicted"])         
    elif (score_approach == ScoreApproach.DIFF_OVERALL_PREDICTED):
        # Difference between the model and sum of diff overall
        club_test_set_results["score"] = (club_test_set_results["diff_overall"] - club_test_set_results["predicted"])         

    club_test_set_results["score_raw"] = club_test_set_results["score"]
    club_test_set_results = club_test_set_results.drop(["year_over_year", "diff_age"], axis=1)    
    club_test_set_results = club_test_set_results.groupby(["club"]).mean().reset_index()
    # Normalize between 1 and 100
    club_test_set_results[['score']] = MinMaxScaler().fit_transform(club_test_set_results[['score']])
    club_test_set_results['score'] = club_test_set_results['score'] * 100
    
    club_train_set_results = train_df.copy()
    if (score_approach == ScoreApproach.DIFF_OVERALL_TIMES_DIFFERENCE):
        # Difference between the model times the diff overall
        club_train_set_results["score"] = club_train_set_results["diff_overall"] *  (club_train_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    elif (score_approach == ScoreApproach.DIFF_OVERALL_TIMES_PREDICTED):
        # Product of predicted versus overall
        club_train_set_results["score"] = club_train_set_results["diff_overall"] * club_train_set_results["predicted"] 
    elif (score_approach == ScoreApproach.DIFF_OVERALL_PLUS_DIFFERENCE):
        # Difference between the model and sum of diff overall
        club_train_set_results["score"] = club_train_set_results["diff_overall"] +  (club_train_set_results["diff_overall"] - club_test_set_results["predicted"]) 
    elif (score_approach == ScoreApproach.OVERALL_PLUS_DIFF_OVERALL_PLUS_DIFFERENCE):
        # Difference between the model and sum of overall
        club_train_set_results["score"] = club_train_set_results["overall"] + club_train_set_results["diff_overall"] +  (club_train_set_results["diff_overall"] - club_train_set_results["predicted"])         
    elif (score_approach == ScoreApproach.DIFF_OVERALL_PREDICTED):
        # Difference between the model and sum of diff overall
        club_train_set_results["score"] = (club_train_set_results["diff_overall"] - club_train_set_results["predicted"])  
        
    club_train_set_results["score_raw"] = club_train_set_results["score"]

    club_train_set_results = club_train_set_results.drop(["year_over_year", "diff_age"], axis=1)
    club_train_set_results = club_train_set_results.groupby(["club"]).mean().reset_index()
    # Normalize between 1 to 100
    club_train_set_results[['score']] = MinMaxScaler().fit_transform(club_train_set_results[['score']])
    club_train_set_results['score'] = club_train_set_results['score'] * 100
    print(club_train_set_results.shape)
    print(club_test_set_results.shape)
    return club_train_set_results, club_test_set_results

### Model Analysis and Visualization Helpers

In [None]:
def displayModelTestScoreAgeScatter(club_test_set_model):
    plt.scatter(club_test_set_model["age"], club_test_set_model["score"])
    plt.xlabel("Age")
    plt.ylabel("Score")
    plt.title("Club Scores at different Ages")
    plt.show()

def displayModelTestScoreClubHistogram(club_test_set_model):
    plt.hist(club_test_set_model["score"], bins=20)
    plt.xlabel("Score")
    plt.ylabel("Number of Clubs")
    plt.title("Histogram showing number of clubs ")
    plt.show()
    
def displayModelTestScoreClubHistogramByLeague(club_test_set_model):
    league_names = club_test_set_model.league_name.unique()
    league_scores = []
    for league_name in league_names:
        league_scores.append(club_test_set_model[club_test_set_model["league_name"] == league_name]["score"].values)

    plt.figure(figsize=(20,10))
    plt.hist(league_scores, bins = 10, histtype='bar', label=league_names)
    plt.xlabel("Score")
    plt.ylabel("Number of Clubs")

    plt.legend()
    plt.show()

def displayTop10ClubsForEachAge(club_test_set_model):

#     fig, axs = plt.subplots(5,3,figsize=(32,16))
#     collabel=("Club", "Score")
#     i = 0
#     j = 0
#     for i in [0,1,2,3,4]: 
#         for j in [0,1,2]:
#             axs[i][j].axis('tight')
#             axs[i][j].axis('off')
#             age = i*3+j+16
#             club_test_data_for_age = club_test_set_model[club_test_set_model["age"] == age].sort_values(by="score", ascending = False).head(10)
#             outof = len(club_test_set_model[club_test_set_model["age"] == age])
#             result = club_test_data_for_age[["club","score"]]
#             result['score'] = result['score'].map('{:.4f}'.format)
#             axs[i][j].table(cellText=result.values, colLabels=collabel, loc='center',colWidths=[0.3 for x in collabel])
#             axs[i][j].set_title(f"Clubs identified as top {len(club_test_data_for_age)} out of {outof} for Age - {age} ")

    plt.show()
    
def displayBestClubsOverAllAges(club_test_set_model, top_n = 10):
    # Finding the Clubs that performed best overall at all ages
    club_test_set_model = club_test_set_model.drop_duplicates()
#     all_ages_club_df = pd.DataFrame()
#     for age in range(16, 41):
#         club_test_data_for_age = club_test_set_model[club_test_set_model["age"] == age].sort_values(by="score", ascending = False).head(10)
#         all_ages_club_df=all_ages_club_df.append(club_test_set_model, ignore_index=True)

#     all_ages_club_mean_scores_df = club_test_set_model.groupby(by=["club"]).mean()[["score"]]
    all_ages_club_mean_scores_df = club_test_set_model[["club", "diff_overall", "predicted",  "score_raw",  "score", "url"]]

    result_df = all_ages_club_mean_scores_df.sort_values(by="score", ascending=False)
    print (f"Top {top_n} Clubs")
    display(result_df.head(top_n))

## Prediction Models

The Scoring Approach is to train a model on the difference of the overall score of the club for all the parameters. Once the model is trained predict the overall score improvements anticipated from the other factors for the club. If the overall predicted score is greater then give the team a positive score else give the team a negative score. Normalize the scores on a scale of 100. In order to predict the scores we have used 3 models - Linear, Linear with Ridge and Random Forests

### Linear Regression Model

#### Model Definition

In [None]:
club_train_set, club_test_set, x_train, y_train, x_test, y_test = train_test_split_custom()

linreg = LinearRegression()
linreg.fit(x_train, y_train)

y_train_pred = linreg.predict(x_train)
y_test_pred = linreg.predict(x_test)

# predicted_test_score_improvement = year_over_year_metrics_averaged[["club","diff_overall"]]
club_train_set["predicted"] = y_train_pred
club_test_set["predicted"] = y_test_pred

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)


print ("Basic Linear Regression\n")
print (f"Train MSE basic Linear Regression: {mse_train:.4f}")
print (f"Test MSE basic Linear Regression: {mse_test:.4f}")

# accuracy_train = accuracy_score(y_train.values, y_train_pred)
# accuracy_test = accuracy_score(y_test.values, y_test_pred)
# print (f"Train Accuracy basic Linear Regression: {accuracy_train:.4f}")
# print (f"Test Accuracy basic Linear Regression: {accuracy_train:.4f}")

r2 = r2_score(y_test, y_test_pred)
print (f"R2 values of Basic Linear Reqression: {r2:.4f}")

#### Linear Regression Basic Model - Score Calculation and Normalization

In [None]:
# club_test_set["score"] =  club_test_set["diff_overall"] + (club_test_set["diff_overall"] - club_test_set["predicted"])

# club_test_set[['score']] = MinMaxScaler().fit_transform(club_test_set[['score']])
# club_test_set['score'] = club_test_set['score'] * 100
# club_test_set
club_train_set_lin_reg_results, club_test_set_lin_reg_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set)
# club_test_set_lin_reg_results

#### Model Output

In [None]:
displayModelTestScoreAgeScatter(club_test_set_lin_reg_results)
displayModelTestScoreClubHistogram(club_test_set_lin_reg_results)
displayBestClubsOverAllAges(club_test_set_lin_reg_results)

### Linear Model Ridge Regularization 

In [None]:
club_train_set, club_test_set, x_train, y_train, x_test, y_test = train_test_split_custom()

maxdeg = 2
x_poly_train = PolynomialFeatures(maxdeg).fit_transform(x_train)
x_poly_test = PolynomialFeatures(maxdeg).fit_transform(x_test)
alpha_list = [0.001, 0.01, 0.1]
best_parameter = 0.01
# # Create two lists for training and validation error
# training_error, validation_error = [],[]
# for i in alpha_list:

#     print (i)
#     ridge_reg = Ridge(alpha=i,normalize=True)

#     #Fit on the entire data because we just want to see the trend of the coefficients

#     ridge_reg.fit(x_poly_train, y_train)
    
#     # Perform cross validation on the modified data with neg_mean_squared_error as the scoring parameter and cv=5
#     # Remember to get the train_score
#     ridge_cv = cross_validate(ridge_reg, x_poly_train, y_train, cv=5,scoring='neg_mean_squared_error',return_train_score=True)

#     # Compute the training and validation errors got after cross validation
#     mse_train = np.mean(np.abs(ridge_cv["train_score"]))
#     mse_val = np.mean(np.abs(ridge_cv["test_score"]))

#     # Append the MSEs to their respective lists 
#     training_error.append(mse_train)
#     validation_error.append(mse_val)
    
#     print(f"done {i}")
    
# # Get the best mse from the validation_error list
# best_mse  =  min(validation_error)

# # Get the best alpha value based on the best mse
# best_parameter = alpha_list[validation_error.index(best_mse)]
# print (best_parameter)


In [None]:
ridge_reg = Ridge(alpha=best_parameter,normalize=True)

#Fit on the entire data because we just want to see the trend of the coefficients

ridge_reg.fit(x_poly_train, y_train)

y_train_pred = ridge_reg.predict(x_poly_train)
y_test_pred = ridge_reg.predict(x_poly_test)

In [None]:
# predicted_test_score_improvement = year_over_year_metrics_averaged[["club","diff_overall"]]

club_train_set["predicted"] = y_train_pred
club_test_set["predicted"] = y_test_pred

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

print (f"Train MSE for Ridge Model {mse_train:.4f}")
print (f"Test MSE for Ridge Model {mse_test:.4f}")


# accuracy_train = accuracy_score(y_train, y_train_pred)
# accuracy_test = accuracy_score(y_test, y_test_pred)
# print (f"Train Accuracy Ridge Model: {accuracy_train:.4f}")
# print (f"Test Accuracy Ridge Model: {accuracy_train:.4f}")

r2 = r2_score(y_test, y_test_pred)
print (f"R2 for Ridge Model: {r2:.4f}")

#### Linear Ridge Model - Score Calculation and Normalization

In [None]:
club_train_set_ridge_reg_results, club_test_set_ridge_reg_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set)
club_train_set_ridge_reg_results.shape

#### Model Output

In [None]:
displayModelTestScoreAgeScatter(club_test_set_ridge_reg_results)
displayModelTestScoreClubHistogram(club_test_set_ridge_reg_results)
displayBestClubsOverAllAges(club_test_set_ridge_reg_results)

### Random Forest Model 

In [None]:
club_train_set, club_test_set, x_train, y_train, x_test, y_test = train_test_split_custom()

max_depth = 10
random_state = 144
random_forest = RandomForestRegressor(max_depth=max_depth, random_state=random_state, n_estimators=250, max_features = 0.5)

# Fit the model on the training set
random_forest.fit(x_train, y_train)

In [None]:
y_train_pred = random_forest.predict(x_train)
y_test_pred = random_forest.predict(x_test)

In [None]:
# predicted_test_score_improvement = year_over_year_metrics_averaged[["club","diff_overall"]]
club_train_set["predicted"] = y_train_pred
club_test_set["predicted"] = y_test_pred

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

print (f"Train MSE for Random Forest {mse_train:.4f}")
print (f"Test MSE for Random Forest {mse_test:.4f}")

# accuracy_train = accuracy_score(y_train, y_train_pred)
# accuracy_test = accuracy_score(y_test, y_test_pred)
# print (f"Train Accuracy Random Forest Model: {accuracy_train:.4f}")
# print (f"Test Accuracy Random Forest Model: {accuracy_train:.4f}")

r2 = r2_score(y_test, y_test_pred)
print (f"R2 Squared value for Random Forest: {r2:.4f}")

## Model and Scoring Approach Analysis

#### Scoring Analysis

After selecting the Random Forest as the model, we now plan to evaluate the scoring approaches and determine which is the one we like the most. 

We are going to use the scores to determine the clubs which have done.
The graphs generated are:
1. Scores vs Age which identifies what is the club distribution and the scores for clubs for Age.
2. Histogram of the scores
3. A Table of Top 10 Clubs for each player age available in the Dataset

#### Random Forest - Difference Overall Plus Difference in Prediction Scoring Approach

In [None]:
club_train_set_rf_results, club_test_set_rf_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set)
club_train_set_rf_results.shape

In [None]:
displayModelTestScoreAgeScatter(club_test_set_rf_results)
displayModelTestScoreClubHistogram(club_test_set_rf_results)
displayBestClubsOverAllAges(club_test_set_rf_results)

<div class="alert alert-block alert-success">
This is the selected scoring approach and is set as default for the score generator and was also validated using some manual validation and using SME.
</div>

#### Random Forest - Overall Plus Difference in Prediction Scoring Approach

In [None]:
club_train_set_rf_results, club_test_set_rf_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set, ScoreApproach.OVERALL_PLUS_DIFF_OVERALL_PLUS_DIFFERENCE)


In [None]:
displayModelTestScoreAgeScatter(club_test_set_rf_results)
displayModelTestScoreClubHistogram(club_test_set_rf_results)
displayBestClubsOverAllAges(club_test_set_rf_results)

<div class="alert alert-block alert-danger">
This clearly resulted in clubs who were the most popular clubs with great players. However did not reflect if it had helped improve the players because the absolute value of the overall score was being considered. We hence felt the difference was important.
</div>

#### Random Forest - Diff Overall Times Difference in Prediction Scoring Approach

In [None]:
club_train_set_rf_results, club_test_set_rf_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set, ScoreApproach.DIFF_OVERALL_TIMES_DIFFERENCE)

In [None]:
displayModelTestScoreAgeScatter(club_test_set_rf_results)
displayModelTestScoreClubHistogram(club_test_set_rf_results)
displayBestClubsOverAllAges(club_test_set_rf_results)

<div class="alert alert-block alert-warning">
The multiplication emphasized the importance of the Diff Overall and hence was rejected.
</div>

#### Random Forest - Difference in Prediction Scoring Approach

In [None]:
club_train_set_rf_results, club_test_set_rf_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set, ScoreApproach.DIFF_OVERALL_PREDICTED)
club_train_set_rf_results.shape

In [None]:
displayModelTestScoreAgeScatter(club_test_set_rf_results)
displayModelTestScoreClubHistogram(club_test_set_rf_results)
displayBestClubsOverAllAges(club_test_set_rf_results)

<div class="alert alert-block alert-danger">
This approach heavily relied on the prediction of the model and could result in identifying clubs which if were incorrectly predicted would result in giving significant advantage to the clubs. 
</div>

### Best Clubs Overall

In [None]:
club_train_set_rf_results, club_test_set_rf_results = generate_scored_with_approach_selection_df(club_train_set, club_test_set)
displayBestClubsOverAllAges(club_test_set_rf_results)

In [None]:
club_test_set_model = club_test_set_rf_results.drop_duplicates()
all_ages_club_mean_scores_df = club_test_set_model[["club", "diff_overall", "predicted",  "score_raw",  "score"]]
result_df = all_ages_club_mean_scores_df.sort_values(by="score", ascending=False)
result_df.to_csv('./data/ordered_clubs.csv', index=False)

## END OF SUBMISSION
---