# Predicting the value of players in FIFA 19 using KNN 
---
Written by Neil Mackenzie

# Introduction

Following a [project](https://github.com/NeilMackenzie39/Predicting-Car-Prices-using-KNN) on the use of KNN to predict car prices, I decided to try combining the use of machine learning with one of my passions: soccer. In this project I will use the skills I have learnt thus far in the Dataquest.io 'Data Scientist in Python' [course](https://www.dataquest.io/path/data-scientist/) to build a model that will predict the value of a player in FIFA 19 based on 10 of the player's attributes. The players will be categorized as forwards, midfielders, defenders and goalkeepers and the 10 best features to predict a player's value for each category will be identified and used to make predictions for players belonging to each category.

The steps used are summarised below:
1. Preview the dataset
2. Perform data cleaning to reduce the size of the dataset and remove null values
3. Convert columns containing numerical information in string format to integer/float format
4. Assign numerical information to categorical data
5. Normalize the data
6. Categorize players as forwards, midfielders, defenders or goalkeepers
6. Identify 10 most accurate attributes to predict player value for each category
7. Predict the value of players in each category using only the top 10 features identifed in (6)
8. Plot and analyze the results

The primary purpose of this project is to practice the implementation of KNN. The price of each player in the game is already known so there is no need to predict their value. Nonetheless, running trough this code will be useful to practice the impemenation of a KNN model. 

Once this project is completed, I intend to use the skills I have learned to build a model that will predict the price of custom stainless steel components that are designed at my current company. My intention is to use KNN to predict a component's price based on its complexity and identify how we can reduce the cost of components by reducing the presence/number of features that add to the cost significantly.

I also intend to expand on this project by using linear regression to predict the value of a player and compare the accuracy achieved between KNN and linear regression. This will be done in a future project.

The complete dataset for players in FIFA 2019 is available on [this](https://www.kaggle.com/karangadiya/fifa19) Kaggle page.

Let's get started with step 1 above

# 1. Import and preview dataset
The dataset is read and previewed in the code blocks below

In [None]:
#Import libraries and FIFA 19 player stats csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#Import display to improve display of dataframes in code blocks
from IPython.display import display, HTML

data = pd.read_csv('data.csv')

In [None]:
#Check number of columns in dataframe
num_col = len(data.columns)
print("Number of columns in dataframe:",num_col)

In [None]:
#Unhide all columns and then preview first 5 rows of dataframe to indentify columns
pd.options.display.max_columns = num_col
data.head()

In [None]:
#Assign all columns to variable and then preview columns
stats_columns = data.columns
stats_columns

# 2. Data Cleaning
## Reduce size of dataframe by removing unsued columns
After reviewing the columns in the dataframe above, it appears that some columns will not be very useful when predicting the price of a player. All of these columns are dropped in the cell below to reduce the size of the dataframe.

In [None]:
#Drop columns that are not useful for price prediction
dropped_cols = ['Unnamed: 0', 'ID', 'Photo', 'Nationality', 'Flag','Club Logo','Real Face','Body Type']
data_updated = data.drop(dropped_cols, axis = 1)

#Print preview of dataframe to ensure above code has executed correctly
data_updated.head()

## Handle missing information
The KNN algorithm will not be able to work with the data is there are missing values in the columns fed to the algorithm because it will attempt to evaluate a Euclidean distance from data that does not exist. To remove the potential for this error, lets handle entries with missing information.

### Remove rows with missing position information
This project requires every player to be categorized as either a forward, midfielder, defender or goalkeeper. The easiest way to do this is to categorize a player according to his position.

Unfortunately, the position of some players is listed as 'nan' i.e. not available. Filling in this inormation is possible but would require an in-depth knowledge of every player in the game. Manually entering missing positions would also be a time consuming process, so I have decided to remove these players from the dataframe.

In [None]:
#Drop all rows of players with missing position information:
data_na_pos_removed = data_updated.dropna(subset = ['Position'])
# data_na_pos_removed
print("Rows in dataframe before removing missing position information:",len(data_updated))
print("Rows in dataframe after dropping rows with missing position information:",len(data_na_pos_removed))

### Check for other missing information
The rows with missing player position information have been removed. Lets check for any other missing information in the dataframe:

In [None]:
#Function to check for missing information in any column:
def check_na_vals(dataframe):
    return dataframe[dataframe.isna().any(axis = 1)]

rows_with_na = check_na_vals(data_na_pos_removed)

In [None]:
#Function to use display options to show only 10 rows but also indicate number of rows below dataframe
def display_10_rows(dataframe):
    with pd.option_context('display.max_rows', 10):
            display(dataframe)
            
display_10_rows(rows_with_na)

### Remove columns containing null values (if appropriate)
The code above returned all 18147 rows of the current dataframe, indicating that at least 1 NaN value was contained in every row. From inspection, the above dataframe regularly contains NaN values in the 'Joined' and 'Loaned From' columns.

I don't see any reason why those columns would have any major impact on a player's price, so lets drop those columns and then check for null values again:

In [None]:
#Remove Joined and Loaned columns
data_cols_removed = data_na_pos_removed.drop(['Joined','Loaned From'], axis = 1)

#Check for missing information in any other columns:
rows_with_na = check_na_vals(data_cols_removed)

#See first 10 rows of dataframe containing na values:
display_10_rows(rows_with_na)

### Remove rating by position columns
The dataframe above contains just 3404 rows with NaN values whereas the dataframe contained 18 147 rows after removing missing position information. This indicates we have identified that approximately 19% of the rows in the dataframe contain missing information. This is a large portion of the dataframe so if possible, it would be better to avoid simply all of this information.

Upon reviewing the information in the 20 rows printed above, it seems the most (if not all) the missing information is in the columns that indicate a player's rating for specific positions. Since each player has a primary position indiacted in the 'Position' column as well as an overall rating and a detailed breakdown of their states, it seems feasible to drop the columns containing a player's rating for each position they can/do play in. In this project, players are only going to be categorized as forwards, midfielders, defenders and goalkeepers so the difference between a player's rating as a striker or centre forward (both forward positions) does not change the fact that he will be categorized as a forward.

Lets remove all rating-by-position columns and see if there is still any missing information remaining

In [None]:
#Remove rating-by-position columns
remove_cols = ['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM',
                    'RCM','RM','LWB','LDM','CDM','RDM','RWB','LB','LCB','CB','RCB','RB']
data_positions_removed = data_cols_removed.drop(columns = remove_cols)

#Check for missing information in any other columns:
rows_with_na = check_na_vals(data_positions_removed)

#See first 10 rows of dataframe containing na values:
display_10_rows(rows_with_na)

### Fill NaN values in "Release Clause" Column
With the ratings-by-position columns removed, the dataframe still contains 18147 rows. There are still 1504 rows with missing data which need to be inspected and modified or removed.

Upon inspection of the above dataframe, there seem to be a large number of rows with NaN values in the "Release Clause" column. This column could contain important contractual information if it isn't empty, so it is important not to just remove it. I assume NaN values in this column usually mean that a player does not have a release clause, so lets fill NaN values with the integer value 0 and then check for null values again

In [None]:
#Fill null values in Release Caluse column with 0
data_release_filled = data_positions_removed.fillna({"Release Clause" : 0})

#Check for missing information in any other columns:
rows_with_na = check_na_vals(data_release_filled)

#See first 10 rows of dataframe containing na values:
display_10_rows(rows_with_na)

### Remove players with no price information
Great progress! There are now just 229 rows with missing information in the entire dataframe!

It is immediately clear that many of the players with missing information in the dataframe above have no information in the "Club" column. It is also clear from the first few rows that many of these players are also missing a price in the "Value" column. 

Since the KNN algorithm will be trained to predict the "Price" column, we cannot have 0 values in that column. It is therefore necessary to remove all rows with 0 values in the "Value" column before training the KNN model. Removing these values could also result in removing rows with missing information in other columns such as missing details in the "Club" column.

We'll then convert all valid 'Value' entries to integers. To do this, we first need to remove the currency character and any other text characters to effectively isolate values not equal to 0. The values can then be mupltiplied to have the correct order before being converted to the integer datatype.

In [None]:
#Remove € character from "Values" column
data_release_filled['Value'] = data_release_filled['Value'].str.replace('€','')

#Remove K and M characters from "Values" column after multiplying to get correct order
data_release_filled['Value'] = data_release_filled['Value'].replace({'.5K':'500',
                                                                     'K':'000',
                                                                     '.5M':'500000',
                                                                     'M':'000000'}, 
                                                                    regex = True).map(pd.eval).astype(int)

# Remove rows with 0 in "Value" column
players_with_value = data_release_filled[data_release_filled['Value'] != 0]

In [None]:
#Check for null values in updated dataframe:
print("Number of rows containing null values:",players_with_value.isna().any(axis = 1).sum())

In [None]:
#Print number of rows in dataframe with data in 'Value' column and preview this dataframe
print("Number of rows in dataframe with no missing values",len(players_with_value))

players_with_value.head()

There are now no rows with missing information, so we can continue to select the appropriate data for the KNN algorithm

### Assess range of player values and remove players with very low value
Lets review some statistics on player values in the dataframe. Extreme outliers could cause major prediction errors from the KNN model so these will be removed before we continue.

In [None]:
#Print statistics overview
players_with_value['Value'].describe().apply(lambda x: format(x, 'f'))

In [None]:
#Plot frequency of player value ranges in €5 million intervals
fig = plt.figure(figsize = (10,7))
#Use 20 bins of player values. Value divided by 10^6 so x-axis units are in millions
(players_with_value['Value']/10**6).plot.hist(bins = 20)
plt.ticklabel_format(style = 'plain', axis = 'x')
plt.xticks(np.arange(0,125,10))
plt.xlabel('Player values (€M)')
plt.show()

The stats and graph above clearly show that a huge number of players have a value of below €5M. These players are unlikely to be considered by top football clubs and will also heavily skew the training data in the KNN algorithm. Lets plot the chart above again for players valued above €5M

In [None]:
#Assign players with value > €5 million to new dataframe
above_5m = players_with_value[players_with_value['Value'] >= 5000000]

#Check no. players in new dataframe
print("Players with value above €5 000 000:" , len(above_5m))

In [None]:
#Plot frequency of player value ranges above €5 million
fig = plt.figure(figsize = (10,7))
#Use 20 bins of player values. Value divided by 10^6 so x-axis units are in millions
(above_5m['Value']/10**6).plot.hist(bins = 20)
plt.ticklabel_format(style = 'plain', axis = 'x')
plt.xticks(np.arange(0,125,10))
plt.xlabel('Player values (€M)')
plt.show()

The chart above still shows the majority of the players have values below €15M but the range looks more realistic for the training of a KNN model. The cleaned dataset of players with values has been reduced from 17907 players to 2302 players with a value of €15M.

# 3. Convert numerical inforamtion stored as strings to correct data format
KNN can only measure how 'close' two numerical values are in the feature space. Since there are a number of columns with text data, these features will need to be converted to numerical data to be used with the KNN algorithm. 

In [None]:
#Define list of attributes that will be used for training.
attributes = list(above_5m.columns)
remove_attributes = ['Name','Age','Club','Special','Preferred Foot']
attributes = [attr for attr in attributes if attr not in remove_attributes]

#Check datatype of every column to identify which contain string values
above_5m[attributes].dtypes

The 'Wage', 'Work Rate', 'Contract Valid Until', 'Height', 'Weight' and 'Release Clause' columns need to be either removed, cleaned and converted to text or be assigned numerical values based on the text data contained in the column. Lets have a look at the types of values in each of those columns to decide on how to proceed. Note the 'Position' column column also contains text data but this column will only be used to categorize players and not to predict their value.

In [None]:
#Print some of the data contained in columns containing text 
#to investigate how to convert these columns to integers
print("Unique values in text columns:")
print("Wage:", above_5m['Wage'][0:6])
print("Wage:", above_5m['Wage'].unique())
print("\nContract Valid Until:", above_5m['Contract Valid Until'].unique())
print("\nHeight:", above_5m['Height'].unique())
print("\nWeight:", above_5m['Weight'].unique())
print("\nRelease Clause:\n", above_5m['Release Clause'][0:5])

The data in the columns listed above will be handled as described below:

#### Wage
Similar to what was done with the 'Value' column earlier on, the Euro and 'K' characters will be removed. The scale of the number must first be correct (X1000 because of 'K' character) and can then be converted to an integer.

#### Work Rate
This column contains categorical data that can be converted to numerical values. This is handled in section 4 below.

#### Contract Valid Until
This column could be important since the length of a player's contract will greatly effect his value to the club. This column will therefore need to be converted into an integer. I will convert the date into a number of days from the current date. For simplicity, only the year will be used from this column

#### Height
Remove feet and inches characters and convert height to cm

#### Weight
remove 'lbs' text and convert to kg and float datatype

#### Release clause
Remove text characters and convert to float. Null values also exist in these columns so they will need to be replaced. Because of a lack of further information, null valus will be replaced by 0.   

In [None]:
import re
#Turn off SettingWithCopyWarning
pd.options.mode.chained_assignment = None  # default='warn


#Remove € character from "Wage" column
above_5m['Wage'] = above_5m['Wage'].str.replace('€','')

#Remove K character from "Wage" column after multiplying to get correct order
above_5m['Wage'] = above_5m['Wage'].replace({'K':'000'}, 
                                            regex = True).map(pd.eval).astype(int)


#Extract only year in 'Contract Valid Until' column
pattern = r'(\d{4})'
years = above_5m['Contract Valid Until'].str.extract(pattern).copy()
above_5m['Contract Valid Until'] = years
above_5m['Contract Valid Until'] = above_5m['Contract Valid Until'].fillna(0).astype(int)

#Remove extra characters in "Weight" column
mass_lbs = above_5m['Weight'].str.replace('lbs','').astype(float)
above_5m['Weight'] = round(mass_lbs*0.453592,2)


#Remove text characters in "Release Clause" column and convert to float
above_5m['Release Clause'] = above_5m['Release Clause'].str.replace('€','').str.replace(
    'M','').str.replace('K','').astype(float)
above_5m['Release Clause'] = above_5m['Release Clause'].fillna(0)

#Convert height to cm
feet_inches = above_5m['Height'].str.split("'",expand = True).astype(int)
cm = feet_inches[0]*30.48 + feet_inches[1]/12*30.48
above_5m['Height'] = cm

above_5m.head(10)

# 4. Assign numerical information to categorical data
As mentioned in section (3), the work rate column contains categorical information that can be converted to numerical data. The values in this column are either 'Low', 'Medium' or 'High'. To allow easy use of the KNN algorithm, these values will be converted into integers (1 for Low, 3 for High). 

According to [this](https://www.reddit.com/r/FIFA/comments/3znwub/beginners_guide_work_rates/) webpage, each player has an attacking and defensive work rate separated by a slash. The format of the work rate is given as attacking/defensive. Since there are two values given in this column, the final work rate will then be taken as the average of a player's attacking and defensive work rates.

In [None]:
# #Assign numerical values to 'Work Rate' column
work_rate_cleanup = {'Low/ Low': (1+1)/2, 'Low/ Medium': (1+2)/2,'Medium/ Low':(2+1)/2,
                     'Low/ High':(1+3)/2,'High/ Low':(3+1)/2, 'Medium/ Medium': (2+2)/2, 
                     'High/ Medium': (3+2)/2, 'Medium/ High': (2+3)/2, 'High/ High': (3+3)/2}

above_5m['Work Rate'].replace(work_rate_cleanup, inplace = True)


## Data cleaning complete!
There are now no rows with missing information. Lets confirm the length of the updated dataframe and preview it:

# 5. Normalization
If all values in the dataframe had the same scale, normalization would not be necessary. In this case, most columns are on a 0-100 scale but are not. The conversion of 'Work Rate' and 'Contract Valid Until' columns in the cell above result in two examples of columns that are not on the 0-100 scale.

In order to use KNN accurately, the data fed in to the algorithm must be on the same scale so lets normalize the dataframe:

In [None]:
#List of columns that won't be normalized
non_data_cols = ['Name','Club','Position','Wage', 'Preferred Foot', 'Value']
#Create normalized dataframe with columns to be normalized for players valued above €5M
normalized_attributes = [attr for attr in above_5m.columns if attr not in non_data_cols]
normalized = above_5m[normalized_attributes]

#Do the normalization
normalized = (normalized - normalized.min()) / (
    normalized.max() - normalized.min())

#Add columns that were removed before normalization back to the dataframe
normalized = pd.concat([above_5m[non_data_cols],normalized], axis = 1)

#Preview normalized dataframe
normalized.head()

# 6. Categorize players into new dataframes according to position
The players in the dataset can now be assigned to separate 'forwards', 'midfielders', 'defenders' and 'goalkeepers' dataframes which will be used independently to train models according to each of the four position categories.

In [None]:
#Preview unique positions in dataframe in order to isolate according to each dataframe
positions = normalized['Position'].unique()
positions

The players categorization of players according to the position is guided by [this](https://fifafootballvideogames.fandom.com/wiki/Soccer_positions) webpage. The categorization approach is summarized in the table below:

|Position (Abbreviated)|Position|Category
|---|---|---|
LS|Left Striker|Forward
ST|Striker|Forward
RS|Right Striker|Forward
LF|Left Forward|Forward
CF|Centre Forward|Forward
RF|Right Forward|Forward
LAM|Left Attacking Midfielder|Midfielder
CAM|Central Attacking Midfielder|Midfielder
RAM|Right Attacking Midfielder|Midfielder
LW|Left Wing|Midfielder
RW|Right Wing|Midfielder
LM|Left Midfielder|Midfielder
LCM|Left Centre Midfielder|Midfielder
CM|Centre Midfielder|Midfielder
RCM|Right Centre Midfielder|Midfielder
RM|Right Midfielder|Midfielder
LDM|Left Defensive Midfielder|Midfielder
CDM|Centre Defensive Midfielder|Midfielder
RDM|Right Defensive Midfielder|Midfielder
LWB|Left Wingback|Defender
RWB|Right Wingback|Defender
LB|Left Back|Defender
LCB|Left Centre Back|Defender
CB|Centre Back|Defender
RCB|Right Centre Back|Defender
RB|Right Back|Defender
GK|Goalkeeper|Goalkeeper


In [None]:
#Lists of positions in each category
list_forwards = ['LS','ST','RS','LF','CF','RF']
list_midfielders = ['LAM','CAM','RAM','LW','RW','LM','LCM','CM','RCM','RM','LDM','CDM','RDM']
list_defenders = ['LWB','RWB','LB','LCB','RCB','RB']
list_goalkeepers = ['GK']

#Create new dataframes of players categorised according to position
forwards = normalized[normalized['Position'].isin(list_forwards)]
midfielders = normalized[normalized['Position'].isin(list_midfielders)]
defenders = normalized[normalized['Position'].isin(list_defenders)]
goalkeepers = normalized[normalized['Position'].isin(list_goalkeepers)] 

The players have now been isolated into separate dataframes according to their position on the field. Out of interest, lets see how many players are in each of these dataframes.

In [None]:
print('Number of forwards:',len(forwards))
print('Number of midfielders:',len(midfielders))
print('Number of defenders:',len(defenders))
print('Number of goalkeepers:',len(goalkeepers))

# 7.1 Identify 10 most relevant attributes for predicting player value
The 10 attributes that most accurately predict a player's value will be determined for each category. 

The number of features to use could be optimised by checking the RMSE value associated with each number of attributes. Using this method, the attributes should be added to the test in order of their individual accuracy. This method would add complexity and is not explored in this code.

## Identify correlation coefficients between features and 'Value' column
A function to find the RMS value for a given feature using the KNN algorithm is defined below. This function therefore represents a univariate KNN algorithm and will be used to determine the most accurate price predictors for each category of players

In [None]:
from sklearn.linear_model import LinearRegression

#Assign all column names to attributes variable
attributes = list(normalized.columns)

#Remove unwanted but potentially useful data columns from 
#attributes to creat list of coulmns for training
remove_attributes = ['Name','Age','Club','Special','Wage','Preferred Foot','Position']
train_attributes = [attr for attr in attributes if attr not in remove_attributes]

#Function to sort correlation between each attribute and value
def get_corr(dataset):
    train_subset = dataset[train_attributes]
    corr_matrix = train_subset.corr()
    sorted_corrs = corr_matrix['Value'].abs().sort_values(ascending = False)
    
    #Return top top attributes
    return sorted_corrs

fwd_corrs = get_corr(forwards)
mid_corrs = get_corr(midfielders)
def_corrs = get_corr(defenders)
gk_corrs = get_corr(goalkeepers)

In [None]:
#Isolate 10 features with highest correlation 
#Slice from 1 because value contained at 0 index 
#represents correlation between 'Value' and 'Value' (always 1)
fwd_strong_corrs = fwd_corrs[1:11]
mid_strong_corrs = mid_corrs[1:11]
def_strong_corrs = def_corrs[1:11]
gk_strong_corrs = gk_corrs[1:11]

In [None]:
lr = LinearRegression()

def predict(dataset,attributes):
    lr.fit(dataset[attributes],dataset['Value'])
    dataset['Predicted Value']  = (lr.predict(dataset[attributes])).round(1)
#     dataset = pd.concat([dataset.iloc[:,0:6],dataset['Predicted Value'],dataset.iloc[:,6:-1]], axis = 1)
    dataset['% Difference'] = round((dataset['Value']-dataset['Predicted Value'])/dataset['Value']*100,2)
    predictions = dataset[['Name','Value','Predicted Value','% Difference']]
    return predictions
    
fwd_predictions = predict(forwards,fwd_strong_corrs.index.tolist())
mid_predictions = predict(midfielders,mid_strong_corrs.index.tolist())
def_predictions = predict(defenders,def_strong_corrs.index.tolist())
gk_predictions = predict(goalkeepers,gk_strong_corrs.index.tolist())

gk_predictions

## Run univariate KNN for each attribute and each category of player
The code below will run the KNN algorithm defined in the code cell above to determine the RMSE value by using every attribute in the dataframe to predict a player's price. 

The code could be greatly improved by logically selecting which features to use for predictions based on each category of player (Goalkeeping Reflexes would not be a useful attribute for predicting the value of a striker). In the interest of making progress in the code, I will assume the user has no knowledge of which attributes are useful for each position and simply use them all.

In [None]:
#Assign all column names to attributes variable
attributes = list(normalized.columns)

#Remove unwanted but potentially useful data columns from 
#attributes to creat list of coulmns for training
remove_attributes = ['Name','Age','Value','Club','Special','Wage','Preferred Foot','Position']
train_attributes = [attr for attr in attributes if attr not in remove_attributes]

#Function to determine RMSE for each attribute
def get_RMSE(dataframe):
    #Create dictionary of RMSE value associated with each attribute
    RMSEs = {}
    for attribute in train_attributes:
        rmse = KNN_uni(dataframe,attribute)
        RMSEs[attribute] = rmse
    
    #Create tuple of dictionary so that it can be sorted according to RMSE
    rmse_tuple = RMSEs.items()
    rmse_tuple = sorted(rmse_tuple,key=lambda x: x[1])
    
    #Create and return dataframe of top 10 predictors 
    top_10 = pd.DataFrame(rmse_tuple[0:10], columns = ['Attribute','RMSE'])
    return top_10

#Run get_RMSE function to get RMSE value for each category of player
top_10_forwards = get_RMSE(forwards)
top_10_midfielders = get_RMSE(midfielders)
top_10_defenders = get_RMSE(defenders)
top_10_goalkeepers = get_RMSE(goalkeepers)       

In [None]:
#Check number of attributes that were used to train the model

In [None]:
print(len(train_attributes))

In [None]:
#Use display() function to display dataframes as tables
print("Top 10 predictors for forwards:")
display(top_10_forwards)
print("===============================")

print("\nTop 10 predictors for midfielders:")
display(top_10_midfielders)
print("===============================")

print("\nTop 10 predictors for defenders:")
display(top_10_defenders)
print("===============================")

print("\nTop 10 predictors for goalkeepers:")
display(top_10_goalkeepers)
print("===============================")

In [None]:
#Create list of top 10 attributes for each category from the relevant dataframes

fwd_predictors = top_10_forwards['Attribute'].tolist()
mid_predictors = top_10_midfielders['Attribute'].tolist()
def_predictors = top_10_defenders['Attribute'].tolist()
gk_predictors = top_10_goalkeepers['Attribute'].tolist()

# 8. Predict player value using KNN
The top 10 attributes to predict the value of players in each category have now been determined. To test the accuracy of using just these 10 attributes to predict the price of each player, lets assume the price of each player was not known and predict their value based on the KNN model.

In [None]:
#Define KNN function to predict player value based on top 10 relevant attributes
def knn_predict(dataframe,predictors):
    knn = KNeighborsRegressor()
 
    #Train using set of attributes sent to function
    knn.fit(dataframe[predictors],dataframe['Value'])
    
#     Run prediction of player value using set of attributes sent to function.
#     Add these predicted values as well as the error compared to the 
#     true FIFA 19 value to a new dataframe
    
    predicted = dataframe[['Name','Value']]
    predicted['Predicted'] = knn.predict(dataframe[predictors])
    predicted['% Difference'] = round(((predicted['Value']-predicted['Predicted'])/predicted['Predicted'])*100,2)
    
    #Return predicted values sorted by % difference to FIFA 19 value
    return(predicted.sort_values('% Difference', ascending = False))

fwd_prices = knn_predict(forwards,fwd_predictors)
mid_prices = knn_predict(midfielders,mid_predictors)
def_prices = knn_predict(defenders,def_predictors)
gk_prices = knn_predict(goalkeepers,gk_predictors)                      

# 9. Data Visualization
Predictions have been made using the 10 best predictive features of player value for the dataset. Lets now plot the % difference between the predicted value and the actual FIFA 19 value to visualize the accuracy of using 10 features to predict a player's value

In [None]:
fig = plt.figure(figsize = (7,20))

#Function to plot % Difference column for dataframe sent to function
def plot(dataframe,category,plot_pos):
    ax = fig.add_subplot(4,1,plot_pos)
    ax.hist(dataframe['% Difference'])
    ax.set_xlabel('Difference between predicted price & FIFA 19 price (%)')
    ax.set_ylabel('Frequency')
    ax.set_title('Percentage difference in predicted price for ' + category)
    fig.tight_layout()

#Create plots for each dataframe
plot(fwd_prices,'forwards',1)
plot(mid_prices,'midfielders',2)
plot(def_prices,'defenders',3)
plot(gk_prices,'goalkeepers',4)

# Statistics 
The charts above show surprisingly good accuracy is achieved when predicting a player's value for the top 10 attributes that predict values in the relevant category. The prediction of value for all categories follow a normal distrbution centred near a 0% difference between predicted value and FIFA 19 value.

Lets have a look at the summary statistics for the % Difference between predicted value and FIFA 19 for each category:

In [None]:
print("""Summary stats for % Difference between value predicted using top 10 features
and FIFA 19 value:""")

for i,j in (fwd_prices,'Forwards'), (mid_prices,'Midfilers'), (
    def_prices,'Defenders'), (gk_prices,'Goalkeerps'):
    print("\n=========================================================================")
    print(j)
    print("=========================================================================")
    print(i['% Difference'].describe())

From the stats above we can see that, in general, a player's value can be predicted fairly accurately using just 10 of the 45 that were used tested in the univariate KNN model earlier on.

# Conclusion
In this project, the value of players categorized as forwards, midfielders, defenders and goalkeepers in FIFA 19 was predicted using just 10 out of 45 player attributes. 

It was found that just 10 relevant features is enough to accurately predict the value of a player in the game. The average difference between the predicted value and actual value in FIFA 19 is summarized for each category below:

|Category|% Difference (Predicted vs FIFA 19)|
|---|---|
|Forwards|0.33|
|Midfielders|0.09|
|Defenders|0.13|
|Goalkeepers|0.78|

This project has been incredibly valuable practice in data cleaning, pandas, KNN and matplotlib. As mentioned in the introduction, this project will be revisited in future to compare the accuracy of predictions made using KNN vs linear regression.

With better knowledge about the implementation of KNN, I now intend to build a model to analyze which attributes of stainless steel components designed at my current job contribute to their cost most significantly. This information can  be used to a) reduce the complexity of our designs to reduce cost and b) predict the cost of a component based on its design features.