---


<h1 style="text-align: center;">
    
**EXPLORING RELATIONSHIP BETWEEN FOOTBALL PLAYERS CURRENT PERFORMANCE AND THEIR FUTURE MARKET VALUE .**

</h1>

---

![Stats](https://e00-marca.uecdn.es/assets/multimedia/imagenes/2020/04/13/15867919512613.jpg)

### TABLE OF CONTENTS

1. IMPORTING LIBRARIES

2. IMPORTING DATASET

3. DATA PROCESSING

4. EXPLORATORY DATA ANALYSIS (EDA)

5. SAMPLING

6. FEATURE SELECTION

7. MODEL BUILDING
    
    MODEL SELECTION:

        I.      Decision Tree
        II.     Random Forest
        III.    Linear Regression

8. MODEL EVALUATION

9. MODEL VALIDATION
    
    HYPER PARAMETER TUNING (GRID SEARCH)

    FEATURE IMPORTANCE:

        I.      Grid Search Feature Importances
        II.     Decision Terr FEature Importances
        III.    Random Forest Feature Importances
    
    ORDINARY LEAST SQUARE (OLS) LINEAR REGRESSION

10. REFERENCES:

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        1. IMPORTING LIBRARIES
    </center>
</h1>

---

In [None]:
#Importing all libraries...
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn as sb

#Visualization libraries
import plotly.offline as py
%matplotlib inline
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
import matplotlib as mpl
cmap = mpl.colormaps['viridis']

#Importing statistics for confidence interval
import statsmodels.api as sm
from scipy.stats import t
from sklearn import metrics
from math import log

# Performance
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Machine Learning Model and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import plot_tree
from sklearn import tree
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

#To ignore all DeprecationWarning warnings in your code.
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None  # default='warn'

#Setting scatter size for our numerical variables plots
SCATTER_SIZE = 800

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        2. IMPORTING DATASET
    </center>
</h1>

---

In [None]:
# Reading our first (team stats) CSV files from List
df = pd.read_csv('fifa.csv')

#Maximum rows and columns to be displayed from our tables
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

In [None]:
#Checking top row(s) of our df data
df.head()

#### `VARIABLES`

    * Potential Independent variable
    - Performancemetrics i.e age, position, overall_rating, potential etc

    * Dependent variable
    - Future market value

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        3. DATA PROCESSING
    </center>
</h1>

---

In [None]:
# Checking the shape of our first csv file (df)
df.shape

In [None]:
#Checking duplicated data as a whole
df.duplicated()

In [None]:
#Checking sum of duplicates in our data.
df.duplicated().sum()

In [None]:
#Checking all columns in our df dataframe
for col in df.columns:
    print(col)

In [None]:
#Checking data types in our dataframe
df.dtypes

In [None]:
#Checking statistical details of our data
df.describe()

In [None]:
# Count of NaN values in each column
print(df.isnull().sum())

In [None]:
#Dropping unimportant columns in our data
df.drop(['birth_date',
         'id',
         'full_name',
         'weak_foot(1-5)',
         'body_type',
         'national_team_position',
         'national_jersey_number',
         'skill_moves(1-5)',
         'international_reputation(1-5)',
         'GK_reflexes',
         'GK_positioning',
         'GK_kicking',
         'GK_handling',
         'GK_diving',
         'national_team',
         'contract_end_year',
         'club_join_date',
         'tags',
         'traits',
         'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB'
        ],
        axis=1, inplace=True)

#Distplaying top of our data after dropping unimportant features
df.head()

"release_clause_euro", "national_team",  "weak_foot(1-5)", "potential", "full_name",

#### MISSING VALUES

In [None]:
#Dropping all observations with missing data/values
df1=df.dropna().reset_index(drop=True)

# Count of NaN values in each column
print(df1.isnull().sum())

In [None]:
#Checking the shape of our data after dropping missing data 
df1.shape

In [None]:
#Checking minimum value in our dataset
minvalue = df1['age'].min() 
  
minvalue

In [None]:
#Checking maximum age in our dataset
maxvalue = df1['age'].max() 
  
maxvalue

In [None]:
#Selecting only few columns from our dataframe
df1[['name', 'age', 'nationality', 'overall_rating', 'potential', 'club_rating', 'value_euro', 'positions']]

In [None]:
#Getting only one preferred position in positions column (first only)
df1['preferred_position'] = df1['positions'].str.split().str[0]

#Removing comma's to separate player positions in positions columns
df1['positions'] = df1['positions'].str.replace(',',' ')

#Getting only one preferred position in positions column (first only)
df1['preferred_position'] = df1['positions'].str.split().str[0]

#Checking top row(s) of cols data after removing commas in positions value
df1

In [None]:
#Checking preferred position column to see what we have
df1.preferred_position.unique()

#### Playing Positions

In football, there are many different types of positions a player can play. In this project, i decided to reform all the unique player's playing positions into the following below.

`1. Strikers` <br>
-Forward: ST, CF <br>
-Winger: LW, RW <br>

`2. Midfielders` <br>
-Attacking Midfielder: CAM <br>
-Central Midfielder: CM <br>
-Side Midfielder: LM, RM <br>
-Defensive Midfielder: CDM <br>

`3. Defenders` <br>
-Centre Back: CB <br>
-Full Back: LWB, RWB, LB, RB <br>

`4. Goal Keeper` <br>
-Goalkeeper: GK <br>
   

In this dataset some players appear to be playing in multiple positions, so for those with multiple playing positions I only chose the first one.

In [None]:
# count of players in the dataset
print(f'Count of players in the dataset is: {df1.shape[0]} players')

# count of features in the dataset
print(f'Count of features in the dataset is: {df1.shape[1]} features')

# count of nationalities in the dataset
print(f'Count of nationalities in the dataset is: {df1.nationality.nunique()} nationalities')

# count of clubs in the dataset
print(f'Count of clubs in the dataset is: {df1.club_team.nunique()} clubs')

In [None]:
#Adding column that shows total count of players by position
df1['position_count'] = df1.groupby('preferred_position')['preferred_position'].transform('count')

In [None]:
#Checking data types in our dataframe
df1.dtypes

In [None]:
#Checking the shape of our data
df1.shape

In [None]:
#Checking top row(s) of cols data
df1.head()

#### OUTLIERS

In [None]:
#Checking mean and max of age column to detect outliers
df1.describe()['age']

In [None]:
#create a function to find outliers using IQR
def find_outliers_IQR(df1):
    q1=df1.quantile(0.25)
    q3=df1.quantile(0.75)
    IQR=q3-q1
    
    outliers = df1[((df1<(q1-1.5*IQR)) | (df1>(q3+1.5*IQR)))]
    
    return outliers

In [None]:
#First run age through the function to return a series of the outliers.
outliers = find_outliers_IQR(df1['age'])

print("Number of outliers: "+ str(len(outliers)))
print("Max outlier value: "+ str(outliers.max()))
print("Min outlier value: "+ str(outliers.min()))

#outliers

In [None]:
#Dropping rows containing outliers
df2 = df1.drop(outliers.index)

In [None]:
#Checking the shape of our new data after dropping outliers
df2.shape

In [None]:
# select all numeric columns
df2.select_dtypes(include='number') 

In [None]:
# Create the correlation matrix
corr = df2.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 9))

sns.heatmap(
    corr, 
    cmap=cmap,
    annot=True,
    vmax=.3,
    vmin=-.3,
    center=0, 
    square=True, 
    linewidths=.5, 
    cbar_kws={"shrink": .5})

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        4. EXPLORATORY DATA ANALYSIS (EDA)
    </center>
</h1>

---

In [None]:
#Graph of value against release clause
fig = px.line(df2,
              x='release_clause_euro',
              y='value_euro',
              color='name', 
              title='Graph of value against release clause',
              markers=True)
fig.show()

In [None]:
#Graph of value against player wages
fig = px.line(df2,
              x='wage_euro',
              y='value_euro',
              color='name',
              title='Graph of value against player wages €',
              markers=True)
fig.show()

In [None]:
#Graph of value against height in cms
fig = px.line(df2, x='height_cm',
              y='value_euro', 
              color='name', 
              title='Graph of value against height in (cm)', 
              markers=True)
fig.show()

In [None]:
#Graph of value against weight in kgs
fig = px.line(df2, 
              x='weight_kgs', 
              y='value_euro', 
              color='name', 
              title='Graph of value against weight in kgs',
              markers=True)
fig.show()

In [None]:
#Graph of value against nationality
fig = px.line(df2, 
              x='nationality',
              y='value_euro',
              color='name',
              title='Graph of value against nationality',
              markers=True)
fig.show()

In [None]:
#Graph of value against club team
fig = px.line(df2, 
              x='club_team',
              y='value_euro', 
              color='name',
              title='Graph of value against club team',
              markers=True)
fig.show()

In [None]:
#Graph of number of players, preferred foot and positions
fig = px.bar(df2,
             x="preferred_position",
             y='position_count',
             color='preferred_foot',
             title="Number of players according to preferred position and preferred foot")
fig.show()

In [None]:
#Graph of value, preferred position and work rate of players
fig = px.bar(df2, 
             x="preferred_position", 
             y="value_euro",
             color="work_rate", 
             title='Graph of value, preferred position and work rate of players',
             text="name")
fig.show()

In [None]:
#Area chart of football players value against their age in their respective positions on the field
fig = px.area(df2,
              x="age",
              y="value_euro",
              color="preferred_position",
              line_group="name",
              title='Area chart of of football players value against their age')
fig.show()

In [None]:
#Group the data by Result:
general = df2.groupby('preferred_position')['preferred_position'].count().reset_index(name = "count")

#Apply px.pie:
fig = px.pie(general,
             values ='count',
             names ='preferred_position',
             title='Pie-Chart of Number of players in each Preferred Posistion', color = 'count')

#Add text and define text information:
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.show()

In [None]:
#Group the data by Result:
general = df2.groupby('preferred_foot')['preferred_foot'].count().reset_index(name = "count")

#Apply px.pie:
fig = px.pie(general,
             values ='count',
             names ='preferred_foot',
             title='Pie-Chart of Number (and Percentage) of players according to Preferred Foot', color = 'count')

#Add text and define text information:
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.show()

From the figure above, we can see the most players are on positions CB, ST and CM.

In [None]:
#Plotting a graph of football players nationality distribution
fig = px.histogram(
    df2, 
    "nationality", 
    nbins=50, 
    title='Count distribution by Nationality'
)

fig.show()

In [None]:
#plotting Scatter plot for Value (€) against Overall Rating
fig = px.scatter(
    df, 
    x='overall_rating', 
    y='value_euro', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Value (€) against Overall Rating')

fig.show()

In [None]:
#Plotting a scatter plot for Overall Rating against Age
fig = px.scatter(
    df, 
    x='age', 
    y='overall_rating', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Overall Rating against players Age' 
)

fig.show()

#Top 10 players by potential.
df2.sort_values("potential", ascending=False)[['name', "value_euro",  'potential', "overall_rating", "age",]].head(10)

In [None]:
#Plotting a scatter plot for Overall Rating against Age
fig = px.scatter(
    df2, 
    x='age', 
    y='potential', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Potential against Age' 
)

fig.show()

In [None]:
#Plotting a scatter plot for Overall Rating against Age
fig = px.scatter(
    df2, 
    x='potential', 
    y='value_euro', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Potential against Age' 
)

fig.show()

In [None]:
#Plotting a scatter plot for Overall Rating potential
fig = px.scatter(
    df2, 
    x='potential', 
    y='overall_rating', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Overall Rating against Potential' 
)

fig.show()


#Top 10 players by potential
df2.sort_values("potential", ascending=False)[['name', "age", "value_euro", "overall_rating", 'potential']].head(10)

In [None]:
#Plotting a histogram of player value distribution
fig = px.histogram(
    df2, 
    "value_euro", 
    nbins=100, 
    title='Histogram showing value distribution',
    width=800,
    height=600
)

fig.show()

Let's check the `Value` columns distribution. As we can see the majority of player's value is less than `15M euro (€)`. 

From the above figure, we looked at the avarage value for every `Age` in our dataset. As we see from the plot the most suitable `age` to get high transfer value is `27 years`.

In [None]:
#Checking top of our dataset
df2.head(3)

In [None]:
#Finding correlation using pearson in our df2 data
pearsoncorr=df2.corr(method='pearson')
pearsoncorr

In [None]:
matrix = df2.corr().round(2)
sns.heatmap(matrix, annot=True, vmax=1, vmin=-1, center=0, cmap='vlag')
plt.show()

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        5. SAMPLING
    </center>
</h1>

---

In [None]:
#Returning a row as we havent yet specified sample number
df2.sample()

In [None]:
#Defining our sample number
subset = df2.sample(n=500)

In [None]:
#Displaying our subset
subset

In [None]:
#Checking shape of our subset data
subset.shape

In [None]:
# count of players in the dataset
print(f'Count of players in the dataset is: {subset.shape[0]} players')

In [None]:
#Checking nationality in our subset data
subset['nationality'].value_counts()

In [None]:
#Count in our preferred position column of our subset data
subset['preferred_position'].value_counts()

In [None]:
#Exploring statistics for all our categorical variables
for i in subset.columns[2:11]:
    
    ind = subset[i].value_counts().index
    val = subset[i].value_counts().values
    
    print(f'Variable: {i}')
    print(subset[i].value_counts().describe())
    print('.'*50)

In [None]:
#Finding correlation using pearson in our subset data
subset.corr(method='pearson')

In [None]:
#Top of our subset head
subset.head()

In [None]:
#Shape of our subset
subset.shape

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        6. FEATURE SELECTION
    </center>
</h1>

---

In [None]:
#Filtering important features in our dataset
features = [
    'age', 'composure', 'ball_control', 'reactions', 'potential', 'crossing', 'aggression',
    'standing_tackle','interceptions', 'jumping', 'short_passing', 'long_passing',
    'heading_accuracy', 'balance', 'strength', 'agility', 'height_cm', 'weight_kgs', 'freekick_accuracy', 'penalties', 'value_euro']

#Displaying features selected.
df3 = subset[[*features]]
df3.head()

In [None]:
#Value varies greatly, so use log transformation for modelling
#Value distribution after log transformation
df3["value_euro"] = np.log(df3["value_euro"])
df3["value_euro"].hist();

In [None]:
#Finding correlation using pearson in our df3 data after feature selection
pearsoncorr=df3.corr(method='pearson')
pearsoncorr

In [None]:
corr = df3.corr()[['value_euro']].sort_values(by='value_euro', ascending=False)
sns.heatmap(corr, annot=True)

In [None]:
#Heatmap of new selected features
matrix = df.corr().round(2)
sns.heatmap(matrix, annot=True, vmax=1, vmin=-1, center=0, cmap='vlag')
plt.show()

In [None]:
#plotting Scatter plot for Value (€) against Overall Rating
fig = px.scatter(
    df3, 
    x='potential', 
    y='value_euro', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Value (€) against potential', trendline="ols")

fig.show()

In [None]:
#plotting Scatter plot for Value (€) against Overall Rating
fig = px.scatter(
    df3, 
    x='reactions', 
    y='value_euro', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Value (€) against Reactions', trendline="ols")

fig.show()

In [None]:
#plotting Scatter plot for Value (€) against Overall Rating
fig = px.scatter(
    df3, 
    x='composure', 
    y='value_euro', 
    height=SCATTER_SIZE,
    width=SCATTER_SIZE,
    title='Scatter plot for Value (€) against Composure', trendline="ols")

fig.show()

In [None]:
#Checking vshape of our dats
df3.shape

In [None]:
# count of players in the dataset after sampling
print(f'Count of players in the dataset is: {df3.shape[0]} players')

In [None]:
#Checking description of our data
df3.describe()

In [None]:
# Checking null values.
df3.isnull().any()

In [None]:
#Checking data types
df3.dtypes

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        7. MODEL BUILDING
    </center>
</h1>

---

In [None]:
#CREATING TEAST SET
#Assigning values to X and y.
y = df3.value_euro
X = df3.drop(['value_euro'], axis=1)

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#Division of trained and tested data
print('Train',' ','Test')
print(len(X_train),'+',len(X_test),'=',len(y_train)+len(y_test))

## MODEL SELECTION

Selecting different (ML) algorithms and choosing which best fits our data.

1. DECISION TREE
2.  RANDOM FOREST
3.  LINEAR REGRESSION

### I.  DECISION TREE

In [None]:
# Create Decision Tree regressor object
tree_reg = DecisionTreeRegressor(random_state=42,
                                 max_depth=3)

#Feeding the train data to our model, so it can figure out how it should make its predictions in the future on new data.
tree_reg.fit(X_train, y_train)

In [None]:
#Instructing our model to predict future player values. 
y_pred = tree_reg.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('RSquared:', metrics.r2_score(y_test, y_pred))

In [None]:
# plot the decision tree
plt.figure(figsize=(20,15))
plot_tree(tree_reg, filled=True, feature_names=X.columns)
plt.show()

### II.  RANDOM FOREST

In [None]:
# Initializing the Random Forest Regression model with 10 decision trees
forest_reg = RandomForestRegressor(n_estimators = 5, random_state = 42, max_depth=2)

# Fitting the Random Forest Regression model to the data
forest_reg.fit(X_train, y_train)

In [None]:
y_pred = forest_reg.predict(X_test)

print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('RSquared:', r2_score(y_test, y_pred))

In [None]:
features = X.columns

# Obtain just the first tree
first_tree = forest_reg.estimators_[0]

plt.figure(figsize=(20,15))
tree.plot_tree(first_tree,
               feature_names=features,
               fontsize=8, 
               filled=True, 
               rounded=True
               );

### III. LINEAR REGRESSION

In [None]:
lin_reg = LinearRegression() #Instantiate linear regression object
lin_reg.fit(X_train, y_train) #Fit the model

In [None]:
#Getting the coefficients
lin_reg.coef_

In [None]:
#Intercept
lin_reg.intercept_

In [None]:
#Let's find the intercept and co-efficient for each column in our training dataset.
#Slope and intercept values
pd.DataFrame(data = np.append(lin_reg.intercept_ , lin_reg.coef_),
index = ['Intercept']+[col+" Coef." for col in X.columns],
columns=['value_euro']).sort_values('value_euro', ascending=False)

`Interpreting Linear Regression Coefficients`


A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease

In [None]:
# Regression line formula: y = mx + b
# Where y:
#       Is the predicted target label.
#       m:
#       Is the slope of the line.
#       b:
#       Is the y intercept.

In [None]:
#Predicting the output of new observations with the trained model.
y_pred = lin_reg.predict(X_test)

#Predict the output of new observations with the trained model.
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('RSquared:', r2_score(y_test, y_pred))

In [None]:
y_pred

In [None]:
#Best fit line
plt.scatter(y_test, y_pred, label="Test data", color='red', alpha=0.5)
plt.plot([min(y), max(y)], [min(y), max(y)], color='black')
plt.xlabel("value_euro")
plt.ylabel("value_prediction")
plt.title("Real Values vs predictions (Linear Regression)")
plt.legend()
plt.grid()
plt.show()

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        8. MODEL EVALUATION
    </center>
</h1>

---

In [None]:
#Creating a statistical dataframe for all models (Model,R2 Score, RMSE, MAE, MSE) 
models = [lin_reg, tree_reg, forest_reg]
overral=pd.DataFrame(columns=["Model Name","RSquared", "RMSE", "MAE", "MSE"])

for model in models:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    r2=r2_score(y_test, y_pred)
    rmse=np.sqrt(mean_squared_error(y_test, y_pred))
    mae=mean_absolute_error(y_test, y_pred)
    mse=mean_squared_error(y_test, y_pred)
    
    overral=overral.append({"Model Name":model.__class__.__name__, "RSquared":r2, "RMSE":rmse, "MAE":mae, "MSE":mse},ignore_index=True)

#Sorting by ascending
overral = overral.sort_values(by="RSquared", ascending=False)
overral

In [None]:
#Assuming we have trained our model and obtained predictions on the testing dataset
predictions_reg = lin_reg.predict(X_test)
predictions_tree = tree_reg.predict(X_test)
predictions_forest = forest_reg.predict(X_test)

# Create a new DataFrame for predictions and actual values
results_df = pd.DataFrame({'predicted market value (Linear Regression)': predictions_reg,
                           'predicted market value (Decision Tree)':predictions_tree,
                           'predicted market value (Random Forest)': predictions_forest,
                           'value_euro': y_test})

#New dataFrame for 3 models comparing the actual values and predicted values:.
results_df.head(5)

### MEASURE OF ERRORS VISUALIZATION

In [None]:
#Graph of Coefficient of Determination (RSquared) for all Models
fig = px.bar(overral,
             x="RSquared",
             y='Model Name',
             #color='RMSE',
             title="Coefficient of Determination (RSquared) for all Models")
fig.show()

In [None]:
#Graph of number of players, preferred foot and positions
fig = px.bar(overral,
             x="RMSE",
             y='Model Name',
             title="Root Mean Squared Error across all Models")
fig.show()

In [None]:
#Graph of Mean Absolute Error across all Model
fig = px.bar(overral,
             x="MAE",
             y='Model Name',
             title="Mean Absolute Error across all Models")
fig.show()

In [None]:
#Graph of Mean Squared Error across all Models
fig = px.bar(overral,
             x="MSE",
             y='Model Name',
             title="Mean Squared Error across all Models")
fig.show()

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
        9. MODEL RESULTS
    </center>
</h1>

---

### MODEL TUNING USING

#### GRADIENT BOOSTING REGRESSOR

In [None]:
#Creating a GBR object
GBR = GradientBoostingRegressor()

In [None]:
#making an dictionary called parameters in which we have four parameters
parameters = {'learning_rate': [0.01,0.02,0.03,0.04],
                  'subsample'    : [0.9, 0.5, 0.2, 0.1],
                  'n_estimators' : [100,500,1000, 1500],
                  'max_depth'    : [4,6,8,10]
                 }

In [None]:
#Making an object grid_GBR for GridSearchCV and fitting the dataset
grid_GBR = GridSearchCV(estimator=GBR, param_grid = parameters, cv = 2, n_jobs=-1)
grid_GBR.fit(X_train, y_train)

In [None]:
#Printing all results
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n",grid_GBR.best_estimator_)
print("\n The best parameters across ALL searched params:\n",grid_GBR.best_params_)
print("\n The best GridSearchCV score across ALL searched params:\n",grid_GBR.best_score_)

In [None]:
#Creating a statistical dataframe for all models including GBR (Model,R2 Score, RMSE, MAE, MSE) 
models = [lin_reg, tree_reg, forest_reg, GBR]
overral=pd.DataFrame(columns=["Model Name","RSquared", "RMSE", "MAE", "MSE"])

for model in models:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    r2=r2_score(y_test, y_pred)
    rmse=np.sqrt(mean_squared_error(y_test, y_pred))
    mae=mean_absolute_error(y_test, y_pred)
    mse=mean_squared_error(y_test, y_pred)
    
    overral=overral.append({"Model Name":model.__class__.__name__, "RSquared":r2, "RMSE":rmse, "MAE":mae, "MSE":mse},ignore_index=True)

#Sorting by ascending
overral = overral.sort_values(by="RSquared", ascending=False)
overral

### FEATURE SELECTION

### 1. MEAN DECREASE IN IMPURIRTY (MDI)

Feature Importance to help us understand which features were the most influential in making predictions.

#### a). DECISION TREE FEATURE IMPORTANCE

In [None]:
# Get feature importances for Tree Regression
importances = tree_reg.feature_importances_

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Tree Regression Feature Importances (MDI)")
plt.bar(range(X_train.shape[1]), importances[indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.tight_layout()
plt.show()

#Feature importance in percentages
importances = tree_reg.feature_importances_
columns = X. columns
i = 0
while i < len(columns):
    print (f" Feature Importance: '{columns [i]}' = {round(importances [i] * 100, 2)}%.")
    i += 1

#### b). RANDOM FOREST FEATUTRE IMPORTANCE

In [None]:
# Get feature importances for RandomForest
importances = forest_reg.feature_importances_

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Random Forest Feature Importances (MDI)")
plt.bar(range(X_train.shape[1]), importances[indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.tight_layout()
plt.show()

#Feature importance in percentages
columns = X. columns
i = 0
while i < len(columns):
    print (f" Feature Importance: '{columns [i]}' = {round(importances [i] * 100, 2)}%.")
    i += 1

### 2. ORDINARY LEAST SQUARE (OLS) Model FOR LINEAR REGRESSION

In [None]:
#Constant
X = sm.add_constant(X)

#Statistical summary of our prediction
lin_reg_model = sm.OLS(y, X).fit()
lin_reg_model.summary()

In [None]:
#Printing RSquare and Adjusted RSquare
print("Adjusted RSquare: ", lin_reg_model.rsquared_adj)
print("RSquare: ", lin_reg_model.rsquared)

---
<h1 style='background:cornflowerblue;
           border:0;
           color:black'>
    <center>
       10. REFERENCES:
    </center>
</h1>

---

1. https://towardsdatascience.com/simple-football-data-set-exploration-with-pandas-60a2bc56bd5a


2. https://www.geeksforgeeks.org/python-pandas-dataframe-corr/


3. https://www.statology.org/pandas-add-count-column/


4. https://plotly.com/python/plotly-express/


5. https://www.datatechnotes.com/2019/10/accuracy-check-in-python-mae-mse-rmse-r.html


6. https://sp-selvan.medium.com/split-data-set-to-train-the-model-decision-tree-regressor-18469de05466


7. https://sp-selvan.medium.com/split-data-set-to-train-the-model-decision-tree-regressor-18469de05466


8. https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/


9. https://data36.com/regression-tree-python-scikit-learn/


10. https://saturncloud.io/blog/how-to-detect-and-exclude-outliers-in-a-pandas-dataframe/#:~:text=To%20exclude%20outliers%20from%20our,the%20rows%20containing%20the%20outliers.


11. https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/