# Predicting players rating 
 
In this project you are going to predict the overall rating of soccer player based on their attributes such as 'crossing', 'finishing etc. 
 
The dataset you are going to use is from **European Soccer Database** (https://www.kaggle.com/hugomathien/soccer) has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. 

# About the Dataset 
 
The ultimate Soccer database for **data analysis and machine learning.** 
 The dataset comes in the form of an SQL database and contains statistics of about 25,000 football matches, from the top football league of 11 European Countries. It covers seasons from 2008 to 2016 and contains match statistics (i.e: scores, corners, fouls etc...) as well as the team formations, with player names and a pair of coordinates to indicate their position on the pitch. 
 
* +25,000 matches  
* +10,000 players  
* 11 European Countries with their lead championship  
* Seasons 2008 to 2016  
* Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates  Team line up with squad formation (X, Y coordinates)  
* Betting odds from up to 10 providers  
* Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000 matches 
 
 
The dataset also has a set of about 35 statistics for each player, derived from EA Sports' FIFA video games. It is not just the stats that come with a new version of the game but also the weekly updates. So for instance if a player has performed poorly over a period of time and his stats get impacted in FIFA, you would normally see the same in the dataset. 

## Machine Learning skills required to complete the project 
 
**Supervised learning** 
 
Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. 

**Regression** 
 
Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. It is parametric in nature because it makes certain assumptions (discussed next) based on the data set. If the data set follows those assumptions, regression gives incredible results. 
 
**Model evaluation**
 
Student must know how to judge a model on unseen data. What metric to select to judge the performance.

# Let's get started..... 
 


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pip install "scikit_learn==0.21.3"

In [None]:
import sqlite3
import numpy as np
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
from xgboost import plot_importance

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import Imputer, SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ShuffleSplit, RandomizedSearchCV
from sklearn.pipeline import make_pipeline

import pickle

In [None]:
cnx = sqlite3.connect('/kaggle/input/soccer/database.sqlite')

In [None]:
dd = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)

In [None]:
print(dd)

In [None]:
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

In [None]:
df.head()

In [None]:
target = df.pop('overall_rating')

In [None]:
df.shape

In [None]:
target.head()

## Imputing target funtion :

In [None]:
target.isnull().values.sum()

there are 836 missing value present in target function.

In [None]:
target.describe()

In [None]:
plt.hist(target, 30, range=(33, 94))

almost normal distribution so we can impute mean value for missing value in target.

In [None]:
y = target.fillna(target.mean())

In [None]:
y.isnull().values.any()

## Data Exploration :

In [None]:
df.columns

In [None]:
for col in df.columns:
    unique_cat = len(df[col].unique())
    print("{col}--> {unique_cat}..{typ}".format(col=col, unique_cat=unique_cat, typ=df[col].dtype))

we can see only four features have the type 'object'. here the feature named 'date' has no significance in this problem so can ignore it and perform one hot encoding on the rest of 3 features.

In [None]:
dummy_df = pd.get_dummies(df, columns=['preferred_foot', 'attacking_work_rate', 'defensive_work_rate'])
dummy_df.head()

In [None]:
X = dummy_df.drop(['id', 'date'], axis=1)

***
## Feature selection :

* As tree model doesn't gets affected by missing values present in data set. but feature selection by `SelectFromModel` can not be done on datasets that carries null value. Therefore, we should also perform imputation on dataset. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
#imputing null value of each column with the mean of that column
imput = SimpleImputer()
X_train = imput.fit_transform(X_train)
X_test = imput.fit_transform(X_test)

In [None]:
#finding feature_importance for feature selection. from it we'll be able to decide threshold value
model = XGBRegressor()
model.fit(X_train, y_train)
print(model.feature_importances_)

In [None]:
selection = SelectFromModel(model, threshold=0.01, prefit=True)

select_X_train = selection.transform(X_train)
select_X_test = selection.transform(X_test)

In [None]:
select_X_train.shape

## Training different models :

### 1. Linear Regression :

In [None]:
pipe = make_pipeline(StandardScaler(),             #preprocessing(standard scalling)
                     LinearRegression())           #estimator(linear regression)

cv = ShuffleSplit(random_state=0)   #defining type of cross_validation(shuffle spliting)

param_grid = {'linearregression__n_jobs': [-1]}     #parameters for model tunning

grid = GridSearchCV(pipe, param_grid=param_grid, cv=cv)

In [None]:
grid.fit(select_X_train, y_train)          #training 

In [None]:
grid.best_params_

In [None]:
lin_reg = pickle.dumps(grid)

### 2. Decision Tree :

In [None]:
pipe = make_pipeline(StandardScaler(),                  #preprocessing
                     DecisionTreeRegressor(criterion='mse', random_state=0))          #estimator

cv = ShuffleSplit(n_splits=10, random_state=42)        #cross validation

param_grid = {'decisiontreeregressor__max_depth': [3, 5, 7, 9, 13]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=cv)

In [None]:
grid.fit(select_X_train, y_train)          #training 

In [None]:
grid.best_params_

In [None]:
Dectree_reg = pickle.dumps(grid)

### 3. Ranom Forest :

In [None]:
pipe = make_pipeline(StandardScaler(),
                     RandomForestRegressor(n_estimators=500, random_state=123))

cv = ShuffleSplit(test_size=0.2, random_state=0)

param_grid = {'randomforestregressor__max_features':['sqrt', 'log2', 10],
              'randomforestregressor__max_depth':[9, 11, 13]}                 

grid = GridSearchCV(pipe, param_grid=param_grid, cv=cv)

In [None]:
grid.fit(select_X_train, y_train)          #training 

In [None]:
grid.best_params_

In [None]:
Randfor_reg = pickle.dumps(grid)

### 4. Xgboost regressor :

In [None]:
pipe = make_pipeline(StandardScaler(),
                     XGBRegressor(n_estimators= 500, random_state=42))

cv = ShuffleSplit(n_splits=10, random_state=0)

param_grid = {'xgbregressor__max_depth': [5, 7],
              'xgbregressor__learning_rate': [0.1, 0.3]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=cv, n_jobs= -1)

In [None]:
grid.fit(select_X_train, y_train)

In [None]:
grid.best_params_

In [None]:
xgbreg = pickle.dumps(grid)

## <u>Comparision between different models</u> :

In [None]:
lin_reg = pickle.loads(lin_reg)
Dectree_reg = pickle.loads(Dectree_reg)
Randfor_reg = pickle.loads(Randfor_reg)
xgbreg = pickle.loads(xgbreg)

In [None]:
print("""Linear Regressor accuracy is {lin}
DecisionTree Regressor accuracy is {Dec}
RandomForest regressor accuracy is {ran}
XGBoost regressor accuracy is {xgb}""".format(lin=lin_reg.score(select_X_test, y_test),
                                                       Dec=Dectree_reg.score(select_X_test, y_test),
                                                       ran=Randfor_reg.score(select_X_test, y_test),
                                                       xgb=xgbreg.score(select_X_test, y_test)))

By accuracy comparision performed above we can say hear that XGBoost regressor gives better result than any other model. and it can predict the target function with approx 98% accuracy.

> **Thank You For Having A Look At This Notebook**

> **Please Upvote if this was Helpful.**