<div align="center">
  <img width="600px" src="https://www.collinsdictionary.com/images/full/baseball_557405302_1000.jpg">
</div>

# Baseball Salary Predict

**Authors**:
- Cristhian Castillo
- Kevin Zarama

In this notebook, we are going to work on building models to predict the salary of baseball players.

**Notebook Objetive**: Obtain a reliable model to predict the salary of baseball players.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Packages-and-Functions" data-toc-modified-id="Packages-and-Functions-1">Packages and Functions</a></span><ul class="toc-item"><li><span><a href="#Packages" data-toc-modified-id="Packages-1.1">Packages</a></span></li><li><span><a href="#Custom-Functions" data-toc-modified-id="Custom-Functions-1.2">Custom Functions</a></span></li></ul></li><li><span><a href="#trabajo" data-toc-modified-id="trabajo-2">trabajo</a></span></li><li><span><a href="#Load-datasets" data-toc-modified-id="Load-datasets-3">Load datasets</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-4">Data Preparation</a></span><ul class="toc-item"><li><span><a href="#Remove-Variables" data-toc-modified-id="Remove-Variables-4.1">Remove Variables</a></span></li><li><span><a href="#Separation-of-variables" data-toc-modified-id="Separation-of-variables-4.2">Separation of variables</a></span></li><li><span><a href="#One-Hot-Encoder" data-toc-modified-id="One-Hot-Encoder-4.3">One Hot Encoder</a></span></li></ul></li><li><span><a href="#Baseline" data-toc-modified-id="Baseline-5">Baseline</a></span></li><li><span><a href="#Pipelines" data-toc-modified-id="Pipelines-6">Pipelines</a></span><ul class="toc-item"><li><span><a href="#PCA" data-toc-modified-id="PCA-6.1">PCA</a></span></li><li><span><a href="#Data-Scalling" data-toc-modified-id="Data-Scalling-6.2">Data Scalling</a></span></li></ul></li><li><span><a href="#Base-Model" data-toc-modified-id="Base-Model-7">Base Model</a></span></li></ul></div>

## Packages and Functions

### Packages

In [32]:
import pandas as pd
import numpy as np
import math


from sklearn.preprocessing import StandardScaler

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

# Pipelines
from sklearn.pipeline import Pipeline

# Metrics
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.decomposition import PCA

### Custom Functions 

In [2]:
import sys
sys.path.append("utils/")

In [12]:
from custom_data_preprocessing import get_cols_by_type, one_hot_encoder
from custom_metrics import get_linear_metrics

## trabajo
a.	Entrenar con todas las variables – Kevin y Jesus

b.	Entrenamiento con solo las variables numéricas – Aura y Cristhian

c.	Sin outliers – Kevin y Jesus

d.	Análisis VIF -> Eliminar variables con alta colinealidad - Aura y Cristhian

e.	PCA -> Completo de todas las variables – Kevin y Jesus	

f.	PCA -> Variables independientes correlacionadas - Aura y Cristhian

g.	Features Selecting

## Load datasets

In [4]:
df = pd.read_csv("datasets/Baseball_Clean.csv")

In [5]:
df_predict = pd.read_csv("datasets/Baseball_Predict.csv")

## Data Preparation

### Remove Variables

In [6]:
df = df.drop(["Player", "Unnamed: 0"], axis=1)

### Separation of variables

In [7]:
y = df["Salary"]
X = df.drop(["Salary"], axis=1)

### One Hot Encoder

In [10]:
cat_cols, num_cols =  get_cols_by_type(X)

In [11]:
X = one_hot_encoder(X, cat_cols, drop_first=True)

## Baseline

In [13]:
baseline_predict = np.full(len(y), y.mean())
baseline_metrics = get_linear_metrics(y, baseline_predict)
baseline_metrics

{'rmse': 450.26022382434286,
 'r2': 0.0,
 'r2_adjusted': 0.5,
 'mae': 343.6183374633141}

## Pipelines

In [28]:
base_models = {
    "LinearRegression": {
        "model": LinearRegression(),
        "steps": [
            ("LinearRegression", LinearRegression())
        ]
    },
    "Ridge": {
        "model": Ridge(),
        "steps": [
            ("Ridge", Ridge())
        ]
    },
    "Lasso": {
        "model": Lasso(),
        "steps": [
            ("Ridge", Lasso())
        ]
    },
    "KNN": {
        "model": KNeighborsRegressor(),
        "steps": [
            ("KNN", KNeighborsRegressor())
        ]
    },
    "CART": {
        "model": DecisionTreeRegressor(),
        "steps": [
            ("CART", DecisionTreeRegressor())
        ]
    },
    "SVR": {
        "model": SVR(),
        "steps": [
            ("SVR", SVR())
        ]
    },
    "XGBoost": {
        "model": XGBRegressor(objective='reg:squarederror'),
        "steps": [
            ("XGBoost", XGBRegressor(objective='reg:squarederror'))
        ]
    }
}

In [26]:
def train_models(X, y, pipelines, cv=5):
    df = pd.DataFrame(columns=["model", "rmse"])
    
    for model in pipelines:
        regressor = Pipeline(pipelines[model]["steps"])
        
        rmse = np.mean(
            np.sqrt(-cross_val_score(regressor, X, y, cv=cv, scoring="neg_mean_squared_error"))
        )
        
        df = df.append(
            {
                "model": model,
                "rmse": rmse
            },
            ignore_index=True
        )
    
    return df

In [29]:
train_models(X, y, base_models, cv=10)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,model,rmse
0,LinearRegression,331.004257
1,Ridge,330.725725
2,Lasso,330.004596
3,KNN,320.038077
4,CART,364.90745
5,SVR,436.323572
6,XGBoost,307.377538


In [31]:
train_models(X[num_cols], y, base_models, cv=10)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,model,rmse
0,LinearRegression,332.657182
1,Ridge,332.65183
2,Lasso,332.343035
3,KNN,320.038077
4,CART,361.628178
5,SVR,436.210544
6,XGBoost,296.673911


### PCA

In [33]:
models_pca = {
    "LinearRegression": {
        "model": LinearRegression(),
        "steps": [
            ("PCA", PCA()),
            ("LinearRegression", LinearRegression())
        ]
    },
    "Ridge": {
        "model": Ridge(),
        "steps": [
            ("PCA", PCA()),
            ("Ridge", Ridge())
        ]
    },
    "Lasso": {
        "model": Lasso(),
        "steps": [
            ("PCA", PCA()),
            ("Ridge", Lasso())
        ]
    },
    "KNN": {
        "model": KNeighborsRegressor(),
        "steps": [
            ("PCA", PCA()),
            ("KNN", KNeighborsRegressor())
        ]
    },
    "CART": {
        "model": DecisionTreeRegressor(),
        "steps": [
            ("PCA", PCA()),
            ("CART", DecisionTreeRegressor())
        ]
    },
    "SVR": {
        "model": SVR(),
        "steps": [
            ("PCA", PCA()),
            ("SVR", SVR())
        ]
    },
    "XGBoost": {
        "model": XGBRegressor(objective='reg:squarederror'),
        "steps": [
            ("PCA", PCA()),
            ("XGBoost", XGBRegressor(objective='reg:squarederror'))
        ]
    }
}

In [None]:
def base_models(X, y, cv=5):
    models = [
        ("LinearRegression", LinearRegression()),
        ("Ridge", Ridge()),
        ("Lasso", Lasso()),
        ("ElasticNet", ElasticNet()),
        ('KNN', KNeighborsRegressor()),
        ('CART', DecisionTreeRegressor()),
        ('SVR', SVR()),
        ("XGBoost", XGBRegressor(objective='reg:squarederror')) 
    ]
    
    df = pd.DataFrame(columns=["model", "rmse"])
    
    for name, pipelines in models:
        rmse = np.mean(np.sqrt(cross_val_score(regressor, X, y, cv=cv, scoring="neg_mean_squared_error")))
        df = df.append(
            {
            "model": name,
            "rmse": rmse
            },
            ignore_index = True
        )
    
    return df, models

In [None]:
linear_regression = Pipeline(base_models["LinearRegression"]["steps"])

### Data Scalling

In [153]:
X_scaled = StandardScaler().fit_transform(X[num_cols])

In [156]:
X[num_cols] = pd.DataFrame(X_scaled, columns=X[num_cols].columns)

## Base Model

In [163]:
y = df["Salary"]


In [165]:
baseline_metrics["model"] = "Baseline"

In [166]:
baseline_metrics

{'rmse': 450.26022382434286,
 'r2': 0.0,
 'r2_adjusted': 0.5,
 'mae': 343.6183374633141,
 'model': 'Baseline'}

In [101]:
results, models = base_models(X, y, cv=10)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [102]:
results

Unnamed: 0,model,rmse
0,LinearRegression,331.004257
1,Ridge,328.112709
2,Lasso,328.137471
3,ElasticNet,331.871482
4,KNN,300.680935
5,CART,371.588276
6,SVR,441.551393
7,XGBoost,308.386742


In [134]:
models_params = {
    "Ridge": {
        'alpha': np.linspace(0, 0.2, 21)
    },
    "Lasso": {
        'alpha': np.linspace(0, 0.2, 21)
    },
    "ElasticNet": {
        "l1_ratio": np.arange(0, 1, 0.01)
    },
    "KNN": {
        "n_neighbors": np.arange(1, 30),
        'weights': ['uniform', 'distance'],
        'leaf_size': np.arange(1, 50),
        'p': [1,2]
    },
    "CART": {
        'max_depth': range(1, 20),
        "min_samples_split": range(2, 30)        
    },
    "SVR": {
        'kernel': ['rbf', 'linear'],
        'gamma': np.logspace(-4, 0, 8),
        'C':  np.logspace(-0, 4, 8),
        'degree' : [3,8]
    },
    "XGBoost": {
        "max_depth": [5, 8, 12, 20],
        "colsample_bytree": [0.5, 0.8, 1]
    }
    
}

In [135]:
def hyperparameter_optimization(X, y, models, models_params, cv=5, gs_cv=3, scoring="neg_mean_squared_error"):
    best_models = {}
    for name, model in models:
        print(f'{name} Hyperparameter Tuning...')
        gs_best = GridSearchCV(model, models_params[name], cv=gs_cv, n_jobs=-1, verbose=False).fit(X, y)
        final_model = model.set_params(**gs_best.best_params_)
        rmse = np.mean(np.sqrt(-cross_val_score(final_model, X, y, cv=cv, scoring=scoring)))
        print(f'RMSE: {rmse}')
        print(f'{name} best params: {gs_best.best_params_}')
       
        best_models[name] = {
            "model": final_model,
            "rmse": rmse,
            "params": gs_best.best_params_
        }
    return best_models

In [136]:
best_models = hyperparameter_optimization(X, y, models[1:-1], models_params)

Ridge Hyperparameter Tuning...
RMSE: 340.6216251974944
Ridge best params: {'alpha': 0.2}
Lasso Hyperparameter Tuning...


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


RMSE: 341.16706372769755
Lasso best params: {'alpha': 0.2}
ElasticNet Hyperparameter Tuning...
RMSE: 338.7637987596942
ElasticNet best params: {'l1_ratio': 0.9}
KNN Hyperparameter Tuning...
RMSE: 302.71186279586163
KNN best params: {'leaf_size': 1, 'n_neighbors': 13, 'p': 1, 'weights': 'distance'}
CART Hyperparameter Tuning...
RMSE: 326.57697101862254
CART best params: {'max_depth': 3, 'min_samples_split': 25}
SVR Hyperparameter Tuning...
RMSE: 292.7264338944361
SVR best params: {'C': 2682.6957952797247, 'degree': 3, 'gamma': 0.019306977288832496, 'kernel': 'rbf'}


In [144]:
hyperparameter_optimization(X, y, [models[-1]], models_params)

XGBoost Hyperparameter Tuning...
RMSE: 284.7094100772666
XGBoost best params: {'colsample_bytree': 1, 'max_depth': 5}


{'XGBoost': {'model': XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
               colsample_bynode=None, colsample_bytree=1, gamma=None, gpu_id=None,
               importance_type='gain', interaction_constraints=None,
               learning_rate=None, max_delta_step=None, max_depth=5,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               n_estimators=100, n_jobs=None, num_parallel_tree=None,
               random_state=None, reg_alpha=None, reg_lambda=None,
               scale_pos_weight=None, subsample=None, tree_method=None,
               validate_parameters=None, verbosity=None),
  'rmse': 284.7094100772666,
  'params': {'colsample_bytree': 1, 'max_depth': 5}}}

In [142]:
models[-1]

('XGBoost',
 XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              random_state=None, reg_alpha=None, reg_lambda=None,
              scale_pos_weight=None, subsample=None, tree_method=None,
              validate_parameters=None, verbosity=None))