
# Salary Prediction on Hitters Data Set:


## AIM
My aim in this study is to set up machine learning models for the Hitters data set and minimize error scores. The works I have done for this purpose are as follows:

## Hitters: Baseball Data

### Description
Major League Baseball Data from the 1986 and 1987 seasons.Major League Baseball Data from the 1986 and 1987 seasons.
    
### Format
A data frame with 322 observations of major league players on the following 20 variables.

### Variables

* AtBat  : Number of times at bat in 1986
* Hits    : Number of hits in 1986
* HmRun   : Number of home runs in 1986
* Runs    : Number of runs in 1986
* RBI     : Number of runs batted in in 1986
* Walks   : Number of walks in 1986
* Years   : Number of years in the major leagues
* CAtBat  : Number of times at bat during his career
* CHits   : Number of hits during his career
* CHmRun  : Number of home runs during his career
* CRuns   : Number of runs during his career
* CRBI    : Number of runs batted in during his career
* CWalks  : Number of walks during his career
* League  : A factor with levels A and N indicating player's league at the end of 1986
* Division: A factor with levels E and W indicating player's division at the end of 1986
* PutOuts : Number of put outs in 1986
* Assists : Number of assists in 1986
* Errors  : Number of errors in 1986
* Salary  : 1987 annual salary on opening day in thousands of dollars
* NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987.

### Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.
       
## 1. Library Import Operations:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import scale 
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import neighbors
from sklearn.svm import SVR
import xgboost
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## 2. Reading Data:

In [None]:
hitters = pd.read_csv('../input/hitters/Hitters.csv')
hitters.head()

In [None]:
#Exploratory Data Analysis
#Structural information of the data set
hitters.info()

In [None]:
hitters.isnull().sum()

It was observed that there were three 'Object' type variables in the data set and there were 59 missing data in the 'Salary' variable.

First of all, we get rid of the variables that are seen as 'Object' with the 'get.dummies' operation.

In [None]:
dummies = pd.get_dummies(hitters[['League', 'Division', 'NewLeague']]) 
dummies.head()

In [None]:
X_ = hitters.drop(['League', 'Division', 'NewLeague'], axis=1).astype('float64') 

hitters = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1) 

hitters.info()

In [None]:
hitters.describe().T

## 3. We will create different data sets for different scenarios that we will apply for salary estimation.

### 3.1. We create the data set 'df_1' by simply deleting the missing data without making any changes to the variables:

In [None]:
df_1 = hitters.dropna()
df_1.head()

In [None]:
df_1.info()

### 3.2. The data set named 'df_2' is created by assigning the average of the variable 'Salary' where they replace the missing data:

In [None]:
df = hitters.copy()
df['Salary'].fillna(df['Salary'].mean(), inplace = True) 
df_2 = df.copy()
df_2.info()

### 3.3.  Missing Data is Filled with Gradient Boosting Regression Estimation Results and data set named 'df_3' is created:

In [None]:
hitters.head()

In [None]:
null = hitters[hitters['Salary'].isnull()]
# Selection of observations with missing data
null.head()

In [None]:
df = hitters.dropna() #Delete observations with missing data
X_train = df.drop('Salary', axis = 1) #Train set definition
X_train.head()

In [None]:
y_train = df[['Salary']] #Determination of the dependent variable of the train set
y_train.head()

In [None]:
X_test = null.drop('Salary', axis = 1) #Defining observations with missing data in the data set as a test set
X_test.head()

In [None]:
gbm_model = GradientBoostingRegressor().fit(X_train, y_train)
gbm_model_pred_test = gbm_model.predict(X_test)
gbm_model_pred_test

In [None]:
X_test['Salary'] = gbm_model_pred_test

In [None]:
df_3 = pd.concat([df, X_test], ignore_index = True)
df_3.head()

In [None]:
df_3.info()

In [None]:
df_3.describe().T

### 3.4. 'Df_4' is created by Suppressing Missing Data with Predicted Values and Suppressing Values:

In [None]:
df_3.info()

### With Local Outlier Factor, outliers of the variables will be determined.

In [None]:
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)
clf.fit_predict(df_3)
df_scores = clf.negative_outlier_factor_
df_scores[0:20]

In [None]:
np.sort(df_scores)

In [None]:
np.sort(df_scores)[16]

In [None]:
threshold_value = np.sort(df_scores)[16]
threshold_value

In [None]:
outlier_df = df_scores > threshold_value

In [None]:
df_3[df_scores == threshold_value]


In [None]:
pressure_value = df_3[df_scores == threshold_value]

In [None]:
outlier = df_3[~outlier_df] 

In [None]:
outlier.to_records(index=False)

In [None]:
res = outlier.to_records(index=False)

In [None]:
res[:] = pressure_value.to_records(index = False)

In [None]:
outlier = pd.DataFrame(res, index = df_3[~outlier_df].index)
outlier.describe().T

In [None]:
n_outlier = df_3[outlier_df]
n_outlier.describe().T

In [None]:
df_4 = pd.concat([n_outlier, outlier], ignore_index = True)
df_4.describe().T

### 3.5. 'Df_5' is Generated by Filling the Missing Data with Predictions and Deleting the Threshold Data:

In [None]:
df_3.info()

In [None]:
df_5 = df_3[df_scores > threshold_value]
df_5.info()

### 3.6. 'Df_6' is Created by Deleting Missing Data and Pressure Outlier Data:

In [None]:
df.info()

In [None]:
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)
clf.fit_predict(df)
df6_scores = clf.negative_outlier_factor_
df6_scores[0:20]

In [None]:
np.sort(df6_scores)


In [None]:
np.sort(df6_scores)[8]

In [None]:
threshold_value6 = np.sort(df6_scores)[8]
threshold_value6

In [None]:
outlier_df6 = df6_scores > threshold_value6
outlier_df6

In [None]:
df[df6_scores == threshold_value6]

In [None]:
pressure_value6 = df[df6_scores == threshold_value6]

In [None]:
outlier6 = df[~outlier_df6] 


In [None]:
outlier6.to_records(index=False)

In [None]:
res6 = outlier6.to_records(index=False)


In [None]:
res6[:] = pressure_value6.to_records(index = False)


In [None]:
n_outlier6 = df[outlier_df6]
n_outlier6.describe().T

In [None]:
outlier6 = pd.DataFrame(res6, index = df[~outlier_df6].index)
outlier6.describe().T

In [None]:
df_6 = pd.concat([n_outlier6, outlier6], ignore_index = True)
df_6.describe().T

### 3.7.  Missing and Outlier Data Deletion and Creating 'df_7' Data Set:

In [None]:
df_7 = n_outlier6
df_7.info()

## 4. Feature Engineering:

In [None]:
df_8 = hitters.copy()
df_8.info()

### 4.1.  Categorical variables "League_N, Division_W, NewLeague_N" 

In [None]:
cat_df = df_8.select_dtypes(include=["uint8"])
cat_df.head()

In [None]:
print(cat_df.League_N.unique())
print(cat_df["League_N"].value_counts().count())
print(cat_df["League_N"].value_counts())
print(df_8["League_N"].value_counts().plot.barh())
df_8.groupby('League_N')['Salary'].mean()

In [None]:
print(cat_df.Division_W.unique())
print(cat_df["Division_W"].value_counts().count())
print(cat_df["Division_W"].value_counts())
print(df_8["Division_W"].value_counts().plot.barh())
df_8.groupby('Division_W')['Salary'].mean()

In [None]:
print(cat_df.NewLeague_N.unique())
print(cat_df["NewLeague_N"].value_counts().count())
print(cat_df["NewLeague_N"].value_counts())
print(df_8["NewLeague_N"].value_counts().plot.barh())
df_8.groupby('NewLeague_N')['Salary'].mean()

In [None]:
Experience = []
for ex in df_8['Years']:
    if ex < 5:
        Experience.append(1)
    elif (ex >= 5) & (ex < 10):
        Experience.append(2)
    elif (ex >= 10) & (ex < 15):
        Experience.append(3)
    elif (ex >= 15) & (ex < 20):
        Experience.append(4)
    else:
        Experience.append(5)
df_8['Experience'] = Experience

### The 'years' variable consists of values between the numbers 1 and 24. We enumerated it with numbers from 1 to 5 in the form of 0-4, 5-9, 10-19, 20-24.

In [None]:
df_8.groupby(['League_N', 'Division_W', 'NewLeague_N'])['Salary'].mean()

In [None]:
df_8.groupby(['League_N', 'Division_W', 'NewLeague_N', 'Experience'])['Salary'].mean()

### The variables 'League_N', 'Division_W', 'NewLeague_N', 'Experience' are groupby and the average of the variable 'Salary' is taken and these averages are replaced by missing values in the variable 'Salary'.

In [None]:
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 1), "Salary"] = 145.961538
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 2), "Salary"] = 774.434536
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 3), "Salary"] = 918.073533
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 4), "Salary"] = 614.375000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 2), "Salary"] = 850.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 3), "Salary"] = 833.333333
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 1), "Salary"] = 203.821429
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 2), "Salary"] = 528.108696
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 3), "Salary"] = 786.916700
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 4), "Salary"] = 479.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 1), "Salary"] = 96.666667
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 0) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 3), "Salary"] = 825.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 1), "Salary"] = 70.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 2), "Salary"] = 525.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 3), "Salary"] = 500.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 4), "Salary"] = 1050.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 1), "Salary"] = 313.753320
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 2), "Salary"] = 776.095190
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 3), "Salary"] = 949.010143
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 0) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 4), "Salary"] = 486.111000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 1), "Salary"] = 565.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 2), "Salary"] = 405.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 0) & (df_8['Experience'] == 3), "Salary"] = 250.000000
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 1), "Salary"] = 188.138889
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 2), "Salary"] = 538.114053
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 3), "Salary"] = 723.452429
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 4), "Salary"] = 763.666600
df_8.loc[(df_8["Salary"].isnull()) & (df_8["League_N"] == 1) & (df_8['Division_W'] == 1) & (df_8["NewLeague_N"] == 1) & (df_8['Experience'] == 5), "Salary"] = 475.000000


In [None]:
df_8.info()

### 4.2. Adding Variables:

The dataset contains data from the players in 1986 and throughout their careers and how many years of experience they have had. We add the annual average of these data and the ratio of the data in 1986 to the overall performance.

In [None]:
df_8['AtBat_rate'] = df_8["CAtBat"] / df_8["Years"]
df_8['Hits_rate'] = df_8["CHits"] / df_8["Years"]
df_8['HmRun_rate'] = df_8["CHmRun"] / df_8["Years"]
df_8['Runs_rate'] = df_8["CRuns"] / df_8["Years"]
df_8['RBI_rate'] = df_8["CRBI"] / df_8["Years"]
df_8['Walks_rate'] = df_8["CWalks"] / df_8["Years"]

df_8['1986_AtBat_rate'] = df_8["AtBat"] / df_8["CAtBat"]
df_8['1986_Hits_rate'] = df_8["Hits"] / df_8["CHits"]
df_8['1986_HmRun_rate'] = df_8["HmRun"] / df_8["CHmRun"]
df_8['1986_Runs_rate'] = df_8["Runs"] / df_8["CRuns"]
df_8['1986_RBI_rate'] = df_8["RBI"] / df_8["CRBI"]
df_8['1986_Walks_rate'] = df_8["Walks"] / df_8["CWalks"]

In [None]:
df_8.info()

In [None]:
df_8 = df_8.dropna()

In [None]:
df_8.info()

## 5. Predict:

### We have 8 data. For each of these, the following models will be completed.

### Models:

          Linear Regression
          Ridge Regression
          Lasso Regression
          ElasticNet Regression
          LightGBM Regression
          XGBoost Regression
          GradientBoosting Regression 
          RandomForest Regression 
          DecisionTree Regression
          MLP Regression
          KNeighbors Regression
          SupportVector Regression
          
### First, predictions will be made without optimizing hyperparameter.
### The datasets will be divided into '80% train set' and '20% test set' and will be set to 'random_state = 46'.

In [None]:
def compML(df, y, alg):
    #train-test distinction
    y = df[y]
    X = df.drop('Salary', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=46)
    #modeelling
    model = alg().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
    model_name = alg.__name__
    print("  for data set  ", model_name, " Model Test Error: ",RMSE)

In [None]:
models = [LinearRegression,
          Ridge,
          Lasso,
          ElasticNet,
          LGBMRegressor, 
          XGBRegressor, 
          GradientBoostingRegressor, 
          RandomForestRegressor, 
          DecisionTreeRegressor,
          MLPRegressor,
          KNeighborsRegressor, 
          SVR]

In [None]:
for i in models:
    compML(df_1, "Salary", i)

In [None]:
for i in models:
    compML(df_2, "Salary", i)

In [None]:
for i in models:
    compML(df_3, "Salary", i)

In [None]:
for i in models:
    compML(df_4, "Salary", i)

In [None]:
for i in models:
    compML(df_5, "Salary", i)

In [None]:
for i in models:
    compML(df_6, "Salary", i)

In [None]:
for i in models:
    compML(df_7, "Salary", i)

In [None]:
for i in models:
    compML(df_8, "Salary", i)

### The above results were taken in the estimations made without optimization of hyperparameter.

# 6. Hiperparametre Optimizasyonları

### The most successful data set in the first estimation made was df_4. Therefore, hyperparameter optimization operations will be done on this data set.

## 6.1. KNN

In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
knn_model = KNeighborsRegressor().fit(X_train, y_train)

In [None]:
knn = KNeighborsRegressor()
knn_params = {"n_neighbors": np.arange(1,30,1)}

In [None]:
knn_cv_model = GridSearchCV(knn, knn_params, cv = 10).fit(X_train, y_train)

In [None]:
knn_cv_model.best_params_

In [None]:
knn_tuned = KNeighborsRegressor(n_neighbors = knn_cv_model.best_params_["n_neighbors"]).fit(X_train, y_train)

In [None]:
knn_tuned_y_pred = knn_tuned.predict(X_test)

In [None]:
knn_tuned_RMSE = np.sqrt(mean_squared_error(y_test, knn_tuned_y_pred))
knn_tuned_RMSE

## 6.2. Support Vector Regression

In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
svr_model = SVR("linear") 

In [None]:
svr_params = {"C": [0.1,0.5,1,3]}

In [None]:
svr_cv_model = GridSearchCV(svr_model, svr_params, cv = 5, verbose = 2, n_jobs = -1).fit(X_train, y_train)

In [None]:
svr_cv_model.best_params_

In [None]:
svr_tuned = SVR("linear", C = 3).fit(X_train, y_train)

In [None]:
svr_model_y_pred = svr_tuned.predict(X_test)

In [None]:
svr_model_tuned_RMSE = np.sqrt(mean_squared_error(y_test, svr_model_y_pred))
svr_model_tuned_RMSE

## 6.3. Artificial Neural Networks

In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

In [None]:
scaler.fit(X_test)
X_test_scaled = scaler.transform(X_test)

In [None]:
mlp_model = MLPRegressor().fit(X_train_scaled, y_train)

In [None]:
mlp_params = {"alpha": [0.1, 0.01, 0.02, 0.001, 0.0001], 
             "hidden_layer_sizes": [(10,20), (5,5), (100,100)]}

In [None]:
mlp_cv_model = GridSearchCV(mlp_model, mlp_params, cv = 10, verbose = 2, n_jobs = -1).fit(X_train_scaled, y_train)

In [None]:
mlp_cv_model.best_params_

In [None]:
mlp_tuned = MLPRegressor(alpha = 0.001, hidden_layer_sizes = (100,100)).fit(X_train_scaled, y_train)

In [None]:
mlp_y_pred = mlp_tuned.predict(X_test_scaled)

In [None]:
mlp_tuned_RMSE = np.sqrt(mean_squared_error(y_test, mlp_y_pred))
mlp_tuned_RMSE

## 6.4. CART (Classification and Regression Tree)




In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
cart_model = DecisionTreeRegressor()

In [None]:
cart_model.fit(X_train, y_train)

In [None]:
cart_params = {"max_depth": [2,3,4,5,10,20],
              "min_samples_split": [2,10,5,30,50,10]}

In [None]:
cart_cv_model = GridSearchCV(cart_model, cart_params, cv = 10, verbose = 2, n_jobs = -1).fit(X_train, y_train)

In [None]:
cart_cv_model.best_params_

In [None]:
cart_tuned = DecisionTreeRegressor(max_depth = 4, min_samples_split = 2).fit(X_train, y_train)

In [None]:
cart_model_y_pred = cart_tuned.predict(X_test)
cart_tuned_RMSE = np.sqrt(mean_squared_error(y_test, cart_model_y_pred))
cart_tuned_RMSE

## 6.5. Random Forests




In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
rf_model = RandomForestRegressor(random_state = 46).fit(X_train, y_train)
rf_model

In [None]:
rf_params = {"max_depth": [5,8,10],
            "max_features": [2,5,10],
            "n_estimators": [200, 500, 1000, 2000],
            "min_samples_split": [2,10,80,100]}

In [None]:
rf_cv_model = GridSearchCV(rf_model, rf_params, cv = 10, n_jobs = -1, verbose = 2).fit(X_train, y_train)

In [None]:
rf_cv_model.best_params_

In [None]:
rf_model = RandomForestRegressor(random_state = 46, 
                                 max_depth = 8,
                                max_features = 5,
                                min_samples_split = 2,
                                 n_estimators = 500)
rf_tuned = rf_model.fit(X_train, y_train)

In [None]:
rf_y_pred = rf_tuned.predict(X_test)
rf_tuned_RMSE = np.sqrt(mean_squared_error(y_test, rf_y_pred))
rf_tuned_RMSE

## Variable Severity

In [None]:
rf_tuned.feature_importances_*100

In [None]:
Importance = pd.DataFrame({'Importance':rf_tuned.feature_importances_*100}, 
                          index = X_train.columns)


Importance.sort_values(by = 'Importance', 
                       axis = 0, 
                       ascending = True).plot(kind = 'barh', 
                                              color = 'r', )

plt.xlabel('Variable Importance')
plt.gca().legend_ = None

##  6.6. Gradient Boosting Machines





In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
gbm_model = GradientBoostingRegressor().fit(X_train, y_train)
gbm_model

In [None]:
gbm_params = {"learning_rate": [0.001,0.1,0.01],
             "max_depth": [3,5,8],
             "n_estimators": [100,200,500],
             "subsample": [1,0.5,0.8],
             "loss": ["ls","lad","quantile"]}

In [None]:
gbm_cv_model = GridSearchCV(gbm_model, 
                            gbm_params, 
                            cv = 10, 
                            n_jobs=-1, 
                            verbose = 2).fit(X_train, y_train)

In [None]:
gbm_cv_model.best_params_

In [None]:
gbm_tuned = GradientBoostingRegressor(learning_rate = 0.1,
                                     loss = "lad",
                                     max_depth = 3,
                                     n_estimators = 100,
                                     subsample = 1).fit(X_train, y_train)

In [None]:
gbm_tuned_y_pred = gbm_tuned.predict(X_test)
gbm_tuned_RMSE = np.sqrt(mean_squared_error(y_test, gbm_tuned_y_pred))
gbm_tuned_RMSE

## Variable Severity:

In [None]:
Importance = pd.DataFrame({'Importance':gbm_tuned.feature_importances_*100}, 
                          index = X_train.columns)


Importance.sort_values(by = 'Importance', 
                       axis = 0, 
                       ascending = True).plot(kind = 'barh', 
                                              color = 'r', )

plt.xlabel('Variable Importance')
plt.gca().legend_ = None

## 6.7. XGBoost




In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
xgb = XGBRegressor()
xgb

In [None]:
xgb_params = {"learning_rate": [0.1,0.01,0.5],
             "max_depth": [2,3,4,5,8],
             "n_estimators": [100,200,500,1000],
             "colsample_bytree": [0.4,0.7,1]}

In [None]:
xgb_cv_model  = GridSearchCV(xgb,xgb_params, cv = 10, n_jobs = -1, verbose = 2).fit(X_train, y_train)

In [None]:
xgb_cv_model.best_params_

In [None]:
xgb_tuned = XGBRegressor(colsample_bytree = 0.4, 
                         learning_rate = 0.1, 
                         max_depth = 4, 
                         n_estimators = 100).fit(X_train, y_train)

In [None]:
xgb_tuned_y_pred = xgb_tuned.predict(X_test)
xgb_tuned_RMSE = np.sqrt(mean_squared_error(y_test, xgb_tuned_y_pred))
xgb_tuned_RMSE

## 6.8.  LightGBM

In [None]:
df_4.head()

In [None]:
y = df_4['Salary']
X = df_4.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
lgb_model = LGBMRegressor()
lgb_model

In [None]:
lgbm_params = {"learning_rate": [0.01, 0.1, 0.5, 1],
              "n_estimators": [20,40,100,200,500,1000],
              "max_depth": [1,2,3,4,5,6,7,8,9,10]}

In [None]:
lgbm_cv_model = GridSearchCV(lgb_model, 
                             lgbm_params, 
                             cv = 10, 
                             n_jobs = -1, 
                             verbose =2).fit(X_train, y_train)

In [None]:
lgbm_cv_model.best_params_

In [None]:
lgbm_tuned = LGBMRegressor(learning_rate = 0.1, 
                          max_depth = 2, 
                          n_estimators = 200).fit(X_train, y_train)

In [None]:
lgbm_tuned_y_pred = lgbm_tuned.predict(X_test)
lgbm_tuned_RMSE = np.sqrt(mean_squared_error(y_test, lgbm_tuned_y_pred))
lgbm_tuned_RMSE

## 6.9.  CatBoost

In [None]:
cat_df = df_4

In [None]:
y = cat_df['Salary']
X = cat_df.drop('Salary', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46)

In [None]:
catb_model = CatBoostRegressor()

In [None]:
catb_params = {"iterations": [200,500,100],
              "learning_rate": [0.01,0.1],
              "depth": [3,6,8]}

In [None]:
catb_cv_model = GridSearchCV(catb_model, 
                           catb_params, 
                           cv = 5, 
                           n_jobs = -1, 
                           verbose = 2).fit(X_train, y_train)

In [None]:
catb_cv_model.best_params_

In [None]:
catb_tuned = CatBoostRegressor(depth = 6, iterations = 500, learning_rate = 0.01).fit(X_train, y_train)

In [None]:
catb_tuned_y_pred = catb_tuned.predict(X_test)

In [None]:
catb_tuned_RMSE = np.sqrt(mean_squared_error(y_test, catb_tuned_y_pred))
catb_tuned_RMSE

# CONCLUSION:

#### In the 'Salary Estimation' study on the 'Hitters' data set, a total of 8 data sets were created:

#### df_1: The observations with missing data in the Hitters data set were created by deleting.

#### df_2: The missing data in the Hitters dataset was created by filling the average of the 'Salary' variable in which they were found.

#### df_3: Missing data in Hitters dataset was estimated and filled with the Gradient Boosting Machine model.

#### df_4: Outliers determined by LocalOutlierFactor in the df_3 dataset were created by suppressing.

#### df_5: The outliers determined by the Local Outlier Factor in the df_3 dataset were deleted.

#### df_6: Observations with missing data in the Hitters data set were created by deleting the outliers.

#### df_7: It was created by deleting missing data and outliers from the Hitters data set.

#### df_8: It was created by adding new variables to the data set. The values in the Years variable were divided, the 'Experience' variable was created, and the annual average of the players' performances and the ratio of their performances in 1986 to all their careers were added as variables.



## Then a function was written for all 'Regression Models' and estimation was performed on all datasets with individual models.

## Finally, model tuning processes were made with Hyperparameter optimizations and final models were established.Finally, model tuning processes were made with Hyperparameter optimizations and final models were established.