We now use boosting to predict `Salary` in the `Hitters` data set.

In [0]:
%pip install --quiet mlxtend

### Preprocessing

In [0]:
# import relevant statistical packages
import numpy as np
import pandas as pd

In [0]:
# import relevant data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# import custom packages
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score as r2, mean_squared_error
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from mlxtend.plotting import plot_linear_regression as PLS
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

In [0]:
# import and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Hitters.csv"
Hitters = spark.read.option("header", "true").csv(url).toPandas()

str_cols = ["Names", "NewLeague", "League", "Division"]
num_cols = list(set(Hitters.columns) - set(str_cols))
Hitters["Salary"] = np.where(Hitters["Salary"] == "NA", np.nan, Hitters["Salary"])
Hitters[str_cols] = Hitters[str_cols].astype(str)
Hitters[num_cols] = Hitters[num_cols].astype(float)

In [0]:
Hitters.head()

**a. Remove the observations for whom the salary information is
unknown, and then log-transform the salaries.**

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
sns.heatmap(Hitters.isna(), cmap='viridis', yticklabels=False, cbar=False)
plt.title('heatmap to visualise missing data', fontsize=30, color='m')
plt.xlabel('features', fontsize=20, color='c')

In [0]:
Hitters.dropna(axis=0, inplace=True)

In [0]:
Hitters.head()

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
sns.heatmap(Hitters.isna(), cmap='viridis', yticklabels=False, cbar=False)
plt.title('heatmap to visualise missing data', fontsize=30, color='m')
plt.xlabel('features', fontsize=20, color='c')

So, I have removed all observations where Salary information is unknown.

In [0]:
Hitters.Salary = np.log(Hitters.Salary)

In [0]:
Hitters.head()

Therefore, I have log-transformed the salaries.

In [0]:
Hitters.League.value_counts()

In [0]:
Hitters.Division.value_counts()

In [0]:
Hitters.NewLeague.value_counts()

In [0]:
Hitters.League = Hitters.League.map({'N': 0, 'A': 1})
Hitters.Division = Hitters.Division.map({'W': 0, 'E': 1})
Hitters.NewLeague = Hitters.NewLeague.map({'N': 0, 'A': 1})

In [0]:
Hitters.head()

**b. Create a training set consisting of the frst 200 observations, and
a test set consisting of the remaining observations.**

In [0]:
X = Hitters.drop(columns=['Salary', 'Names'])
y = Hitters.Salary

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.23954372623, random_state=42)

In [0]:
X_train.info()

**c. Perform boosting on the training set with 1,000 trees for a range
of values of the shrinkage parameter λ. Produce a plot with
diferent shrinkage values on the x-axis and the corresponding
training set MSE on the y-axis.**

In [0]:
SP = np.linspace(start=0.001, stop=0.9, num=100)
MSE = pd.DataFrame()

for k in SP:
    boost = GradientBoostingRegressor(n_estimators=1000, max_depth=4, learning_rate=k).fit(X_train, y_train)
    mse = mean_squared_error(y_test, boost.predict(X_test))
    MSE = MSE.append([mse])

MSE.columns = ['MSE']
MSE.reset_index(drop=True, inplace=True)

In [0]:
MSE.head()

**d. Produce a plot with diferent shrinkage values on the x-axis and
the corresponding test set MSE on the y-axis.**

In [0]:
plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(MSE, SP, alpha=1)
sns.regplot(MSE, SP, x_ci='0.95', line_kws={'color': 'g', 'ls': '-.'})
plt.title('MSE vs shrinkage values', fontsize=30, color='m')
plt.xlabel('MSE', fontsize=20, color='c')
plt.ylabel('shrinkage values', fontsize=20, color='c')

**e. Compare the test MSE of boosting to the test MSE that results
from applying two of the regression approaches seen in
Chapters 3 and 6.**

In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
lmreg = LinearRegression().fit(X_train, y_train)
lmpred = lmreg.predict(X_test)
print("MSE from linear regression: ", mean_squared_error(y_test, lmpred))

In [0]:
print("MSE from boosting: ", MSE.mean())

Therefore, boosting provides lower MSE than linear regression.

**f. Which variables appear to be the most important predictors in
the boosted model?**

In [0]:
feature_importance = boost.feature_importances_*100
rel_imp = pd.Series(feature_importance, index = X.columns).sort_values(inplace = False)

rel_imp.T.plot(kind = 'barh', color = 'y', figsize=(25, 10), grid= True, )

plt.xkcd()
plt.xlabel('variable importance', fontsize=20, color='c')
plt.ylabel('variables', fontsize=20, color='c')
plt.title('importance of each variables', fontsize=30, color='m')
plt.gca().legend_ = None

Therefore, 'CRuns', 'CRBI' and 'AtBat' are the most important variables.

**g. Now apply bagging to the training set. What is the test set MSE
for this approach?**

In [0]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

bag = RandomForestRegressor(max_features=19).fit(X_train, y_train)
bag_pred = bag.predict(X_test)

plt.xkcd()
plt.figure(figsize=(25, 10))
plt.scatter(bag_pred, y_test, label = 'medv', color='g')
plt.plot([0, 1], [0, 1], 'r', transform = plt.gca().transAxes)
plt.xlabel('pred')
plt.ylabel('y_test')

print("Mean Squared Error: ", mean_squared_error(y_test, bag_pred))

The test MSE for bagging is $\approx$0.25, which is lower than the test MSE for boosting.