# <center> Kobe Bryant Shot Selection Analysis </center>

### <center> Exploratory Analysis and Ensemble Prediction of Shots made and missed</center>

![cover](https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/1a5b131c-c4f2-4f4b-8587-945e38919401/d2omfj6-684f32d6-3706-4426-b693-a407dbfc93b3.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwic3ViIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsImF1ZCI6WyJ1cm46c2VydmljZTpmaWxlLmRvd25sb2FkIl0sIm9iaiI6W1t7InBhdGgiOiIvZi8xYTViMTMxYy1jNGYyLTRmNGItODU4Ny05NDVlMzg5MTk0MDEvZDJvbWZqNi02ODRmMzJkNi0zNzA2LTQ0MjYtYjY5My1hNDA3ZGJmYzkzYjMuanBnIn1dXX0.ngz3uDrOtN-k3aoSIGlCYWj3i_DDleRiIIdZ8JcO8F0)

### Introduction

Welcome Kagglers,

All basketball fans and, in general, sports lovers were devastated about the loss in the tragic 2020 of one of the greatest players ever. Kobe Bryant was an off guard who spent his entire career in Los Angeles Lakers, winning for them five championships.

The Black Mamba also was an 18-time All-Star and won 2 gold medals in the Olympic Games of London and Beijing representing its country: the United States, and beating both times my country... hehe

With all these achievements and his unforgetable moves, he is considered one of the best in the game. After the tragic event, Kobe received the recognition, affection, and the warmest possible farewell from the fans all over the world.

In this notebook, I will make an exploratory analysis of the shots made by this player throughout his entire career, including interesting visualizations and extracting some insights about them. Furthermore, I will make a model that predicts whether a shot was successful or not given some features of this same shot. For doing so, an ensemble of different models will be implemented. So if you are not familiar with this kind of procedure, stick with the reading, and I will explain everything you need.

On the other side, I would like to thank and recognize the effort of other kagglers, whose works were a great inspiration for doing this notebook:
- kevins's Kobe Shots - Show Me Your Best Model: https://www.kaggle.com/kevins/kobe-shots-show-me-your-best-model
- Xavier's Kobe Bryant Shot Selection: https://www.kaggle.com/xvivancos/kobe-bryant-shot-selection

Finally, the data description of Kaggle recommends avoiding leakage by only trining on events that occurred prior to the shot for which we are predicting. It is said that is up to us to abide by this rule, and having taken a look at other libraries the general rule is to disregard this restriction. So, for the sake of simplicity, we will predict the shots on all the train observations, with prior and later events of the shot.

Said this, I hope you enjoy the notebook, don't forget to upvote if you like it, and remember that any advice or guidance will be welcome and appreciatted.

### Index

[The data](#section0)

1. [Loading the necessary libraries](#section1)
2. [Loading the dataset itself](#section2)
3. [Correct variable types](#section3)
4. [Data fast summary](#section4)
5. [Some exploration](#section5)
6. [Preprocess the data](#section6)
7. [Separate train and test sets](#section7)
8. [Feature Selection](#section8)
9. [Prepare dataset for futher analysis](#section9)
10. [Evaluate Algorithms](#section10)
11. [Hyperparameter tuning](#section11)
12. [Final model: Voting Ensemble](#section12)
13. [Final predictions and submission](#section13)

### <a id='section0'>The data</a>

The data is from the Kaggle's Playground Prediction Competition, it can be found [here](https://www.kaggle.com/c/kobe-bryant-shot-selection/data). As its data description states:

This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. The task is to predict whether the basket went in (shot_made_flag).

5000 of the shot_made_flags have been removed and represented as missing values in the csv file. These are the test set shots for which we must submit a prediction.

The field names are self-explanatory and contain the following attributes:

    action_type
    combined_shot_type
    game_event_id
    game_id
    lat
    loc_x
    loc_y
    lon
    minutes_remaining
    period
    playoffs
    season 
    seconds_remaining
    shot_distance
    shot_made_flag (this is what you are predicting)
    shot_type
    shot_zone_area
    shot_zone_basic
    shot_zone_range
    team_id
    team_name
    game_date
    matchup
    opponent
    shot_id


### <a id='section1'>1. Loading the necessary libraries</a>

In [None]:
# For processing the data
import numpy as np
import pandas as pd

# Visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.lines import Line2D
%matplotlib inline
sns.set_style("white") # set style for seaborn plots

# Machine learning
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.feature_selection import VarianceThreshold, RFE, SelectKBest, chi2
from sklearn.metrics import make_scorer, log_loss
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (BaggingClassifier, ExtraTreesClassifier, 
                              GradientBoostingClassifier, VotingClassifier, 
                              RandomForestClassifier, AdaBoostClassifier)

# Ignore warnings
import warnings 
warnings.filterwarnings('ignore')

### <a id='section2'>2. Loading the dataset itself</a>

As stated in the data section, the dataset consists of only one csv file. There are 5000 of the shot_made_flags observations as missing values. These values represent our test set, and our goal here is to predict them.

In [None]:
df = pd.read_csv("../input/kobe-bryant-shot-selection/data.csv.zip")

We set the index using the existing column `shot_id`:

In [None]:
df.set_index('shot_id', inplace=True)
df.head()

### <a id='section3'>3. Correct variable types</a>

In [None]:
df.dtypes

First of all, we will transform the `period` column into an `object`. We won't be doing mathematical operations with it so it is not necessary to maintain it as an integer.

On the other side, there are several variables that can be encoded as `category`. This will let us interact with a more efficient DataFrame in terms of running speed and memory usage.

In [None]:
df["period"] = df["period"].astype('object')

vars_to_category = ["combined_shot_type", "game_event_id", "game_id", "playoffs", 
                    "season", "shot_made_flag", "shot_type", "team_id"]
for col in vars_to_category:
    df[col] = df[col].astype('category')

# Let us check the final types
df.dtypes

### <a id='section4'>4. Data fast summary</a>

In [None]:
print("Dimensions of out DataFrame:", df.shape)

In [None]:
df.info()

As we knew, `shot_made_flag` has null values corresponding to the test observations. But surprisingly, this is the only variable with missing data, so no imputation will be needed.

In [None]:
df.describe(include=['number'])

In [None]:
df.describe(include=['object', 'category'])

### <a id="section5">5. Some exploration</a>

In this section, we will further explore our dataset. Primarily, we will display visualizations, which are a very effective way to get insights into our data.

Similarly, we will try to identify variables that can have a significant impact on the explainability of our dependant variable: `shot_made_flag`. We start with our target class distribution:

In [None]:
ax = plt.axes()
sns.countplot("shot_made_flag", data=df, ax=ax, palette=("#552583", "#FDB927"))
ax.set_title("Distribution of the dependent variable")
plt.show()

At first, we can see that the target variable is distributed quite equally. We won't perform any actions to deal with imbalanced datasets.

Now we continue with the shots made or missed in connection with the position they were taken. The next graph will display exactly this:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
scatter = sns.scatterplot(x=df["lon"], y=df["lat"], hue=df['shot_made_flag'],
                                    alpha=0.55, ax=ax, palette=("#552583", "#FDB927"))
scatter.set_xlim(left=-118.54, right=-118)
scatter.set_ylim(bottom=33.6, top=34.1)
ax.set_title("Shots made and missed based on court position")
ax.set_xlabel("")
ax.set_ylabel("")
legend_elemnts = [Line2D([0], [0], marker="o", color='w', label="Made",
                         markerfacecolor="#FDB927", markersize=10),
                  Line2D([0], [0], marker="o", color='w', label="Missed",
                         markerfacecolor="#552583", markersize=10)]
plt.legend(handles=legend_elemnts, title="Shot missed/made", 
           ncol=2, fontsize='small', fancybox=True);

Note that there is a clear cluster of points made next to the basket. On the other hand, there is a clear trend in the central area: it seems that Kobe was more accurate there. I think the right side of the court has a little bit more yellow in it, less perceptible though. Let us check these two last assumptions:

In [None]:
# We don't want to modify the original DataFrame
subset = df.copy()
subset["x_zones"] = pd.cut(df["loc_x"], bins=25)
df_grouped1 = subset.groupby("x_zones").agg({"shot_made_flag": "count"}).reset_index()
df_shots_made = subset[subset["shot_made_flag"]==1]
df_grouped2 = df_shots_made.groupby("x_zones").agg({"shot_made_flag": "count"}).reset_index()
proportions = round(df_grouped2["shot_made_flag"] / df_grouped1["shot_made_flag"], 2)

f, ax = plt.subplots(figsize=(12, 6))
# Plot total shots
g1 = sns.barplot(x="x_zones", y="shot_made_flag", data=df_grouped1,
                 label="Total", color="#552583")

# Plot shots made
g2 = sns.barplot(x="x_zones", y="shot_made_flag", data=df_grouped2,
                 label="Made", color="#FDB927")

idx = 0
for p in g1.patches:
    g1.annotate(proportions[idx],
               (p.get_x() + p.get_width() / 2., p.get_height()-80), 
                ha="center", va="center", 
                xytext=(0, 9), fontsize=9,
                textcoords="offset points")
    if idx < 24: idx += 1
    else: break
    
plt.yticks(ticks=[0, 2000, 4000, 6000])
plt.xticks(fontsize=8, rotation=90)
ax.set_title("Proportion of shots made by total considering x court strips")
ax.set_xlabel("x zones of the court")
ax.set_ylabel("Number of shots")
ax.legend(ncol=2, loc="upper right", frameon=True);

It is now clear that taken central shots have more accuracy than lateral ones. Specifically, 60% of the shots made in the central strip are successful, while 40% are missed. Shots in the corner are the ones that Kobe had lower precision, which is a normal phenomenon among the great majority of players.

What is interesting is that there is a better performance in some lateral zones than in others closer to the center. It also seems that the right-court shots had better results by a narrow margin.

In [None]:
def make_zone_scatter(var, ax):
    sns.scatterplot(x=df["lon"], y=df["lat"], 
                    hue=df[var], ax=ax,
                    palette="Dark2")
    ax.legend(ncol=len(df[var].unique())//3, fontsize='small', fancybox=True)

    
def make_zone_countplot(var, ax):
    sns.countplot(x=var, data=df, 
              order=df[var].value_counts().index, 
              ax=ax, palette="Dark2")
    ax.set_xlabel("")
    ax.set_xticklabels(df[var].unique(), fontsize=8, rotation=90)
    
    
def make_acc_lollipop(var, ax):
    subset = df[[var, "shot_made_flag"]].dropna()
    subset["shot_made_flag"] = pd.to_numeric(subset["shot_made_flag"])
    df_grouped = subset.groupby(var).agg({"shot_made_flag": "mean"}).reset_index()
    df_grouped = df_grouped.sort_values(by="shot_made_flag")
    ax.hlines(y=df_grouped[var], xmin=0,
               xmax=df_grouped["shot_made_flag"], color="#552583", linewidth=3)
    ax.plot(df_grouped["shot_made_flag"], range(0,len(df_grouped.index)), "o", color="#FDB927")
    ax.set_xlim([0, .7])
    ax.set_xlabel("Accuracy")

    
f, ((ax0, ax1, ax2), (ax3, ax4, ax5), (ax6, ax7, ax8)) = plt.subplots(3, 3, figsize=(16, 16))
make_zone_scatter("shot_zone_area", ax0)
make_zone_scatter("shot_zone_basic", ax1)
make_zone_scatter("shot_zone_range", ax2)

make_zone_countplot("shot_zone_area", ax3)
make_zone_countplot("shot_zone_basic", ax4)
make_zone_countplot("shot_zone_range", ax5)

make_acc_lollipop("shot_zone_area", ax6)
make_acc_lollipop("shot_zone_basic", ax7)
make_acc_lollipop("shot_zone_range", ax8)

f.tight_layout()
f.suptitle("Distribution of shots by zone-related variable", fontsize=16, y=1.03);

With this combined figure we can understand how the zone-related variables are situated among the basketball court, how are their distributions (i.e. how many shots took place in each area), and how these areas affect our dependent binary variable `shot_made_flag`.

Besides that, different types of shots have been categorized in the variables `combined_shot_type` and `action_type`. Here we examine these features, providing their impact on the accuracy of the shot metric.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
jump_shot_df = df[df["combined_shot_type"] == "Jump Shot"]
scatter_jumpshots = sns.scatterplot(x=jump_shot_df["lon"], y=jump_shot_df["lat"], 
                                    alpha=0.1, ax=ax, color="grey")

not_jump_shot_df = df[df["combined_shot_type"] != "Jump Shot"]
scatter = sns.scatterplot(x=not_jump_shot_df["lon"], y=not_jump_shot_df["lat"], 
                          hue=not_jump_shot_df["combined_shot_type"], 
                          palette=["C8", "#552583", "C3", "#000000", "#FDB927"],
                          ax=ax)

scatter.set_xlim(left=-118.54, right=-118)
scatter.set_ylim(bottom=33.65, top=34.1);
ax.set_title("Shots made by type/kind")
ax.set_xlabel("")
ax.set_ylabel("")
plt.legend(ncol=len(df["combined_shot_type"].unique())+1, fontsize='small', fancybox=True);

In [None]:
f, ax = plt.subplots(figsize=(10,8))
make_acc_lollipop("action_type", ax)
ax.set_xlim([0, 1.05])
ax.tick_params(axis="y", labelsize=8)

We continue with the accuracy exploration, now considering the seconds remaining for the last fourth quarter or the extra ones. They should be shots with a lot of pressure, which could lead to worse performance, but Kobe has overall good stats. Despite there is a decrease in accuracy after the 5 seconds remaining.

In [None]:
subset = df[df["period"]<=4][["seconds_remaining", "shot_made_flag"]].dropna()
subset = pd.DataFrame(subset, dtype=int)
df_grouped3 = subset.groupby("seconds_remaining").agg({"shot_made_flag": "mean"}).reset_index()

fig, ax = plt.subplots(1, 1, figsize=(10, 4))
sns.barplot(x="seconds_remaining", y="shot_made_flag", data=df_grouped3, 
            palette=("#552583", "#FDB927", "#000000"))
ax.set_title("Proportion of shots converted by seconds remaining of fourth or extra quarter")
ax.set_xlabel("Seconds remaining")
ax.set_ylabel("Precision percentage")
plt.xticks(fontsize=8, rotation=90);

The performance of Kobe dropped in his last three years in the league. At least in terms of shot precision, this graph shows it:

In [None]:
subset4 = df[["season", "shot_made_flag"]].dropna()
subset4["shot_made_flag"] = pd.to_numeric(subset4["shot_made_flag"])
df_grouped4 = subset4.groupby("season").agg({"shot_made_flag": "mean"}).reset_index()

f, ax = plt.subplots(1, 1, figsize=(8,8))
sns.lineplot(x="season", y="shot_made_flag", data=df_grouped4, color="#552583", ax=ax);
sns.scatterplot(x="season", y="shot_made_flag", data=df_grouped4, s=100, color="#FDB927", ax=ax)
ax.set_title("Accuracy per season")
ax.set_xlabel("Accuracy")
ax.set_ylabel("Season")
plt.xticks(fontsize=8, rotation=90);

Finally, we will visualize three extra features: period, playoffs, shot_type. We can extract some surprising and valuable insights from them: Kobe had incredible accuracy stats in playoffs and in the extra times, big moment player.

In [None]:
f, ax = plt.subplots(3, figsize=(12, 10))

for var, i in zip(["period", "playoffs", "shot_type"], range(0,3)):
    sns.countplot(x=var, hue="shot_made_flag", data=df, ax=ax[i], palette=("#552583", "#FDB927"))
    ax[i].set_title(var)

plt.tight_layout()
plt.show()

### <a id="section6">6. Preprocess the data</a>

We will now make some modifications to the data. To keep the original DataFrame integrity, we will copy it into a new one called: `copy_df`. This is considered a good practice and can be helpful to prevent undesired problems.

In [None]:
copy_df = df.copy()
target = copy_df['shot_made_flag'].copy()

#### 6.1 Remove useless columns

Let us start removing some columns that do not provide any informative benefit. 

- `team_id` and `team_name` are quite useless features considering Kobe only played in one team L.A. Lakers: Their values have just one unique value.


- For `game_id` and `game_event_id`, they are independent variables that have null relation with whether a shot is made or missed. They would add noise to our model.


- `lat` and `long` are highly correlated with `loc_x` and `loc_y`, we could be adding multicollinearity problems to our set.


- Ultimately, `shot_made_flag` is our dependent variable, and we have already stored it in the `target` series.

In [None]:
vars_to_remove = ["team_id", "team_name", "game_id", "game_event_id", 
                  "lat", "lon", "shot_made_flag"]

for var in vars_to_remove:
    copy_df = copy_df.drop(var, axis=1)

#### 6.2 Variable's transformation

##### 6.2.1 Action types

There are way too many action types. We need to encode those values with fewer occurrences as a new category: "Other" or "Rare actions". Otherwise, when we one-hot-encode, we will experience a great increase in the columns' dimension.

In [None]:
pd.DataFrame({"counts": copy_df["action_type"].value_counts().sort_values()[:25]})

In [None]:
rare_action_types = copy_df["action_type"].value_counts().sort_values().index.values[:20]
copy_df.loc[copy_df["action_type"].isin(rare_action_types), "action_type"] = "Other"

##### 6.2.2 Game date
We will separate the month and year from the date. As we will see later on, this will contribute to the explainability of the target.

In [None]:
copy_df["game_date"] = pd.to_datetime(copy_df["game_date"])
copy_df["game_year"] = copy_df["game_date"].dt.year
copy_df["game_month"] = copy_df["game_date"].dt.month
copy_df = copy_df.drop("game_date", axis=1)

##### 6.2.3 Last seconds
As we observed in the exploratory analysis section, there was a significant decrease in the shots taken with less than 5 seconds remaining. And similarly, the accuracy with more seconds was quite uniform. We will perform a transformation to include this phenomenon and reduce the number of future columns.

In [None]:
copy_df["seconds_from_period_end"] = 60 * copy_df["minutes_remaining"] + copy_df["seconds_remaining"]
copy_df["last_5_sec_in_period"] = copy_df["seconds_from_period_end"] < 5

# We can drop the rest of time related fields
copy_df = copy_df.drop("minutes_remaining", axis=1)
copy_df = copy_df.drop("seconds_remaining", axis=1)
copy_df = copy_df.drop("seconds_from_period_end", axis=1)

##### 6.2.4 x and y zones

We already did something similar in the data visualization section. Now we will include these strips in our training set for the x-axis and y-axis. But we won't drop `loc_x` and `loc_y`.

In [None]:
copy_df["x_zones"] = pd.cut(copy_df["loc_x"], bins=25)
copy_df["y_zones"] = pd.cut(copy_df["loc_y"], bins=25)

##### 6.2.5 Home games
It will be clearer if we set a binary variable that will determine if a game was played at home or away with the classic 1 or 0 values.

In [None]:
copy_df["home_play"] = copy_df["matchup"].str.contains("vs").astype("int")
copy_df = copy_df.drop("matchup", axis=1)

#### 6.3 Encode the categorical variables

We are finally in a position to one-hot-encode our categorical variables.

In [None]:
pd.get_dummies(copy_df["action_type"]).add_prefix("{}#".format("action_type"))

categorial_vars = [
    'action_type', 'combined_shot_type', 'period', 'season', 'shot_type',
    'shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'game_year',
    'game_month', 'opponent', 'loc_x', 'loc_y', 'x_zones', 'y_zones']

for var in categorial_vars:
    dummies = pd.get_dummies(copy_df[var])
    dummies = dummies.add_prefix("{}#".format(var))
    copy_df.drop(var, axis=1, inplace=True)
    copy_df = copy_df.join(dummies)

### <a id="section7">7. Separate train and test sets</a>


In [None]:
missing = target.isnull()

data_submit = copy_df[missing]
X = copy_df[~missing]
Y = target[~missing]

In [None]:
print(X.shape, Y.shape)

In [None]:
copy_df.shape

### <a id="section8">8. Feature Selection</a>

When we one-hot-encoded, we abruptly increased the columns of our set, we drew from less than 1106 features and we now have 208. This happened even with all the hard work of variable's disregard and transformation we did before.

Well, this is fairly normal when the number of categories in the variables is high. Fortunately, we have enough observations to deal with all these columns; and, more importantly, with techniques to reduce them. We will be doing so in this section by selecting those more informative variables.

Let us start with this reduction of features. We will implement different techniques and combine them in a final selection stage.

#### 8.1 Variance Threshold
We will find all features with a training-set variance greater than 90%.

In [None]:
threshold = 0.9
vt = VarianceThreshold().fit(X)

# Find feature names
feat_var_threshold = copy_df.columns[vt.variances_ > threshold * (1-threshold)]
feat_var_threshold

#### 8.2 Most important features 

`RandomForestClassifier` allows us to get the feature's importances. According to them, we will select the top 30.

In [None]:
model = RandomForestClassifier()
model.fit(X, Y)

feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])
feat_imp_30 = feature_imp.sort_values("importance", ascending=False).head(30).index
feat_imp_30

#### 8.3 Univariate feature selection

With this procedure, we will also select the top 30 features but using a chi2 test. The features must be positive before applying this test.

In [None]:
X_minmax = MinMaxScaler(feature_range=(0,1)).fit_transform(X)
X_scored = SelectKBest(score_func=chi2, k="all").fit(X_minmax, Y)
feature_scoring = pd.DataFrame({
    "feature": X.columns,
    "score": X_scored.scores_
})

feat_scored_30 = feature_scoring.sort_values("score", ascending=False).head(30)["feature"].values
feat_scored_30

#### 8.4 Recursive Feature Elimination

We now select the best 30 features by using recursive feature elimination (RFE) with a logistic regression model.

In [None]:
# Running time can take several minutes
# You can ignore this method and don't include it in the final feature selection
rfe = RFE(LogisticRegression(), 30)
rfe.fit(X, Y)

feature_rfe_scoring = pd.DataFrame({
    "feature": X.columns, 
    "score": rfe.ranking_
})

feat_rfe_30 = feature_rfe_scoring[feature_rfe_scoring["score"] == 1]["feature"].values
feat_rfe_30

#### 8.5 Final feature selection 

Finally, we will get our selection of features by merging all methods above. In a nutshell, we will keep those variables that, at least, appear as the best variable in one of the techniques.

In [None]:
features = np.hstack([
    feat_var_threshold,
    feat_imp_30,
    feat_scored_30,
    feat_rfe_30
])

features = np.unique(features)
print("Final features set:\n")
for f in features:
    print("\t-{}".format(f))

### <a id="section9">9. Prepare dataset for futher analysis</a>

In [None]:
copy_df = copy_df.loc[:, features]
data_submit = data_submit.loc[:, features]
X = X.loc[:, features]

print("Clean dataset shape: {}".format(copy_df.shape))
print("Subbmitable dataset shape: {}".format(data_submit.shape))
print("Train features shape: {}".format(X.shape))
print("Target label shape: {}".format(Y.shape))

Here I show you the actual version of sklearn used to help solve compatibility problems. After that, we set some variables that we will be using through the model construction. The first one is the random seed: to get reproducible results is a must. 

The number of processors is set to -1, this means that your computer will use all its cores to parallel process the code. `n_folds` is the number of partitions we want when we perform cross-validation. Log loss is the metric chosen to get the scoring performance of the models.

In [None]:
seed = 2666
processors = -1
num_folds = 3
scoring="neg_log_loss"

kfold = KFold(n_splits=num_folds, random_state=seed)

#### 10.1 Algorithms spot-check

Now we will fast-prepare some basic models and see how they behave in our particular dataset.

In [None]:
models = []
models.append(("LR", LogisticRegression()))
models.append(("LDA", LinearDiscriminantAnalysis()))
models.append(("K-NN", KNeighborsClassifier(n_neighbors=5)))
models.append(("CART", DecisionTreeClassifier()))
models.append(("NB", GaussianNB()))


results = []
names = []
for name, model in models:
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    results.append(cv_results)
    names.append(name)
    print("{0}:({1:.3f}) +/- ({2:.3f})".format(name, cv_results.mean(), cv_results.std()))

By looking at these results, Logistic Regression and Linear Discriminant Analysis are providing decent results and are worth further examination.

But apart from these simple algorithms, let's look at some ensemble models before to see if we can find some more interesting models: 

#### 10.2 Ensembles

##### 10.2.1 Bagging (Bootstrap Aggregation)

It involves taking multiple samples with replacement from the training dataset, and training a model for each one of them.
The final output prediction is averaged across the predictions of all of the sampled-based-models.

###### Bagged Decision Trees

In [None]:
cart = DecisionTreeClassifier()
num_trees = 100

model = BaggingClassifier(base_estimator = cart, n_estimators = num_trees, random_state=seed)

result = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

###### Random Forest 

In [None]:
num_features = 10

model = RandomForestClassifier(n_estimators=num_trees, max_features=num_features)

results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

###### Extra Trees

In [None]:
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=num_features)
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
result = np.array(result)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

##### 10.2.2 Boosting

Boosting algorithms seek to improve the prediction power by training a sequence of weak models, each compensating the weaknesses of its predecessors. To understand Boosting, it is crucial to recognize that boosting is a generic algorithm rather than a specific model. Boosting needs you to specify a weak model (e.g. regression, shallow decision trees, etc) and then improves it.

###### AdaBoost

In [None]:
model = AdaBoostClassifier(n_estimators=100, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

###### Stochastic Gradient Boosting 

In [None]:
model = GradientBoostingClassifier(n_estimators=100, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

### <a id="section11">11. Hyperparameter tuning</a>

We are left with all those models that got better results. But they could be getting even better performance if we would have defined their optimal architecture of the models. This is what we will be doing here: selecting from a specific list of hyperparameters for each model the ones that work better for our data.

This selection procedure for hyperparameter is known as Hyperparameter Tuning, and `GridSearchCV()` will be our best friend.

#### 11.1 Logistic Regression

In [None]:
lr_grid = GridSearchCV(
    estimator = LogisticRegression(random_state=seed),
    param_grid = {
        'penalty': ['l1', 'l2'],
        'C': [0.001, 0.01, 1, 10, 100, 1000]
    },
    cv = kfold,
    scoring=scoring,
    n_jobs=processors)

lr_grid.fit(X, Y)

print(lr_grid.best_score_)
print(lr_grid.best_params_)

#### 11.2 Linear Discriminant Analysis

In [None]:
lda_grid = GridSearchCV(
    estimator = LinearDiscriminantAnalysis(),
    param_grid = {
        'solver': ['lsqr'],
        'shrinkage':[0, 0.25, 0.5, 0.75, 1],
        'n_components':[None, 2, 5, 10]
    },
    cv=kfold,
    scoring=scoring,
    n_jobs=processors)

lda_grid.fit(X, Y)

print(lr_grid.best_score_)
print(lr_grid.best_params_)

#### 11.3 K-NN


In [None]:
knn_grid = GridSearchCV(
    estimator = Pipeline([
        ('min_max_scaler', MinMaxScaler()),
        ('knn', KNeighborsClassifier())
    ]),
    param_grid = {
        'knn__n_neighbors': [25],
        'knn__algorithm': ['ball_tree'],
        'knn__leaf_size': [2, 3, 4],
        'knn__p': [1]
    },
    cv = kfold,
    scoring = scoring,
    n_jobs=processors
    )

knn_grid.fit(X, Y)

print(knn_grid.best_score_)
print(knn_grid.best_params_)

#### 11.4 Random Forest

In [None]:
rf_grid = GridSearchCV(
    estimator = RandomForestClassifier(warm_start=True, random_state=seed),
    param_grid = {
        'n_estimators': [100, 200],
        'criterion': ['gini', 'entropy'],
        'max_features': [18, 20],
        'max_depth': [8, 10],
        'bootstrap': [True]
    }, 
    cv = kfold, 
    scoring = scoring, 
    n_jobs = processors)

rf_grid.fit(X, Y)

print(rf_grid.best_score_)
print(rf_grid.best_params_)

#### 11.5 AdaBoost 

In [None]:
ada_grid = GridSearchCV(
    estimator = AdaBoostClassifier(random_state=seed),
    param_grid = {
        'algorithm': ['SAMME', 'SAMME.R'],
        'n_estimators': [10, 25, 50],
        'learning_rate': [1e-3, 1e-2, 1e-1]
    }, 
    cv = kfold, 
    scoring = scoring, 
    n_jobs = processors)

ada_grid.fit(X, Y)

print(ada_grid.best_score_)
print(ada_grid.best_params_)

#### 11.6 Gradient Boosting

In [None]:
gbm_grid = GridSearchCV(
    estimator = GradientBoostingClassifier(warm_start=True, random_state=seed),
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [2, 3, 4],
        'max_features': [10, 15, 20],
        'learning_rate': [1e-1, 1]
    }, 
    cv = kfold, 
    scoring = scoring, 
    n_jobs = processors)

gbm_grid.fit(X, Y)

print(gbm_grid.best_score_)
print(gbm_grid.best_params_)

### <a id="section12">12. Final Model: Voting Ensemble</a>

We are on our last step in the model development. We select our four best models based on the log loss scoring, with their best possible hyperparameters, and combine them in an ensemble called a Voting classifier. 

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. The voting classifier isn’t an actual classifier but a wrapper for a set of different algorithms that are trained and evaluated in parallel, in order to exploit the different peculiarities of each of them.

In the soft voting (the modality we have chosen), the probability vector for each predicted class (for all classifiers) are summed up and averaged. The winning class is the one corresponding to the highest value. We also set different weights depending on the results of the models: for example, gradient boosting and random forest, the two models that achieved better log loss, have a weight of 3.

By this way of proceeding, we have more robust models.

In [None]:
estimators = []
estimators.append(('lr', LogisticRegression(penalty='l2', C=1)))
estimators.append(('gbm', GradientBoostingClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, max_features=15, warm_start=True, random_state=seed)))
estimators.append(('rf', RandomForestClassifier(bootstrap=True, max_depth=8, n_estimators=200, max_features=20, criterion='entropy', random_state=seed)))
estimators.append(('ada', AdaBoostClassifier(algorithm='SAMME.R', learning_rate=1e-2, n_estimators=10, random_state=seed)))


# Create the ensemble model
ensemble = VotingClassifier(estimators, voting='soft', weights=[2,3,3,1])

results = cross_val_score(ensemble, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(np.mean(results), np.std(results)))

### <a id="section13">13. Final predictions and submission</a>

In [None]:
model = ensemble
model.fit(X, Y)
preds = model.predict_proba(data_submit)
preds

In [None]:
submission = pd.DataFrame()
submission["shot_id"] = data_submit.index
submission["shot_made_flag"]= preds[:,0]

submission.to_csv("sub.csv", index=False)