<h1> Game of Thrones - Death Prediction</h1>

In this Notebook, I am trying to predict which character is going to die (isAlive = 0) and who is staying alive (isAlive = 1). Please run all cells to see the final output. The chosen model and score are given in the last cell, so stay tuned!

<h2> Scoring </h2>
As it is a classification problem, we cannot just use R-Squared. To evaluate the outcomes of the model, we will use the AUC scoring method. Additionally, we will look at the confusion matrix to see which kind of errors occurred. The errors we are interested in are false negatives and positives, explained further in the grid below.

~~~
                                                 |
  True Negative                                  |  False Positive
  PREDICTED: Character Died   (isAlive = 0)      |  PREDICTED: Character Alive   (isAlive = 1)
  ACTUAL:    Character Died   (isAlive = 0)      |  ACTUAL:    Character Died    (isAlive = 0)
                                                 |
-------------------------------------------------|---------------------------------------------------
                                                 |
  False Negative                                 |  True Positive
  PREDICTED: Character Died   (isAlive = 0)      |  PREDICTED: Character Alive   (isAlive = 1)
  ACTUAL:    Character Alive  (isAlive = 1)      |  ACTUAL:    Character Alive   (isAlive = 1)
                                                 |  
~~~

If two models are the same in the AUC-score, I would prefer the model with less False Positives. For me as a supporter of the show, it is worse thinking the character is alive and then realizing he died in the end. The other way around is way more emotional, as thinking the character died and realizing he is alive gives you kind of a good feeling.


<h2> Preparing the data </h2>

As usual, the first step in building a model is preparing the data.

<h3> Gender guesser </h3>

The first column I have imputed is one to guess the gender. For that, i split the name in first and last name at the first space. As the guessing process of the first name ran for more than 5 minutes, I decided to copy the outcome and only keep the final list. The gender guesser gave me the following outcome:

~~~
unknown          1385
male              381
female            125
mostly_male        24
mostly_female      21
andy               10
~~~

Seeing that, I assume there are not a lot of English names in there and the value of the gender for the model will not be very high.

<h3> Imputing missing values </h3>

As always, imputing missing values took an important place in the data engineering part.

<h4> Categorical columns </h4>
First, I looked at the categorical columns, namely title, culture, mother, father, heir, house and spouse. I flagged the missing values in these columns in a new column and imputed the NaN's with <em>unknown</em>, as it would not make sense imputing missing values in these columns, because in the most cases no value in house or culture means the character is no part of such. 

<h4> Numerical columns </h4>
For the numerical columns, namely age and date of birth, I decided to impute with the median. As we had two outliers that were very big (positive for date of birth, negative for age), it was no option to impute with the mean (the age would have come out negative).

<h4> Boolean columns </h4>
For the boolean columns, namely the the check if mother, father, spouse or heir are alive, I decided to impute with 0. When one of these is NaN, it means the person is unknown, and therefor we cannot assume that the person is alive. 

<h3> Engineering new features </h3>
Looking at the data, a couple of new columns came to my mind. First, as we have the outliers in age and year of birth, I decided to add both of these values together. Next, I created a column checking the size of the house a person is living in. Additionally, I looked at the number of books a person is in. When that number is zero, I assumed the person is not named in one of the books and therefor is in the TV show only. 

<h3> Creating Dummy-variables </h3>
For the categorical columns gender, culture and house I created dummy variables. I decided not to do so for mother, father, spouse and heir, as these are too specific. For the culture column, I first grouped together cultures that are the same but only differ in the spelling, for example Summer Islands, Summer Islander and Summer Isles are all Summer Islands for me. 


<h3> Creating the final DataFrame </h3>
Last, I separated the target variable <em> isAlive </em> from the rest.  I dropped the categorical columns and columns I do not use later on, for example the index column. Additionally, I dropped one column per dummy-variable.

In [None]:
# importing libraries
import numpy as np                                      # mathematical essentials
import pandas as pd                                     # data science essentials
import matplotlib.pyplot as plt                         # data visualization
import gender_guesser.detector as gender                # gender guessing

# importing libraries for modeling
import statsmodels.formula.api as smf                   # linear modeling
from sklearn.model_selection import train_test_split    # training and testing
from sklearn.metrics import roc_auc_score               #scoring metric
from sklearn.metrics import make_scorer                 # customizable scorer
from sklearn.metrics import confusion_matrix            # confusion matrix
from sklearn.preprocessing import StandardScaler        # standard scaler
from sklearn.model_selection import RandomizedSearchCV  # hyperparameter tuning
from sklearn.linear_model import LogisticRegression     # logistic regression
from sklearn.neighbors import KNeighborsClassifier      # KNN for Classification
from sklearn.tree import DecisionTreeClassifier         # Decision tree for Classification
from sklearn.ensemble import RandomForestClassifier     # random forest
from sklearn.ensemble import GradientBoostingClassifier # gbm


# loading data
got = pd.read_excel(io = 'GOT_character_predictions.xlsx')


# getting first and last name --> https://stackoverflow.com/questions/51290134/using-pandas-how-do-i-split-based-on-the-first-space
got[['first_name', 'last_name']] = got['name'].str.split(n=1, expand=True)

#placeholder_lst = []
#for name in got['first_name']:
#    guess = gender.Detector().get_gender(name)
#    placeholder_lst.append(guess)

#guessing the gender --> gender list is finished placeholder list
gender_list = ['unknown', 'unknown', 'andy', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'male', 'mostly_male', 'mostly_male', 'mostly_male', 'mostly_male', 'mostly_male', 'mostly_male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'female', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'male', 'andy', 'andy', 'unknown', 'andy', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'male', 'unknown', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'mostly_male', 'male', 'mostly_male', 'mostly_male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'mostly_male', 'unknown', 'unknown', 'male', 'female', 'andy', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'mostly_female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'andy', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'female', 'female', 'female', 'female', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'mostly_female', 'female', 'unknown', 'mostly_female', 'unknown', 'female', 'unknown', 'female', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'andy', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'female', 'female', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_female', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'mostly_male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_male', 'female', 'male', 'male', 'male', 'female', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'male', 'male', 'male', 'female', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'andy', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'male', 'unknown', 'unknown', 'female', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'female', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'female', 'female', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_female', 'mostly_female', 'mostly_female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'female', 'male', 'male', 'male', 'male', 'unknown', 'female', 'female', 'female', 'unknown', 'mostly_male', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'female', 'male', 'female', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'female', 'unknown', 'male', 'unknown', 'unknown', 'mostly_female', 'male', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'female', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'male', 'male', 'female', 'mostly_female', 'female', 'mostly_female', 'mostly_female', 'mostly_female', 'mostly_female', 'mostly_female', 'mostly_female', 'unknown', 'unknown', 'female', 'female', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'female', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'female', 'mostly_male', 'unknown', 'female', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'female', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'mostly_male', 'male', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'female', 'female', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'female', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'male', 'male', 'andy', 'male', 'male', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'male', 'male', 'male', 'male', 'male', 'male', 'mostly_male', 'mostly_male', 'mostly_male', 'mostly_male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'male', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'female', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'mostly_female', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'mostly_female', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'mostly_female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'female', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'andy', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'mostly_female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'male', 'unknown', 'male', 'male', 'unknown', 'male', 'unknown', 'male', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'male', 'male', 'male', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'mostly_male', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'mostly_female', 'unknown', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'female', 'male', 'mostly_male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown']


# converting list into a series
got['gender_guess'] = pd.Series(gender_list)

# categorical columns
cat_cols =['title', 'culture','mother','father', 'heir', 'house', 'spouse']

#flagging missing values, impute with unknown for categorical columns
for col in got[cat_cols]:
    if got[col].isnull().astype(int).sum() > 0:
        got['m_'+col] = got[col].isnull().astype(int)
        got[col] = got[col].fillna('unknown')
        
# imputing missing values for age and birth with median
got['dateOfBirth'] = got['dateOfBirth'].fillna(got['dateOfBirth'].median()).round(3)
got['age'] = got['age'].fillna(got['age'].median()).round(3)

# imputing missing values for boolean columns
got['isAliveMother'] = got['isAliveMother'].fillna(0)               
got['isAliveFather'] = got['isAliveFather'].fillna(0)                
got['isAliveHeir'] = got['isAliveHeir'].fillna(0)              
got['isAliveSpouse'] = got['isAliveSpouse'].fillna(0)  

# create column depending on age and birth to get rid of outlier
got['age_birth'] = got['dateOfBirth']+got['age']
        
# get house size, death rate
got['houseSize'] = got['house'].map(got['house'].value_counts()) 

# total books in
got['total_books_in'] = got['book1_A_Game_Of_Thrones']+ got['book2_A_Clash_Of_Kings']+ got['book3_A_Storm_Of_Swords']+got['book4_A_Feast_For_Crows']+got['book5_A_Dance_with_Dragons']

# total books in is zero --> did not appear in books --> only appeared in TV-Show
got['show_only'] = pd.cut(got['total_books_in'], [-1, 0, 6], include_lowest=True, labels=[1,0]).astype(int)


# mapping same cultures into one
cult = {
    'Summer Islands': ['Summer Islands', 'Summer Islander', 'Summer Isles'],
    'Ghiscari': ['Ghiscari', 'Ghiscaricari',  'Ghis'],
    'Asshai': ["Asshai'i", 'Asshai'],
    'Crannogs': ['Crannogmen'],
    'Astapori': ['Astapori'],
    'Lysene': ['Lysene', 'Lyseni'],
    'Braavosi': ['Braavosi', 'Braavos'],
    'Dornish': ['Dornishmen', 'Dorne', 'Dornish'],
    'Dothraki': ['Dothraki'],
    'Myrish': ['Myr', 'Myrish', 'Myrmen'],
    'Westermen': ['Westermen', 'Westerman', 'Westerlands', 'westermen'],
    'Westerosi': ['Westeros', 'Westerosi'],
    'Stormlander': ['Stormlands', 'Stormlander'],
    'Northmen': ['The north', 'Northmen', 'northmen', 'Northern mountain clans'],
    'Free Folk': ['Wildling', 'First Men', 'free folk', 'Free Folk', 'Free folk' ],
    'Qartheen': ['Qartheen', 'Qarth'],
    'Reach': ['The Reach', 'Reach', 'Reachmen'],
    'Ironborn': ['Ironborn', 'Ironmen', 'ironborn'],
    'Mereen': ['Meereen', 'Meereenese'],
    'RiverLands': ['Riverlands', 'Rivermen'],
    'Vale': ['Vale', 'Valemen', 'Vale mountain clans'],
    'Valyrian' : ['Valyrian'],
    'Pentoshi' : ['Pentoshi'],
    'Tyrosh' : ['Tyroshi'],
    'Unknown' : ['unknown'],
    'Other' : ['Ibbenese', 'Astapor', 'Lhazarene', 'Rhoynar', 'Naathi', 'Norvoshi', 'Norvos', 'Wildlings',
              'Sistermen', 'Lhazareen', 'Andal', 'Andals', 'Qohor']
}

got['Cult_new'] = ""

for culture in cult:
    got.loc[got.culture.isin(values=cult[culture]), 'Cult_new'] = culture

# creating dummy variables for gender and culture
one_hot_gender       = pd.get_dummies(got['gender_guess'], prefix='gender') 
one_hot_culture      = pd.get_dummies(got['Cult_new'], prefix='culture')
one_hot_house     = pd.get_dummies(got['house'], prefix='house')
got = got.join(other = [one_hot_gender, one_hot_culture, one_hot_house])    

# replacing spaces by underscore in house_ columns
got.columns = got.columns.str.replace(' ', '_')

# getting target variable
got_target = got.loc[:, 'isAlive']
    
    
# dropping not needed columns
got = got.drop(['S.No','name', 'last_name', 'first_name', 'title', 'culture', 'Cult_new',
                'mother', 'father', 'heir', 'spouse', 'house', 
                'gender_guess', 'isAlive', 'gender_unknown', 'culture_Unknown', 'house_unknown'], axis = 1)

<h2> Model 1 - Logistic Regression </h2>

My first approach for solving the problem is a logistic regression model. 

<h3> Train-Test-Split function </h3>
For later use in all the models, I created a function that does a test and train split and allows to standardize if needed.

In [None]:
# function for train and test split and standardizing
def test_train_standardize( x_data,
                            y_data,
                            standardize = False,
                            pct_test    = 0.1,
                            seed        = 219):
    """
Creates train set and test set for given X and y data. Standardizes, if chosen.
Outputs the x and y data for testing and training. 
Parameters:
x_data        : explanatory variable data
y_data        : response variable
standardize   : whether or not to standardize the x data, default False
pct_test      : test size for training and validation from (0,1), default 0.1
seed          : random seed to be used in algorithm, default 219
    """
    if standardize == True:
        # optionally standardizing x_data
        scaler             = StandardScaler()
        scaler.fit(x_data)
        x_scaled           = scaler.transform(x_data)
        x_scaled_df        = pd.DataFrame(x_scaled)
        x_data             = x_scaled_df

    # train-test split
    x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                        y_data,
                                                        test_size    = pct_test,
                                                        random_state = seed, 
                                                        stratify     = y_data)
    return x_train, x_test, y_train, y_test

<h3> Creating combinations with low p-values </h3>
In the logistic regression model, p-values are important. Therefor, I started to look into the correlation of the columns to the target variable and built a candidate dictionary with the most promising columns. After that, I checked for p-values using the smf library and only kept the column combinations with low p-values for all variables.

In [None]:
# # creating a dataframe with all columns and target variable to look for correlation and p-values
# got_full_logreg = got.join(got_target)

# print(abs(got_full_logreg.corr()['isAlive']).sort_values(ascending=False).head(n=10))

candidate_dic = {
    'top_columns_1' : ['popularity', 'age_birth', 'book4_A_Feast_For_Crows'],
    'top_columns_2' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'show_only'],
    'top_columns_3' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones'],
    'top_columns_4' : ['age_birth', 'book4_A_Feast_For_Crows', 'culture_Valyrian', 'book1_A_Game_Of_Thrones'],
    'top_columns_5' : ['numDeadRelations', 'age_birth', 'book4_A_Feast_For_Crows'],
    'top_columns_6' : ['show_only', 'age_birth', 'book1_A_Game_Of_Thrones', 'm_mother']
}

# # instantiating a logistic regression model object
# logit_full = smf.logit(formula = """ isAlive ~
#                                     book4_A_Feast_For_Crows +
#                                     book1_A_Game_Of_Thrones +
#                                     show_only +
#                                     popularity
#                                     """,
#                                      data    = got_full_logreg)

# # fitting the model object
# logit_full = logit_full.fit()

# # checking the results SUMMARY
# logit_full. summary2()

<h3> Running Logistic Regression </h3>
For later use, I wrote a function that runs logistic regression automatically.

In [None]:
def optimal_logreg( x, 
                    y,
                    columns        = candidate_dic['top_columns_3'],
                    ran_gridsearch = False,
                    solver         = 'lbfgs',
                    C              = 1,
                    warm_start     = False):
    """
Runs a logistic regression on a dataset and target variable. Slizes down the 
explanatory variable by predifined columns first and solves the logistic
regression with the predefined parameters after.
x              : Whole explanatory dataset
y              : Whole target variable 
columns        : Columns used in explanatory variable
ran_gridsearch : Variable that clarifies, if grid search was run. In that case, the model is fit on the
                whole dataset, else only on the train-dataset. 
solver         : Solver used in logistic regression
C              : C used in logistic regression
warm_start     : warm start for logistic regression, either True or False
    
    """
    
    # slize the columns
    x = x[columns]   
    
    # initiate logistic regression
    lr = LogisticRegression(solver       = solver,
                            C            = C,
                            warm_start   = warm_start,
                            random_state = 219,
                            max_iter     = 1000
    )
    
    
    # creating train and test sets
    x_train, x_test, y_train, y_test = test_train_standardize(x, y)
    
    # FITTING the training data
    if ran_gridsearch:
        # ran cv gridsearch before, therefor fitting on whole dataset 
        lr_fit = lr.fit(x, y)
    else:
        # no cv before --> fitting on train data
        lr_fit = lr.fit(x_train, y_train)
    
    # PREDICTING based on the testing set
    lr_pred = lr_fit.predict(x_test)

    # saving scoring data
    lr_train_score = lr_fit.score(x_train, y_train).round(4) # accuracy
    lr_test_score  = lr_fit.score(x_test, y_test).round(4)   # accuracy
    
    # saving auc score
    lr_auc_score = roc_auc_score(y_true  = y_test,
                                 y_score = lr_pred).round(decimals = 4)
    
    # creating confusion matrix
    lr_tn, lr_fp, lr_fn, lr_tp = confusion_matrix(y_true = y_test, y_pred = lr_pred).ravel()
    lr_conf_matrix = [lr_tn, lr_fp, lr_fn, lr_tp]
      
    return [lr_test_score, lr_train_score, lr_auc_score, lr_conf_matrix]

<h4> Testing column combinations </h4>

The next step is to run this function on our pre-defined column sets and see which set performs best.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_1      0.8103       0.8024     0.6366  [14, 36, 1, 144]
top_columns_2      0.8103       0.8007     0.6366  [14, 36, 1, 144]
top_columns_3      0.8564       0.8127     0.7397  [25, 25, 3, 142]
top_columns_4      0.8462       0.7887     0.7262  [24, 26, 4, 141]
top_columns_5      0.8103       0.7990     0.6300  [13, 37, 0, 145]
top_columns_6      0.8103       0.7978     0.6300  [13, 37, 0, 145]
~~~

Looking at the outcome, we see that the combinations 3 and 4 seem to work best, shifting a couple False Positives to True Negatives. So for further testing in logistic regression, I will continue with these two options. Interestingly, these are the two options with book 1 and book 4 in it.

In [None]:
# # running logistic regression for all set of columns in candidate dic
# scores = []
# for x in candidate_dic:
#     score = optimal_logreg(x = got, y = got_target, columns = candidate_dic[x])
#     scores.append(score)

# lr_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= candidate_dic.keys())
# lr_score_df

<h4> Running Hyperparameter Tuning </h4>
For the best models, in that case for the column sets 3 and 4, I will run a hyperparameter tuning to evaluate, if I can higher the AUC score a little bit. For that, I wrote a function. Running these, the parameters used for the columns should be:

~~~
               warm_start     solver    C
top_columns_3       False  newton-cg  2.5
top_columns_4        True  newton-cg  3.0
~~~

I then ran logistic regression on these two set of columns with their specified optimal hyperparameters and got the following output:

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_3      0.8513       0.8007     0.7428  [26, 24, 5, 140]
top_columns_4      0.8462       0.7893     0.7262  [24, 26, 4, 141]
~~~

For the column set 3, the score got up a little bit, while for set 3 it stayed unchanged. The best model built with Logistic Regression is therefor the one on column set 3 with tuned hyperparamaters scoring an AUC score of 0.7428.

In [None]:
# Logistic Regression hyperparameter tuning
def run_cv_logreg(x, 
                  y, 
                  columns = candidate_dic['top_columns_3']):
    """
Runs a cross validation on logistic regression. Outputs the best parameters
found in the search.
Parameters:
x       : Whole explanatory dataset
y       : Whole target variable 
columns : columns used in explanatory variable
    """
    
    # slize the columns
    x = x[columns]  
    
    # declaring a hyperparameter space
    C_range          = np.arange(0.1, 5.0, 0.1)
    warm_start_range = [True, False]
    solver_range     = ['newton-cg', 'sag', 'lbfgs']

    # creating a hyperparameter grid
    param_grid = {'C'          : C_range,
                  'warm_start' : warm_start_range,
                  'solver'     : solver_range}

    # INSTANTIATING the model object without hyperparameters
    lr_tuned = LogisticRegression(random_state = 219,
                                  max_iter     = 1000) # increased for convergence

    # GridSearchCV object
    lr_tuned_cv = RandomizedSearchCV(estimator           = lr_tuned,   # the model object
                                     param_distributions = param_grid, # parameters to tune
                                     cv                  = 3,          # how many folds in cross-validation
                                     n_iter              = 250,        # number of combinations of hyperparameters to try
                                     random_state        = 219,        # starting point for random sequence
                                     scoring = make_scorer(
                                               roc_auc_score,
                                               needs_threshold = False)) # scoring criteria (AUC)

    # FITTING to the FULL DATASET (due to cross-validation)
    lr_tuned_cv.fit(x, y)
    
    return lr_tuned_cv.best_params_

# # find optimal hyperparameters for sets 3 and 4
# opt_1 = run_cv_logreg(got, got_target, columns = candidate_dic['top_columns_3'])
# opt_2 = run_cv_logreg(got, got_target, columns = candidate_dic['top_columns_4'])

# df = pd.DataFrame([opt_1, opt_2], index = ['top_columns_3', 'top_columns_4'])
# print(df)

# # run Logistic Regression with optimized hyperparameters
# scores = []
# score_3 = optimal_logreg(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True, solver = 'newton-cg', C = 2.5, warm_start = False)
# scores.append(score_3)
# score_4 = optimal_logreg(x = got, y = got_target, columns = candidate_dic['top_columns_4'], ran_gridsearch = True, solver = 'newton-cg', C = 3.0, warm_start = True)
# scores.append(score_4)

# lr_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= ['top_columns_3', 'top_columns_4'])
# lr_score_df


<h2> Model 2 - KNN Classification </h2>

My second approach for solving the problem is a KNN Classification model. 

<h3> Running KNN Regression</h3>
For later use, I wrote a function that runs KNN regression automatically. 

In [None]:
# KNN Classification
def optimal_knn(x,
                y,
                columns       = candidate_dic['top_columns_3'],
                max_neighbors = 20,
                show_viz      = False,
                standardize   = False):
    """
Exhaustively compute training and testing results for KNN across
[1, max_neighbors]. Outputs the training, testing and final score as well
as the confusion matrix.
PARAMETERS
----------
x             : Whole explanatory dataset
y             : Whole target variable 
columns       : Columns used in explanatory variable
max_neighbors : maximum number of neighbors in exhaustive search, default 20
show_viz      : display or surpress k-neigbors visualization, default True
"""
    # slize the columns
    x = x[columns]

    # creating train and test sets
    if standardize:
        x_train, x_test, y_train, y_test = test_train_standardize(x, y, standardize = True)
    else:
        x_train, x_test, y_train, y_test = test_train_standardize(x, y)

    # creating lists for training set accuracy and test set accuracy
    training_accuracy = []
    test_accuracy = []
    auc_score_list = []
    conf_matrix_list = []
    
    # setting neighbor range
    neighbors_settings = range(1, max_neighbors + 1)

    for n_neighbors in neighbors_settings:
        # creating KNN Classifier and fit train data
        knn = KNeighborsClassifier(n_neighbors = n_neighbors)
        knn_fit = knn.fit(x_train, y_train) 
        
        # predict 
        knn_pred = knn_fit.predict(x_test)
           
        # recording the training set accuracy
        train_score = knn_fit.score(x_train, y_train)
        training_accuracy.append(train_score)
    
        # recording the generalization accuracy
        test_score = knn_fit.score(x_test, y_test)
        test_accuracy.append(test_score)
         
        # recording the auc score
        auc_score = roc_auc_score(y_true  = y_test,
                                  y_score = knn_pred).round(decimals = 4)
        auc_score_list.append(auc_score)
        
        # creating confusion matrix
        knn_tn, knn_fp, knn_fn, knn_tp = confusion_matrix(y_true = y_test, y_pred = knn_pred).ravel()
        knn_conf_matrix = [knn_tn, knn_fp, knn_fn, knn_tp]
        conf_matrix_list.append(knn_conf_matrix)


    # optionally displaying visualization
    if show_viz == True:
        # plotting the visualization
        fig, ax = plt.subplots(figsize=(12,8))
        plt.plot(neighbors_settings, training_accuracy, label = "training accuracy")
        plt.plot(neighbors_settings, test_accuracy, label = "test accuracy")
        plt.ylabel("Accuracy")
        plt.xlabel("n_neighbors")
        plt.legend()
        plt.show()
    
    # index of neighbors with best auc score
    index_best = auc_score_list.index(max(auc_score_list))

    return [test_accuracy[index_best], training_accuracy[index_best], auc_score_list[index_best], conf_matrix_list[index_best]]

<h4> Testing column combinations </h4>
First, I run KNN Regression on all columns minus the dropped ones without standardization. That came out with an higher AUC-Score, but a big gap between testing and training.

~~~
             Test Score  Train Score  AUC Score   Confusion Matrix
All Columns    0.861538     0.990862     0.8086  [35, 15, 12, 133]
~~~

So the model seems to overfit on the chosen columns. Running the KNN-Classifier on the pre-defined columns, I got the following result:

~~~
               Test Score  Train Score  AUC Score   Confusion Matrix
top_columns_1    0.789744     0.808110     0.8193   [44, 6, 35, 110]
top_columns_2    0.856410     0.840662     0.8117  [36, 14, 14, 131]
top_columns_3    0.912821     0.853798     0.8628   [38, 12, 5, 140]
top_columns_4    0.897436     0.825814     0.8262   [34, 16, 4, 141]
top_columns_5    0.769231     0.700742     0.7990   [43, 7, 38, 107]
top_columns_6    0.851282     0.795545     0.8148  [37, 13, 16, 129]
~~~

Again, column pair 3 does the best job, even though the gap is still a little too big. Knowing that, I ran a couple of Classifications with a new set of columns named <em>knn_dic</em> to check if I can get this score up and the gap down, but it did not seem to make big of a difference, though knn_1 got up the score a little bit.

~~~
               Test Score  Train Score  AUC Score   Confusion Matrix
top_columns_3    0.912821     0.853798     0.8628   [38, 12, 5, 140]
knn_1            0.912821     0.852656     0.8693   [39, 11, 6, 139]
knn_2            0.912821     0.849800     0.8628   [38, 12, 5, 140]
knn_3            0.912821     0.854940     0.8628   [38, 12, 5, 140]
knn_4            0.923077     0.849229     0.8762   [39, 11, 4, 141]
knn_5            0.897436     0.825243     0.8262   [34, 16, 4, 141]
~~~

Last, I ran the classification on my original column dictionary and I was standardizing the columns this time. As we deal with a lot of boolean columns, it did not make the score better.

~~~
               Test Score  Train Score  AUC Score   Confusion Matrix
top_columns_1    0.774359     0.804112     0.7959   [42, 8, 36, 109]
top_columns_2    0.769231     0.797259     0.7924   [42, 8, 37, 108]
top_columns_3    0.841026     0.841805     0.8014  [36, 14, 17, 128]
top_columns_4    0.897436     0.825814     0.8262   [34, 16, 4, 141]
top_columns_5    0.753846     0.695603     0.7886   [43, 7, 41, 104]
top_columns_6    0.851282     0.794974     0.8148  [37, 13, 16, 129]
~~~

In [None]:
# # run knn with all columns
# knn_all = optimal_knn(got, got_target, got.columns)

# # run knn with columns in candidate dic
# scores = [knn_all]
# for x in candidate_dic:
#     score = optimal_knn(x = got, y = got_target, columns = candidate_dic[x])
#     scores.append(score)

# idx = ['All Columns']
# for key in candidate_dic.keys():
#     idx.append(key)

# knn_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= idx)

# # new dictionary changing up column combination 3
knn_dic = {
        'knn_1' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones', 'culture_Valyrian'],
        'knn_2' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones', 'numDeadRelations'],
        'knn_3' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones', 'numDeadRelations', 'culture_Valyrian'],
        'knn_4' :['popularity', 'age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones', 'numDeadRelations', 'show_only'],
        'knn_5' :['age_birth', 'book4_A_Feast_For_Crows', 'book1_A_Game_Of_Thrones']
    }

# # run knn on that new dictionary
# scores = []
# for x in knn_dic:
#     score = optimal_knn(x = got, y = got_target, columns = knn_dic[x])
#     scores.append(score)

# knn_score_df_new = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= knn_dic.keys())

# # run knn with standardization
# scores = []
# for x in candidate_dic:
#     score = optimal_knn(x = got, y = got_target, columns = candidate_dic[x], standardize=True)
#     scores.append(score)
    
# knn_score_df_stand = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= candidate_dic.keys())


# knn_score_df = pd.concat([knn_score_df, knn_score_df_new, knn_score_df_stand])

# knn_score_df


<h2> Model 3 - Decision Tree </h2>

My third approach for solving the problem is a Decision Tree. 

<h3> Running Decision Tree</h3>
For later use, I wrote a function that runs a Decision Tree automatically. 

In [None]:
def optimal_dectree(x, 
                    y, 
                    columns          = candidate_dic['top_columns_3'],
                    ran_gridsearch   = False,
                    criterion        = 'gini', 
                    splitter         = 'best', 
                    max_depth        = 8, 
                    min_samples_leaf = 25):
    """
Runs a decision tree on a dataset and target variable. Slizes down the 
explanatory variable by predifined columns first and solves the decision
tree with the predefined parameters after.
x                : Whole explanatory dataset
y                : Whole target variable 
columns          : Columns used in explanatory variable
ran_gridsearch   : Variable that clarifies, if grid search was run. In that case, the model is fit on the
                  whole dataset, else only on the train-dataset. 
criterion        : criterion used in decision tree
splitter         : splitter used in decision tree
max_depth        : max depth used in decision tree
min_samples_leaf : minimum samples leaf used in decision tree
    """
        
    # slize the columns
    x = x[columns]
    
    # INSTANTIATING a classification tree object
    full_tree = DecisionTreeClassifier(splitter          = splitter,
                                        criterion        = criterion,
                                        max_depth        = max_depth,
                                        min_samples_leaf = min_samples_leaf,
                                        random_state     = 219)
    
    # creating train and test sets
    x_train, x_test, y_train, y_test = test_train_standardize(x, y)
    
    # FITTING the training data
    if ran_gridsearch:
        # ran cv gridsearch before, therefor fitting on whole dataset 
        full_tree_fit = full_tree.fit(x, y)
    else:
        # no cv before --> fitting on train data
        full_tree_fit = full_tree.fit(x_train, y_train)


    # PREDICTING on new data
    full_tree_pred = full_tree_fit.predict(x_test)


    # saving scoring data for future use
    full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
    full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy

    # saving AUC
    full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                          y_score = full_tree_pred).round(4) # auc
    
    # creating confusion matrix
    tree_tn, tree_fp, tree_fn, tree_tp = confusion_matrix(y_true = y_test, y_pred = full_tree_pred).ravel()
    tree_conf_matrix = [tree_tn, tree_fp, tree_fn, tree_tp]
    
#     # look for most important columns
#     df = pd.DataFrame(full_tree_fit.feature_importances_, index = x.columns, columns = ['Importance']).sort_values(by = 'Importance', ascending=False)
#     print(df.head(n=10))
    
    return [full_tree_test_score, full_tree_train_score, full_tree_auc_score, tree_conf_matrix]

<h4> Testing column combinations </h4>
First, I run the decision tree on all columns minus the dropped ones. That lead to a good AUC score of 0.83.

~~~

             Test Score  Train Score  AUC Score  Confusion Matrix
All Columns      0.8974       0.8498     0.8328  [35, 15, 5, 140]
~~~

I then verified which are the most important columns with the following code:

~~~
df = pd.DataFrame(
        full_tree_fit.feature_importances_, 
        index = x.columns, 
        columns = ['Importance']).sort_values(by ='Importance', ascending=False)
print(df.head(n=10))
~~~

For the decision tree, these where the most important columns:

~~~
                            Importance
age_birth                     0.528619
popularity                    0.164484
book4_A_Feast_For_Crows       0.097439
total_books_in                0.063694
book1_A_Game_Of_Thrones       0.037923
book5_A_Dance_with_Dragons    0.036501
m_house                       0.016557
m_spouse                      0.015028
m_title                       0.012088
houseSize                     0.011523
~~~

Running the tree with the top 5 columns already found a better AUC score.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
tree_top5          0.9077       0.8447     0.8593  [38, 12, 6, 139]
tree_top8          0.8974       0.8498     0.8328  [35, 15, 5, 140]
tree_top10         0.8974       0.8498     0.8328  [35, 15, 5, 140]
~~~

Last, once again I ran the decision tree on the first pre-defined set of columns:

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_1      0.8974       0.8384     0.8197  [33, 17, 3, 142]
top_columns_2      0.8923       0.8367     0.8359  [36, 14, 7, 138]
top_columns_3      0.9077       0.8429     0.8528  [37, 13, 5, 140]
top_columns_4      0.8923       0.8155     0.8162  [33, 17, 4, 141]
top_columns_5      0.8564       0.8241     0.7200  [22, 28, 0, 145]
top_columns_6      0.8564       0.8258     0.7200  [22, 28, 0, 145]
~~~

Again, column set 3 seems to be a good fit!

In [None]:
# # run decision tree with all columns
# tree_all = optimal_dectree(got, got_target, got.columns)

# # run decision tree with columns in candidate dic
# scores = [tree_all]
# for x in candidate_dic:
#     score = optimal_dectree(x = got, y = got_target, columns = candidate_dic[x])
#     scores.append(score)

# idx = ['All Columns']
# for key in candidate_dic.keys():
#     idx.append(key)

# tree_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= idx)

# # new dictionary with most important features
tree_dic = {
        'tree_top5' : ['age_birth', 'popularity', 'book4_A_Feast_For_Crows', 'total_books_in', 'book1_A_Game_Of_Thrones'],
        'tree_top8' : ['age_birth', 'popularity', 'book4_A_Feast_For_Crows', 'total_books_in', 'book1_A_Game_Of_Thrones', 'book5_A_Dance_with_Dragons', 'm_house', 'm_spouse'],
        'tree_top10': ['age_birth', 'popularity', 'book4_A_Feast_For_Crows', 'total_books_in', 'book1_A_Game_Of_Thrones', 'book5_A_Dance_with_Dragons', 'm_house', 'm_spouse', 'm_title', 'houseSize']
    }

# # run decision tree on that new dictionary
# scores = []
# for x in tree_dic:
#     score = optimal_dectree(x = got, y = got_target, columns = tree_dic[x])
#     scores.append(score)

# tree_score_df_new = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= tree_dic.keys())
# tree_score_df = pd.concat([tree_score_df, tree_score_df_new])

# tree_score_df

<h4> Running Hyperparameter Tuning </h4>
For the best models found before, I now run hyperparameter tuning. The candidates tested are <em> tree_top5 </em> and <em> top_columns_3 </em>. It came out with the following results:

~~~
              splitter  min_samples_leaf  max_depth criterion
top_columns_3     best                 4          7   entropy
tree_top5         best                 4          7   entropy
~~~

Running the decision tree with these parameters lowered the AUC score.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_3      0.9128       0.8555     0.8431  [35, 15, 2, 143]
tree_top5          0.9179       0.8521     0.8466  [35, 15, 1, 144]
~~~

However, changing the max depth to 8 instead of 7 and leaving the rest as found in the randomn search, highered the AUC score to over 0.86 for both of the sets. Looking at the confusion matrix, the <em> top_columns_3 </em> did a slightly better job.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_3      0.9231       0.8549     0.8631  [37, 13, 2, 143]
tree_top5          0.9128       0.8578     0.8628  [38, 12, 5, 140]
~~~

In [None]:
def run_cv_dectree(x, 
                   y,
                   columns = candidate_dic['top_columns_3']):
    """
Runs a cross validation on a decision tree. Outputs the best parameters
found in the search.
Parameters:
x       : Whole explanatory dataset
y       : Whole target variable
columns : columns used in explanatory variable

    """
    
    # slize the columns
    x = x[columns] 
    
    # declaring a hyperparameter space
    criterion_range = ['gini', 'entropy']
    splitter_range  = ['best', 'random']
    depth_range     = np.arange(1, 8, 1)
    leaf_range      = np.arange(1, 100, 1)

    # creating a hyperparameter grid
    param_grid = {'criterion'        : criterion_range,
                  'splitter'         : splitter_range,
                  'max_depth'        : depth_range,
                  'min_samples_leaf' : leaf_range}
 
    # INSTANTIATING the model object without hyperparameters
    tuned_tree = DecisionTreeClassifier(random_state = 219)

    # RandomizedSearchCV object
    tuned_tree_cv = RandomizedSearchCV(estimator             = tuned_tree,
                                       param_distributions   = param_grid,
                                       cv                    = 3,
                                       n_iter                = 1000,
                                       random_state          = 219,
                                       scoring = make_scorer(roc_auc_score,
                                                 needs_threshold = False))

    # FITTING to the FULL DATASET (due to cross-validation)
    tuned_tree_cv.fit(x, y)
    
    return tuned_tree_cv.best_params_
    
# # find optimal hyperparameters for set 3 and tree_top5
# opt_1 = run_cv_dectree(got, got_target, columns = candidate_dic['top_columns_3'])
# opt_2 = run_cv_dectree(got, got_target, columns = tree_dic['tree_top5'])

# df = pd.DataFrame([opt_1, opt_2], index = ['top_columns_3', 'tree_top5'])
# print(df)

# # run decision tree with optimized hyperparameters
# scores = []
# score_3 = optimal_dectree(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True, criterion = 'entropy', splitter = 'best', max_depth = 8, min_samples_leaf = 4)
# scores.append(score_3)
# score_top5 = optimal_dectree(x = got, y = got_target, columns = tree_dic['tree_top5'], ran_gridsearch = True, criterion = 'entropy', splitter = 'best', max_depth = 8, min_samples_leaf = 4)
# scores.append(score_top5)

# tree_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= ['top_columns_3', 'tree_top5'])
# print(tree_score_df)

<h2> Model 4 - Random Forest </h2>

My fourth approach for solving the problem is a Random Forest. 

<h3> Running Random Forrest</h3>
For later use, I wrote a function that runs a Random Forest automatically. 

In [None]:
def optimal_ranfor( x, 
                    y, 
                    columns          = candidate_dic['top_columns_3'],
                    ran_gridsearch   = False,
                    n_estimators     = 100,
                    criterion        = 'gini',
                    bootstrap        = True, 
                    max_depth        = 8, 
                    min_samples_leaf = 25,
                    warm_start       = False):
    """
Runs a decision tree on a dataset and target variable. Slizes down the 
explanatory variable by predifined columns first and solves the decision
tree with the predefined parameters after.
x                : Whole explanatory dataset
y                : Whole target variable 
columns          : Columns used in explanatory variable
ran_gridsearch   : Variable that clarifies, if grid search was run. In that case, the model is fit on the
                  whole dataset, else only on the train-dataset. 
n_estimators     : number of estimators used in decision tree
criterion        : criterion used in decision tree
bootstrap        : bootstrap used in decision tree
max_depth        : max depth used in decision tree
min_samples_leaf : minimum samples leaf used in decision tree
warm_start       : whether true or false, warm start used in decision tree
    """
        
    # slize the columns
    x = x[columns]
    
    # INSTANTIATING a classification tree object
    rf  = RandomForestClassifier(n_estimators    = n_estimators,
                                criterion        = criterion,
                                max_depth        = max_depth,
                                min_samples_leaf = min_samples_leaf,
                                bootstrap        = bootstrap,
                                warm_start       = warm_start,
                                random_state     = 219)
    
    # creating train and test sets
    x_train, x_test, y_train, y_test = test_train_standardize(x, y)
    
    # FITTING the training data
    if ran_gridsearch:
        # ran cv gridsearch before, therefor fitting on whole dataset 
        rf_fit = rf.fit(x, y)
    else:
        # no cv before --> fitting on train data
        rf_fit = rf.fit(x_train, y_train)


    # PREDICTING on new data
    rf_pred = rf_fit.predict(x_test)


    # saving scoring data for future use
    rf_train_score = rf_fit.score(x_train, y_train).round(4) # accuracy
    rf_test_score  = rf_fit.score(x_test, y_test).round(4)   # accuracy

    # saving AUC
    rf_auc_score   = roc_auc_score(y_true  = y_test,
                                   y_score = rf_pred).round(4) # auc
    
    # creating confusion matrix
    rf_tn, rf_fp, rf_fn, rf_tp = confusion_matrix(y_true = y_test, y_pred = rf_pred).ravel()
    rf_conf_matrix = [rf_tn, rf_fp, rf_fn, rf_tp]
    
#     # look for most important columns
#     df = pd.DataFrame(rf_fit.feature_importances_, index = x.columns, columns = ['Importance']).sort_values(by = 'Importance', ascending=False)
#     print(df.head(n=15))
    
    return [rf_test_score, rf_train_score, rf_auc_score, rf_conf_matrix]

<h4> Testing column combinations </h4>
First, I run the random forest on all columns minus the dropped ones. That lead to bad a AUC score of 0.50.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
All Columns        0.7436       0.7470     0.5000   [0, 50, 0, 145]
~~~

As for the decision tree, I verified the most important columns:

~~~
                         Importance
popularity                 0.107835
age_birth                  0.106666
dateOfBirth                0.101263
total_books_in             0.081695
book4_A_Feast_For_Crows    0.076046
show_only                  0.056806
numDeadRelations           0.047509
house_House_Frey           0.046862
houseSize                  0.044695
house_House_Targaryen      0.040999
book1_A_Game_Of_Thrones    0.040636
house_Night's_Watch        0.034534
age                        0.028214
culture_Ironborn           0.025178
book2_A_Clash_Of_Kings     0.022260
~~~

Running the random forest with the top 5 columns once again found a better AUC score.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
rf_top5            0.9077       0.8458     0.8331  [34, 16, 2, 143]
rf_top8            0.8410       0.8104     0.6900  [19, 31, 0, 145]
rf_top10           0.8769       0.8252     0.7666  [27, 23, 1, 144]
rf_top15           0.8615       0.8150     0.7366  [24, 26, 1, 144]
~~~


Last, once again I ran the forest on the first pre-defined set of columns:

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_1      0.8667       0.8281     0.7662  [28, 22, 4, 141]
top_columns_2      0.8718       0.8441     0.7631  [27, 23, 2, 143]
top_columns_3      0.9026       0.8412     0.8297  [34, 16, 3, 142]
top_columns_4      0.8462       0.8104     0.7000  [20, 30, 0, 145]
top_columns_5      0.8256       0.8081     0.6600  [16, 34, 0, 145]
top_columns_6      0.8462       0.8104     0.7000  [20, 30, 0, 145]
~~~

Our favourite column set 3 does a good job again.

In [None]:
# # run random forest with all columns
# rf_all = optimal_ranfor(got, got_target, got.columns)

# # run random forest with columns in candidate dic
# scores = [rf_all]
# for x in candidate_dic:
#     score = optimal_ranfor(x = got, y = got_target, columns = candidate_dic[x])
#     scores.append(score)

# idx = ['All Columns']
# for key in candidate_dic.keys():
#     idx.append(key)

# rf_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= idx)

# # new dictionary with most important features
rf_dic = {
        'rf_top5' : ['popularity', 'age_birth', 'dateOfBirth', 'total_books_in', 'book4_A_Feast_For_Crows'],
        'rf_top8' : ['popularity', 'age_birth', 'dateOfBirth', 'total_books_in', 'book4_A_Feast_For_Crows', 'show_only', 'numDeadRelations', 'house_House_Frey'],
        'rf_top10': ['popularity', 'age_birth', 'dateOfBirth', 'total_books_in', 'book4_A_Feast_For_Crows', 'show_only', 'numDeadRelations', 'house_House_Frey', 'houseSize', 'house_House_Targaryen'],
        'rf_top15': ['popularity', 'age_birth', 'dateOfBirth', 'total_books_in', 'book4_A_Feast_For_Crows', 'show_only', 'numDeadRelations', 'house_House_Frey', 'houseSize', 'house_House_Targaryen', 'book1_A_Game_Of_Thrones', "house_Night's_Watch", 'age', 'culture_Ironborn', 'book2_A_Clash_Of_Kings'],
    
    }

# # run random forest on that new dictionary
# scores = []
# for x in rf_dic:
#     score = optimal_ranfor(x = got, y = got_target, columns = rf_dic[x])
#     scores.append(score)

# rf_score_df_new = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= rf_dic.keys())
# rf_score_df = pd.concat([rf_score_df, rf_score_df_new])

# rf_score_df

<h4> Running Hyperparameter Tuning </h4>
For the best models found before, I now run hyperparameter tuning. The candidates tested are <em> tree_top5 </em> and <em> top_columns_3 </em>. It came out with the following results:

~~~
               warm_start  n_estimators  min_samples_leaf  max_depth  criterion  bootstrap 
top_columns_3        True           600                 1          6    entropy      False  
tree_top5            True           100                 1          7       gini      False         
~~~

Running the decision tree with the hyperparameters changed leads to a higher AUC score, especially in the <em> tree_top5 </em> columns.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_3      0.9179       0.8521     0.8531  [36, 14, 2, 143]
tree_top5          0.9231       0.8624     0.8697  [38, 12, 3, 142]
~~~

Therefor, running the random forest with the <em> tree_top5 </em> columns is the best option here. However, changing the max depth to 8 again, got the score up even more.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
tree_top5          0.9333       0.8664     0.8897  [40, 10, 3, 142]
~~~

In [None]:
def run_cv_ranfor( x, 
                   y,
                   columns = candidate_dic['top_columns_3']):
    """
Runs a cross validation on a random forest. Outputs the best parameters
found in the search.
Parameters:
x       : Whole explanatory dataset
y       : Whole target variable
columns : columns used in explanatory variable
    """
    # slizing the columns
    x = x[columns]
    
    # declaring a hyperparameter space
    estimator_range  = np.arange(100, 1100, 250)
    leaf_range       = np.arange(1, 31, 10)
    depth_range     = np.arange(1, 8, 1)
    criterion_range  = ['gini', 'entropy']
    bootstrap_range  = [True, False]
    warm_start_range = [True, False]


    # creating a hyperparameter grid
    param_grid = {'n_estimators'     : estimator_range,
                  'min_samples_leaf' : leaf_range,
                  'max_depth'        : depth_range,
                  'criterion'        : criterion_range,
                  'bootstrap'        : bootstrap_range,
                  'warm_start'       : warm_start_range}
 
    # INSTANTIATING the model object without hyperparameters
    forest_grid = RandomForestClassifier(random_state = 219)

    # GridSearchCV object
    forest_cv = RandomizedSearchCV(estimator           = forest_grid,
                                   param_distributions = param_grid,
                                   cv                  = 3,
                                   n_iter              = 1000,
                                   scoring             = make_scorer(roc_auc_score,
                                                needs_threshold = False))

    # FITTING to the FULL DATASET (due to cross-validation)
    forest_cv.fit(x, y)
    
    return forest_cv.best_params_
    
# # find optimal hyperparameters for set 3 and rf_top5
# opt_1 = run_cv_ranfor(got, got_target, columns = candidate_dic['top_columns_3'])
# opt_2 = run_cv_ranfor(got, got_target, columns = rf_dic['rf_top5'])

# df = pd.DataFrame([opt_1, opt_2], index = ['top_columns_3', 'tree_top5'])
# print(df)

# # run random forest with optimized hyperparameters
# scores = []
# score_3 = optimal_ranfor(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True,
#                         warm_start = True, n_estimators = 600, min_samples_leaf = 1, max_depth = 6, criterion = 'entropy', bootstrap = False)
# scores.append(score_3)
# score_top5 = optimal_ranfor(x = got, y = got_target, columns = tree_dic['tree_top5'], ran_gridsearch = True,
#                             warm_start = True, n_estimators = 100, min_samples_leaf = 1, max_depth = 8, criterion = 'gini', bootstrap = False)
# scores.append(score_top5)

# rf_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= ['top_columns_3', 'tree_top5'])
# print(rf_score_df)

<h2> Model 5 - Gradient Boosted Machine </h2>

My last approach for solving the problem is a GBM. 

<h3> Running GBM </h3>
Once again, I put running the GBM model in a function. 

In [None]:
def optimal_gbm(x, 
                y, 
                columns          = candidate_dic['top_columns_3'],
                ran_gridsearch   = False,
                learning_rate    = 0.1,
                n_estimators     = 100,
                max_depth        = 3,
                warm_start       = False):
    """
Runs a gbm on a dataset and target variable. Slizes down the 
explanatory variable by predifined columns first and solves the decision
tree with the predefined parameters after.
x                : Whole explanatory dataset
y                : Whole target variable 
columns          : Columns used in explanatory variable
ran_gridsearch   : Variable that clarifies, if grid search was run. In that case, the model is fit on the
                  whole dataset, else only on the train-dataset. 
learning_rate    : learning rate used in gbm
n_estimators     : number of estimators used in gbm
max_depth        : max depth used in gbm
warm_start       : whether true or false, warm start used in gbm
    """
        
    # slize the columns
    x = x[columns]
    
    # INSTANTIATING the model object without hyperparameters
    gbm = GradientBoostingClassifier(loss          = 'deviance',
                                     learning_rate = learning_rate,
                                     n_estimators  = n_estimators,
                                     criterion     = 'friedman_mse',
                                     max_depth     = max_depth,
                                     warm_start    = warm_start,
                                     random_state  = 219)
    
    # creating train and test sets
    x_train, x_test, y_train, y_test = test_train_standardize(x, y)
    
    # FITTING the training data
    if ran_gridsearch:
        # ran cv gridsearch before, therefor fitting on whole dataset 
        gbm_fit = gbm.fit(x, y)
    else:
        # no cv before --> fitting on train data
        gbm_fit = gbm.fit(x_train, y_train)


    # PREDICTING on new data
    gbm_pred = gbm_fit.predict(x_test)


    # saving scoring data for future use
    gbm_train_score = gbm_fit.score(x_train, y_train).round(4) # accuracy
    gbm_test_score  = gbm_fit.score(x_test, y_test).round(4)   # accuracy

    # saving AUC
    gbm_auc_score   = roc_auc_score(y_true  = y_test,
                                    y_score = gbm_pred).round(4) # auc
    
    # creating confusion matrix
    gbm_tn, gbm_fp, gbm_fn, gbm_tp = confusion_matrix(y_true = y_test, y_pred = gbm_pred).ravel()
    gbm_conf_matrix = [gbm_tn, gbm_fp, gbm_fn, gbm_tp]
    
    return [gbm_test_score, gbm_train_score, gbm_auc_score, gbm_conf_matrix]

<h4> Testing column combinations </h4>
First, I run the GBM on all columns minus the dropped ones and the pre defined set of columns:

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
All Columns        0.8974       0.8749     0.8131  [32, 18, 2, 143]
top_columns_1      0.8769       0.8498     0.7797  [29, 21, 3, 142]
top_columns_2      0.8718       0.8561     0.7631  [27, 23, 2, 143]
top_columns_3      0.9026       0.8584     0.8231  [33, 17, 2, 143]
top_columns_4      0.8564       0.8258     0.7200  [22, 28, 0, 145]
top_columns_5      0.8718       0.8350     0.7500  [25, 25, 0, 145]
top_columns_6      0.8615       0.8338     0.7300  [23, 27, 0, 145]
~~~

And once again, set 3 is our prefered model here.

In [None]:
# # run gbm with all columns
# gbm_all = optimal_gbm(got, got_target, got.columns)

# # run gbm with columns in candidate dic
# scores = [gbm_all]
# for x in candidate_dic:
#     score = optimal_gbm(x = got, y = got_target, columns = candidate_dic[x])
#     scores.append(score)

# idx = ['All Columns']
# for key in candidate_dic.keys():
#     idx.append(key)

# gbm_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= idx)

# gbm_score_df


<h4> Running Hyperparameter Tuning </h4>
The last exercise of this assignment is running hyperparameter tuning on GBM. I only optimize column set 3 in that case.

~~~
               warm_start  n_estimators  max_depth  learning_rate
top_columns_3       False           100          4            0.6       
~~~

After tuning the parameters, the score got up significantly.

~~~
               Test Score  Train Score  AUC Score  Confusion Matrix
top_columns_3      0.9333       0.8766     0.8831  [39, 11, 2, 143]
~~~

In [None]:
def run_cv_gbm( x, 
                y,
                columns = candidate_dic['top_columns_3']):
    """
Runs a cross validation on GBM. Outputs the best parameters
found in the search.
Parameters:
x       : Whole explanatory dataset
y       : Whole target variable
columns : columns used in explanatory variable
    """
    # slizing the columns
    x = x[columns]
    
    # declaring a hyperparameter space
    learn_range        = np.arange(0.1, 2.2, 0.5)
    estimator_range    = np.arange(100, 501, 25)
    depth_range        = np.arange(2, 8, 2)
    warm_start_range   = [True, False]


    # creating a hyperparameter grid
    param_grid = {'learning_rate' : learn_range,
                  'max_depth'     : depth_range,
                  'n_estimators'  : estimator_range,
                  'warm_start'    : warm_start_range}

    # INSTANTIATING the model object without hyperparameters
    gbm_grid = GradientBoostingClassifier(random_state = 219)

    # GridSearchCV object
    gbm_cv = RandomizedSearchCV(estimator          = gbm_grid,
                               param_distributions = param_grid,
                               cv                  = 3,
                               n_iter              = 500,
                               random_state        = 219,
                               scoring             = make_scorer(roc_auc_score,
                                                     needs_threshold = False))

    # FITTING to the FULL DATASET (due to cross-validation)
    gbm_cv.fit(x, y)
    
    return gbm_cv.best_params_
    
# # find optimal hyperparameters for set 3 and rf_top5
# opt_1 = run_cv_gbm(got, got_target, columns = candidate_dic['top_columns_3'])

# df = pd.DataFrame([opt_1], index = ['top_columns_3'])
# print(df)

# # run gbm with optimized hyperparameters
# scores = []
# score_3 = optimal_gbm(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True,
#                      warm_start = False, n_estimators = 100, max_depth = 4, learning_rate = 0.6)
# scores.append(score_3)


# gbm_score_df = pd.DataFrame(scores, columns = ['Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'], 
#                            index= ['top_columns_3'])
# print(gbm_score_df)

<h2> Bringing it all together </h2>

Here comes the interesting part, comparing all the models. For every model type, I collected the one with the highest score. Looking at the final outcome, the random forest scored the highest AUC score. However, looking at the test-train gap, I would choose the GBM-model. In predicted false positives and negatives, the two models are almost the same, so I choose the GBM-model as final model.


~~~
            Model Name  Test Score  Train Score  AUC Score  Confusion Matrix
3        Random Forest    0.933300     0.866400     0.8897  [40, 10, 3, 142]
4                  GBM    0.933300     0.876600     0.8831  [39, 11, 2, 143]
1                  KNN    0.912821     0.852656     0.8693  [39, 11, 6, 139]
2        Decision Tree    0.923100     0.854900     0.8631  [37, 13, 2, 143]
0  Logistic Regression    0.851300     0.800700     0.7428  [26, 24, 5, 140]
~~~

In [None]:
# declaring model performance objects
model_performance = pd.DataFrame(columns = ['Model Name', 'Test Score', 'Train Score', 'AUC Score', 'Confusion Matrix'])

# best logistic regression model
logreg = optimal_logreg(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True, solver = 'newton-cg', C = 2.5, warm_start = False)

# appending to model_performance
model_performance = model_performance.append({
                        'Model Name': 'Logistic Regression',
                        'Test Score': logreg[0],
                        'Train Score': logreg[1],
                        'AUC Score': logreg[2],
                        'Confusion Matrix': logreg[3],
                    },
                           ignore_index = True)

# best KNN model
knn = optimal_knn(x = got, y = got_target, columns = knn_dic['knn_1'])
# appending to model_performance
model_performance = model_performance.append({
                        'Model Name': 'KNN',
                        'Test Score': knn[0],
                        'Train Score': knn[1],
                        'AUC Score': knn[2],
                        'Confusion Matrix': knn[3],
                    },
                           ignore_index = True)

# best tree model
tree = optimal_dectree(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True, criterion = 'entropy', splitter = 'best', max_depth = 8, min_samples_leaf = 4)
# appending to model_performance
model_performance = model_performance.append({
                        'Model Name': 'Decision Tree',
                        'Test Score': tree[0],
                        'Train Score': tree[1],
                        'AUC Score': tree[2],
                        'Confusion Matrix': tree[3],
                    },
                           ignore_index = True)


# best forest model
rf = optimal_ranfor(x = got, y = got_target, columns = tree_dic['tree_top5'], ran_gridsearch = True,
                    warm_start = True, n_estimators = 100, min_samples_leaf = 1, max_depth = 8, criterion = 'gini', bootstrap = False)

# appending to model_performance
model_performance = model_performance.append({
                        'Model Name': 'Random Forest',
                        'Test Score': rf[0],
                        'Train Score': rf[1],
                        'AUC Score': rf[2],
                        'Confusion Matrix': rf[3],
                    },
                           ignore_index = True)


# best gbm model
gbm = optimal_gbm(x = got, y = got_target, columns = candidate_dic['top_columns_3'], ran_gridsearch = True,
                  warm_start = False, n_estimators = 100, max_depth = 4, learning_rate = 0.6)

# appending to model_performance
model_performance = model_performance.append({
                        'Model Name': 'GBM',
                        'Test Score': gbm[0],
                        'Train Score': gbm[1],
                        'AUC Score': gbm[2],
                        'Confusion Matrix': gbm[3],
                    },
                           ignore_index = True)

# checking the results
# print(model_performance.sort_values(by = 'AUC Score', ascending = False))

print(f"""
My final model chosen is the GBM-model with the following scores:


Model    Test Score      Train Score      Train-Test-Gap      AUC Score       Confusion Matrix
-----    -----------     -----------      --------------      ---------      -----------------
*GBM*    {gbm[0]}           {gbm[1]}          {(gbm[0]-gbm[1]).round(decimals = 4)}              {gbm[2]}         {gbm[3]}
 RF      {rf[0]}           {rf[1]}          {(rf[0]-rf[1]).round(decimals = 4)}              {rf[2]}         {rf[3]}
 KNN     {knn[0].round(decimals=4)}           {knn[1].round(decimals=4)}          {(knn[0]-knn[1]).round(decimals = 4)}              {knn[2]}         {knn[3]}
 Tree    {tree[0]}           {tree[1]}          {(tree[0]-tree[1]).round(decimals = 4)}              {tree[2]}         {tree[3]}
 LR      {logreg[0]}           {logreg[1]}          {(logreg[0]-logreg[1]).round(decimals = 4)}              {logreg[2]}         {logreg[3]}
""")