### Spaceship. Part 02.
## Model Developmnet

Let's load training data prepared in ['01_preparation.ipynb'](01_preparation.ipynb):

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('train_prepared.csv', index_col=0)

X_train = data.drop('Transported', axis =1)
y_train = data['Transported']

# Random seed for reproducibility
SEED = 123


Let's start with a Logistic Regression model. We'll calculate ROC AUC cross-validation score:

In [2]:
%%time

# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Create the model
reg = LogisticRegression(max_iter =150)

# Import the modules for cross-validation
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Create a KFold object
kf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)

# Perform cross-validation
scores = cross_val_score(reg, X_train, y_train, cv=kf, scoring="roc_auc")

# Calculate average ROC AUC score
print("Average ROC AUC score for Logistic Regression: {}".format(np.mean(scores)))

roc_auc_scores = pd.DataFrame({'Logistic Regression': [np.mean(scores)]})
print(roc_auc_scores)


Average ROC AUC score for Logistic Regression: 0.8125688600125863
   Logistic Regression
0             0.812569
CPU times: total: 1.14 s
Wall time: 3.45 s


As you can see in the Feature Effect table below, according to our model, all our 6 features are important for the result:

In [3]:
reg.fit(X_train, y_train)
feature_effect = pd.Series(data=reg.coef_[0], index=X_train.columns)
print(feature_effect)

0   -1.494964
1   -0.731635
2    0.557557
3   -1.346397
4   -0.256378
5    2.815987
dtype: float64


Now, let's do k_Neighbors with a parameter grid:

In [4]:
%%time

# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier with points weighted by distance
knn = KNeighborsClassifier(weights = 'distance')

# Parameters for grid search
params_knn = {'algorithm': ['ball_tree', 'kd_tree'], \
             'leaf_size': [20, 25, 30, 35], \
             'metric': ['l1', 'l2', 'cosine', 'nan_euclidean'], \
             'n_neighbors': [10, 15, 20]}

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid = GridSearchCV(estimator=knn,
                       param_grid=params_knn,
                       scoring='roc_auc',
                       cv=kf,
                       verbose=0,
                       n_jobs=-1)

# Train the models
grid.fit(X_train, y_train)


print("Best parameters for k-Neighbors: {}".format(grid.best_params_))
print("Average ROC AUC score for k-Neighbors: {}".format(grid.best_score_))
roc_auc_scores['KNeighborsClassifier'] = [grid.best_score_]
print(roc_auc_scores)


Best parameters for k-Neighbors: {'algorithm': 'kd_tree', 'leaf_size': 20, 'metric': 'l2', 'n_neighbors': 20}
Average ROC AUC score for k-Neighbors: 0.845907597817887
   Logistic Regression  KNeighborsClassifier
0             0.812569              0.845908
CPU times: total: 1.16 s
Wall time: 32 s


288 fits failed out of a total of 576.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
72 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\mikej\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mikej\anaconda3\lib\site-packages\sklearn\neighbors\_classification.py", line 215, in fit
    return self._fit(X, y)
  File "C:\Users\mikej\anaconda3\lib\site-packages\sklearn\neighbors\_base.py", line 493, in _fit
    self._check_algorithm_metric()
  File "C:\Users\mikej\anaconda3\lib\site-packages\sklearn\neighbors\_base.py", line 434, in _check_algorithm_metric
    raise ValueError(
ValueErr

We see some improvement.

None of the best parameters are one the extremes of our parameter grid, so we, most likely, don't need to extend it.

How about Random Forests?

Random Forests grid search take too much time in this environment, so I found the optimal paremeters in separate environment. The code is in this file: ['02_RF.py'](02_RF.py).

We'll continue in ['03_submission.ipynb'](03_submission.ipynb).