# Spaceship. Part 02.
## Submission

Here are best parameters for Random Forests so far:{'n_estimators': 1868, 'min_samples_leaf': 13, 'max_features': 'sqrt'}.

Best ROC AUC cross-validation score is 0.8727799993801867.

In [4]:
%%time

import pandas as pd
import numpy as np
data = pd.read_csv('train_prepared.csv', index_col=0)

X_train = data.drop('Transported', axis =1)
y_train = data['Transported']

# Random seed for reproducibility
SEED = 123

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate rf
rf = RandomForestClassifier(random_state=SEED, n_estimators= 1868, min_samples_leaf= 13, \
                            max_features= 'sqrt')
# Fit 
rf.fit(X_train, y_train)

#Predict 
y_pred_train_proba = rf.predict_proba(X_train)[:, 1]

from sklearn.metrics import roc_auc_score


print("ROC AUC score on a train set: {}".format(roc_auc_score(y_train, y_pred_train_proba)))
print("Accuracy on a train set: {}".format(rf.score(X_train, y_train)))

ROC AUC score on a train set: 0.9301234657433379
Accuracy on a train set: 0.8419417922466352
CPU times: total: 31.4 s
Wall time: 32 s


Now, let's create our predictions on a test set and submit them to a Spaceship competition to receive a test accuracy score:

In [5]:
X_test = pd.read_csv('X_test_prepared.csv', index_col=0)
test_Ids = pd.read_csv('test_Ids.csv', index_col=0).reset_index(drop=True)

y_pred_test = rf.predict(X_test)

# Convert to 'True/False'

y_pred_test = ["True" if i == 1 else "False" for i in y_pred_test]

y_pred_test = pd.DataFrame(y_pred_test, columns=['Transported'])

submission = pd.concat([test_Ids, y_pred_test], axis=1)

submission.to_csv('03_submission.csv', index=False)
                           


Accuracy of our first submission is 0.79003. Let's examine how far we are from the best results:

In [6]:
print('We labeled correctly {} passengers (one sumbission).'.format(round(len(X_test) * 0.79003, 1)))
print('AmberLi456 (4th place) labeled correctly {} passengers (30 sumbissions).'.format(round(len(X_test) * 0.82557, 1)))

We labeled correctly 3379.0 passengers (one sumbission).
AmberLi456 (4th place) labeled correctly 3531.0 passengers (30 sumbissions).


There is a room for improvement. We can do it in several ways:

-) Try more parameters for Random Forests

-) Try different classifiers

-) Try greater number of PCA components

-) Explore passengers that we are wrongly predicting on a train set and try to find patterns.

Since our train accuracy is high, it's unlikely that our model has high bias. Most likely, we are dealing with overfitting. Therefore, two first ideas are preferable:

-) Try more parameters for Random Forests

-) Try different classifiers

## Try more parameters for Random Forests

I'll expand the parameter grid in ['03_RF.py'](03_RF.py).

Some findings for different parameters of Random Forests (parameter set and cross-validation ROC AUC score):

{'n_estimators': 1868, 'min_samples_leaf': 13, 'max_features': 'sqrt'} 0.8727799993801867

{n_estimators= 100, 'criterion': 'entropy', 'max_depth': 12, 'min_samples_leaf': 8} 0.8740706564503291

{'criterion': 'log_loss', 'max_depth': 15, 'min_samples_leaf': 5, 'min_impurity_decrease': 0.0006830666561997347}
0.8755529319815356

{'max_depth': 16, 'max_features': 0.9, 'max_leaf_nodes': 141, 'min_impurity_decrease': 9.376419510025081e-06, 'min_samples_leaf': 2, 'warm_start': False, 'max_samples': 0.8945061534250961}
0.8764434269931722

{'max_depth': 17, 'max_features': 0.7, 'max_leaf_nodes': 123, 'min_impurity_decrease': 0.00020380822483963789, 'min_samples_leaf': 2, 'max_samples': 0.9999360987512214}
0.8770831641318569

Increasing number of estimators increases ROC AUC cross-validation score, but high number is computationally expensive. So, in most of iterations I used n_estimators = 100 and I didn't use computationally expensive criterion 'entropy'. Let's try some our  best parameters with different criterions and different numbers of estimators.

'Entropy' criterion didn't show any improvements. however, we've found a better number of estimators:

{'n_estimators': 516}
0.8773454605099553

Now, let's train our model with the best parameters:

In [11]:
%%time

rf = RandomForestClassifier(random_state=SEED, \
                               n_estimators= 516, \
                               criterion= 'log_loss', \
                               max_depth= 17, \
                               max_features=0.7, \
                               max_leaf_nodes=123,\
                               min_impurity_decrease= 0.00020380822483963789, \
                               min_samples_leaf= 2, \
                               max_samples= 0.9999360987512214, \
                               n_jobs=-1
                               )


# Fit 
rf.fit(X_train, y_train)

#Predict 
y_pred_train_proba = rf.predict_proba(X_train)[:, 1]

from sklearn.metrics import roc_auc_score


print("ROC AUC score on a train set: {}".format(roc_auc_score(y_train, y_pred_train_proba)))
print("Accuracy on a train set: {}".format(rf.score(X_train, y_train)))

ROC AUC score on a train set: 0.9240332336918978
Accuracy on a train set: 0.8380305993327966
CPU times: total: 30.1 s
Wall time: 3.36 s


Let's submit:

In [12]:
y_pred_test_2 = rf.predict(X_test)

# Convert to 'True/False'
y_pred_test_2 = ["True" if i == 1 else "False" for i in y_pred_test_2]

y_pred_test_2 = pd.DataFrame(y_pred_test_2, columns=['Transported'])


print('Number of passengers predicted differently in the second sumbission:')
print(len(y_pred_test[y_pred_test['Transported'] != y_pred_test_2['Transported']]))


submission_2 = pd.concat([test_Ids, y_pred_test_2], axis=1)

submission_2.to_csv('03_submission_2.csv', index=False)

Number of passengers predicted differently in the second sumbission:
128


Accuracy of our second submission is higher, 0.79611. Let's examine how far we are now from the best results:

In [9]:
print('In the first submission, we labeled correctly {} passengers.'.format(round(len(X_test) * 0.79003, 1)))
print('In the second submission, we labeled correctly {} passengers.'.format(round(len(X_test) * 0.79611, 1)))
print('AmberLi456 (4th place) labeled correctly {} passengers (30 sumbissions).'.format(round(len(X_test) * 0.82557, 1)))

In the first submission, we labeled correctly 3379.0 passengers.
In the second submission, we labeled correctly 3405.0 passengers.
AmberLi456 (4th place) labeled correctly 3531.0 passengers (30 sumbissions).


The next most promising idea is to try different models and maybe combine different models with high ROC AUC score into an ensemble.

I tried different models (see files 03_[model_name].py) and Random Forests still show the best cross-validation score. 

I could try model ensembling, but I will try to start over with tweaking some steps and testing every step with our best model.

Let's continue in ['04_startover.ipynb'](04_startover.ipynb).