# Description

Do the prediction tests for the Kaggle competition. We do the following:
1. Load and decode the final model from its JSON format. The model was trained in another notebook.
2. Test the training error to double check that the model loaded correctly.
3. Find the test data predictions.

# Our Imports

In [1]:
from my_src import (my_model,
                    my_json)

In [2]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import roc_auc_score

In [3]:
import pandas as pd
import json

# Decode the JSON model

In [4]:
with open('final_model.json', 'r') as file:
    json_model = file.read()
    final_model = json.loads(json_model, object_hook = my_json.as_full_model(LinearDiscriminantAnalysis))

# Check the Training Score of Decoded Model

In [5]:
chunk_size = 2 * 10**4

In [6]:
# Apply the dimension reduction to the training data.

X_train = []
y_train = []
reader = pd.read_csv('data/train.csv', index_col = 'ID_code', chunksize = chunk_size)
for i, df in enumerate(reader):
    print('Chunk ', i, end = ', ')
    X_new = df.drop('target', axis = 1)
    y_new = df['target']
    X_new = final_model['preprocess'].transform(X_new)
    X_train.append(X_new)
    y_train.append(y_new)

Chunk  0, Chunk  1, Chunk  2, Chunk  3, Chunk  4, Chunk  5, Chunk  6, Chunk  7, Chunk  8, Chunk  9, 

In [7]:
# Form pandas.DataFrames.

X_train = pd.concat(X_train, axis = 0)
y_train = pd.concat(y_train, axis = 0)

In [8]:
# Get the training score.

y_predict = final_model['predictor'].predict_proba(X_train)[:, 1]
y_predict = pd.Series(y_predict, index = y_train.index)
roc_auc_score(y_train, y_predict)

0.8953335023516895

The training score looks good, so the model loaded okay. NOTE, the training score is overly optimistic. The test score should be less as the model slightly overfits the training data.

# Make Test Predictions.

In [9]:
# Reduce the dimensions of the test data.

X_test = []
reader = pd.read_csv('data/test.csv', index_col = 'ID_code', chunksize = chunk_size)
for i, df in enumerate(reader):
    print('Chunk', i, end = ', ')
    X_new = final_model['preprocess'].transform(df)
    X_test.append(X_new)
X_test = pd.concat(X_test, axis = 0)

Chunk 0, Chunk 1, Chunk 2, Chunk 3, Chunk 4, Chunk 5, Chunk 6, Chunk 7, Chunk 8, Chunk 9, 

In [10]:
# Make predictions.

y_test = final_model['predictor'].predict_proba(X_test)[:, 1]
y_test = pd.DataFrame(y_test, index = X_test.index, columns = ['target'])
y_test.head()

Unnamed: 0_level_0,target
ID_code,Unnamed: 1_level_1
test_0,0.377315
test_1,0.098759
test_2,0.195158
test_3,0.082016
test_4,0.019838


In [11]:
# Save the predictions to a comma separated file.

y_test.to_csv('predictions/final_model.csv')

The predictions are now ready to be submitted to Kaggle for final scoring.