# Overview
In this notebook I am going to grab the data that will be used for the example Kaggle competition and also develop a working machine learning model (this is required so that the competition has a "benchmark"). As always one of the most difficult parts of this process is finding a good data set, but I'm going to be focusing on the other components of the process -- getting a working Kaggle competition that can be hosted in a classroom.

In [23]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from os import path
import pandas as pd

# Get the IRIS Data
Since we're just showing how one can get a Kaggle competition up and running, the classic "Iris" data built by Ronald Fisher in the 1930s will suffice. To make this a workable data set for Kaggle though we need two conditions to hold:
1. We need to be able to upload the data to Kaggle's server (hence we have to save the data to disk)
2. There must be a training and a testing set.

Regarding the second point, one additional point is the target for the test set, referred to as $\mathbf{y}_{\text{Te}}$ must be separate from the data you upload to the server.

We're going to execute those steps and save the data to disk so we can start working setting up the Kaggle competition.

In [2]:
X, y = load_iris(return_X_y=True)

In [4]:
# Now we have to create the train-test partition
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=17
)

Finally we have to save the data to disk (they will be placed in the "data" folder in this repository). A large number of file formats are valid. For example, in my competition I used an .h5 file, a format that allows a large quantity of data to be stored with low memory footprint, but for now we'll just use .csv. In general though, it doesn't matter, but I would recommend making it an easier format to download and work with during a session because the .h5 experiment required me to do a bit of extra work so the students understood how to work with the data.

In [37]:
fnames = ['X_train.csv', 'X_test.csv', 'y_train.csv', 'y_test.csv']
arrs = [X_train, X_test, y_train, y_test]

# Convert the data matrices into pandas DataFrames
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

# Save the data matrices to disk
fnames = ['X_train.csv', 'X_test.csv']
arrs = [X_train, X_test]
[arr.to_csv(path.join('data', fname), index=False)
 for (arr, fname) in zip(arrs, fnames)]

# Convert the training vector into a DataFrame
y_train = pd.DataFrame(y_train)
y_train.to_csv('data/y_train.csv', index=False)

Now we have saved all of the data to disk. We're going to now shift gears to getting the Kaggle competition up and running. We'll have to return to this notebook to build our benchmark for the competition, but that isn't until much later in the process.

# Making a Valid Solution and Submission File
Kaggle requires that we have a valid solution file and a valid submission file. For metric of "CategorizationAccuracy" it has the expected format (for example)

|Id | Category|
| --| --------|
| 1 | 0       |
| 2 | 1       |

and so forth. We are going to convert the `y_test` to meet this standard and also generate a valid submission file.


In [30]:
# Make a valid solution file
nsamples = len(y_test)
id_col = np.arange(nsamples)

solution_file = pd.DataFrame({"Id": id_col,
                              "Category": y_test})

# Make a valid submission file
nlabels = len(np.unique(y_test))
rng = np.random.RandomState(17)
rand_pred = rng.randint(low=0, high=nlabels, size=nsamples)

submission_file = pd.DataFrame({"Id": id_col,
                                "Category": rand_pred})

In [33]:
# Save the files to disk
solution_file.to_csv('data/solution_file.csv', index=False)
submission_file.to_csv('data/submission_file.csv', index=False)

# Running the Benchmark
For the benchmark, I am just going to use the randomly generated submission file. Of course in real competition you should probably provide a better benchmark, but this is fine for this example.