# Train/Validate Overview

The orders table consists of three types of orders: 

| eval_set  | Records | Purpose |
|:----------|:-----------|:-----|
|prior |3,214,874 |Prior orders that provide history for other two sets  |
|train |131,209 |The last order for training on
|test | 75,000 |Orders for which to predict the future orders then submit on Kaggle.com

Thus, test is off limits and not really touched. But from train, I can make a validation set by splitting it up. Let's do it with the last digit of order_id. 

**If order_id % 10 >= 7 and eval_set = 'train', then 'validation'**

# Create dataframe with actual results

Go to the answers and make a dataframe that matches the sample submission. From the Kaggle evaluation page: 

> For each order_id in the test set, you should predict a space-delimited list of product_ids for that order. If you wish to predict an empty order, you should submit an explicit 'None' value. You may combine 'None' with product_ids. The spelling of 'None' is case sensitive in the scoring metric. The file should have a header and look like the following:

>```
order_id,products  
17,1 2  
34,None  
137,1 2 3  
etc.
```

It is easier to compare if testing is a dict object, keys being the order id, value being the set of products

In [1]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

# Query sqlite3 for the validation orders and their products
actual_results_raw = pd.read_sql_query("SELECT A.order_id as order_id, "
                                   "  COALESCE(CAST(B.product_id as text), 'None') as product_id "
                                   "FROM orders A LEFT JOIN ( "
                                   "  SELECT order_id, product_id "
                                   "  FROM products_train "
                                   "  WHERE order_id % 10 >= 7 "
                                   "    AND reordered = 1 ) B "
                                   "  ON A.order_id = B.order_id "
                                   "WHERE A.eval_set = 'train' "
                                   "  AND A.order_id % 10 >= 7;" , conn)

conn.close()

In [2]:
actual_results = {}

# Convert evaluation framework: results[order_id] = set([product1, product2, ...])
for row in actual_results_raw.itertuples():
    if row.order_id not in actual_results:
        actual_results[row.order_id] = set()
    actual_results[row.order_id].add(row.product_id)
