# User/Item Logistic Regression

How about a ton of features for a given user/item combo? I bet we can get better than 0.0176.

I'll use sklearn.linear_model.LogisticRegression - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

For starters, I'll use the options from Assignment 3: 

* multi_class="multinomial"−multi_class="multinomial"− we want to build softmax classifier (there are other ways of dealing with multiclass setting for Logistic Regression)
* C=106−C=106−  for now we don't want to use regularization;  CC  is the inverse regularization constant which is  *C=1λC=1λ ; thus we should make  CC  big to turn of regulazrization
* solver=sag−solver=sag−  optimization algorithm to use; Stochastic Average Gradient. Stochastic Gradient Descent method gitters massively. This is due to the not very good approximation of gradient (only by one example). To neglect this error one can simply average gradient across last few steps; that is exectly what  sagsag  does
* n_iter=15−n_iter=15−  the number of passes over the training data (aka epochs)

** Pseudocode **

time_of_day
day_of_week
total_previous_orders

-- Item specific features -- 
total_previous_buys_item
time_since_last_buy_item
last_buy_time_of_day_item
last_buy_day_of_week_item
average_duration_between_buys_user_item
average_duration_between_buys_population_item 
rebuy_percentage_item - Amongst all people purchasing this item before, what percentage rebuy it? 


-- Aisle specific features --
total_previous_buys_aisle
time_since_last_buy_aisle
last_buy_time_of_day_aisle
last_buy_day_of_week_aisle
average_duration_between_buys_user_aisle
average_duration_between_buys_population_aisle
rebuy_percentage_aisle

-- Department specific features -- 
total_previous_buys_dept
time_since_last_buy_dept
last_buy_time_of_day_dept
last_buy_day_of_week_dept
average_duration_between_buys_user_dept
average_duration_between_buys_population_dept
rebuy_percentage_dept

1. Start with order_num = 1, collect every item, grab dow, order time, etc. , tie to aisle and dept. 






In [2]:
import pandas as pd
import numpy as np
import sqlite3 
from sklearn.linear_model import LogisticRegression

# User Rebuys

For each transaction in a user's purchase history, let's get what products were purchased and track them. As a result, we should get the following features:

* total_previous_buys
* time_since_last_buy
* last_buy_day_of_week
* last_buy_hour_of_Day
* last order number
* history_size

Start with the first transaction (which every guest has) to initialize, then loop through each subsequent purchase and update. 

For fun, I'll keep a copy of the status as we pass through, since it seems like it will generate great training data for later. 

## Initialize Buys using first transaction

In [7]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

cur.execute("DROP TABLE IF EXISTS user_item_history_order;")

# Initialize first table using first orders. 
cur.execute("CREATE TABLE user_item_history_order AS "
    "SELECT A.user_id as user_id "
    ", B.product_id as product_id "
    ", 1 as total_previous_buys_item "
    ", 0 as time_since_last_buy_item "
    ", A.order_dow as last_buy_day_of_week_item "
    ", A.order_hour_of_day as last_buy_time_of_day_item "
    ", 1 as last_order_number "
    ", 1 as orders_in_history "
    "FROM orders A "
    " INNER JOIN products_prior B "
    "  ON A.order_id = B.order_id "
    "WHERE A.order_number = 1 "
    "AND A.eval_set = 'prior';")


conn.commit()
conn.close()

DatabaseError: database disk image is malformed

## Iterate through order_number for items

For each subsequent order k = 2 through 100, there's four options

1. **The user did not make a subsequent order** - there may be an order in train or test, but no prior orders remain for this user_id. Carve them out to save processing time
2. **The user did not repurchase a given item in this new order** - This means the kth order did not rebuy the item. Most of the stats do not change, but time since last order certainly increases 
3. **Previously purchased item was rebought** - Increase the order count and reset many of the columns
4. **The user purchased a new item not previously bought** - Do actions from the initialize step

After that, union together options 2 + 3 + 4 to finish


In [None]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()
    
    
# Create holding table
cur.execute("DROP TABLE IF EXISTS user_item_history_order_done")
cur.execute("CREATE TABLE user_item_history_order_done ("
            "user_id integer, "
            "product_id integer, "
            "total_previous_buys_item integer, "
            "time_since_last_buy_item integer, "
            "last_buy_day_of_week_item integer, "
            "last_buy_time_of_day_item integer, "
            "last_order_number integer, "
            "orders_in_history integer);")

conn.commit()
conn.close()

# For each order_num:
for order_num in range(2, 101):
    conn = sqlite3.connect("instacart.db")
    cur = conn.cursor()
    
    # Print progress
    print("Starting on order " + str(order_num))
    
    prev_order_num = order_num - 1
    
    ### 
    # 0.a) Grab the orders/products for this order_num
    ### 
    
    cur.execute("DROP TABLE IF EXISTS products_for_this_order;")
    cur.execute("CREATE TABLE products_for_this_order AS "
                "SELECT A.order_id as order_id "
                " , A.user_id as user_id "
                " , A.order_dow as order_dow "
                " , A.order_hour_of_day as order_hour_of_day "
                " , A.days_since_prior_order as days_since_prior_order "
                " , A.order_number as order_number "
                " , B.product_id "
                "FROM orders A "
                " INNER JOIN products_prior B "
                " ON A.order_id = B.order_id "
                "WHERE A.order_number = ? "
                " AND A.eval_set = 'prior';", (order_num,))    
        
    print("Grab products done")
       
    ###
    # 0.b) Grab the orders/products from last round
    ###
    cur.execute("DROP TABLE IF EXISTS history_for_this_order;")
    cur.execute("CREATE TABLE history_for_this_order AS "
                "SELECT * "
                "FROM user_item_history_order "
                "WHERE orders_in_history = ?;", (prev_order_num,))
                
    print("Grab history done")
                
    ###
    # 1) No purchase for this user
    ###
    
    ## Copy results from iteration table to final done table
    cur.execute("INSERT INTO user_item_history_order_done "
                "SELECT A.* "
                "FROM history_for_this_order A "
                " LEFT OUTER JOIN orders B "
                "  ON A.user_id = B.user_id "
                "  AND B.order_number = ? "
                "WHERE B.order_id IS NULL;", (order_num,))
 
    print("No more purchases done")
    
    ###
    # 2) Items not repurchased this order
    ###
    
    ## History (Z) left joined to products A, looking for product on A to be empty
    ##  -> This says the previously purchased product on table Z was not purchased this order 
    
    cur.execute("DROP TABLE IF EXISTS user_item_history_order_2a;")
    cur.execute("CREATE TABLE user_item_history_order_2a AS " 
                "SELECT Z.user_id as user_id "
                " ,Z.product_id as product_id "
                " ,Z.total_previous_buys_item as total_previous_buys_item "
                " ,Z.time_since_last_buy_item as time_since_last_buy_item "
                " ,Z.last_buy_day_of_week_item as last_buy_day_of_week_item "
                " ,Z.last_buy_time_of_day_item as last_buy_time_of_day_item "
                " ,Z.last_order_number as last_order_number "
                " ,Z.orders_in_history + 1 as orders_in_history "
                "FROM history_for_this_order Z "
                " LEFT OUTER JOIN products_for_this_order A "
                "  ON Z.user_id = A.user_id "
                "  AND Z.product_id = A.product_id "
                "WHERE A.product_id IS NULL;")
    
    ## None of the items in _2a were purchased, but an order happened. Join to
    # the order and get the days since last order. 
    
    cur.execute("DROP TABLE IF EXISTS user_item_history_order_2b;")
    cur.execute("CREATE TABLE user_item_history_order_2b AS " 
                "SELECT Z.user_id as user_id "
                " ,Z.product_id as product_id "
                " ,Z.total_previous_buys_item as total_previous_buys_item "
                " ,(Z.time_since_last_buy_item + A.days_since_prior_order) as time_since_last_buy_item "
                " ,Z.last_buy_day_of_week_item as last_buy_day_of_week_item "
                " ,Z.last_buy_time_of_day_item as last_buy_time_of_day_item "
                " ,Z.last_order_number as last_order_number "
                " ,Z.orders_in_history as orders_in_history "
                "FROM user_item_history_order_2a Z "
                " INNER JOIN orders A "
                "  ON Z.user_id = A.user_id "
                "  AND A.order_number = ?;", (order_num,))
    
    print("No repurchase done")

    ###
    # 3) Items that were repurchased this order
    ###
    
    ## History inner joined to products, only report on matches.

    
    cur.execute("DROP TABLE IF EXISTS user_item_history_order_3;")
    cur.execute("CREATE TABLE user_item_history_order_3 AS " 
                "SELECT Z.user_id as user_id "
                " ,Z.product_id as product_id "
                " ,Z.total_previous_buys_item + 1 as total_previous_buys_item "
                " ,0 as time_since_last_buy_item "
                " ,A.order_dow as last_buy_day_of_week_item "
                " ,A.order_hour_of_day as last_buy_time_of_day_item "
                " ,A.order_number as last_order_number "
                " ,Z.orders_in_history + 1 as orders_in_history "
                "FROM history_for_this_order Z "
                " INNER JOIN products_for_this_order A "
                "  ON Z.user_id = A.user_id "
                "  AND Z.product_id = A.product_id;")
    
    print("Repurchase Done")
    
    ###
    # 4) New items not previous purchased
    ###
    
    ## History is in right position now, looking for products in history to be null

    
    cur.execute("DROP TABLE IF EXISTS user_item_history_order_4;")
    cur.execute("CREATE TABLE user_item_history_order_4 AS " 
                "SELECT A.user_id as user_id "
                " ,A.product_id as product_id "
                " ,1 as total_previous_buys_item "
                " ,0 as time_since_last_buy_item "
                " ,A.order_dow as last_buy_day_of_week_item "
                " ,A.order_hour_of_day as last_buy_time_of_day_item "
                " ,A.order_number as last_order_number "
                " ,A.order_number as orders_in_history "
                "FROM products_for_this_order A "
                " LEFT OUTER JOIN history_for_this_order Z "
                "  ON A.user_id = Z.user_id "
                "  AND A.product_id = Z.product_id "
                "WHERE Z.product_id IS NULL;")
    
    
    print("New purchases done")
    
    ###
    # Union together to finish order_num
    ###
    
    cur.execute("INSERT INTO user_item_history_order "
                "SELECT * FROM user_item_history_order_2b;")
    cur.execute("INSERT INTO user_item_history_order "
                "SELECT * FROM user_item_history_order_3;")
    cur.execute("INSERT INTO user_item_history_order "
                "SELECT * FROM user_item_history_order_4;")
    
    print("Union Done")
    conn.commit()
    conn.close()
    


In [None]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

###
# One last insert for users with 100 orders:
###

## Copy results from iteration table to final done table
cur.execute("INSERT INTO user_item_history_order_done "
            "SELECT A.* "
            "FROM history_for_this_order A "
            " LEFT OUTER JOIN orders B "
            "  ON A.user_id = B.user_id "
            "  AND B.order_number = 101 "
            "WHERE B.order_id IS NULL;")

conn.commit()
conn.close()


We've now got a table of every product purchased by every user, how long it's been, etc. Now I want to know purchase cycle info. 

Let's talk this out. Say user_id = 1, product_id = 196. 

History looks like this: 

|user_id|product_id|total_previous_buys|time_since_last_buy_item|last_buy_day_of_week_item|last_buy_time_of_day_item|last_order_number|orders_in_history
|-------------------
|1|196|1|0|2|8|1|1
|1|196|2|0|3|7|2|2
|1|196|3|0|3|12|3|3
|1|196|4|0|4|7|4|4
|1|196|5|0|4|15|5|5
|1|196|6|0|2|7|6|6
|1|196|7|0|1|9|7|7
|1|196|8|0|1|14|8|8
|1|196|9|0|1|16|9|9
|1|196|10|0|4|8|10|10
|1|196|10|14|4|8|10|11

With this, we can calculate purchase cycle and then aggregate for each user/item, then across all users for each item.



In [369]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

###
# Calculate rebuy for each item/user
###

# Filter order history on only purchases: 
cur.execute("CREATE TABLE user_item_purchases AS "
            "SELECT * "
            "FROM user_item_history_order "
            "WHERE last_order_number = orders_in_history;")

print("user_item_purchases Done")

# Calculate first order for each item to remove
cur.execute("CREATE TABLE user_item_first_purchase AS "
            "SELECT user_id, product_id, MIN(last_order_number) as first_order_number "
            "FROM user_item_purchases "
            "GROUP BY user_id, product_id;")

print("user_item_first_purchase Done")

# Remove first order to only reflect rebuys
cur.execute("CREATE TABLE user_item_rebuy AS "
            "SELECT A.* "
            "FROM user_item_purchases A "
            "  LEFT OUTER JOIN user_item_first_purchase B "
            "  ON A.user_id = B.user_id "
            "  AND A.product_id = B.product_id "
            "  AND A.last_order_number = B.first_order_number "
            "WHERE B.first_order_number is NULL;")

print("user_item_rebuy Done")

# Combine with order stats to get full time since last purchase
# *** You may be wondering why this wasn't calcualted before!
# *** user_item_history_order was designed so that it takes in
# *** the next order as part of the time-since-last-buy, 
# *** since that is what the model will take in as input
cur.execute("CREATE TABLE user_item_rebuy_duration AS "
            "SELECT A.user_id, A.product_id, B.order_number, "
            " A.total_previous_buys_item, "
            " A.time_since_last_buy_item + B.days_since_prior_order "
            "  as rebuy_duration, "
            " B.order_dow, "
            " B.order_hour_of_day "
            "FROM user_item_rebuy A "
            " INNER JOIN orders B "
            " ON A.user_id = B.user_id "
            " AND A.last_order_number = B.order_number;")
            

    
print("user_item_rebuy_duration Done")

conn.commit()
conn.close()


user_item_purchases Done
user_item_first_purchase Done
user_item_rebuy Done
user_item_rebuy_duration Done


Now, with user_item_rebuy_duration, I can get a sense of average time to rebuy for each item/user. 

In [370]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

# Calculate the average rebuy per item, etc.
cur.execute("CREATE TABLE user_item_averages AS "
            "SELECT user_id, product_id, "
            " AVG(rebuy_duration) as average_rebuy_duration, "
            " AVG(order_dow) as average_rebuy_dow, "
            " AVG(order_hour_of_day) as average_rebuy_hod "
            "FROM user_item_rebuy_duration "
            "GROUP BY user_id, product_id;")
            

conn.commit()
conn.close()

# Train Logistic Model With Just User Features

Okay, that's lots of features for each user/product. Let's see how well it predicts! 

## Assemble feature space

In [None]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

# Get one row per user and product
cur.execute("DROP TABLE IF EXISTS features_1;")
cur.execute("CREATE TABLE features_1 AS "
            "SELECT A.*, "
            "COALESCE(B.average_rebuy_duration, time_since_last_buy_item) as average_rebuy_duration, "
            "COALESCE(B.average_rebuy_dow, last_buy_day_of_week_item) as average_rebuy_dow, "
            "COALESCE(B.average_rebuy_hod, last_buy_time_of_day_item) as average_rebuy_hod "
            "FROM user_item_history_order_done A "
            " LEFT OUTER JOIN user_item_averages B "
            " ON A.user_id = B.user_id "
            "  AND A.product_id = B.product_id;")


# Combine with features about current order
cur.execute("DROP TABLE IF EXISTS features_1_b;")
cur.execute("CREATE TABLE features_1_b AS "
            "SELECT A.user_id, "
            " A.product_id, "
            " A.total_previous_buys_item, "
            " A.time_since_last_buy_item + B.days_since_prior_order as time_since_last_buy_item, "
            " A.last_buy_day_of_week_item, "
            " A.last_buy_Time_of_day_item, "
            " A.last_order_number, "
            " A.orders_in_history, "
            " A.average_rebuy_duration, "
            " A.average_rebuy_dow, "
            " A.average_rebuy_hod, "
            " B.order_id, "
            " B.order_number, "
            " B.order_dow, "
            " B.order_hour_of_day, "
            " B.days_since_prior_order "
            "FROM features_1 A "
            " INNER JOIN orders B "
            " ON A.user_id = B.user_id "
            "WHERE B.eval_set = 'train';")

# Append training and validation results 
cur.execute("DROP TABLE IF EXISTS features_1_c;")
cur.execute("CREATE TABLE features_1_c AS "
            "SELECT A.*, "
            " CASE WHEN B.product_id is NULL THEN 0 ELSE 1 END AS reorder "
            "FROM features_1_b A "
            " LEFT OUTER JOIN products_train B "
            " ON A.order_id = B.order_id "
            "  AND A.product_id = B.product_id;")


conn.commit()
conn.close()

In [6]:
# It's a lot of data and it kills the kernal, so let's operate on 1/10th at a time

model_lr_sklearn = LogisticRegression(multi_class="ovr", C=1e6, solver="sag", max_iter=100, warm_start=True)

for i in range(1):
    conn = sqlite3.connect("instacart.db")
    cur = conn.cursor()
    cur.execute("SELECT * FROM features_1_c WHERE order_id % 10 = " + str(i) + ";")

    thefeatures = np.array(cur.fetchall())

    # Split feature data into features to train on and the evaluation
    x_train = thefeatures[:,[2,3,4,5,6,7,8,9,10,12,13,14,15]]
    y_train = thefeatures[:,[16]]
    y_train_flat = y_train.flatten()

    model_lr_sklearn.fit(x_train, y_train_flat)
    
    print(str(i) + "iteration, score: " + str(model_lr_sklearn.score(x_train, y_train_flat)))


0iteration, score: 0.904487822741




# Score on validation set



In [5]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

user_item_logistic = dict()

# validation set is orders ending with 7/8/9.
for i in [7, 8, 9]:
    print(i)
    # Get data for predictions
    conn = sqlite3.connect("instacart.db")
    cur = conn.cursor()
    cur.execute("SELECT * FROM features_1_c WHERE order_id % 10 = " + str(i) + ";")
    validation = cur.fetchall()
    validation_array = np.array(validation)
    conn.close()
    
    # Put into array: 
    x_pred = validation_array[:,[2,3,4,5,6,7,8,9,10,12,13,14,15]]
    predictions = model_lr_sklearn.predict(x_pred)
    
    # Get order and products
    x_lookup = validation_array[:, [1, 11]]
    
    x_assemble = np.column_stack((x_lookup, predictions))
    
    # Assert: x_assemble column 0 == product id; 1 == order id; 2 == reorder prediction
    
    # For each row: 
    for i in range(x_assemble.shape[0]):
        # if it is a new order number, create a set
        if int(x_assemble[i,1]) not in user_item_logistic:
            user_item_logistic[int(x_assemble[i,1])] = set()
        if x_assemble[i,2] == 1:
            user_item_logistic[int(x_assemble[i,1])].add(str(int(x_assemble[i,0])))
    
# Add "None" to orders with no items
for y in [x for x in user_item_logistic.keys() if len(user_item_logistic[x]) == 0]:
    user_item_logistic[y].add("None")
        

7
8
9


In [6]:
# Grab f1 score:
%run F1_Score.ipynb   

# grab actual results: 
%run Load_actual_results.ipynb

# Score: 
user_item_logistic_results = f1(user_item_logistic, actual_results)



True Positives:  16982
False Positives: 39880
False Negatives: 237140
Precision:       0.2986528788997925
Recall:          0.06682617010727131
----------------------------
F1: 0.10921462197412084


In [96]:
import pandas as pd

# Write to results table
con = sqlite3.connect("instacart.db")
cur = con.cursor()

# Insert into model results
cur.execute("INSERT INTO model_results (Model, F1, True_Positives, "
            "False_Positives, False_Negatives) VALUES (?, ?, ?, ?, ?);",
            list(("User_Item_Logistic",) + user_item_logistic_results ) )

# Print contents of model_results
print(pd.read_sql_query("SELECT * FROM model_Results;", con))

con.commit()
con.close()

                Model        F1  True_Positives  False_Positives  \
0         Dummy Model  0.017609            2586            37002   
1       Naive_Reorder  0.004044             887           183617   
2  User_Item_Logistic  0.111259           17314            39802   

   False_Negatives  
0           251536  
1           253235  
2           236808  


# Retry prediction with lower probability threshold 

predict simply guesses 0 or 1, but given the false negatives, perhaps I should be more liberal.

user_item_logistic_lower set at rebuy probability at > .25
user_item_logistic_even_lower set at rebuy probability at > .15

In [16]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

x_assemble = dict()

# validation set is orders ending with 7/8/9.
for i in [7, 8, 9]:
    print(i)
    # Get data for predictions
    conn = sqlite3.connect("instacart.db")
    cur = conn.cursor()
    cur.execute("SELECT * FROM features_1_c WHERE order_id % 10 = " + str(i) + ";")
    validation = cur.fetchall()
    validation_array = np.array(validation)
    conn.close()

    # Put into array: 
    x_pred = validation_array[:,[2,3,4,5,6,7,8,9,10,12,13,14,15]]
    predictions = model_lr_sklearn.predict_proba(x_pred)

    # Get order and products
    x_lookup = validation_array[:, [1, 11]]

    x_assemble[i] = np.column_stack((x_lookup, predictions))

# Assert: x_assemble column 0 == product id; 1 == order id; 2 == reorder prediction

x_assemble_all = np.vstack((x_assemble[7], x_assemble[8], x_assemble[9]))

for j in [.1, .11, .12, .13, .14, .15, .16, .17, .18, .19, .2]:
    user_item_logistic_lower = dict()
    
    # For each row: 
    for i in range(x_assemble_all.shape[0]):
       # if it is a new order number, create a set
        if int(x_assemble_all[i,1]) not in user_item_logistic_lower:
            user_item_logistic_lower[int(x_assemble_all[i,1])] = set()
        if x_assemble_all[i,3] > j:
            user_item_logistic_lower[int(x_assemble_all[i,1])].add(str(int(x_assemble_all[i,0])))

    # Add "None" to orders with no items
    for y in [x for x in user_item_logistic_lower.keys() if len(user_item_logistic_lower[x]) == 0]:
        user_item_logistic_lower[y].add("None")

    print("===================")
    print("= Cuttoff at " + str(j))
    # Score: 
    user_item_logistic_results = f1(user_item_logistic_lower, actual_results)

7
8
9
= Cuttoff at 0.1
True Positives:  203585
False Positives: 838505
False Negatives: 50537
Precision:       0.1953622047999693
Recall:          0.8011309528494188
----------------------------
F1: 0.3141229983984102
= Cuttoff at 0.11
True Positives:  197179
False Positives: 761963
False Negatives: 56943
Precision:       0.205578527475598
Recall:          0.7759225883630697
----------------------------
F1: 0.32503890332194807
= Cuttoff at 0.12
True Positives:  190169
False Positives: 687948
False Negatives: 63953
Precision:       0.2165645352498585
Recall:          0.7483374127387633
----------------------------
F1: 0.33591671016455005
= Cuttoff at 0.13
True Positives:  182207
False Positives: 610907
False Negatives: 71915
Precision:       0.2297362043791939
Recall:          0.7170060049897293
----------------------------
F1: 0.3479769603031217
= Cuttoff at 0.14
True Positives:  173447
False Positives: 533977
False Negatives: 80675
Precision:       0.24518110779391142
Recall:         

In [18]:
for j in [.191, .192, .193, .194, .195, .196, .197, .198, .199, .2, .201, .202, .203, .204, .205, .206, .207]:
    user_item_logistic_lower = dict()
    
    # For each row: 
    for i in range(x_assemble_all.shape[0]):
       # if it is a new order number, create a set
        if int(x_assemble_all[i,1]) not in user_item_logistic_lower:
            user_item_logistic_lower[int(x_assemble_all[i,1])] = set()
        if x_assemble_all[i,3] > j:
            user_item_logistic_lower[int(x_assemble_all[i,1])].add(str(int(x_assemble_all[i,0])))

    # Add "None" to orders with no items
    for y in [x for x in user_item_logistic_lower.keys() if len(user_item_logistic_lower[x]) == 0]:
        user_item_logistic_lower[y].add("None")

    print("===================")
    print("= Cuttoff at " + str(j))
    # Score: 
    user_item_logistic_results = f1(user_item_logistic_lower, actual_results)

= Cuttoff at 0.191
True Positives:  122220
False Positives: 234369
False Negatives: 131902
Precision:       0.34274753287398096
Recall:          0.48095009483633844
----------------------------
F1: 0.4002547849965041
= Cuttoff at 0.192
True Positives:  120895
False Positives: 228392
False Negatives: 133227
Precision:       0.34611938033765927
Recall:          0.47573606378038896
----------------------------
F1: 0.4007066517072168
= Cuttoff at 0.193
True Positives:  119481
False Positives: 222245
False Negatives: 134641
Precision:       0.34963976987410966
Recall:          0.47017180724219076
----------------------------
F1: 0.4010452330124462
= Cuttoff at 0.194
True Positives:  118087
False Positives: 215933
False Negatives: 136035
Precision:       0.35353272259146157
Recall:          0.4646862530595541
----------------------------
F1: 0.40155948733469127
= Cuttoff at 0.195
True Positives:  116714
False Positives: 210029
False Negatives: 137408
Precision:       0.35720428593726566
Reca

In [10]:
# Grab f1 score:
%run F1_Score.ipynb   

# grab actual results: 
%run Load_actual_results.ipynb

# Score: 
user_item_logistic_results = f1(user_item_logistic_lower, actual_results)



True Positives:  203585
False Positives: 838505
False Negatives: 50537
Precision:       0.1953622047999693
Recall:          0.8011309528494188
----------------------------
F1: 0.3141229983984102


In [29]:
import pandas as pd

# Write to results table
con = sqlite3.connect("instacart.db")
cur = con.cursor()

# Insert into model results
cur.execute("INSERT INTO model_results (Model, F1, True_Positives, "
            "False_Positives, False_Negatives) VALUES (?, ?, ?, ?, ?);",
            list(("User_Item_Logistic_Even_Lower",) + user_item_logistic_results ) )

# Print contents of model_results
print(pd.read_sql_query("SELECT * FROM model_Results;", con))

con.commit()
con.close()

                           Model        F1  True_Positives  False_Positives  \
0                    Dummy Model  0.017609            2586            37002   
1                  Naive_Reorder  0.004044             887           183617   
2             User_Item_Logistic  0.111259           17314            39802   
3       User_Item_Logistic_Lower  0.332571           66918            81388   
4  User_Item_Logistic_Even_Lower  0.370396          165545           474213   

   False_Negatives  
0           251536  
1           253235  
2           236808  
3           187204  
4            88577  


# Score on Test Data

Even Lower Logistic seems to be pretty darn good, let's put in a submission

## Create features for test submissions

In [31]:
conn = sqlite3.connect("instacart.db")
cur = conn.cursor()

# Combine with features about current order
cur.execute("DROP TABLE IF EXISTS features_1_test;")
cur.execute("CREATE TABLE features_1_test AS "
            "SELECT A.user_id, "
            " A.product_id, "
            " A.total_previous_buys_item, "
            " A.time_since_last_buy_item + B.days_since_prior_order as time_since_last_buy_item, "
            " A.last_buy_day_of_week_item, "
            " A.last_buy_Time_of_day_item, "
            " A.last_order_number, "
            " A.orders_in_history, "
            " A.average_rebuy_duration, "
            " A.average_rebuy_dow, "
            " A.average_rebuy_hod, "
            " B.order_id, "
            " B.order_number, "
            " B.order_dow, "
            " B.order_hour_of_day, "
            " B.days_since_prior_order "
            "FROM features_1 A "
            " INNER JOIN orders B "
            " ON A.user_id = B.user_id "
            "WHERE B.eval_set = 'test';")

conn.commit()
conn.close()

## Create Scores

In [19]:
# Create dictionary to house predictions
user_item_logistic_lower_test = dict()

# Total is 4.8M rows, let's break up into 10ths again. 
for i in range(10):
    # Get data for predictions
    conn = sqlite3.connect("instacart.db")
    cur = conn.cursor()
    cur.execute("SELECT * FROM features_1_test WHERE order_id % 10 =" + str(i) + " ;")
    test = cur.fetchall()
    test_array = np.array(test)
    conn.close()

    # Put into array: 
    test_pred = test_array[:,[2,3,4,5,6,7,8,9,10,12,13,14,15]]
    test_predictions = model_lr_sklearn.predict_proba(test_pred)

    # Get order and products
    test_lookup = test_array[:, [1, 11]]

    test_assemble = np.column_stack((test_lookup, test_predictions))

    # Assert: x_assemble column 0 == product id; 1 == order id; 2 == reorder prediction

    # For each row: 
    for i in range(test_assemble.shape[0]):
       # if it is a new order number, create a set
        if int(test_assemble[i,1]) not in user_item_logistic_lower_test:
            user_item_logistic_lower_test[int(test_assemble[i,1])] = set()
        if test_assemble[i,3] > 0.2:
            user_item_logistic_lower_test[int(test_assemble[i,1])].add(str(int(test_assemble[i,0])))
    
# Add "None" to orders with no items
for y in [x for x in user_item_logistic_lower_test.keys() if len(user_item_logistic_lower_test[x]) == 0]:
    user_item_logistic_lower_test[y].add("None")

In [20]:
len(user_item_logistic_lower_test.keys())

75000

In [21]:
test_array.shape

(474973, 16)

In [22]:
#Write estimate
f = open('copeland1.csv', 'w')
f.write('order_id,products\n')
for order in user_item_logistic_lower_test:
    f.write(str(order) + ",")
    for product in user_item_logistic_lower_test[order]:
        f.write(product + " ")
    f.write('\n')
f.close()
    
