# Creating Synth Data
In this script we crate a synthetic data set. The data set is created for the purpose of uni-testing but it can also be used as an example to how to format your data set if you want to use this code.

The basic structure of the data is a table with the following columns:

1. user id
2. item id
3. Features (as many columns as you want)
4. Target rate

In this data set we are going to have 2 users 10 items and 6 time windows (5 for training and one for test). We will use a single feature which is the log(rate + 1) of consumption at the previous time window.

The first user is going to consume the same 3 items (out of the 10) in every time window at a somewhat similar rate (~50). For the second user we'll choose randomely at each time window.

This way we can validate few things in our model:

1. For user 1 in the rate process the coefficient for user budget and the feature should be equally relatively high. This is because the exposure process that will cancel the rest and won't pull the user_budget down.
2. In the exposure process for user the user_budget should be very low and the feature coefficient should be really high.
3. For user 2 it the choice of items is kind of random so in the exposure proces we expect the user-budget coefficient to higher than the one for user 1 and lower for the feature coefficient.

### This is not supposed to be a super effiicient code. It's just supposed to readable and help you produce the synthetic data.

In [1]:
from __future__ import division
import numpy as np

In [2]:
N, M, T = 2, 10, 5

user_1_items = np.array([1, 3, 5], dtype=np.intc)

The first time stmap is unique because I need have a fake previously consumed rate. The column of the data are:

1. user id
2. item id
3. feature_1
4. target rate

In [4]:
prev_t_1 = np.zeros([10, 4])
prev_t_1[:, 1] = np.arange(10)
prev_t_1[user_1_items, 2] = np.log(50 + np.random.randint(0, 5, 3) + 1)  # Previous "fake" rate
prev_t_1[user_1_items, 3] = 50 + np.random.randint(0, 5, 3)

In [5]:
prev_t_2 = np.zeros([10, 4])
prev_t_2[:, 0] = 1
prev_t_2[:, 1] = np.arange(10)

# Choosing randomly 3 items
user_2_items = np.random.choice(10, 3, replace=False)
prev_t_2[user_2_items, 2] = np.log(50 + np.random.randint(0, 5, 3))  # Previous "fake" rate

# Choosing randomly 3 items
user_2_items = np.random.choice(10, 3)
prev_t_2[user_2_items, 3] = 50 + np.random.randint(0, 5, 3)

Creating the rest of the time windows data.

In [6]:
def data_for_user(prev_t, next_items):
    """Creates a synth data for the next timestamp for user.
    
    Uses the pervious rates as the past rate and radomely assign rates for the next items.
    """
    t_user = np.zeros([10, 4])
    t_user[:, :2] = prev_t[:, :2]  # Nothing changed here
    
    t_user[:, 2] = np.log(prev_t[:, -1] + 1) # Previous real rate
    
    t_user[next_items, 3] = 50 + np.random.randint(0, 5, 3)
    
    return t_user

In [7]:
train_data = np.vstack([prev_t_1, prev_t_2])

In [8]:
for t in range(5):
    ## Need to add 4 more 
    prev_t_1 = data_for_user(prev_t_1, user_1_items)
    
    user_2_items = np.random.choice(10, 3, replace=False)
    prev_t_2 = data_for_user(prev_t_2, user_2_items)
    
    train_data = np.vstack([train_data, prev_t_1, prev_t_2])

In [10]:
# Now the test data
prev_t_1 = data_for_user(prev_t_1, user_1_items)
user_2_items = np.random.choice(10, 3, replace=False)
prev_t_2 = data_for_user(prev_t_2, user_2_items)

In [11]:
test_data = np.vstack([prev_t_1, prev_t_2])

In [14]:
np.savetxt('./data/train', train_data, fmt='%.5f', delimiter='\t')
np.savetxt('./data/test', test_data, fmt='%.5f', delimiter='\t')