Building a recommender system. Competition on Kaggle https://www.kaggle.com/c/sbermarket-internship-competition/overview

As training data, a dataset with an order history of 20,000 users up to the cutoff date is presented, which separates the training and test data by time.

train.csv:
- user_id - unique user id
- order_completed_at
- order date
- cart - list of unique categories (category_id) that the order consisted of

As a prediction, for each user-category pair from the submit example, return 1 if the category will be present in the user's next order, or 0 otherwise. The category list for each user in the submit example is all the categories they have ever ordered.

# Primary analysis
The first idea that comes to mind is to define ranking simply as the percentage of customer orders in which a particular category was purchased. This way we get an initial estimate of the probability of a category purchase by a customer, regardless of when the order was made. Let's calculate this share, but before that we will do a little exploratory analysis

In [9]:
import pandas as pd
import numpy as np

In [10]:
df = pd.read_csv('train.csv.zip', compression='zip')
sample = pd.read_csv('sample_submission.csv.zip', compression='zip')

In [11]:
df['order_completed_at'] = pd.to_datetime(df.order_completed_at)

In [12]:
df.groupby(['user_id', 'order_completed_at']).count()

# orders and the number of categories in them for each id

Unnamed: 0_level_0,Unnamed: 1_level_0,cart
user_id,order_completed_at,Unnamed: 2_level_1
0,2020-07-19 09:59:17,8
0,2020-08-24 08:55:32,25
0,2020-09-02 07:38:25,11
1,2019-05-08 16:09:41,1
1,2020-01-17 14:44:23,6
...,...,...
19998,2020-09-01 08:12:32,7
19998,2020-09-02 15:03:23,4
19999,2020-08-31 18:54:24,1
19999,2020-08-31 19:32:08,1


In [13]:
sorted(df.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart'].unique())

# List of the number of orders in the selection

[3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 78,
 79,
 80,
 81,
 83,
 85,
 86,
 87,
 88,
 90,
 91,
 92,
 93,
 94,
 95,
 97,
 98,
 99,
 101,
 104,
 107,
 108,
 111,
 113,
 114,
 115,
 116,
 119,
 120,
 124,
 125,
 126,
 127,
 133,
 134,
 137,
 145,
 154,
 155,
 163,
 165,
 187,
 213]

We will find the indexes of those people who are in the sample and we will do further analysis for them, since at this stage we assume that using the ratings of similar clients we will be able to find the rating of a certain client, so we are not interested in id's that are not in the sample.

In [14]:
id_ind = list(set(sample['id'].apply(lambda x: int(x.split(';')[0])).values))

# id that is in the sample

In [15]:
cat_ind = list(set(sample['id'].apply(lambda x: int(x.split(';')[1])).values))

# indices of categories that are in the sample

In [16]:
df_ind = df.query('user_id in @id_ind')

# the indexes of observations we need

# Simple share
Now let's calculate the share of orders in which the category was present

In [17]:
order_num = df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count().values.reshape(13036,)
order_num

# number of orders for id

array([3, 9, 7, ..., 3, 3, 3], dtype=int64)

In [18]:
cat_num = df_ind.groupby(['user_id', 'cart']).count()
cat_num

# how many orders had the category

Unnamed: 0_level_0,Unnamed: 1_level_0,order_completed_at
user_id,cart,Unnamed: 2_level_1
0,5,1
0,10,1
0,14,2
0,20,1
0,22,1
...,...,...
19998,398,2
19998,409,1
19998,415,2
19998,420,2


In [19]:
div = []
for i, j in zip(order_num, np.array(cat_num.groupby('user_id').count()['order_completed_at'])):
    div.append((str(i)+' ') * j)
div = np.array(list(map(int, ' '.join(div).split())))
share = np.array(cat_num).reshape(790449,) / div
share_df = df_ind.groupby(['user_id', 'cart']).count().reset_index().drop('order_completed_at', axis=1)
share_df['share'] = share

# Calculate the share of orders

In [20]:
share_df

Unnamed: 0,user_id,cart,share
0,0,5,0.333333
1,0,10,0.333333
2,0,14,0.666667
3,0,20,0.333333
4,0,22,0.333333
...,...,...,...
790444,19998,398,0.666667
790445,19998,409,0.333333
790446,19998,415,0.666667
790447,19998,420,0.666667


In [21]:
sample['merge'] = list(zip(sample['id'].apply(lambda x: int(x.split(';')[0])).values, sample['id'].apply(lambda x: int(x.split(';')[1])).values))
share_df['merge'] = list(zip(share_df['user_id'].values, share_df['cart'].values))

# Create columns to join tables

In [22]:
result = sample.merge(share_df, on='merge')[['id', 'share']]

In [23]:
result

Unnamed: 0,id,share
0,0;133,0.333333
1,0;5,0.333333
2,0;10,0.333333
3,0;396,0.333333
4,0;14,0.666667
...,...,...
790444,19998;26,0.333333
790445,19998;31,0.333333
790446,19998;29,0.333333
790447,19998;798,0.333333


In [24]:
result['target'] = 0
result.loc[result[result['share'] >= 0.5].index , 'target'] = 1

# Predict a purchase for all categories whose share is greater than 0.5

In [25]:
result[['id', 'target']].to_csv('submission1.csv', index=False)

# Time structure analysis
The next thought that comes to mind is to analyze how the probability of a purchase depends on later and earlier orders. It is logical to assume that orders that were made recently have more influence on a new order than older ones, since a person's preferences, favorite categories, and so on can change over time. To test this hypothesis, we calculate the average similarity between the first and last orders, as well as between the last and penultimate ones, then compare them and draw the appropriate conclusions. Similarity will be considered as the proportion of identical categories in orders

In [26]:
orders = df_ind.groupby(['user_id', 'order_completed_at']).count()
orders

Unnamed: 0_level_0,Unnamed: 1_level_0,cart
user_id,order_completed_at,Unnamed: 2_level_1
0,2020-07-19 09:59:17,8
0,2020-08-24 08:55:32,25
0,2020-09-02 07:38:25,11
1,2019-05-08 16:09:41,1
1,2020-01-17 14:44:23,6
...,...,...
19997,2020-08-31 11:04:05,1
19997,2020-08-31 11:48:23,17
19998,2020-08-30 12:15:55,8
19998,2020-09-01 08:12:32,7


In [27]:
sim_first = np.array([])
random_ids = np.random.choice(id_ind, size=100, replace=False)

for i in random_ids:
    first = orders.loc[i].index[0]
    last = orders.loc[i].index[-1]
    first_set = set(df_ind[(df_ind['user_id'] == i) & (df_ind['order_completed_at'] == first)]['cart'])
    last_set = set(df_ind[(df_ind['user_id'] == i) & (df_ind['order_completed_at'] == last)]['cart'])
    union = sorted(list(first_set | last_set))
    first_list = [0 for x in range(len(union))]
    last_list = [0 for x in range(len(union))]
    for i in first_set:
        first_list[union.index(i)] = 1
    for j in last_set:
        last_list[union.index(j)] = 1
    sim_first = np.append(sim_first, sum(np.array(first_list) + np.array(last_list) == 2) / len(first_list))
    
# Similarity between first and last order for 100 random id

In [28]:
sim_second = np.array([])

for i in random_ids:
    penult = orders.loc[i].index[-2]
    last = orders.loc[i].index[-1]
    penult_set = set(df_ind[(df_ind['user_id'] == i) & (df_ind['order_completed_at'] == penult)]['cart'])
    last_set = set(df_ind[(df_ind['user_id'] == i) & (df_ind['order_completed_at'] == last)]['cart'])
    union = sorted(list(penult_set | last_set))
    penult_list = [0 for x in range(len(union))]
    last_list = [0 for x in range(len(union))]
    for i in penult_set:
        penult_list[union.index(i)] = 1
    for j in last_set:
        last_list[union.index(j)] = 1
    sim_second = np.append(sim_second, sum(np.array(penult_list) + np.array(last_list) == 2) / len(penult_list))
    
# Similarity between penultimate and last orders for 100 random id

In [29]:
print('Similarity between orders:\nFirst and last = {:.3f}\nBefore last and last = {:.3f}'.format(np.mean(sim_first), np.mean(sim_second)))

Similarity between orders:
First and last = 0.181
Before last and last = 0.232


# Modify share
Indeed, the similarity between the last and penultimate orders is higher than between the last and first. It is worth taking this into account when calculating the share of the category in orders, since the latest orders should clearly have a larger contribution than the earlier ones.
To get started, just calculate the share of the last three orders and make a prediction of 1 for those categories that were bought 2 or 3 times

In [30]:
a = df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']
for i in sorted(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart'].unique()):
    print('Number of orders: {}, id with this number of orders: {} pieces'.format(i, sum(a == i)))

Number of orders: 3, id with this number of orders: 1888 pieces
Number of orders: 4, id with this number of orders: 1465 pieces
Number of orders: 5, id with this number of orders: 1195 pieces
Number of orders: 6, id with this number of orders: 973 pieces
Number of orders: 7, id with this number of orders: 817 pieces
Number of orders: 8, id with this number of orders: 742 pieces
Number of orders: 9, id with this number of orders: 611 pieces
Number of orders: 10, id with this number of orders: 490 pieces
Number of orders: 11, id with this number of orders: 412 pieces
Number of orders: 12, id with this number of orders: 383 pieces
Number of orders: 13, id with this number of orders: 344 pieces
Number of orders: 14, id with this number of orders: 305 pieces
Number of orders: 15, id with this number of orders: 306 pieces
Number of orders: 16, id with this number of orders: 279 pieces
Number of orders: 17, id with this number of orders: 218 pieces
Number of orders: 18, id with this number of

In [23]:
df_ind['last_1'] = 0   # last order
df_ind['last_2'] = 0   # penultimate order
df_ind['last_3'] = 0   # penultimate order

df_ind['user_id_time'] = list(zip(df_ind['user_id'], df_ind['order_completed_at']))

ind1 = np.array(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 1
val_id1 = list(orders.iloc[ind1].reset_index()['user_id'])
val_time1 = list(orders.iloc[ind1].reset_index()['order_completed_at'])
val_id_time1 = list(zip(val_id1, val_time1))
df_ind.loc[df_ind.query('user_id_time in @val_id_time1').index, 'last_1'] = 1
last_1 = df_ind.groupby(['user_id', 'cart']).sum()['last_1'].values
share_df['last_1'] = last_1

ind2 = np.array(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 2
val_id2 = list(orders.iloc[ind2].reset_index()['user_id'])
val_time2 = list(orders.iloc[ind2].reset_index()['order_completed_at'])
val_id_time2 = list(zip(val_id2, val_time2))
df_ind.loc[df_ind.query('user_id_time in @val_id_time2').index, 'last_2'] = 1
last_2 = df_ind.groupby(['user_id', 'cart']).sum()['last_2'].values
share_df['last_2'] = last_2

ind3 = np.array(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 3
val_id3 = list(orders.iloc[ind3].reset_index()['user_id'])
val_time3 = list(orders.iloc[ind3].reset_index()['order_completed_at'])
val_id_time3 = list(zip(val_id3, val_time3))
df_ind.loc[df_ind.query('user_id_time in @val_id_time3').index, 'last_3'] = 1
last_3 = df_ind.groupby(['user_id', 'cart']).sum()['last_3'].values
share_df['last_3'] = last_3

# Calculate binary features, whether the category was in the last three orders

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [24]:
share_df['target'] = 0
share_df.loc[share_df.query('last_1 + last_2 + last_3 in [2, 3]').index, 'target'] = 1
share_df

# Calculate new target

Unnamed: 0,user_id,cart,share,merge,last_1,last_2,last_3,target
0,0,5,0.333333,"(0, 5)",0,1,0,0
1,0,10,0.333333,"(0, 10)",0,1,0,0
2,0,14,0.666667,"(0, 14)",0,1,1,1
3,0,20,0.333333,"(0, 20)",0,0,1,0
4,0,22,0.333333,"(0, 22)",0,1,0,0
...,...,...,...,...,...,...,...,...
790444,19998,398,0.666667,"(19998, 398)",0,1,1,1
790445,19998,409,0.333333,"(19998, 409)",1,0,0,0
790446,19998,415,0.666667,"(19998, 415)",0,1,1,1
790447,19998,420,0.666667,"(19998, 420)",0,1,1,1


In [25]:
result2 = sample.merge(share_df, on='merge').rename(columns={'target_y': 'target'})[['id', 'target']]

In [26]:
result2[['id', 'target']].to_csv('submission2.csv', index=False)

There is an idea in finance that the best predictor of an asset's price is its current price. Let's take advantage of this and just predict the categories in the next purchase by the categories that were last purchased.

In [27]:
result3 = sample.merge(share_df, on='merge').rename(columns={'last_1': 'target'})[['id', 'target']]

In [28]:
result3[['id', 'target']].to_csv('submission3.csv', index=False)

As already established, the similarity of orders decreases over time, so we find the weighted average of the last three orders. Let's take a weight of 0.5 for the last order, 0.3 for the penultimate order, and 0.2 for the penultimate order and predict 1 for a weighted average >= 0.5. Thus, the prediction will be similar to the usual share of the last three orders, but for all categories that were in the last order, one will be predicted. Some combination of the previous two solutions will turn out

In [29]:
share_df['target_w'] = 0
share_df.loc[share_df.query('last_1*0.5 + last_2*0.3 + last_3*0.2 >= 0.5').index, 'target_w'] = 1
share_df

Unnamed: 0,user_id,cart,share,merge,last_1,last_2,last_3,target,target_w
0,0,5,0.333333,"(0, 5)",0,1,0,0,0
1,0,10,0.333333,"(0, 10)",0,1,0,0,0
2,0,14,0.666667,"(0, 14)",0,1,1,1,1
3,0,20,0.333333,"(0, 20)",0,0,1,0,0
4,0,22,0.333333,"(0, 22)",0,1,0,0,0
...,...,...,...,...,...,...,...,...,...
790444,19998,398,0.666667,"(19998, 398)",0,1,1,1,1
790445,19998,409,0.333333,"(19998, 409)",1,0,0,0,1
790446,19998,415,0.666667,"(19998, 415)",0,1,1,1,1
790447,19998,420,0.666667,"(19998, 420)",0,1,1,1,1


In [30]:
result4 = sample.merge(share_df, on='merge').rename(columns={'target_w': 'target'})[['id', 'target']]

In [31]:
result4[['id', 'target']].to_csv('submission4.csv', index=False)

Now predict 1 for all categories with a share greater than 0.5, and which were in the last 3 orders

In [32]:
share_df['target_1'] = 0
share_df.loc[share_df.query('share >= 0.5').index, 'target_1'] = 1
share_df.loc[share_df.query('last_1 + last_2 + last_3 == 3').index, 'target_1'] = 1
share_df

Unnamed: 0,user_id,cart,share,merge,last_1,last_2,last_3,target,target_w,target_1
0,0,5,0.333333,"(0, 5)",0,1,0,0,0,0
1,0,10,0.333333,"(0, 10)",0,1,0,0,0,0
2,0,14,0.666667,"(0, 14)",0,1,1,1,1,1
3,0,20,0.333333,"(0, 20)",0,0,1,0,0,0
4,0,22,0.333333,"(0, 22)",0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
790444,19998,398,0.666667,"(19998, 398)",0,1,1,1,1,1
790445,19998,409,0.333333,"(19998, 409)",1,0,0,0,1,0
790446,19998,415,0.666667,"(19998, 415)",0,1,1,1,1,1
790447,19998,420,0.666667,"(19998, 420)",0,1,1,1,1,1


In [33]:
result5 = sample.merge(share_df, on='merge').rename(columns={'target_1': 'target'})[['id', 'target']]

In [34]:
result5[['id', 'target']].to_csv('submission5.csv', index=False)

# Building the model

We will use gradient booting. But first you need to get as many features as possible for each pair of custom elements. Let's take the following signs:

1) The share of customer orders, when meeting with the appearance category

2) Binary sign - takes the value 1 if there was a category in the last rating

3) similar to the 2nd sign, but for the penultimate order

4) Popularity of the category - the share of orders of all customers in the category search

We also have information about the identifier and the numeric category, which also contain valuable information. One hot encoding if necessary is undesirable, as it will inflate the properties of our space too much, and the essential features will be extremely sparse, so the average target encoding is applied to the user_id and cart variables.

5) userid

6) trolley

For training, we will take the entire selection, except for the last order - the purchase information for it will be dependent

In [67]:
df['y'] = 0
df['user_id_time'] = list(zip(df['user_id'], df['order_completed_at']))
last_ind = np.array(df.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 1
orders2 = df.groupby(['user_id', 'order_completed_at']).count()
last_val_time = list(orders2.iloc[last_ind].reset_index()['order_completed_at'])
last_val_id = list(orders2.iloc[last_ind].reset_index()['user_id'])
last_val_id_time = list(zip(last_val_id, last_val_time))
df.loc[df.query('user_id_time in @last_val_id_time').index, 'y'] = 1
df_model = df.drop(df.query('user_id_time in @last_val_id_time').index)    # удалили из выборки последний заказ для каждого id
data = df_model.groupby(['user_id', 'cart']).count().reset_index().drop('order_completed_at', axis=1)    # таблица для обучающих данных
y = df.groupby(['user_id', 'cart']).sum().loc[df_model.groupby(['user_id', 'cart']).count().index]['y'].values
df_model.drop('y', axis=1, inplace=True)
data.drop('user_id_time', axis=1, inplace=True)

data['y'] = y

# Calculate the dependent variable

In [68]:
order_num2 = df_model.drop('user_id_time', axis=1).groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart'].values.reshape(20000,)
cat_num2 = df_model.drop('user_id_time', axis=1).groupby(['user_id', 'cart']).count()

div = []
for i, j in zip(order_num2, np.array(cat_num2.groupby('user_id').count()['order_completed_at'])):
    div.append((str(i)+' ') * j)
div = np.array(list(map(int, ' '.join(div).split())))
share = np.array(cat_num2).reshape(1031269 ,) / div
data['share'] = share

# Calculate the share of orders

In [69]:
df_model['last'] = 0
df_model['penult'] = 0

orders3 = df_model.groupby(['user_id', 'order_completed_at']).count()

last_ind2 = np.array(df_model.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 1
last_val_id2 = list(orders3.iloc[last_ind2].reset_index()['user_id'])
last_val_time2 = list(orders3.iloc[last_ind2].reset_index()['order_completed_at'])
last_val_id_time2 = list(zip(last_val_id2, last_val_time2))
df_model.loc[df_model.query('user_id_time in @last_val_id_time2').index, 'last'] = 1
last = df_model.groupby(['user_id', 'cart']).sum()['last'].values
data['last'] = last

penult_ind2 = np.array(df_model.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 2
penult_val_id2 = list(orders3.iloc[penult_ind2].reset_index()['user_id'])
penult_val_time2 = list(orders3.iloc[penult_ind2].reset_index()['order_completed_at'])
penult_val_id_time2 = list(zip(penult_val_id2, penult_val_time2))
df_model.loc[df_model.query('user_id_time in @penult_val_id_time2').index, 'penult'] = 1
penult = df_model.groupby(['user_id', 'cart']).sum()['penult'].values
data['penult'] = penult

# Calculate binary features, whether the category was in the last and penultimate orders

In [70]:
pop_cat = df_model.groupby('cart').count()['user_id'] / orders2.groupby('user_id').count()['cart'].sum()
data_temp = data.sort_values('cart')

pop = []
for i, j in zip(pop_cat, np.array(data.sort_values('cart').groupby('cart').count()['user_id'])):
    pop.append((str(i)+' ') * j)
pop = np.array(list(map(float, ' '.join(pop).split())))
data_temp['popularity'] = pop
data = data_temp.loc[data.index]

# Calculate the popularity of categories

In [71]:
from category_encoders.target_encoder import TargetEncoder
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['user_id', 'cart', 'share', 'last', 'penult', 'popularity']],data['y'],
                                                    test_size=0.2, random_state=1, stratify = data['y'])

te1 = TargetEncoder(smoothing=1)
X_train['user_id_te'] = te1.fit_transform(X_train['user_id'].astype(str), y_train)
X_test['user_id_te'] = te1.transform(X_test['user_id'].astype(str))

te2 = TargetEncoder(smoothing=1)
X_train['cart_te'] = te2.fit_transform(X_train['cart'].astype(str), y_train)
X_test['cart_te'] = te2.transform(X_test['cart'].astype(str))

# Convert user_id and cart

Next, select the features for the sample

In [72]:
data_sample = df_ind.groupby(['user_id', 'cart']).count().reset_index().drop('order_completed_at', axis=1)
# prediction table

data_sample['share'] = share_df['share']
# previously calculated category share in orders

In [73]:
df_ind['last'] = 0
df_ind['penult'] = 0
df_ind['user_id_time'] = list(zip(df_ind['user_id'], df_ind['order_completed_at']))
orders = df_ind.groupby(['user_id', 'order_completed_at']).count()

last_ind3 = np.array(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 1
last_val_id3 = list(orders.iloc[last_ind3].reset_index()['user_id'])
last_val_time3 = list(orders.iloc[last_ind3].reset_index()['order_completed_at'])
last_val_id_time3 = list(zip(last_val_id3, last_val_time3))
df_ind.loc[df_ind.query('user_id_time in @last_val_id_time3').index, 'last'] = 1
last = df_ind.groupby(['user_id', 'cart']).sum()['last'].values
data_sample['last'] = last

penult_ind3 = np.array(df_ind.groupby(['user_id', 'order_completed_at']).count().groupby('user_id').count()['cart']).cumsum() - 2
penult_val_id3 = list(orders.iloc[penult_ind3].reset_index()['user_id'])
penult_val_time3 = list(orders.iloc[penult_ind3].reset_index()['order_completed_at'])
penult_val_id_time3 = list(zip(penult_val_id3, penult_val_time3))
df_ind.loc[df_ind.query('user_id_time in @penult_val_id_time3').index, 'penult'] = 1
penult = df_ind.groupby(['user_id', 'cart']).sum()['penult'].values
data_sample['penult'] = penult

# Calculate binary features, whether the category was in the last and penultimate orders

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [74]:
pop_cat = df_ind.groupby('cart').count()['user_id'] / orders2.groupby('user_id').count()['cart'].sum()
data_sample_temp = data_sample.sort_values('cart')

pop = []
for i, j in zip(pop_cat, np.array(data_sample.sort_values('cart').groupby('cart').count()['user_id'])):
    pop.append((str(i)+' ') * j)
pop = np.array(list(map(float, ' '.join(pop).split())))
data_sample_temp['popularity'] = pop
data_sample = data_sample_temp.loc[data_sample.index]

# Calculate the popularity of categories

In [75]:
te1 = TargetEncoder(smoothing=1)
data['user_id_te'] = te1.fit_transform(data['user_id'].astype(str), data['y'])
data_sample['user_id_te'] = te1.transform(data_sample['user_id'].astype(str))

te2 = TargetEncoder(smoothing=1)
data['cart_te'] = te2.fit_transform(data['cart'].astype(str), data['y'])
data_sample['cart_te'] = te2.transform(data_sample['cart'].astype(str))

# Convert user_id and cart

The features are selected, now let's train the model

In [77]:
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb
from sklearn.metrics import f1_score

n_estimators = np.append(np.array(list(range(40, 121, 40))), np.array([500, 1000]))
num_leaves = np.array(list(range(10, 71, 10)))

searcher = GridSearchCV(lgb.LGBMClassifier(objective="binary"),
                        [{'n_estimators' : n_estimators, 'num_leaves' : num_leaves}],
                            scoring = 'f1', cv = 5)
searcher.fit(X_train, y_train)
y_pred = searcher.predict(X_test)
    
print("n_estimators: {}, num_leaves: {}, valid F1 score: {:.4f}".format(searcher.best_params_['n_estimators'], searcher.best_params_['num_leaves'], f1_score(y_test, y_pred)))

reg_alpha: 8.858667904100823, valid F1 score: 0.4040


In [80]:
from sklearn.model_selection import cross_val_score
cross_val_score(lgb.LGBMClassifier(objective="binary", n_estimators=searcher.best_params_['n_estimators'], num_leaves = searcher.best_params_['num_leaves']), data[['user_id_te', 'cart_te', 'share', 'last', 'penult', 'popularity']], data['y'], cv=5, scoring='f1')

array([0.37488643, 0.39263397, 0.40924531, 0.45170095, 0.48964472])

In [81]:
data_sample['merge'] = list(zip(data_sample['user_id'], data_sample['cart']))
data_sample = sample.merge(data_sample, on='merge')

In [83]:
booster = lgb.LGBMClassifier(objective="binary", n_estimators=searcher.best_params_['n_estimators'], num_leaves = searcher.best_params_['num_leaves'])
booster.fit(data[['user_id_te', 'cart_te', 'share', 'last', 'penult', 'popularity']], data['y'])
target = booster.predict(data_sample[['user_id_te', 'cart_te', 'share', 'last', 'penult', 'popularity']])

In [84]:
sample['target'] = target

In [85]:
sample[['id', 'target']].to_csv('submission9.csv', index=False)