### Двухуровневая рекомендательная система.
#### Рожков Василий

##### Входные данные:
 - data - данные по продажам
 - item_features - данные по товарам
 - user_features - данные по покупателям
 - test_data - тестовые данные по продажам для финального тестирования модели
 
##### Задача: построить рекомендательную систему по товарам.  
##### Целевая метрика - money precision @ 5. Целевое значение - money precision @ 5 > 20%  

##### Бизнес ограничения в топ-5 товарах:  
- Для каждого юзера 5 рекомендаций
- 2 новых товара (юзер никогда не покупал)
- 1 дорогой товар, > 7 долларов
- Все товары из разных категорий (категория - sub_commodity_desc)
- Стоимость каждого рекомендованного товара > 1 доллара

##### Выходной формат данных - .csv файл с рекомендациями. В .csv файле 2 столбца: user_id - (item_id1, item_id2, ..., item_id5)  

Реализуем двухуровневую рекомендательную систему по схеме Implicit.ALS + LightGBM


----

##### Реализация  (пайплайн)  
- загружаем данные
- разбиваем на трейн/тесты в соответствии с 2 уровнями
- осуществляем предфильрацию
- обучаем рекоммендер первого уровня. при обучении используем tfidf-взвешивание, берем own_rec - прочие были отметены опытным путем

- готовим фичи для товаров: 
 * эмбеддинги
 * цена
 * среднее кол-во товара в корзине
 * накопительная выручка по товару
 * кол-во товаров в той же категории
 * кол-во дней с последней продажи. если продаж за период не было, то берем кол-во дней в периоде и умножаем на 2 (типа вес)
 * оставшиеся фичи преобразуем в категориальные  

- готовим фичи для юзеров:
 * эмбеддинги
 * средний чек
 * дней с последней покупки. если покупок за период не было, то берем кол-во дней в периоде и умножаем на 2 (типа вес)
 * преобразуем возраст, средний доход, размер дома и кол-во детей в числовой формат
 * оставшиеся фичи преобразуем в категориальные

- обучаем модель второго уровня. в качестве результата берем скор предикта.
- по скорам отбираем для юзера рекомендованные товары (100)
- по бизнес-требованиям из них отбираем по 5 товаров
- считаем метрику

- с помощью обученной модели считаем предикт для тестовых данных, считаем метрику.

#### Что можно было бы еще:  
 - попробовать на первом уровне вместо бейзлайна использовать гибридную модель. первые попытки ощутимого результата не дали, поэтому было отложено.
 - попробовать дополнительные фичи как для товаров и юзеров, так и для пар юзер-товар. 
 - попробовать gridsearch
 - Попробовать иные лоссы
 - пред-фильтровать данные для модели 2 уровня
 - попробовать на втором уровне XGBoost или нейронку.  
 - выполнить более качественный рефакторинг - добавить функциональную обработку данных или вынести второй уровень в отдельный класс с написанием соответствующих методов
   
это все был отложено, так как полуенный результат в принципе пока устроил, а времени катастрофически не хватало.

In [1]:
!pip install implicit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting implicit
  Downloading implicit-0.6.1-cp37-cp37m-manylinux2014_x86_64.whl (18.6 MB)
[K     |████████████████████████████████| 18.6 MB 334 kB/s 
Installing collected packages: implicit
Successfully installed implicit-0.6.1


In [2]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix, coo_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

# import os, sys
# module_path = os.path.abspath(os.path.join(os.pardir))
# if module_path not in sys.path:
#     sys.path.append(module_path)

#from src.metrics import precision_at_k, recall_at_k, money_precision_at_k
#from src.utils import pre_filter_items, get_users_features, get_items_features, get_recommendation_5
#from src.recommenders import MainRecommender

# Для работы с матрицами
from scipy.sparse import csr_matrix, coo_matrix

# Детерминированные алгоритмы
from implicit.nearest_neighbours import ItemItemRecommender, CosineRecommender, TFIDFRecommender, BM25Recommender

# Метрики
from implicit.evaluation import train_test_split
from implicit.evaluation import precision_at_k, mean_average_precision_at_k, AUC_at_k, ndcg_at_k



import warnings
warnings.filterwarnings("ignore")

from tqdm import tqdm
tqdm.pandas()

  f"CUDA extension is built, but disabling GPU support because of '{e}'",


In [None]:
#from google.colab import drive
#drive.mount('/content/drive')
#from google.colab import files
#files.upload()

Загружаем данные

In [3]:
path_data = 'retail_train.csv'  # ниже загружаю уже с расчитанной ценой
path_features = 'product.csv'
path_user = 'hh_demographic.csv'

data = pd.read_csv(path_data)
item_features = pd.read_csv(path_features)
user_features = pd.read_csv(path_user)

# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)

In [None]:
#test_path = 'retail_test1.csv'
#test_data = pd.read_csv(test_path)

In [63]:
data.shape, 
data.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [64]:
data.nlargest(3, 'sales_value')

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
1085442,1609,32006114302,339,12484608,3,840.0,412,0.0,2038,49,0.0,0.0
2030766,346,40387571385,574,948670,5,631.8,415,0.0,1312,83,0.0,0.0
655985,125,30515165970,230,1089093,2,505.0,323,0.0,1231,34,0.0,0.0


In [4]:
#сразу считаем цены - они нам потребуются далее
prices = data.groupby(['item_id'])['sales_value'].mean().reset_index()
sales_qty = data.groupby(['item_id'])['quantity'].mean().reset_index()
prices = prices.merge(sales_qty, on='item_id', how='left')
prices['price'] = [prices.iloc[i]['sales_value'] / prices.iloc[i]['quantity']\
                   if prices.iloc[i]['quantity'] > 0 else 0 for i in prices['item_id'].index]
prices.drop(columns=['sales_value', 'quantity'], axis=1, inplace=True)

In [5]:
data = data.merge(prices, on='item_id', how='left')
data.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc,price
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0,2.385178
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0,0.945892


Расчет цены делается не особо быстро - оставил возможность загруки готовых данных на будущее.

In [6]:
data['week_no'].nunique()

95

In [7]:
popularity = data.groupby('item_id')['sales_value'].sum().reset_index()
popularity.describe()

Unnamed: 0,item_id,sales_value
count,89051.0,89051.0
mean,5115772.0,83.458481
std,5178973.0,1628.715079
min,25671.0,0.0
25%,966583.0,3.5
50%,1448516.0,10.78
75%,9553042.0,46.105
max,18024560.0,467993.62


In [8]:
np.sort(popularity.sales_value)[::-1]

array([467993.62,  42645.75,  37981.91, ...,      0.  ,      0.  ,
            0.  ])

In [9]:
popularity.sort_values('sales_value', ascending=False, inplace=True)

#10% самых популярных товаров приносят 75% выручки
popularity.head(8900)['sales_value'].sum() / popularity['sales_value'].sum() *100 

74.73621863438936

In [10]:
popularity = data.groupby('item_id')['user_id'].nunique().reset_index()
popularity.describe()

Unnamed: 0,item_id,user_id
count,89051.0,89051.0
mean,5115772.0,14.759767
std,5178973.0,45.904111
min,25671.0,1.0
25%,966583.0,1.0
50%,1448516.0,2.0
75%,9553042.0,10.0
max,18024560.0,2039.0


In [11]:
item_features = pd.read_csv('product.csv')
item_features.head(2)

Unnamed: 0,PRODUCT_ID,MANUFACTURER,DEPARTMENT,BRAND,COMMODITY_DESC,SUB_COMMODITY_DESC,CURR_SIZE_OF_PRODUCT
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [12]:
user_features = pd.read_csv('hh_demographic.csv')
user_features.head(2)

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [None]:
#Train-test split
#В рекомендательных системах корректнее использовать train-test split по времени, а не случайно
#Я возьму последние 3 недели в качестве теста

In [13]:
test_size_weeks = 3

data_train = data[data['week_no'] < data['week_no'].max() - test_size_weeks]
data_test = data[data['week_no'] >= data['week_no'].max() - test_size_weeks]

In [14]:
data_train.shape[0], data_test.shape[0]

(2278490, 118314)

Создадим датафрейм с покупками юзеров на тестовом датасете (последние 3 недели)

In [15]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result_rezerv=result
result.head(10)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."
3,7,"[840386, 889774, 898068, 909714, 929067, 95347..."
4,8,"[835098, 872137, 910439, 924610, 992977, 10412..."
5,9,"[864335, 990865, 1029743, 9297474, 10457112, 8..."
6,13,"[6534178, 1104146, 829197, 840361, 862070, 884..."
7,14,"[840601, 867293, 933067, 951590, 952408, 96569..."
8,15,"[910439, 1082185, 959076, 1023958, 1082310, 13..."
9,16,"[1062973, 1082185, 13007710]"


In [77]:
result.iloc[0:3].actual

0    [821867, 834484, 856942, 865456, 889248, 90795...
1    [835476, 851057, 872021, 878302, 879948, 90963...
2    [920308, 926804, 946489, 1006718, 1017061, 107...
Name: actual, dtype: object

In [78]:
result.iloc[9].actual

array([ 1062973,  1082185, 13007710])

In [79]:
test_users = result.shape[0]
new_test_users = len(set(data_test['user_id']) - set(data_train['user_id']))

print('В тестовом дата сете {} юзеров'.format(test_users))
print('В тестовом дата сете {} новых юзеров'.format(new_test_users))

В тестовом дата сете 2042 юзеров
В тестовом дата сете 0 новых юзеров


1.1 Random recommendation

In [80]:
def random_recommendation(items, n=5):
    """Случайные рекоммендации"""
    
    items = np.array(items)
    recs = np.random.choice(items, size=n, replace=False)
    
    return recs.tolist()

In [81]:
%%time

items = data_train.item_id.unique()

result['random_recommendation'] = result['user_id'].apply(lambda x: random_recommendation(items, n=5))
result.head(10)

CPU times: user 4 s, sys: 55.6 ms, total: 4.05 s
Wall time: 4.05 s


Unnamed: 0,user_id,actual,random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9655444, 9884159, 16223315, 13154969, 2635976]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[922486, 2171174, 1212361, 15717202, 893367]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[852540, 1801210, 9655242, 1830708, 1069082]"
3,7,"[840386, 889774, 898068, 909714, 929067, 95347...","[5709129, 845683, 6919107, 9420013, 1098198]"
4,8,"[835098, 872137, 910439, 924610, 992977, 10412...","[9527010, 13098159, 1121512, 342276, 7441927]"
5,9,"[864335, 990865, 1029743, 9297474, 10457112, 8...","[1077546, 1032551, 9861182, 12301396, 6606120]"
6,13,"[6534178, 1104146, 829197, 840361, 862070, 884...","[1341423, 12301117, 996290, 889648, 407798]"
7,14,"[840601, 867293, 933067, 951590, 952408, 96569...","[1378599, 78215, 1016573, 8203677, 1850595]"
8,15,"[910439, 1082185, 959076, 1023958, 1082310, 13...","[1988937, 1056651, 10180671, 834430, 1010955]"
9,16,"[1062973, 1082185, 13007710]","[6904466, 12582269, 831223, 8091414, 1137505]"


1.2 Popularity-based recommendation

In [82]:
def popularity_recommendation(data, n=5):
    """Топ-n популярных товаров"""
    
    popular = data.groupby('item_id')['sales_value'].sum().reset_index()
    popular.sort_values('sales_value', ascending=False, inplace=True)
    
    recs = popular.head(n).item_id
    
    return recs.tolist()

In [83]:
%%time

# Можно так делать, так как рекомендация не зависит от юзера
popular_recs = popularity_recommendation(data_train, n=5)

result['popular_recommendation'] = result['user_id'].apply(lambda x: popular_recs)
result.head(2)

CPU times: user 174 ms, sys: 6.7 ms, total: 180 ms
Wall time: 176 ms


Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9655444, 9884159, 16223315, 13154969, 2635976]","[6534178, 6533889, 1029743, 6534166, 1082185]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[922486, 2171174, 1212361, 15717202, 893367]","[6534178, 6533889, 1029743, 6534166, 1082185]"


плюсы очевидны. Но в чем м.б. минус?
ну наверное раз это популярные товары, то юзер о них и так уже может знать без наших рекомендаций;
странно рекомендовать 0.01% каталога, когда у нас есть весь каталог. Надо стремиться к покрытию 3-5% товаров каталога;


 1.3 Weighted random recommender 
 
 Можно сэмплировать товары случайно, но пропорционально какому-либо весу Например, прямопропорционально популярности. Вес = log(sales_sum товара), т.е выручка от всех продаж товара 
 
 Пример
 item_1 - 5, item_2 - 7, item_3 - 4 # / sum item_1 - 5 / 16, item_2 - 7 / 16, item_3 - 4 / 16

In [84]:
data = data_train
popular = data.groupby('item_id')['sales_value'].sum().reset_index()
popular[popular['sales_value']==0]

Unnamed: 0,item_id,sales_value
138,30937,0.0
2475,142713,0.0
3514,410388,0.0
5163,744587,0.0
5815,821773,0.0
...,...,...
86577,17104189,0.0
86708,17179257,0.0
86839,17284401,0.0
86844,17291554,0.0


In [85]:
#Есть нулевые значения, что при логарифмировании даст -inf
#так же есть определенный шум -- очень низкая sales_value, 
#что опять же приведет к высокой отрицательной величине по итогу логарифмирования
popular[popular['item_id']==1093910]

Unnamed: 0,item_id,sales_value
35956,1093910,8.881784e-16


In [86]:
#учтем вышеприведенные моменты
items_probabilities = popular.query('sales_value!=0 & item_id != 1093910')
items_probabilities

Unnamed: 0,item_id,sales_value
0,25671,20.94
1,26081,0.99
2,26093,1.59
3,26190,1.54
4,26355,1.98
...,...,...
86859,17330511,9.98
86861,17382205,7.99
86862,17383227,4.49
86863,17827644,2.50


In [87]:
m=500 #1
items_probabilities['log'] = np.log(items_probabilities['sales_value']*m)
items_probabilities

Unnamed: 0,item_id,sales_value,log
0,25671,20.94,9.256269
1,26081,0.99,6.204558
2,26093,1.59,6.678342
3,26190,1.54,6.646391
4,26355,1.98,6.897705
...,...,...,...
86859,17330511,9.98,8.515191
86861,17382205,7.99,8.292799
86862,17383227,4.49,7.716461
86863,17827644,2.50,7.130899


In [88]:
m=500 #1
#items_probabilities['log'] = np.log(items_probabilities['sales_value']*m)

#комментарий преподавателя:
#как правило эта проблема решается через log(1 + x), в Numpy для этого есть специальная функция np.log1p()
#это также поможет обойти ситуацию с нулевыми продажами

items_probabilities['log'] = np.log1p(items_probabilities['sales_value'])
sum_ = items_probabilities.log.sum()
sum_

241205.85742736512

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [89]:
#probability -- это условная вероятность сэмплирования товара
items_probabilities['probability'] = items_probabilities['log']/sum_.sum()

#сумма по всем товарам д.б. равна единице, проверим
items_probabilities.probability.sum()

  

0.9999999999999999

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [90]:
items_probabilities[items_probabilities['probability']<0]
#print(np.log(0.5))

Unnamed: 0,item_id,sales_value,log,probability


In [91]:
def weighted_random_recommendation(items_probabilities, n):

    # Подсказка: необходимо модифицировать функцию random_recommendation()
    # посмотрите в документации параметр, отвечающий за вероятность сэмплирования
    # your_code

    items = np.array(items_probabilities['item_id'])
    probability = np.array(items_probabilities['probability'])
    
    recs = np.random.choice(items, size=n, replace=False, p=probability)
    
    return recs.tolist()

In [92]:
%%time
items_probabilities_ = items_probabilities[['item_id', 'probability']]
#items_probabilities_.columns = ['item_id','probability']
# your_code
weighted_random_recommendation(items_probabilities_,5)

CPU times: user 7.45 ms, sys: 14 µs, total: 7.46 ms
Wall time: 7.23 ms


[847207, 13039092, 822190, 1078460, 929925]

Выводы по бейзлайнам
Фиксируют базовое качество;
Бейзлайны могут быть фильтрами;
Иногда бейзлайны лучше ML-модели

**2. Детерминированные алгоритмы item-item**

2.1 Item-Item Recommender / ItemKNN

user_item_matrix.pnguser_item_matrix.png

То, что именно находится в матрице user-item нужно определять из бизнес-логики

Варианты для нашего датасета(не исчерпывающий список): - Факт покупки (0 / 1) - Кол-во покупок (count) - Сумма покупки, руб - ...

Детерминированные алгоритмы: - Предсказывают те числа, которые стоят в матрице

ML-алгоритмы (большинство): - В качестве таргетов "под капотом" принимают 0 и 1 (в ячейке не 0 -> таргет 1) - А абсолютные значения воспринимают как веса ошибок

P.S. На самом деле есть много трюков, как можно заполнять матрицу user-item. Об этом мы поговорим на следующих вебинарах

Как работает Item-Item Recommender

item_item_recommender.pngitem_item_recommender.png

Шаг 1: Ищем K ближайших юзеров к целевому юзеру
Шаг 2: predict "скора" товара = среднему "скору" этого товара у его соседей
Шаг 3: Сортируем товары по убыванию predict-ов и берем топ-k

Примечание: KNN не работает, если User не поставил ни одной оценки

(!) Важно

У item-item алгоритмов большая сложность predict (O(I2log(I)) или O(I3), в зависимости от реализации
Если в датасете много item_id, то item-item модели ОЧЕНЬ долго предсказывают. Со всеми товарами predict на тесте ~2 часа
Давайте возьмем из ~90к товаров только 5k самых популярных
P.S. Брать топ-Х популярных и рекомендовать только из них - очень популярная стратегия.
P.P.S. В рекомендательных системах много таких трюков. Что-то подобное в курсе вы увидите еще не раз

In [93]:
popularity = data_train.groupby('item_id')['quantity'].sum().reset_index()
popularity.rename(columns={'quantity': 'n_sold'}, inplace=True)

popularity.head()

Unnamed: 0,item_id,n_sold
0,25671,6
1,26081,1
2,26093,1
3,26190,1
4,26355,2


In [94]:
top_5000 = popularity.sort_values('n_sold', ascending=False).head(5000).item_id.tolist()

In [None]:
top_5000

In [96]:
# Заведем фиктивный item_id (если юзер не покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

user_item_matrix = pd.pivot_table(data_train, 
                                  index='user_id', columns='item_id', 
                                  values='quantity',
                                  aggfunc='count', 
                                  fill_value=0
                                 )

user_item_matrix[user_item_matrix > 0] = 1 # так как в итоге хотим предсказать 
user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

# переведем в формат saprse matrix
sparse_user_item = csr_matrix(user_item_matrix).tocsr()

user_item_matrix.head()

item_id,202291,397896,420647,480014,545926,707683,731106,818980,819063,819227,...,15778533,15831255,15926712,15926775,15926844,15926886,15927403,15927661,15927850,16809471
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [97]:
user_item_matrix.iloc[5]

item_id
202291      0.0
397896      0.0
420647      0.0
480014      0.0
545926      0.0
           ... 
15926886    0.0
15927403    0.0
15927661    0.0
15927850    0.0
16809471    0.0
Name: 6, Length: 5001, dtype: float64

In [98]:
#проверим разреженность матрицы
user_item_matrix.sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]) * 100

#5.3% довольно высоко получилось, поскольку мы удалили непопулярные товары

5.33770796861036

In [99]:
#ВАЖНЫЙ МОМЕНТ -- нам нужно сохранить маппинг из реальных ID товаров и ID юзеров,
#поскольку implicit их не сохраняет

userids = user_item_matrix.index.values
itemids = user_item_matrix.columns.values

matrix_userids = np.arange(len(userids))
matrix_itemids = np.arange(len(itemids))

id_to_itemid = dict(zip(matrix_itemids, itemids))
id_to_userid = dict(zip(matrix_userids, userids))

itemid_to_id = dict(zip(itemids, matrix_itemids))
userid_to_id = dict(zip(userids, matrix_userids))

In [100]:
# все юзеры
u = np.array(list(userid_to_id.keys()))

In [101]:
#Матрица интеракций конкретных юзеров
# r = user_item_matrix.loc[user_item_matrix.index.isin([2,3])]
r = user_item_matrix.loc[user_item_matrix.index == 1]
r = csr_matrix(user_item_matrix.loc[user_item_matrix.index == 1]).tocsr()

!!!**!!!**

In [102]:
%%time

model = ItemItemRecommender(K=3, num_threads=4) # K - кол-во билжайших соседей

# model.fit(csr_matrix(user_item_matrix).T.tocsr(),  # На вход item-user matrix
model.fit(csr_matrix(user_item_matrix).tocsr(),  # На вход item-user matrix  - нетранспонированная        
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.21 s, sys: 29.2 ms, total: 2.24 s
Wall time: 1.86 s


In [103]:
# # calculate the top recommendations for a single user
# ids, scores = model.recommend(0, user_items[0])

# # calculate the top recommendations for a batch of users
# userids = np.arange(10)
# ids, scores = model.recommend(userids, user_items[userids])
recs = model.recommend(userid=1, N=5, # userid=userid_to_id[2],  # userid - id от 0 до N
                        #recalculate_user=True)
                        user_items=r,
                        filter_already_liked_items=False, 
                        filter_items=[itemid_to_id[999999]], 
                        recalculate_user=False
                       )
recs

(array([ 300, 2757, 3408, 2148, 2307], dtype=int32),
 array([ 1284.,  1317., 56256.,  2953.,  1402.]))

In [104]:
# Рекомендации для всех пользователей
recs = model.recommend(userid=np.array(list(userid_to_id.keys())), N=5, # userid - id от 0 до N
                        user_items=csr_matrix(user_item_matrix).tocsr(),
                        filter_already_liked_items=False, 
                        filter_items=[itemid_to_id[999999]], 
                        recalculate_user=False
                       )
len(recs[0])

2499

In [105]:
#второй возвращаемый параметр это не вероятность, а score релевантности
#чем он больше, тем лучше
# print([i for rec in recs[0] for i in rec])
for rec in recs[0]:
  print([id_to_itemid[i] for i in rec])
  break

[840361, 1029743, 1082185, 981760, 995242]


In [106]:
%%time

result['itemitem'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == x]).tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=[itemid_to_id[999999]], 
                                    recalculate_user=True)[0]])

CPU times: user 1.64 s, sys: 11.8 ms, total: 1.65 s
Wall time: 1.65 s


In [107]:
result.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9655444, 9884159, 16223315, 13154969, 2635976]","[6534178, 6533889, 1029743, 6534166, 1082185]","[840361, 1029743, 1082185, 981760, 995242]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[922486, 2171174, 1212361, 15717202, 893367]","[6534178, 6533889, 1029743, 6534166, 1082185]","[826249, 6534178, 1082185, 1098066, 981760]"


**4.2 Косинусное сходство и CosineRecommender**

In [108]:
%%time

model = CosineRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(csr_matrix(user_item_matrix).tocsr(), 
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.22 s, sys: 29.8 ms, total: 2.25 s
Wall time: 1.73 s


In [109]:
recs = model.recommend(userid=userid_to_id[1], 
                        user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == 1]).tocsr(),   # на вход user-item matrix
                        N=5, 
                        filter_already_liked_items=False, 
                        filter_items=[itemid_to_id[999999]], 
                        recalculate_user=False)

In [110]:
%%time

result['cosine'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == x]).tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=[itemid_to_id[999999]], 
                                    recalculate_user=False)[0]])

CPU times: user 1.6 s, sys: 10.9 ms, total: 1.61 s
Wall time: 1.62 s


In [176]:
user_item_matrix.head()

item_id,202291,397896,420647,480014,545926,707683,731106,818980,819063,819227,...,15778533,15831255,15926712,15926775,15926844,15926886,15927403,15927661,15927850,16809471
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [177]:
? TFIDFRecommender

In [111]:
%%time

model = TFIDFRecommender(K=3, num_threads=4) # K - кол-во билжайших соседей

model.fit(csr_matrix(user_item_matrix).tocsr(), 
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.28 s, sys: 36.6 ms, total: 2.32 s
Wall time: 1.93 s


In [None]:
? TFIDFRecommender

In [112]:
recs = model.recommend(userid=userid_to_id[1], 
                        user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == 1]).tocsr(),   # на вход user-item matrix
                        N=5, 
                        filter_already_liked_items=False, 
                        filter_items=[itemid_to_id[999999]], 
                        recalculate_user=False)

In [113]:
%%time

result['tfidf'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == x]).tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=[itemid_to_id[999999]], 
                                    recalculate_user=True)[0]])

CPU times: user 1.62 s, sys: 11.7 ms, total: 1.63 s
Wall time: 1.64 s


In [114]:
result.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9655444, 9884159, 16223315, 13154969, 2635976]","[6534178, 6533889, 1029743, 6534166, 1082185]","[840361, 1029743, 1082185, 981760, 995242]","[961554, 1098066, 1127831, 981760, 1082185]","[840361, 961554, 1082185, 981760, 1127831]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[922486, 2171174, 1212361, 15717202, 893367]","[6534178, 6533889, 1029743, 6534166, 1082185]","[826249, 6534178, 1082185, 1098066, 981760]","[883404, 981760, 826249, 1098066, 1082185]","[981760, 883404, 1082185, 1098066, 826249]"


### 4.4 Трюк

In [115]:
%%time

model = ItemItemRecommender(K=1, num_threads=4) # K - кол-во билжайших соседей
#ближайший сосед к пользователю  это он сам -- покупки самого пользователя на 100% совпадают с покупками пользователя
#рекомендации среди своих покупок. Иногда это довольно полезно. Итог 21,9% - хороший результат.

model.fit(csr_matrix(user_item_matrix).tocsr(), 
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.18 s, sys: 43 ms, total: 2.22 s
Wall time: 1.81 s


In [116]:
recs = model.recommend(userid=userid_to_id[1], 
                        user_items=csr_matrix(user_item_matrix).tocsr(),   # на вход user-item matrix
                        N=5, 
                        filter_already_liked_items=False, 
                        filter_items=[itemid_to_id[999999]], 
                        recalculate_user=False)

In [117]:
[id_to_itemid[rec] for rec in recs[0]]

[1081177, 995785, 1004906, 1082185, 1029743]

In [118]:
%%time

result['own_purchases'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=csr_matrix(user_item_matrix.loc[user_item_matrix.index == x]).tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=[itemid_to_id[999999]], 
                                    recalculate_user=True)[0]])

CPU times: user 1.62 s, sys: 12 ms, total: 1.64 s
Wall time: 1.64 s


### 4.5 Измерим качество по precision@5


In [119]:
result.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9655444, 9884159, 16223315, 13154969, 2635976]","[6534178, 6533889, 1029743, 6534166, 1082185]","[840361, 1029743, 1082185, 981760, 995242]","[961554, 1098066, 1127831, 981760, 1082185]","[840361, 961554, 1082185, 981760, 1127831]","[1081177, 995785, 1004906, 1082185, 1029743]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[922486, 2171174, 1212361, 15717202, 893367]","[6534178, 6533889, 1029743, 6534166, 1082185]","[826249, 6534178, 1082185, 1098066, 981760]","[883404, 981760, 826249, 1098066, 1082185]","[981760, 883404, 1082185, 1098066, 826249]","[1068719, 1127831, 1098066, 1082185, 6534178]"


In [120]:
def typess(df):
    for j in range(df.shape[1]):
        print(type(df.iloc[0][j]))

In [121]:
typess(result)

<class 'numpy.int64'>
<class 'numpy.ndarray'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>


In [None]:
#принудительно переведем numpy.ndarray в list
for i in range(result.shape[0]):
    result['actual'][i] = result['actual'][i].tolist()

result['actual'][1], type(result['actual'][1])

In [None]:
result['actual'][1], type(result['actual'][1])

In [124]:
result['random_recommendation'][1], type(result['random_recommendation'][1])

([922486, 2171174, 1212361, 15717202, 893367], list)

Можно ли улучшить бейзлайны, если считать их на топ-5000 товарах?

In [125]:
def precision_at_k(recommended_list, bought_list, k=5):
    
    bought_list = np.array(bought_list)
    recommended_list = np.array(recommended_list)
    
    bought_list = bought_list  # Тут нет [:k] !!
    recommended_list = recommended_list[:k]
    
    flags = np.isin(bought_list, recommended_list)
    
    precision = flags.sum() / len(recommended_list)
    
    
    return precision

In [126]:
result.apply(lambda row: precision_at_k(row['random_recommendation'], row['actual']), axis=1).mean()

0.0007835455435847209

In [127]:
result.apply(lambda row: precision_at_k(row['popular_recommendation'], row['actual']), axis=1).mean()

0.15523996082272282

In [128]:
result.apply(lambda row: precision_at_k(row['itemitem'], row['actual']), axis=1).mean()

0.21897649363369248

In [129]:
result.apply(lambda row: precision_at_k(row['cosine'], row['actual']), axis=1).mean()

0.1551420176297747

In [130]:
result.apply(lambda row: precision_at_k(row['tfidf'], row['actual']), axis=1).mean()

0.16865817825661114

In [131]:
result.apply(lambda row: precision_at_k(row['own_purchases'], row['actual']), axis=1).mean()

0.20191740412979353

In [132]:
def typess(df):
    for j in range(df.shape[1]):
        print(type(df.iloc[0][j]))

In [133]:
print(type(result.iloc[0][0]))


<class 'numpy.int64'>


In [134]:
typess(result)

<class 'numpy.int64'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>


**Нейронная сеть**

In [16]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers

In [17]:
user_features.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16


Я решил подготовить матрицу для подачи в нейронную сеть. Взял два категориальных столбца: возраст AGE_DESC и состав семьи HH_COMP_DESC. Разложил данные из столбцов по категориям со значениями 1/0. 

In [18]:
user_features['AGE_DESC'].value_counts()

45-54    288
35-44    194
25-34    142
65+       72
55-64     59
19-24     46
Name: AGE_DESC, dtype: int64

In [19]:
siv_user_feat=pd.concat([user_features, pd.get_dummies(user_features['AGE_DESC'])], axis=1)
siv_user_feat.rename(columns={'household_key': 'user_id'}, inplace=True)

In [20]:
siv_user_feat.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,user_id,19-24,25-34,35-44,45-54,55-64,65+
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1,0,0,0,0,0,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7,0,0,0,1,0,0
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,0,1,0,0,0,0
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,0,1,0,0,0,0
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16,0,0,0,1,0,0


In [21]:
siv_user_feat['HH_COMP_DESC'].value_counts()

2 Adults No Kids    255
2 Adults Kids       187
Single Female       144
Single Male          95
Unknown              73
1 Adult Kids         47
Name: HH_COMP_DESC, dtype: int64

In [22]:
popular_z=siv_user_feat['HH_COMP_DESC'].mode()[0]

In [23]:
siv_user_feat.replace({'HH_COMP_DESC':{'Unknown':popular_z}}, inplace=True) #замена Unknown на наиболее популярное значение

In [24]:
siv_user_feat['HH_COMP_DESC'].value_counts()

2 Adults No Kids    328
2 Adults Kids       187
Single Female       144
Single Male          95
1 Adult Kids         47
Name: HH_COMP_DESC, dtype: int64

In [25]:
siv_user_feat=pd.concat([siv_user_feat, pd.get_dummies(siv_user_feat['HH_COMP_DESC'])], axis=1)
siv_user_feat.head()

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,user_id,19-24,25-34,35-44,45-54,55-64,65+,1 Adult Kids,2 Adults Kids,2 Adults No Kids,Single Female,Single Male
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1,0,0,0,0,0,1,0,0,1,0,0
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7,0,0,0,1,0,0,0,0,1,0,0
2,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,8,0,1,0,0,0,0,0,1,0,0,0
3,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,13,0,1,0,0,0,0,0,1,0,0,0
4,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,16,0,0,0,1,0,0,0,0,0,1,0


In [26]:
siv_user_feat=siv_user_feat.drop('AGE_DESC', axis=1)
siv_user_feat=siv_user_feat.drop('MARITAL_STATUS_CODE', axis=1)
siv_user_feat=siv_user_feat.drop('INCOME_DESC', axis=1)
siv_user_feat=siv_user_feat.drop('HOMEOWNER_DESC', axis=1)
siv_user_feat=siv_user_feat.drop('HH_COMP_DESC', axis=1)
siv_user_feat=siv_user_feat.drop('HOUSEHOLD_SIZE_DESC', axis=1)
siv_user_feat=siv_user_feat.drop('KID_CATEGORY_DESC', axis=1)
siv_user_feat.head()

Unnamed: 0,user_id,19-24,25-34,35-44,45-54,55-64,65+,1 Adult Kids,2 Adults Kids,2 Adults No Kids,Single Female,Single Male
0,1,0,0,0,0,0,1,0,0,1,0,0
1,7,0,0,0,1,0,0,0,0,1,0,0
2,8,0,1,0,0,0,0,0,1,0,0,0
3,13,0,1,0,0,0,0,0,1,0,0,0
4,16,0,0,0,1,0,0,0,0,0,1,0


теперь необходимо добавить данные о товарах. План такой вставить 5000 столбцов с id_item с 1/0. Но сначала вставим по каждому пользователю список актуальных покупок за последние 3 недели.

In [27]:
result_rezerv.head()

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."
3,7,"[840386, 889774, 898068, 909714, 929067, 95347..."
4,8,"[835098, 872137, 910439, 924610, 992977, 10412..."


In [28]:
df=pd.merge(siv_user_feat, result_rezerv, on='user_id', how='left')
df.head()

Unnamed: 0,user_id,19-24,25-34,35-44,45-54,55-64,65+,1 Adult Kids,2 Adults Kids,2 Adults No Kids,Single Female,Single Male,actual
0,1,0,0,0,0,0,1,0,0,1,0,0,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,7,0,0,0,1,0,0,0,0,1,0,0,"[840386, 889774, 898068, 909714, 929067, 95347..."
2,8,0,1,0,0,0,0,0,1,0,0,0,"[835098, 872137, 910439, 924610, 992977, 10412..."
3,13,0,1,0,0,0,0,0,1,0,0,0,"[6534178, 1104146, 829197, 840361, 862070, 884..."
4,16,0,0,0,1,0,0,0,0,0,1,0,"[1062973, 1082185, 13007710]"


In [None]:
df.info

In [29]:
from sklearn. model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

print(train. shape , test. shape )

(640, 13) (161, 13)


In [30]:
y_train=train['actual']
y_train.head()

364    [970119, 1023958, 844165, 853317, 900370, 9169...
458    [823758, 826790, 844179, 849505, 856252, 95537...
76     [852856, 953837, 1080941, 1082185, 1127831, 71...
64     [846864, 964151, 1047944, 1048727, 1052912, 69...
638    [868645, 996425, 6534178, 848015, 873622, 8823...
Name: actual, dtype: object

In [63]:
x_train=train.drop('user_id', axis=1)
x_train.head()

Unnamed: 0,19-24,25-34,35-44,45-54,55-64,65+,1 Adult Kids,2 Adults Kids,2 Adults No Kids,Single Female,Single Male,actual
364,0,0,0,1,0,0,0,0,0,0,1,"[970119, 1023958, 844165, 853317, 900370, 9169..."
458,0,0,1,0,0,0,0,0,1,0,0,"[823758, 826790, 844179, 849505, 856252, 95537..."
76,0,0,0,0,1,0,0,0,1,0,0,"[852856, 953837, 1080941, 1082185, 1127831, 71..."
64,0,0,0,1,0,0,0,0,0,0,1,"[846864, 964151, 1047944, 1048727, 1052912, 69..."
638,0,0,0,1,0,0,0,0,0,0,1,"[868645, 996425, 6534178, 848015, 873622, 8823..."


In [64]:
x_train=x_train.drop('actual', axis=1)
x_train.head()

Unnamed: 0,19-24,25-34,35-44,45-54,55-64,65+,1 Adult Kids,2 Adults Kids,2 Adults No Kids,Single Female,Single Male
364,0,0,0,1,0,0,0,0,0,0,1
458,0,0,1,0,0,0,0,0,1,0,0
76,0,0,0,0,1,0,0,0,1,0,0
64,0,0,0,1,0,0,0,0,0,0,1
638,0,0,0,1,0,0,0,0,0,0,1


In [65]:
x_train.shape

(640, 11)

In [72]:
tf.convert_to_tensor(x_train)

<tf.Tensor: shape=(640, 11), dtype=uint8, numpy=
array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]], dtype=uint8)>

In [None]:
tf.convert_to_tensor(y_train)

In [None]:
trainset = tf.data.Dataset.from_tensor_slices(x_train, y_train).batch(32)
#validationset = tf.data.Dataset.from_tensor_slices((
    dict(x_val),dict(y_val))).batch(32)

In [55]:
x_train_nmp=x_train.values
y_train_nmp=y_train.values

In [56]:
print(x_train_nmp[0].shape,x_train_nmp[0].dtype)

(12,) int64


In [33]:
y_test=test['actual']
x_test=test.drop('actual', axis=1)
x_test_nmp=x_test.values
y_test_nmp=y_test.values

In [169]:
import tensorflow as tf
import tensorflow 

from tensorflow import keras
from tensorflow.keras import layers
from keras.layers import Input, Dense, Layer

In [None]:
#x_train = x_train.reshape(-1, 12)
#x_test = x_test.reshape(-1, 12)

#x_train_nmp.shape
#x_train_nmp.dtype

In [None]:
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
x_train.shape 

In [67]:
model = keras.Sequential([
    layers.Input(shape=(11,)),
    layers.Dense(256, activation='relu'),
    #layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='relu'),
    layers.Dense(5, activation='softmax'),
])

In [74]:
model.compile(optimizer='adam',
             loss='categorical_crossentropy',
             metrics=['accuracy'])

In [75]:
model.fit(x_train, y_train, batch_size=64, epochs=10)

ValueError: ignored

In [None]:
model.summary()