This notebook was used as reference in my project.

This notebook was taken from [kaggle](https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch/notebook)

# Wide and Deep Learning for RecSys with Pytorch

This notebook was inspired by "Wide & Deep Learning for Recommender Systems" [paper](https://arxiv.org/pdf/1606.07792.pdf) by Google. In this paper authors propose an interesting NN arcitecture for Recommender Systems  
![](https://miro.medium.com/max/875/1*1jA7Qt71aMK_qG89tfUOoA.png)  
I was strugguling to find realization of this arcitecture, so I decided to implement my own using Pytorch

# Data loading

The data which I choose for implementing this architecture is Movie Lens 100k dataset. It has some key advantages:
- Popular. I bet you are already know or at leats hear about it
- Simple. Just user rates for number of movies and a bit of meta information
- Variative. It allows to construct binary features like previous watched films as long as some continious features important for deep part of network
- Small size. It has only 100K rates and limited number of users and features, so the traing part won\`t take to long 

And also one major drawback:
- Tre dataset has no information to generate cross-product of userinstalled apps and impression apps as in original paper

**The data consist of:**  
Information about when and how user rated a movie

In [None]:
import pandas as pd

#Load the Ratings data
data = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.data', sep="\t", header=None)
data.columns = ['user id', 'movie id', 'rating', 'timestamp']
data.head()

Additional information about each user such as age, gender, occupation and zip code

In [None]:
#Load the User data
users = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.user', 
                    sep="|", encoding='latin-1', header=None)
users.columns = ['user id', 'age', 'gender', 'occupation', 'zip code']
users.head()

Additional information about movie such as title, release date and genre

In [None]:
#Load movie data
items = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.item', 
                    sep="|", encoding='latin-1', header=None)
items.columns = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 
                 'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 
                 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 
                 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items.head()

The list of all genres represented in dataset

In [None]:
GENRES = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.genre', 
                     sep="|", header=None, usecols=[0])[0].tolist()
GENRES

# EDA

Let\`s take look at data a bit closer

There are total 943 users and 1682 movies

In [None]:
print(
    (f"Number of users: {users['user id'].nunique()}\n" 
    f"Nuber of movies: {items['movie id'].nunique()}")
)

Movies are often rated as 3 or 4 stars from five

In [None]:
data['rating'].value_counts().sort_index().plot.bar()

The users are mostly aged from 20 to 30

In [None]:
users['age'].value_counts().sort_index().plot.bar(figsize=(12, 8))

There are more than to male users for each female in this dataset

In [None]:
users['gender'].value_counts().plot.bar()

Not suprizingly the most popular occupatation for so young users is student

In [None]:
users['occupation'].value_counts().plot.bar()

# Features and target

It\`s time to define a target for this data. I choose to predict next watched movie. Also I use user mean rate as a feature

In [None]:
dataset = data.sort_values(['user id', 'timestamp']).reset_index(drop=True)
dataset['one'] = 1
dataset['sample_num'] = dataset.groupby('user id')['one'].cumsum()

dataset['target'] = dataset.groupby('user id')['movie id'].shift(-1)
dataset['mean_rate'] = dataset.groupby('user id')['rating'].cumsum() / dataset['sample_num']

dataset.head()

The next kind of features I need for wide and deep architecture is "user history" features, so I keep the list of previously watched films for every new film that user rated

In [None]:
dataset['prev movies'] = dataset['movie id'].apply(lambda x: str(x))
dataset['prev movies'] = dataset.groupby('user id')['prev movies'].apply(lambda x: (x + ' ').cumsum().str.strip())
dataset['prev movies'] = dataset['prev movies'].apply(lambda x: x.split())
dataset.head()

And also I need continious features. Firstly I use movie meta information to generate features like user mean rate by genre and share of user watched movies by genre

In [None]:
dataset = dataset.merge(items[['movie id'] + GENRES], on='movie id', how='left')
for genre in GENRES:
    dataset[f'{genre}_rate'] = dataset[genre]*dataset['rating']
    dataset[genre] = dataset.groupby('user id')[genre].cumsum()
    dataset[f'{genre}_rate'] = dataset.groupby('user id')[f'{genre}_rate'].cumsum() / dataset[genre]

dataset[GENRES] = dataset[GENRES].apply(lambda x: x / dataset['sample_num'])
dataset.head()

Secondly I use user meta information to generate features gender and one-hot encoded occupation

In [None]:
dataset = dataset.merge(users, on='user id', how='left')
dataset['gender'] = (dataset['gender'] == 'M').astype(int)
dataset = pd.concat([dataset.drop('occupation', axis=1), pd.get_dummies(dataset['occupation'])], axis=1)
dataset.drop('other', axis=1, inplace=True)
dataset.drop('zip code', axis=1, inplace=True)
dataset.head()

Finaly I transform list of previous watched films to sparse format. For that I use scipy COO matrix

In [None]:
def get_coo_indexes(lil):
    rows = []
    cols = []
    for i, el in enumerate(lil):
        if type(el)!=list:
            el = [el]
        for j in el:
            rows.append(i)
            cols.append(j)
    return rows, cols

In [None]:
from scipy.sparse import coo_matrix
import numpy as np

def get_sparse_features(series, shape):
    coo_indexes = get_coo_indexes(series.tolist())
    sparse_df = coo_matrix((np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape)
    return sparse_df

In [None]:
get_sparse_features(dataset['prev movies'], (len(dataset), dataset['movie id'].max()+1))

# Data split

There is train test split in data provided by authors. But I won\`t use it because it ignores timestamp. Otherwize I split data based on time label

In [None]:
COLD_START_TRESH = 5
TEST_SIZE = 0.2

In [None]:
filtred_data = dataset[(dataset['sample_num'] >= COLD_START_TRESH) &
                       ~(dataset['target'].isna())].sort_values('timestamp')
train_data = filtred_data[:int(len(filtred_data)*(1-TEST_SIZE))]
test_data = filtred_data[int(len(filtred_data)*(1-TEST_SIZE)):]
train_data.shape, test_data.shape

Let`s look how was the data splited between train and test

In [None]:
pd.concat([data['user id'].value_counts().describe(),
           train_data['user id'].value_counts().describe(),
           test_data['user id'].value_counts().describe()],
         axis=1,
         keys=['total', 'train', 'test'])

We have at least 5 films for each user in train. Movie count distribution in train reflects movie count distribution in total dataset

In [None]:
for df in [data, train_data, test_data]:
    df.groupby('user id')['movie id'].count().plot.hist(bins=20)

All but 71 movies present in train data. So it won\`t be possible to recommend them

In [None]:
print((
    f"Total movies: {data['movie id'].nunique()}\n"
    f"Movies in train: {train_data['movie id'].nunique()}\n"
    f"Movies in test: {test_data['movie id'].nunique()}\n"
))  

In [None]:
X_train = train_data.drop(['user id', 'movie id', 'rating', 'timestamp', 'one', 'sample_num', 'target', 'prev movies'],
                          axis=1)
prev_movies_train = get_sparse_features(train_data['prev movies'], (len(train_data), dataset['movie id'].max()+1))
y_train = train_data['target']

X_test = test_data.drop(['user id', 'movie id', 'rating', 'timestamp', 'one', 'sample_num', 'target', 'prev movies'],
                        axis=1)
prev_movies_test = get_sparse_features(test_data['prev movies'], (len(test_data), dataset['movie id'].max()+1))
y_test = test_data['target']

## Simple basline

I use multiclass LightGBM model with no parameters tuning as a baseline

In [None]:
import lightgbm as lgb

params = {
    'objective': 'softmax',
    'num_class': items['movie id'].nunique() + 1,
    'num_iterations': 10,
    'verbose': -1
}
train_data = lgb.Dataset(X_train.reset_index(drop=True), label=y_train, free_raw_data=False)
movies_data_train = lgb.Dataset(prev_movies_train, free_raw_data=False)
train_data = train_data.construct()
movies_data_train = movies_data_train.construct()
train_data = train_data.add_features_from(movies_data_train)
model = lgb.train(params, train_data)

In [None]:
test_data = lgb.Dataset(X_test.reset_index(drop=True), free_raw_data=False)
movies_data_test = lgb.Dataset(prev_movies_test, free_raw_data=False)
test_data = test_data.construct()
movies_data_test = movies_data_test.construct()
test_data = test_data.add_features_from(movies_data_test)
preds_baseline = model.predict(test_data.get_data())
preds_baseline.shape

## The Wide and Deep architecture

Finaly, let\`s get to the wide and deep network architecture
  
In the original paper the cross-product of user installed apps and impression apps. As long as we don\`t have any impressions working with movie lens data I use only information about previous watched filmes as features for wide component

I need to define two functions:
- First sparse_to_idx helps me to convert indexes of previous watched movies to series of films indexes. Also I pad this data with zeroes so I can later use it in embedding layer
- Second is reverse to first idx_to_sparse helps me to convert target with index of movie to series of all zeros and one in place of index. I will use it later

In [None]:
def sparse_to_idx(data, pad_idx=-1):
    indexes = data.nonzero()
    indexes_df = pd.DataFrame()
    indexes_df['rows'] = indexes[0]
    indexes_df['cols'] = indexes[1]
    mdf = indexes_df.groupby('rows').apply(lambda x: x['cols'].tolist())
    max_len = mdf.apply(lambda x: len(x)).max()
    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values

In [None]:
def idx_to_sparse(idx, sparse_dim):
    sparse = np.zeros(sparse_dim)
    sparse[int(idx)] = 1
    return pd.Series(sparse, dtype=int)

In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Now I construct input tensors for network. I need three tensors:
- The tensor with continious features
- The tensor with previous wathched films as sequence of indexes to feed into embedding layer
- The tensor with previous wathched films as binary features

In [None]:
# Train part
PAD_IDX = 0
# tensor with continious features
X_train_tensor = torch.Tensor(X_train.fillna(0).values).to(device)
# tensor with sequence of indexes
movies_train_tensor = torch.sparse_coo_tensor(
    indices=prev_movies_train.nonzero(), 
    values=[1]*len(prev_movies_train.nonzero()[0]),
    size=prev_movies_train.shape
).to_dense().to(device)
# tensor with binary features
movies_train_idx = torch.Tensor(
    sparse_to_idx(prev_movies_train, pad_idx=PAD_IDX),
).long().to(device)
# target
target_train = torch.Tensor(y_train.values).long().to(device)

In [None]:
# tensor with continious features 
X_test_tensor = torch.Tensor(X_test.fillna(0).values).to(device)
# tensor with continious features
movies_test_tensor = torch.sparse_coo_tensor(
    indices=prev_movies_test.nonzero(), 
    values=[1]*len(prev_movies_test.nonzero()[0]),
    size=prev_movies_test.shape
).to_dense().to(device)
# tensor with binary features
movies_test_idx = torch.Tensor(
    sparse_to_idx(prev_movies_test, pad_idx=PAD_IDX),
).long().to(device)
# target
target_test = torch.Tensor(y_test.values).long().to(device)

And now define Wide and Deep architecture as a pytorch class

In [None]:
from torch import nn, cat, mean

class WideAndDeep(nn.Module):
    def __init__(
        self, 
        continious_feature_shape, # number of continious features
        embed_size, # size of embedding for binary features
        embed_dict_len, # number of unique binary features
        pad_idx # padding index
    ):
        super(WideAndDeep, self).__init__()
        self.embed = nn.Embedding(embed_dict_len, embed_size, padding_idx=pad_idx)
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(embed_size + continious_feature_shape, 1024),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU()
        )
        self.head = nn.Sequential(
            nn.Linear(embed_dict_len + 256, embed_dict_len),
        )

    def forward(self, continious, binary, binary_idx):
        # get embeddings for sequence of indexes
        binary_embed = self.embed(binary_idx)
        binary_embed_mean = mean(binary_embed, dim=1)
        # get logits for "deep" part: continious features + binary embeddings
        deep_logits = self.linear_relu_stack(cat((continious, binary_embed_mean), dim=1))
        # get final softmax logits for "deep" part and raw binary features
        total_logits = self.head(cat((deep_logits, binary), dim=1))
        return total_logits

In [None]:
model = WideAndDeep(
    X_train.shape[1], 
    16, 
    items['movie id'].nunique() + 1, 
    PAD_IDX
).to(device)
print(model)

Let\`s train the network

In [None]:
EPOCHS = 10
loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for t in range(EPOCHS):
    model.train()
    pred_train = model(X_train_tensor, movies_train_tensor, movies_train_idx)
    loss_train = loss_fn(pred_train, target_train)

    # Backpropagation
    optimizer.zero_grad()
    loss_train.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        pred_test = model(X_test_tensor, movies_test_tensor, movies_test_idx)
        loss_test = loss_fn(pred_test, target_test)
    
    print(f"Epoch {t}")
    print(f"Train loss: {loss_train:>7f}")
    print(f"Test loss: {loss_test:>7f}")

## Compare metrics

To ensure Wide and Deep network is capable of solving recommendation task I compare it with baseline

The first metric I look at is Mean Squared Error
  
As you see my implementation is twice better than baseline

In [None]:
# mse
from sklearn.metrics import mean_squared_error

y_test_sparse = y_test.apply(lambda x: idx_to_sparse(x, items['movie id'].nunique() + 1))
mse_baseline = mean_squared_error(y_test_sparse, preds_baseline)
print(f'Mean squared error for baseline: {mse_baseline:.4f}')

In [None]:
loss = nn.MSELoss()
softmax = nn.Softmax(dim=0)
mse_wnd = loss(softmax(pred_test), torch.Tensor(y_test_sparse.values).to(device)).cpu().detach().numpy()
print(f'Mean squared error for Wide and Deep: {mse_wnd:.4f}')

The second metric I look at is the mean rank next movie has in recommendations
  
Baseline puts it on a shy 841 place as the Wide and deep does 200 places better!

In [None]:
# mean rank
from scipy.stats import rankdata

ranks = pd.DataFrame(preds_baseline).apply(lambda x: pd.Series(rankdata(-x)), axis=1)
ranks_target = (ranks.values * y_test_sparse).sum(axis=1)
mean_rank_baseline = ranks_target.mean()
print(f'Mean rank for baseline: {mean_rank_baseline:.0f}')

In [None]:
preds_wnd = softmax(pred_test).cpu().detach().numpy()
ranks_wnd = pd.DataFrame(preds_wnd).apply(lambda x: pd.Series(rankdata(-x)), axis=1)
ranks_target_wnd = (ranks_wnd.values * y_test_sparse).sum(axis=1)
mean_rank_wnd = ranks_target_wnd.mean()
print(f'Mean rank for Wide and Deep: {mean_rank_wnd:.0f}')

So there is my implementatation and it seems to be working despite all limitations. Thumbs up!