<a href="https://colab.research.google.com/github/Hitoshi-Nakanishi/Recommendation/blob/master/001_grid_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from pathlib import Path
import logging
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, KFold

from MatrixFactorization.PMF import PMF

root_dir = Path.cwd() / Path('../data/ml-latest-small')
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('my_logger')



In [None]:
movies = pd.read_csv(root_dir / 'movies.csv')
ratings = pd.read_csv(root_dir / 'ratings.csv')
movies.movieId = movies.movieId - 1
ratings.userId = ratings.userId - 1
ratings.movieId = ratings.movieId - 1

In [None]:
display(movies.head(2))
display(ratings.head(2))

Unnamed: 0,movieId,title,genres
0,0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,Jumanji (1995),Adventure|Children|Fantasy


Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,4.0,964982703
1,0,2,4.0,964981247


In [None]:
ratings2 = ratings.query('movieId < 400').iloc[:,:3].values.astype(np.int)
print(ratings2.shape)
dims = {}
dims['N'] = ratings2[:,0].max() + 1
dims['M'] = ratings2[:,1].max() + 1

(11572, 3)


# Maximum a posteriori

To train Probabilistic Matrix Factorization model, we maximize a posteriori distribution by coordinate ascent algorithm.

\begin{align}
u_i &= \left(\lambda \sigma^2 I + \sum_{j \in \Omega_{u_i}} v_j v_j^T \right)^{-1} \left( \sum_{j \in \Omega_{u_i}} M_{ij} v_j\right) \\
v_i &= \left(\lambda \sigma^2 I + \sum_{i \in \Omega_{v_j}} u_j u_j^T \right)^{-1} \left(\sum_{i \in \Omega_{v_j}} M_{ij} u_j \right) \\
\end{align}

Here, we see two hyperparameters: the factor of regularization term $\lambda \sigma^2$ and feature space dimension $D = \text{dim(U's col)}$
the best hyperparamters are searched from grids below using 3-fold cross validation

In [None]:
%%time
params = {'dims': dims, 'dim_D': 10, 'lambda_': 1, 'sigma2': 2, 'epoch_num': 10, 'logger': None}
pmf = PMF(**params)
tuned_parameters = {'dim_D': [2, 5, 10, 15], 'lambda_': [1.0], 'sigma2': [0.01, 0.1, 0.5, 1, 2, 4]}
gs = GridSearchCV(pmf, tuned_parameters, cv=KFold(n_splits=3, shuffle=True), n_jobs=4)
gs.fit(ratings2)

with open('gs.pickle', 'wb') as f:
    pickle.dump(gs, f)

CPU times: user 2min 15s, sys: 18.5 s, total: 2min 34s
Wall time: 17min 12s


# how to load dumped pickle file about gridsearch

```python
with open('gs.pickle', 'rb') as f:
    gs2 = pickle.load(f)

```