# Exercise (Matrix Factorization)

In this lesson, we'll reuse the model we trained in [the tutorial](#$TUTORIAL_URL(2)$). To get started, run the setup cell below to import the libraries we'll be using, load our data into Dataframes, and load a serialized version of the model we trained earlier.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras
import os
import random

input_dir = '../input/movielens_preprocessed'
df = pd.read_csv(os.path.join(input_dir, 'rating.csv'), usecols=['userId', 'movieId', 'rating', 'y'])
#movies_df = pd.read_csv(os.path.join(input_dir, 'movie.csv'), usecols=['movieId', 'title', 'year']).set_index('movieId', drop=False)
movies = movies_df = pd.read_csv(os.path.join(input_dir, 'movie.csv'), index_col=0)

# TODO: suppress warning
model = keras.models.load_model('factorization_model.h5')

print("Setup complete!")

Setup complete!


## Part 1: Generating Recommendations

At the end of [the first lesson where we built an embedding model](#$TUTORIAL_URL(1)$), I showed how we could use our model to predict the ratings a particular user would give to some set of movies.

For reference, here's a (slightly modified) copy of that code where we calculated predicted ratings for 5 specific movies:

In [2]:
uid = 26556
candidate_movies = movies[
    movies.title.str.contains('Naked Gun')
    | (movies.title == 'The Sisterhood of the Traveling Pants')
    | (movies.title == 'Lilo & Stitch')
].copy()

preds = model.predict([
    [uid] * len(candidate_movies), # User ids 
    candidate_movies.index, # Movie ids
])
# Because our model was trained on a 'centered' version of rating (subtracting the mean, so that
# the target variable had mean 0), to get the predicted star rating on the original scale, we need
# to add the mean back in.
row0 = df.iloc[0]
offset = row0.rating - row0.y
candidate_movies['predicted_rating'] = preds + offset
candidate_movies.head()[ ['movieId', 'title', 'predicted_rating'] ]

Unnamed: 0,movieId,title,predicted_rating
366,366,Naked Gun 33 1/3: The Final Insult,3.76164
3775,3775,The Naked Gun: From the Files of Police Squad!,4.945276
3776,3776,The Naked Gun 2 1/2: The Smell of Fear,4.457918
5347,5347,Lilo & Stitch,3.753898
10138,10138,The Sisterhood of the Traveling Pants,1.969995


Suppose we're interested in the somewhat more open-ended problem of **generating recommendations**. i.e. given some user ID and some number `k`, we need to generate a list of `k` movies we think the user will enjoy.

The most straightforward way to do this would be to calculate the predicted rating this user would assign for *every movie in the dataset*, then take the movies with the `k` highest predictions.

In the code cell below, write code to create a variable `reccs` containing the 5 movies with the highest predicted rating for the user with userId 26556. `reccs` should be a DataFrame with a `movieId` column (it can also have any other additional columns you'd like).

**TODO: Maybe have them fill in the body of a function for generating recommendations instead? That way, it would be easy for them to reuse it later on to get reccs for other models that we load (in part 5). Or try it on other users. fn would take model, k, and uid as inputs.**

In [3]:
def recommend(model, user_id, n=5):
    """Return a DataFrame with the n most highly recommended movies for the user with the
    given id. (Where most highly recommended means having the highest predicted ratings 
    according to the given model).
    """
    # 1. Create a list/array containing all movie ids (from the movies dataframe)
    # 2. Call model.predict() on a list/array with repeated copies of uid, and the sequence you created in step 1
    # 3. Return the movieIds associated with the 5 highest values returned in step 2. The easiest thing may be to
    #    add a 'predicted_rating' column to the movies dataframe (or a copy of it), then sort on that column.
    pass

#part1.check()

In [5]:
#q1.hint()

In [6]:
#q1.solution()
def recommend(model, user_id, n=5):
    """Return a DataFrame with the n most highly recommended movies for the user with the
    given id. (Where most highly recommended means having the highest predicted ratings 
    according to the given model).
    """
    all_movie_ids = df.movieId.unique()
    preds = model.predict([
        np.repeat(uid, len(all_movie_ids)),
        all_movie_ids,
    ])
    # Add back the offset calculated earlier, to 'uncenter' the ratings, and get back to a [0.5, 5] scale.
    movies_df.loc[all_movie_ids, 'predicted_rating'] = preds + offset
    reccs = movies_df.sort_values(by='predicted_rating', ascending=False).head(n)
    return reccs

uid = 26556
recommend(model, uid)

Unnamed: 0,movieId,title,genres,key,year,n_ratings,mean_rating,predicted_rating
21770,21770,McKenna Shoots for the Stars,Children|Drama,McKenna Shoots for the Stars (2012),2012,2,4.25,8.314224
25793,25793,Babette Goes to War,Comedy|War,Babette Goes to War (1959),1959,1,1.5,7.968139
25794,25794,Belle comme la femme d'un autre,Comedy,Belle comme la femme d'un autre (2014),2014,1,1.0,7.963304
26329,26329,Sense & Sensibility,Drama|Romance,Sense & Sensibility (2008),2008,4,4.125,7.767611
25823,25823,The Sex and Violence Family Hour,(no genres listed),The Sex and Violence Family Hour (1983),1983,1,1.5,7.720166


## Part 2: Sanity check

Do these recommendations seem sensible? If you'd like a reminder of user 26556's tastes, run the cell below to see all their ratings (in descending order).

In [8]:
user_ratings = df[df.userId==uid]
movie_cols = ['movieId', 'title', 'genres', 'year', 'n_ratings', 'mean_rating']
user_ratings.sort_values(by='rating', ascending=False).merge(movies[movie_cols], on='movieId')

Unnamed: 0,userId,movieId,rating,y,title,genres,year,n_ratings,mean_rating
0,26556,2706,5.0,1.474498,Airplane II: The Sequel,Comedy,1982,4284,3.046287
1,26556,2705,5.0,1.474498,Airplane!,Comedy,1980,18866,3.795963
2,26556,2863,4.5,0.974498,Dr. No,Action|Adventure|Thriller,1962,7183,3.6876
3,26556,2102,4.5,0.974498,Strangers on a Train,Crime|Drama|Film-Noir|Thriller,1951,5154,4.160627
4,26556,2216,4.5,0.974498,History of the World: Part I,Comedy|Musical,1981,4313,3.59576
5,26556,534,4.5,0.974498,Six Degrees of Separation,Drama,1993,5101,3.707337
6,26556,937,4.5,0.974498,Mr. Smith Goes to Washington,Drama,1939,5712,4.037443
7,26556,2286,4.5,0.974498,Fletch,Comedy|Crime|Mystery,1985,6298,3.453211
8,26556,913,4.0,0.474498,Notorious,Film-Noir|Romance|Thriller,1946,4932,4.196818
9,26556,730,4.0,0.474498,Spy Hard,Comedy,1996,6112,2.735314


Review our top-recommended movies. Are they good or bad? If they're bad, in what way are they bad? You may also find it interesting to look at:
- The metadata associated with the top-recommended movies
- The 'least-recommended' movies (the ones with the lowest predicted scores)
- The actual predicted rating values.

Once you have an opinion, uncomment the cell below to see if we're in agreement.

`#q2.solution()`

I'm going to claim that these recommended movies are **bad**. In terms of genre and themes, our top picks seem like poor fits. User 26556 has pretty mature tastes - they like Hitchcock, classic James Bond, and Leslie Nielsen comedies. But our top pick for them, *McKenna Shoots for the Stars*, seems squarely aimed at pre-teen girls.

Though I had to google the title to discover that fact. In fact, I didn't recognize any of the films in our top-5 recommendations. And that speaks to the biggest problem with our recommendations: they're **super obscure**. Our top 5 recommendations only have a total of 9 reviews between them in the whole dataset. We barely know anything about these movies - how can we be so confident that user 26556 is going to love them?

This is similar to the problem we encountered in the previous exercise, where our model confidently assigned extreme bias values to movies with only a tiny number of reviews.

> **Aside:** You may have noticed another problem, which becomes very obvious when we look at the movies with 
the highest (or lowest) predicted scores: sometimes our model predicts values outside the allowable
range of 0.5-5 stars. For the purposes of recommendation, this is actually no problem: we only care about ranking
movies, not about the absolute values of their predicted scores. But this is still an interesting problem
to consider. How could we prevent our model from incurring needless errors by making predictions outside
the allowable range? Should we? If you have ideas, head over to [this forum thread](TODO) to discuss.)

## Part 3: How are we going to fix this mess?

How can we improve the problem with our recommendations that we identified in Part 2? This could involve changing our model's structure, our training procedure, or our procedure for generating recommendations given a model.

Give it some thought, then uncomment the cell below to compare notes with me. (If you have no idea, that's totally fine!)

`q3.solution()`

One simple solution would be limiting our recommendations to movies with at least `n` ratings. This feels inelegant, in that we have to choose some arbitrary cut-off, and any reasonable choice will probably exclude some good recommendations. It would be nice if we could take into account popularity in a 'smoother' way. On the other hand, this is very simple to implement, and we don't even need to re-train our model, so it's worth a shot.

If we're willing to train a new model, there's another less hacky approach we can take which might fix our obscure recommendation problem *and* improve our overall accuracy at the same time: regularization. Specifically, putting an L2 weight penalty on our embeddings. I'll talk more about this in part 5 (and show how we would implement it).

## Part 4: Fixing our obscure recommendation problem (thresholding)

Fill in the code cell below to implement the `recommend_nonobscure` function, which will recommend the best movies which have at least some minimum number of ratings. (You may wish to modify the code you wrote in `recommend`, or even call `recommend` as a subroutine).

In [9]:
def recommend_nonobscure(model, user_id, n=5, min_ratings=1000):
    """Return a DataFrame with the n movies which the given model assigns the highest 
    predicted ratings for the given user, *limited to movies with at least the given
    threshold of ratings*.
    """
    pass

#q4.check()

In [17]:
#q4.solution()
def recommend_nonobscure(model, user_id, n=5, min_ratings=1000):
    """Return a DataFrame with the n movies which the given model assigns the highest 
    predicted ratings for the given user, *limited to movies with at least the given
    threshold of ratings*.
    """
    # Add predicted_rating column if we haven't already done so.
    if 'predicted_rating' not in movies.columns:
        all_movie_ids = df.movieId.unique()
        preds = model.predict([
            np.repeat(uid, len(all_movie_ids)),
            all_movie_ids,
        ])
        # Add back the offset calculated earlier, to 'uncenter' the ratings, and get back to a [0.5, 5] scale.
        movies_df.loc[all_movie_ids, 'predicted_rating'] = preds + offset
    
    nonobscure_movie_ids = movies.index[movies.n_ratings >= min_ratings]
    return movies.loc[nonobscure_movie_ids].sort_values(by='predicted_rating', ascending=False).head(n)
    

Run the cell below to take a look at our new recommended movies. Did this fix our problem? Do we get better results with a different threshold?

In [18]:
recommend_nonobscure(model, uid)

Unnamed: 0,movieId,title,genres,key,year,n_ratings,mean_rating,predicted_rating
18811,18811,The Cabin in the Woods,Comedy|Horror|Sci-Fi|Thriller,"Cabin in the Woods, The (2012)",2012,1757,3.677391,5.557927
1233,1233,Evil Dead II (Dead by Dawn),Action|Comedy|Fantasy|Horror,Evil Dead II (Dead by Dawn) (1987),1987,7788,3.772888,5.304126
2614,2614,"South Park: Bigger, Longer and Uncut",Animation|Comedy|Musical,"South Park: Bigger, Longer and Uncut (1999)",1999,17371,3.626832,5.202612
1189,1189,Army of Darkness,Action|Adventure|Comedy|Fantasy|Horror,Army of Darkness (1993),1993,12469,3.723536,5.187207
1213,1213,Dead Alive (Braindead),Comedy|Fantasy|Horror,Dead Alive (Braindead) (1992),1992,2576,3.72705,5.079445


## Part 5: Fixing our obscure recommendation problem (regularization)

The code below is identical to the code used to create the model we've been using in this exercise, except we've added L2 regularization to our embeddings (by specifying a value for the keyword argument `embeddings_regularizer` when creating our Embedding layers).

> **TODO: Aside here giving an introduction to regularization via weight penalty. Intuitive interpretation (imposing a prior), and mechanics of how it works (adding a term to the loss calculated as blah).**

In [12]:
movie_embedding_size = user_embedding_size = 8
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')

movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

l2_model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)

Training this model for a decent number of iterations takes around 15 minutes, so to save some time, I have an already trained model you can load from disk by running the cell below.

In [13]:
l2_model = keras.models.load_model('movie_svd_model_8_r12n.h5')



(If you're curious, you can check out the kernel where I trained this model [here](TODO). You may notice that, aside from whether the addition of regularization improves the subjective quality of our recommendations, it already has the benefit of improving our validation error, by reducing overfitting.)

Try using the code you wrote in part 1 to generate recommendations using this model. How do they compare?

In [14]:
# TODO: Use the recommend() function you wrote earlier to get the 5 best recommended movies
# for user 26556, and assign them to the variable l2_reccs.
l2_reccs = []
#part5.check()

In [19]:
#part5.solution()
l2_reccs = recommend(l2_model, uid)
l2_reccs

Unnamed: 0,movieId,title,genres,key,year,n_ratings,mean_rating,predicted_rating
228,228,Dumb & Dumber (Dumb and Dumber),Adventure|Comedy,Dumb & Dumber (Dumb and Dumber) (1994),1994,32085,2.950768,5.550111
1471,1471,Austin Powers: International Man of Mystery,Action|Adventure|Comedy,Austin Powers: International Man of Mystery (1...,1997,22074,3.442133,5.316734
340,340,Ace Ventura: Pet Detective,Comedy,Ace Ventura: Pet Detective (1994),1994,38226,2.982501,5.267947
1372,1372,Beavis and Butt-Head Do America,Adventure|Animation|Comedy|Crime,Beavis and Butt-Head Do America (1996),1996,8752,2.964548,5.233418
2597,2597,Austin Powers: The Spy Who Shagged Me,Action|Adventure|Comedy,Austin Powers: The Spy Who Shagged Me (1999),1999,24651,3.213309,5.177351


What do you think this model's predicted scores will look like for the 'obscure' movies that our earlier model highly recommended? Run the cell below to find out.

In [15]:
uid = 26556
obscure_reccs = recommend(model, uid)
obscure_mids = obscure_reccs.index
preds = l2_model.predict([
    np.repeat(uid, len(obscure_mids)),
    obscure_mids,
])
recc_df = movies_df.loc[obscure_mids].copy()
recc_df['l2_predicted_rating'] = preds + offset
recc_df

Unnamed: 0,movieId,title,genres,key,year,n_ratings,mean_rating,predicted_rating,l2_predicted_rating
21770,21770,McKenna Shoots for the Stars,Children|Drama,McKenna Shoots for the Stars (2012),2012,2,4.25,8.314224,3.516188
25793,25793,Babette Goes to War,Comedy|War,Babette Goes to War (1959),1959,1,1.5,7.968139,3.527213
25794,25794,Belle comme la femme d'un autre,Comedy,Belle comme la femme d'un autre (2014),2014,1,1.0,7.963304,3.523651
26329,26329,Sense & Sensibility,Drama|Romance,Sense & Sensibility (2008),2008,4,4.125,7.767611,3.521178
25823,25823,The Sex and Violence Family Hour,(no genres listed),The Sex and Violence Family Hour (1983),1983,1,1.5,7.720166,3.517346


`</exercise>`

# Scratch space below - please ignore

```
TODO
- (maybe) look at dist. of weights of r12n model vs. prev model
- (maybe) load another pretrained model with even stronger r12n. Compare reccs. What's the behaviour in the limit as we increase r12n strength?
- Look at reccs for a few other users with fairly clear, distinctive tastes. e.g...
    112287 (American Beauty, The Notebook, Mean Girls, The Devil Wears Prada)
    69106 (Terminator 2, Toy Story, WALL-E, Days of Thunder, Star Wars, The Matrix)
    83421 (The Godfather, The Shawshank Redemption, Casino, Casablanca)
```

In [16]:
assert False

AssertionError: 

In [None]:
thresh = 500
movies_df[movies_df.n_ratings > thresh]\
    .sort_values(by='predicted_rating', ascending=False).head(5)

In [None]:
w, = model.get_layer('movie_embedding').get_weights()

w[:10]

In [None]:
w, = model.get_layer('user_embedding').get_weights()

w[1:10]

In [None]:
model.summary()

In [None]:
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()

print(
    np.linalg.norm(wm[0]),
    np.linalg.norm(wm[1]),
    np.linalg.norm(wm[2]),
)

In [None]:
norms = np.linalg.norm(wm, axis=1)
ns = pd.Series(norms)


unorms = np.linalg.norm(wu, axis=1)
nus = pd.Series(unorms)
display(
    "Dist of movie norms:",
    ns.describe(),
    "Dist of user norms:",
    nus.describe(),
)

In [None]:
from IPython.display import display
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()

quantiles = [.1, .25, .4, .5, .6, .75, .9]

display(
    "Movie embedding weights:",
    pd.Series(wm.flatten()).describe(quantiles),
    "User embedding weights:",
    pd.Series(wu.flatten()).describe(quantiles),
)

In [None]:
from IPython.display import display
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()
bm, = model.get_layer('movie_bias').get_weights()
bu, = model.get_layer('user_bias').get_weights()

quantiles = [.1, .25, .4, .5, .6, .75, .9]

display(
    pd.Series(wm.flatten()).describe(quantiles),
    pd.Series(wu.flatten()).describe(quantiles),
    pd.Series(bm.flatten()).describe(quantiles),
    pd.Series(bu.flatten()).describe(quantiles),
)

In [None]:
# Uggggh. Remember to do this before any training experiments.
df = df.sample(frac=1, random_state=1)

In [None]:
# XXX: New experiment - dropout?
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
dropout = .2
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
user_embedded = keras.layers.Dropout(dropout)(user_embedded)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)
movie_embedded = keras.layers.Dropout(dropout)(movie_embedded)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001),
    loss='MSE',
    metrics=['MAE'],
)

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=15,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_32_dropout.h5')

In [None]:
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 0
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 0
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_8_r12n.h5')

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=8,
    verbose=2,
    validation_split=.05,
);

In [None]:
# Okay, this is excellent. Biases seem shockingly low. 
# I wonder if accuracy is much affected by just taking them away?
# XXX: lower lr experiment
movie_embedding_size = user_embedding_size = 16 # XXX: Tested with 8 (and worked well there). idk if 
# r12n might need to go up when increasing the number of parameters like this?

# Each instance will consist of two inputs: a single user id, and a single movie id
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 1
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=60,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_16.h5')

In [None]:
pd.Series(preds.flatten()).describe([.05, .1, .25, .5, .75, .9, .95])

These reccos suck. OTOH, this is a great motivator/lead-in to r12n.

# Scratch space / brainstorm

Idea: focus on r12n? (In the same way that I was thinking of having ex 1 be walking through adding biases step by step.)

Are there other datasets I could have them do factorization on? (Could then be fun to use those learned embeddings for next two exercises)

- million song dataset https://www.kaggle.com/c/msdchallenge
- goodreads https://www.kaggle.com/zygmunt/goodbooks-10k/home <--- this one looks promising

## Recommendations

This technique makes it easier to generate recommendations for a particular user. Let's try it. (I guess this is a point where having some understanding of the matrix factorization aspect would be useful.)

To make recommendations for user u...
- could individually calculate predicted scores for every movie in the dataset, but that'd be pretty tedious.
- multiply user vector by weight matrix to get a column vector with predicted scores per movie. Sort.

## convergence properties (thought experiment)

Compare the loss over time of some DNN models like the ones we trained yesterday vs. some factorization models. Do you notice a difference? Can you think of why this would be?

## factorization vs. dnn (thought experiment)

We seem to be getting better results with our factorization model. But can you think of situations where you would want to use the dnn model instead?

## r12n

Look at train vs. val loss for different embedding sizes. How could we prevent overfitting while keeping a larger embedding size? 