# Exercise (Matrix Factorization)

In this lesson, we'll reuse the model we trained in [the tutorial](#$TUTORIAL_URL(2)$). To get started, run the setup cell below to import the libraries we'll be using, load our data into Dataframes, and load a serialized version of the model we trained earlier.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras
import os
import random

input_dir = '../input/movielens_preprocessed'
# XXX: Remember to shuffle if this ends up being used to train anything
df = pd.read_csv(os.path.join(input_dir, 'rating.csv'), usecols=['userId', 'movieId', 'rating', 'y'])
#movies_df = pd.read_csv(os.path.join(input_dir, 'movie.csv'), usecols=['movieId', 'title', 'year']).set_index('movieId', drop=False)
movies = movies_df = pd.read_csv(os.path.join(input_dir, 'movie.csv'), index_col=0)

# TODO: suppress warning
model = keras.models.load_model('factorization_model.h5')

print("Setup complete!")

Setup complete!


## Part 1: Generating Recommendations


#### DB: I'm concerned that the user is learning more about how to use recommender systems in parts 1-2 below, rather than about factorization (which is nominally the topic of this lesson).


At the end of [the first lesson where we built an embedding model](#$TUTORIAL_URL(1)$), I showed how we could use our model to predict the ratings a particular user would give to some set of movies.

For reference, here's a (slightly modified) copy of that code where we calculated predicted ratings for 7 specific movies:

In [2]:
uid = 26556
candidate_movies = movies[
    movies.title.str.contains('Naked Gun')
    | ((movies.title == 'Planet of the Apes') & (movies.year==1968))
    | ((movies.title == 'The Blob') & (movies.year==1958))
    | (movies.title == 'The Sisterhood of the Traveling Pants')
    | (movies.title == 'Lilo & Stitch')
].copy()

preds = model.predict([
    [uid] * len(candidate_movies), # User ids 
    candidate_movies.index, # Movie ids
])
# Because our model was trained on a 'centered' version of rating (subtracting the mean, so that
# the target variable had mean 0), to get the predicted star rating on the original scale, we need
# to add the mean back in.
row0 = df.iloc[0]
offset = row0.rating - row0.y
candidate_movies['predicted_rating'] = preds + offset
candidate_movies.head()[ ['movieId', 'title', 'predicted_rating'] ]

Unnamed: 0,movieId,title,predicted_rating
366,366,Naked Gun 33 1/3: The Final Insult,3.761633
1305,1305,The Blob,3.392891
2444,2444,Planet of the Apes,3.958446
3775,3775,The Naked Gun: From the Files of Police Squad!,4.94527
3776,3776,The Naked Gun 2 1/2: The Smell of Fear,4.457911


Suppose we're interested in the somewhat more open-ended problem of **generating recommendations**. i.e. given some user ID and some number `k`, we need to generate a list of `k` movies we think the user will enjoy.

The most straightforward way to do this would be to calculate the predicted rating this user would assign for *every movie in the dataset*, then take the movies with the `k` highest predictions.

In the code cell below, write code to create a variable `reccs` containing the 5 movies with the highest predicted rating for the user with userId 26556. `reccs` should be a DataFrame with a `movieId` column (it can also have any other additional columns you'd like).

**TODO: Maybe have them fill in the body of a function for generating recommendations instead? That way, it would be easy for them to reuse it later on to get reccs for other models that we load (in part 5). Or try it on other users. fn would take model, k, and uid as inputs.**

In [3]:
uid = 26556

# 1. Create a list/array containing all movie ids (from the movies dataframe)

# 2. Call model.predict() on a list/array with repeated copies of uid, and the sequence you created in step 1

# 3. Get the movieIds associated with the 5 highest values returned in step 2. The easiest thing may be to
#    add a 'predicted_rating' column to the movies dataframe (or a copy of it), then sort on that column.

# 4. Assign assign the result of 3 to reccs!
reccs = []
#q1.check()

In [4]:
#q1.hint()

In [5]:
#q1.solution()
uid = 26556
all_movie_ids = df.movieId.unique()
preds = model.predict([
    np.repeat(uid, len(all_movie_ids)),
    all_movie_ids,
])
# Add back the offset calculated earlier, to 'uncenter' the ratings, and get back to a [0.5, 5] scale.
movies_df.loc[all_movie_ids, 'predicted_rating'] = preds + offset
n = 5
reccs = movies_df.sort_values(by='predicted_rating', ascending=False).head(n)
reccs

Unnamed: 0,movieId,title,genres,movieId_orig,key,year,n_ratings,mean_rating,predicted_rating
21770,21770,McKenna Shoots for the Stars,Children|Drama,105563,McKenna Shoots for the Stars (2012),2012,2,4.25,8.314218
25793,25793,Babette Goes to War,Comedy|War,126126,Babette Goes to War (1959),1959,1,1.5,7.968132
25794,25794,Belle comme la femme d'un autre,Comedy,126128,Belle comme la femme d'un autre (2014),2014,1,1.0,7.963298
26329,26329,Sense & Sensibility,Drama|Romance,129032,Sense & Sensibility (2008),2008,4,4.125,7.767604
25823,25823,The Sex and Violence Family Hour,(no genres listed),126186,The Sex and Violence Family Hour (1983),1983,1,1.5,7.72016


## Part 2: Sanity check

Do these recommendations seem sensible? If you'd like a reminder of user 26556's tastes, run the cell below to see all their ratings (in descending order).

In [6]:
user_ratings = df[df.userId==uid]
movie_cols = ['movieId', 'title', 'genres', 'year', 'n_ratings', 'mean_rating']
user_ratings.sort_values(by='rating', ascending=False).merge(movies[movie_cols], on='movieId')

Unnamed: 0,userId,movieId,rating,y,title,genres,year,n_ratings,mean_rating
0,26556,2706,5.0,1.474504,Airplane II: The Sequel,Comedy,1982,4284,3.041351
1,26556,2705,5.0,1.474504,Airplane!,Comedy,1980,18866,3.80044
2,26556,534,4.5,0.974504,Six Degrees of Separation,Drama,1993,5101,3.708884
3,26556,2102,4.5,0.974504,Strangers on a Train,Crime|Drama|Film-Noir|Thriller,1951,5154,4.162029
4,26556,2863,4.5,0.974504,Dr. No,Action|Adventure|Thriller,1962,7183,3.683762
5,26556,2286,4.5,0.974504,Fletch,Comedy|Crime|Mystery,1985,6298,3.446007
6,26556,2216,4.5,0.974504,History of the World: Part I,Comedy|Musical,1981,4313,3.597246
7,26556,937,4.5,0.974504,Mr. Smith Goes to Washington,Drama,1939,5712,4.031262
8,26556,730,4.0,0.474504,Spy Hard,Comedy,1996,6112,2.741261
9,26556,916,4.0,0.474504,To Catch a Thief,Crime|Mystery|Romance|Thriller,1955,4699,4.030083


Review our top-recommended movies. Are they good or bad? If they're bad, in what way are they bad? You may also find it interesting to look at:
- The metadata associated with the top-recommended movies
- The 'least-recommended' movies (the ones with the lowest predicted scores)
- The actual predicted rating values.

Once you have an opinion, uncomment the cell below to see if we're in agreement.

`#q2.solution()`


#### DB: I think we should aim for as little text as possible in the exercises.  

#### DB: I've watched people using other MOOCs before, and they tend look for the parts that they do... their eyes skip over any text that isn't either specific instructions or code (and in some cases, they seem to skip over that as well.  In that sense, the words are in the way.


I'm going to claim that these recommended movies are **bad**. In terms of genre and themes, our top picks seem like poor fits. User 26556 has pretty mature tastes - they like Hitchcock, classic James Bond, and Leslie Nielsen comedies. But our top pick for them, *McKenna Shoots for the Stars*, seems squarely aimed at pre-teen girls.

Though I had to google the title to discover that fact. In fact, I didn't recognize any of the films in our top-5 recommendations. And that speaks to the biggest problem with our recommendations: they're **super obscure**. Our top 5 recommendations only have a total of 9 reviews between them in the whole dataset. We barely know anything about these movies - how can we be so confident that user 26556 is going to love them?

(You may have noticed another problem, which becomes very obvious when we look at the movies with 
the highest (or lowest) predicted scores: sometimes our model predicts values outside the allowable
range of 0.5-5 stars. For the purposes of recommendation, this is actually no problem: we only care about ranking
movies, not about the absolute values of their predicted scores. But this is still an interesting problem
to consider. How could we prevent our model from incurring needless errors by making predictions outside
the allowable range? Should we? If you have ideas, head over to [this forum thread](TODO) to discuss.)

**TODO: Mention how this is a common/important problem when working with sparse, high-cardinality categorical data with heavy tails. We'll encounter similar problems in the next two lessons.**

## Part 3: How are we going to fix this mess?
#### DB: I really like this problem.
How can we improve the problem with our recommendations that we identified in Part 2? This could involve changing our model's structure, our training procedure, or our procedure.

Give it some thought, then uncomment the cell below to compare notes with me. (If you have no idea, that's totally fine!)

`q3.solution()`

One simple solution would be limiting our recommendations to movies with at least `n` ratings. This feels somewhat inelegant, in that we have to choose some arbitrary cut-off. And any reasonable choice will probably exclude some good recommendations. It would be nice if we could take into account popularity in a 'smoother' way. On the other hand, this is very simple to implement, and we don't even need to re-train our model, so it's worth a shot.

If we're willing to train a new model, there's another less hacky approach we can take which might fix our obscure recommendation problem *and* improve our overall accuracy at the same time: regularization. Specifically, putting an L2 weight penalty on our embeddings. I'll talk more about this in part 5 (and show how we would implement it).

**TODO: Maybe also mention violation of missing at random assumption. Idea of augmenting training data with missing values from the user-item matrix drawn from a distribution with a lower mean than the avg. rating in the dataset. https://arxiv.org/pdf/1206.5267.pdf**

## Part 4: Fixing our obscure recommendation problem (thresholding)

Fill in the code cell below to create a variable `popular_reccs` containing our 5 best recommendations for user #26556 limited to movies that have at least 1,000 ratings in the dataset. Take a look at the recommended movies. Did this fix our problem? Do we get better results with a different threshold?

In [7]:
uid = 26556
threshold = 1000

# TODO fill in code here to generate our top 5 predictions having a minimum number of ratings. Assign
# a dataframe with these movies to reccs.
reccs = []
#q4.check()

In [8]:
#q4.solution()
reccs = movies_df[movies_df.n_ratings >= threshold]\
    .sort_values(by='predicted_rating', ascending=False).head(5)
reccs

Unnamed: 0,movieId,title,genres,movieId_orig,key,year,n_ratings,mean_rating,predicted_rating
18811,18811,The Cabin in the Woods,Comedy|Horror|Sci-Fi|Thriller,93840,"Cabin in the Woods, The (2012)",2012,1757,3.677778,5.557921
1233,1233,Evil Dead II (Dead by Dawn),Action|Comedy|Fantasy|Horror,1261,Evil Dead II (Dead by Dawn) (1987),1987,7788,3.772697,5.30412
2614,2614,"South Park: Bigger, Longer and Uncut",Animation|Comedy|Musical,2700,"South Park: Bigger, Longer and Uncut (1999)",1999,17371,3.626352,5.202606
1189,1189,Army of Darkness,Action|Adventure|Comedy|Fantasy|Horror,1215,Army of Darkness (1993),1993,12469,3.723848,5.187201
1213,1213,Dead Alive (Braindead),Comedy|Fantasy|Horror,1241,Dead Alive (Braindead) (1992),1992,2576,3.722017,5.079439


## Part 5: Fixing our obscure recommendation problem (regularization)

The code below is identical to the code used to create the model we've been using in this exercise, except we've added L2 regularization to our embeddings (by specifying a value for the keyword argument `embeddings_regularizer` when creating our Embedding layers).

> **TODO: Aside here giving an introduction to regularization via weight penalty. Intuitive interpretation (imposing a prior), and mechanics of how it works (adding a term to the loss calculated as blah).**

In [9]:
movie_embedding_size = user_embedding_size = 8
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')

movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

l2_model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)

Training this model for a decent number of iterations takes around 15 minutes, so to save some time, I have an already trained model you can load from disk by running the cell below.

In [10]:
l2_model = keras.models.load_model('movie_svd_model_8_r12n.h5')



**(TODO: Note on the fact that this does indeed end up improving validation MAE. And maybe note effect on overfitting? Show plot of train and val loss, with and w/o r12n)**

Try using the code you wrote in part 1 to generate recommendations using this model. How do they compare?

In [11]:
# TODO: Call function defined in part 1 using l2_model

What do you think this model's predicted scores will look like for the 'obscure' movies that our earlier model highly recommended? Run the cell below to find out.

In [14]:
uid = 26556
reccs = movies_df.sort_values(by='predicted_rating', ascending=False).head(n)
obscure_mids = reccs.index
preds = l2_model.predict([
    np.repeat(uid, len(obscure_mids)),
    obscure_mids,
])
recc_df = movies_df.loc[obscure_mids].copy()
recc_df['l2_predicted_rating'] = preds + offset
recc_df

Unnamed: 0,movieId,title,genres,movieId_orig,key,year,n_ratings,mean_rating,predicted_rating,l2_predicted_rating
21770,21770,McKenna Shoots for the Stars,Children|Drama,105563,McKenna Shoots for the Stars (2012),2012,2,4.25,8.314218,3.516182
25793,25793,Babette Goes to War,Comedy|War,126126,Babette Goes to War (1959),1959,1,1.5,7.968132,3.527207
25794,25794,Belle comme la femme d'un autre,Comedy,126128,Belle comme la femme d'un autre (2014),2014,1,1.0,7.963298,3.523645
26329,26329,Sense & Sensibility,Drama|Romance,129032,Sense & Sensibility (2008),2008,4,4.125,7.767604,3.521172
25823,25823,The Sex and Violence Family Hour,(no genres listed),126186,The Sex and Violence Family Hour (1983),1983,1,1.5,7.72016,3.517339


`</exercise>`

# Scratch space below - please ignore

```
TODO
- (maybe) look at dist. of weights of r12n model vs. prev model
- (maybe) load another pretrained model with even stronger r12n. Compare reccs. What's the behaviour in the limit as we increase r12n strength?
- Look at reccs for a few other users with fairly clear, distinctive tastes. e.g...
    112287 (American Beauty, The Notebook, Mean Girls, The Devil Wears Prada)
    69106 (Terminator 2, Toy Story, WALL-E, Days of Thunder, Star Wars, The Matrix)
    83421 (The Godfather, The Shawshank Redemption, Casino, Casablanca)
```

In [None]:
assert False

In [None]:
thresh = 500
movies_df[movies_df.n_ratings > thresh]\
    .sort_values(by='predicted_rating', ascending=False).head(5)

In [None]:
w, = model.get_layer('movie_embedding').get_weights()

w[:10]

In [None]:
w, = model.get_layer('user_embedding').get_weights()

w[1:10]

In [None]:
model.summary()

In [None]:
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()

print(
    np.linalg.norm(wm[0]),
    np.linalg.norm(wm[1]),
    np.linalg.norm(wm[2]),
)

In [None]:
norms = np.linalg.norm(wm, axis=1)
ns = pd.Series(norms)


unorms = np.linalg.norm(wu, axis=1)
nus = pd.Series(unorms)
display(
    "Dist of movie norms:",
    ns.describe(),
    "Dist of user norms:",
    nus.describe(),
)

In [None]:
from IPython.display import display
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()

quantiles = [.1, .25, .4, .5, .6, .75, .9]

display(
    "Movie embedding weights:",
    pd.Series(wm.flatten()).describe(quantiles),
    "User embedding weights:",
    pd.Series(wu.flatten()).describe(quantiles),
)

In [None]:
from IPython.display import display
wm, = model.get_layer('movie_embedding').get_weights()
wu, = model.get_layer('user_embedding').get_weights()
bm, = model.get_layer('movie_bias').get_weights()
bu, = model.get_layer('user_bias').get_weights()

quantiles = [.1, .25, .4, .5, .6, .75, .9]

display(
    pd.Series(wm.flatten()).describe(quantiles),
    pd.Series(wu.flatten()).describe(quantiles),
    pd.Series(bm.flatten()).describe(quantiles),
    pd.Series(bu.flatten()).describe(quantiles),
)

In [None]:
# Uggggh. Remember to do this before any training experiments.
df = df.sample(frac=1, random_state=1)

In [None]:
# XXX: New experiment - dropout?
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
dropout = .2
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
user_embedded = keras.layers.Dropout(dropout)(user_embedded)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)
movie_embedded = keras.layers.Dropout(dropout)(movie_embedded)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001),
    loss='MSE',
    metrics=['MAE'],
)

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=15,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_32_dropout.h5')

In [None]:
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 0
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
movie_embedding_size = user_embedding_size = 32
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 0
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=10,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_8_r12n.h5')

In [None]:
model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=8,
    verbose=2,
    validation_split=.05,
);

In [None]:
# Okay, this is excellent. Biases seem shockingly low. 
# I wonder if accuracy is much affected by just taking them away?
# XXX: lower lr experiment
movie_embedding_size = user_embedding_size = 16 # XXX: Tested with 8 (and worked well there). idk if 
# r12n might need to go up when increasing the number of parameters like this?

# Each instance will consist of two inputs: a single user id, and a single movie id
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')
movie_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-6)
user_r12n = keras.regularizers.l1_l2(l1=0, l2=1e-7)
#movie_r12n = user_r12n = None # zzz
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       embeddings_initializer='glorot_uniform',
                                       embeddings_regularizer=user_r12n,
                                       input_length=1, name='user_embedding')(user_id_input)
movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size, 
                                        embeddings_initializer='glorot_uniform',
                                        embeddings_regularizer=movie_r12n,
                                        input_length=1, name='movie_embedding')(movie_id_input)

dotted = keras.layers.Dot(2)([user_embedded, movie_embedded])
out = keras.layers.Flatten()(dotted)

biases = 1
if biases:
    bias_r12n = None
    bias_r12n = keras.regularizers.l1_l2(l1=1e-4, l2=1e-7) # XXX 1e-6 -> 1e-4
    bias_init = 'zeros'
    movie_b = keras.layers.Embedding(df.movieId.max()+1, 1, 
                                             name='movie_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(movie_id_input)
    movie_b = keras.layers.Flatten()(movie_b)
    #out = keras.layers.Add()([movie_b, out])

    user_b = keras.layers.Embedding(df.userId.max()+1, 1, 
                                             name='user_bias',
                                             embeddings_initializer=bias_init,
                                             embeddings_regularizer=bias_r12n,
                                            )(user_id_input)
    user_b = keras.layers.Flatten()(user_b)
    out = keras.layers.Add()([user_b, movie_b, out])

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.compile(
    tf.train.AdamOptimizer(0.001), # XXX: Try lower?
    loss='MSE',
    metrics=['MAE'],
)
#model.summary()

tf.set_random_seed(1); np.random.seed(1); random.seed(1)
history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=10**4,
    epochs=60,
    verbose=2,
    validation_split=.05,
);

In [None]:
model.save('movie_svd_model_16.h5')

In [None]:
pd.Series(preds.flatten()).describe([.05, .1, .25, .5, .75, .9, .95])

These reccos suck. OTOH, this is a great motivator/lead-in to r12n.

# Scratch space / brainstorm

Idea: focus on r12n? (In the same way that I was thinking of having ex 1 be walking through adding biases step by step.)

Are there other datasets I could have them do factorization on? (Could then be fun to use those learned embeddings for next two exercises)

- million song dataset https://www.kaggle.com/c/msdchallenge
- goodreads https://www.kaggle.com/zygmunt/goodbooks-10k/home <--- this one looks promising

## Recommendations

This technique makes it easier to generate recommendations for a particular user. Let's try it. (I guess this is a point where having some understanding of the matrix factorization aspect would be useful.)

To make recommendations for user u...
- could individually calculate predicted scores for every movie in the dataset, but that'd be pretty tedious.
- multiply user vector by weight matrix to get a column vector with predicted scores per movie. Sort.

## convergence properties (thought experiment)

Compare the loss over time of some DNN models like the ones we trained yesterday vs. some factorization models. Do you notice a difference? Can you think of why this would be?

## factorization vs. dnn (thought experiment)

We seem to be getting better results with our factorization model. But can you think of situations where you would want to use the dnn model instead?

## r12n

Look at train vs. val loss for different embedding sizes. How could we prevent overfitting while keeping a larger embedding size? 