# Purpose:
This notebook contains the code that is use to generate weights for the three "Score prediction" regression neural nets. 

Each net is trained on a different type of score. The final output of this notebook is the saved keras weights which are used in the final predictor. 



In [9]:
!nvidia-smi

Thu Sep  7 19:52:42 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 20%   43C    P0    62W / 250W |     10MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage    

In [10]:
import numpy as np
from keras import backend as K
from keras.layers import Input, Embedding, merge
import keras.layers
from keras.regularizers import l2, l1
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.models import Sequential, Model
from keras.optimizers import SGD, RMSprop, Adam
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
#import pydotplus as pydot 
#import graphviz
from keras.utils import plot_model
from sklearn.metrics import mean_absolute_error

In [11]:
import pandas as pd

In [12]:
ratings_score_train = pd.read_csv('mal_scores_train_nonzero_v2.csv')#  Time to load in all of the data. 
ratings_score_test = pd.read_csv('mal_scores_test_nonzero_v2.csv')# the v2 version of this data has no "test" users who aren't present in the training data. 
#ratings_no_score_train = pd.read_csv('mal_scores_train_zero.csv') # In prior iterations of this model, this was used to impute scores in a stacked neural net. This did not improve the final recommendations. 

At this iteration, we can load in a dictionary of user an anime that have been vectorized. Commented out is the code needed to regenerate that dictionary. 


In [13]:
userid2idx=np.load("user.npy").item()
animeid2idx=np.load("anime.npy").item()
#From prior parts of the model, the user and anime embedding id is stored in this dictionary, to make sure that anime have the same embedding when they are loaded into the recommendation engine.  
#userid2idx = {o:i for i,o in enumerate(users)} # is the code redo the dictionary
#animeid2idx = {o:i for i,o in enumerate(animes)} # remove missing anime numbers and re-order
#np.save("user.npy", userid2idx)
#np.save("anime.npy", animeid2idx)
#userid mapping
#users1 = ratings_score_train.userid.unique()
#users2 = ratings_score_test.userid.unique()
#users3 = ratings_no_score_train.userid.unique()
#animeid mapping
#anime1= ratings_score_train.animeid.unique()
#anime2=ratings_score_test.animeid.unique()
#anime3=ratings_no_score_train.animeid.unique()
#animes = set(anime1).union(set(anime2)).union(set(anime3))
#users = set(users1).union(set(users2)).union(set(users3))


In [14]:
n_users = len(userid2idx)
n_animes = len(animeid2idx)

In [16]:
ratings_score_test['anime_id_emb'] = ratings_score_test.animeid.apply(lambda x: animeid2idx[x])
ratings_score_train['anime_id_emb'] = ratings_score_train.animeid.apply(lambda x: animeid2idx[x])
#ratings_no_score_train['anime_id_emb'] = ratings_no_score_train.animeid.apply(lambda x: animeid2idx[x])

In [17]:
n_users

230962

In [18]:
n_animes

12873

In [19]:
ratings_score_test['user_id_emb'] = ratings_score_test.userid.apply(lambda x: userid2idx[x])
ratings_score_train['user_id_emb'] = ratings_score_train.userid.apply(lambda x: userid2idx[x])
#ratings_no_score_train['user_id_emb'] = ratings_no_score_train.userid.apply(lambda x: userid2idx[x])

In [20]:
n_factors = 36 #changing this number changes how many hidden factors each user and each anime is transformed into. 

## Data Cleaning:
After some iteration, the final recommendations are improved by removing certain outliers. 

In [21]:
ratings_score_train=ratings_score_train[(ratings_score_train['user_rev_count']<1500) & (ratings_score_train['user_rev_count']>25)]

In [22]:
ratings_score_train=ratings_score_train[ratings_score_train['score_usr_scaled']!=0]

In [23]:
ratings_score_train = ratings_score_train[ratings_score_train['anime_rev_count']<50000]# let's remove the absolutely most common.

In [24]:
def embedding_input_anime1(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_Anime_Hidden_Factors')(inp)
def embedding_input_user1(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_User_Hidden_Factors')(inp)

user_in1, u1 = embedding_input_user1('user_id_in', n_users+15, n_factors, 1e-5)
anime_in1, a1 = embedding_input_anime1('anime_id_in', n_animes, n_factors, 0)

  
  This is separate from the ipykernel package so we can avoid doing imports until


In [25]:
# nn1, It is given no modifications for status. It is predicting score
x = merge([u1, a1], mode='concat', name='All_Factors_on_one_layer')
x = Flatten()(x)
#x = Dropout(0.55, name='Prevent_overfit')(x)
x = Dense(70, activation='relu',name='Random_HF_Interactions')(x)
x = Dropout(0.55,name='Prevent_overfit2')(x)
x = Dense(16, activation='relu',name='Random_HF_Interactions2')(x)
x = Dropout(0.1, name='Prevent_overfit')(x)
x = Dense(1,name='Final_Interactions')(x)
nn1 = Model([user_in1, anime_in1], x)
nn1.compile(Adam(0.001), loss='mse')

  
  name=name)


In [26]:
nn1.fit([ratings_score_train.user_id_emb, ratings_score_train.anime_id_emb], ratings_score_train.score, batch_size=5120, epochs=8, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score))

Train on 9771714 samples, validate on 1210873 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f144940b978>

In [27]:
nn1.lr=.0005

In [28]:
nn1.fit([ratings_score_train.user_id_emb, ratings_score_train.anime_id_emb], ratings_score_train.score, batch_size=10120, epochs=8, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score))

Train on 9771714 samples, validate on 1210873 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f13de3a4f98>

In [29]:
nn1pred= nn1.predict([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb])
nn1targ = ratings_score_test.score.values
mean_absolute_error(nn1targ, nn1pred) 

0.98166690871267037

In [30]:
nn1.save_weights('nn_score_weights.h5')

In [30]:
#plot_model(nn1, to_file='score.png')
#plot_model(nn12, to_file='user_score.png')
#plot_model(nn13, to_file='anime_score.png')

The User_Scaled_Score branch
data prep: drop all scores for shows that are not  listed as complete or dropped.


In [21]:
ratings_score_train_complete= ratings_score_train[(ratings_score_train['status']=='COMPLETED') | (ratings_score_train['status']=='DROPPED')]

In [22]:
#ratings_score_train_complete= ratings_score_train # The alternative to the above cell to train the model on shows of all status. 

In [33]:
def embedding_input_anime22(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_Anime_Hidden_Factors')(inp)
def embedding_input_user22(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_User_Hidden_Factors')(inp)

user_in22, u22 = embedding_input_user22('user_id_in', n_users+15, n_factors, 1e-5)
anime_in22, a22 = embedding_input_anime22('anime_id_in', n_animes, n_factors, 0)
# nn22, It is only given complete. Trying to predict usr scaled score. 
x = merge([u22, a22], mode='concat', name='All_Factors_on_one_layer')
x = Flatten()(x)
#x = Dropout(0.55, name='Prevent_overfit')(x)
x = Dense(70, activation='relu',name='Random_HF_Interactions')(x)
x = Dropout(0.55,name='Prevent_overfit2')(x)
x = Dense(16, activation='relu',name='Random_HF_Interactions2')(x)
x = Dropout(0.1, name='Prevent_overfit')(x)
x = Dense(1,name='Final_Interactions')(x)
nn22 = Model([user_in22, anime_in22], x)
nn22.compile(Adam(0.001), loss='mse')

  
  This is separate from the ipykernel package so we can avoid doing imports until
  # This is added back by InteractiveShellApp.init_path()
  name=name)


In [34]:
nn22.fit([ratings_score_train_complete.user_id_emb, ratings_score_train_complete.anime_id_emb], ratings_score_train_complete.score_usr_scaled, batch_size=5120, epochs=8, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score_usr_scaled))

Train on 9240364 samples, validate on 1210873 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f9984cc7780>

In [35]:
nn22.lr=.0005

In [36]:
nn22.fit([ratings_score_train_complete.user_id_emb, ratings_score_train_complete.anime_id_emb], ratings_score_train_complete.score_usr_scaled, batch_size=20120, epochs=4, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score_usr_scaled))

Train on 9240364 samples, validate on 1210873 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9986309e80>

In [37]:
nn22.save_weights('nn_score_usr_weights.h5')

In [38]:
nn22pred= nn22.predict([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb])
nn22targ = ratings_score_test.score_usr_scaled.values
mean_absolute_error(nn22targ, nn22pred)

0.93386003456007916

In [39]:
def embedding_input_anime23(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_Anime_Hidden_Factors')(inp)
def embedding_input_user23(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg),name='Embed_User_Hidden_Factors')(inp)

user_in23, u23 = embedding_input_user23('user_id_in', n_users+15, n_factors, 0)
anime_in23, a23 = embedding_input_anime23('anime_id_in', n_animes, n_factors, 0)
# nn23, It is only given complete. Trying to predict anime scaled score. 
x = merge([u23, a23], mode='concat', name='All_Factors_on_one_layer')
x = Flatten()(x)
#x = Dropout(0.55, name='Prevent_overfit')(x)
x = Dense(70, activation='relu',name='Random_HF_Interactions')(x)
x = Dropout(0.55,name='Prevent_overfit2')(x)
x = Dense(16, activation='relu',name='Random_HF_Interactions2')(x)
x = Dropout(0.1, name='Prevent_overfit')(x)
x = Dense(1,name='Final_Interactions')(x)
nn23 = Model([user_in23, anime_in23], x)
nn23.compile(Adam(0.001), loss='mse')

  
  This is separate from the ipykernel package so we can avoid doing imports until
  # This is added back by InteractiveShellApp.init_path()
  name=name)


In [40]:
nn23.fit([ratings_score_train_complete.user_id_emb, ratings_score_train_complete.anime_id_emb], ratings_score_train_complete.score_anime_scaled, batch_size=5120, epochs=8, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score_anime_scaled))

Train on 9240364 samples, validate on 1210873 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f998a39d780>

In [41]:
nn23.lr=.0005

In [42]:
nn23.fit([ratings_score_train_complete.user_id_emb, ratings_score_train_complete.anime_id_emb], ratings_score_train_complete.score_anime_scaled, batch_size=20120, epochs=4, 
          validation_data=([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb], ratings_score_test.score_anime_scaled))

Train on 9240364 samples, validate on 1210873 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f99995a6b38>

In [43]:
nn23.save_weights('nn_score_anime_weights.h5')

In [44]:
nn23pred= nn23.predict([ratings_score_test.user_id_emb, ratings_score_test.anime_id_emb])
nn23targ = ratings_score_test.score_anime_scaled.values
mean_absolute_error(nn23targ, nn23pred)

0.96388360887655522