<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Load-and-look-into-the-data" data-toc-modified-id="1.-Load-and-look-into-the-data-1">1. Load and look into the data</a></span></li><li><span><a href="#2.-Encode-the-lable" data-toc-modified-id="2.-Encode-the-lable-2">2. Encode the lable</a></span></li><li><span><a href="#3.-Generate-feature-columns" data-toc-modified-id="3.-Generate-feature-columns-3">3. Generate feature columns</a></span></li><li><span><a href="#4.--Generate-the-training-samples" data-toc-modified-id="4.--Generate-the-training-samples-4">4.  Generate the training samples</a></span></li><li><span><a href="#5.-Train-the-model" data-toc-modified-id="5.-Train-the-model-5">5. Train the model</a></span><ul class="toc-item"><li><span><a href="#5.1-Train" data-toc-modified-id="5.1-Train-5.1">5.1 Train</a></span></li><li><span><a href="#5.2-Predict" data-toc-modified-id="5.2-Predict-5.2">5.2 Predict</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat,get_feature_names

## 1. Load and look into the data

In [2]:
data = pd.read_csv("movielens_sample.txt")
sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"]
target = ['rating'] 

- get the dimension of the data set
- obtain the table of rating
- show the first 5 rows of the data set

In [3]:
print("dimension of moivelens:",data.shape)
print("value counts of rating:\n",data['rating'].value_counts())
data.head(5)

dimension of moivelens: (200, 10)
value counts of rating:
 4    65
3    54
5    48
2    23
1    10
Name: rating, dtype: int64


Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres,gender,age,occupation,zip
0,3299,235,4,968035345,Ed Wood (1994),Comedy|Drama,F,25,4,19119
1,3630,3256,3,966536874,Patriot Games (1992),Action|Thriller,M,18,4,77005
2,517,105,4,976203603,"Bridges of Madison County, The (1995)",Drama|Romance,F,25,14,55408
3,785,2115,3,975430389,Indiana Jones and the Temple of Doom (1984),Action|Adventure,M,18,19,29307
4,5848,909,5,957782527,"Apartment, The (1960)",Comedy|Drama,M,50,20,20009


## 2. Encode the lable

Usually we have two methods to encode the sparse categorical feature for embedding
- Label Encoding: map the features to integer value from 0 ~ len(#unique) - 1
- Hash Encoding: map the features to a fix range,like 0 ~ 9999.We have 2 methods to do that:
    - Do feature hashing before training
    - Do feature hashing on the fly in training process
    
Here we use the first way.

In [4]:
for feature in sparse_features:
    lbe = LabelEncoder()  # HashEncoder Method: lbe = HashEncoder()
    data[feature] = lbe.fit_transform(data[feature])

## 3. Generate feature columns

In [5]:
fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) 
                          for feature in sparse_features]

for i in range(len(fixlen_feature_columns)):
    print(fixlen_feature_columns[i],'\n')

SparseFeat(name='movie_id', vocabulary_size=187, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.initializers_v1.RandomNormal object at 0x000001E900BDABC8>, embedding_name='movie_id', group_name='default_group', trainable=True) 

SparseFeat(name='user_id', vocabulary_size=193, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.initializers_v1.RandomNormal object at 0x000001E900BDA508>, embedding_name='user_id', group_name='default_group', trainable=True) 

SparseFeat(name='gender', vocabulary_size=2, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.initializers_v1.RandomNormal object at 0x000001E900BDA148>, embedding_name='gender', group_name='default_group', trainable=True) 

SparseFeat(name='age', vocabulary_size=7, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.ker

In [6]:
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
print('feature names:',feature_names)

feature names: ['movie_id', 'user_id', 'gender', 'age', 'occupation', 'zip']


## 4.  Generate the training samples

In [7]:
train, test = train_test_split(data, test_size=0.2)
train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

type(train_model_input) # train_model_input is a dictionary

dict

## 5. Train the model
### 5.1 Train

In [8]:
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
model.compile("adam", "mse", metrics=['mse'], )
history = model.fit(train_model_input, 
                    train[target].values, 
                    batch_size=256, 
                    epochs=1, 
                    verbose=True, 
                    validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "




### 5.2 Predict

In [9]:
pred_ans = model.predict(test_model_input, batch_size=256)

mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = round(mse ** 0.5, 4)
print("test RMSE", rmse)
print("test MSE", mse)

test RMSE 3.7893
test MSE 14.3591
