# Introduction to Recommender Systems

<p align="center">
    <img width="721" alt="cover-image" src="https://user-images.githubusercontent.com/49638680/204351915-373011d3-75ac-4e21-a6df-99cd1c552f2c.png">
</p>

---

# Final Project

My name is Alexis Petignat and this, is my final project for the Recommender System Course.

## Exploratory Data Analysis

The exploratory data analysis is something that you should do before building your recommender system. It is a good way to understand your data and to get insights about it. The main goal of such a procedure is to get answers to several questions about your data, these answers should be the numerical justification for all your choices in the following steps, in particular for the feature engineering and the model selection.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# set plot size
plt.rcParams["figure.figsize"] = (20, 13)
%matplotlib inline
%config InlineBackend.figure_format = "retina"

## 1 Data Preprocessing

In this part we will start by loading the data and ensuring it is ready to be used and analysed

### 1.1 Load the data

Let us start by loading the data!

In [2]:
interactions = pd.read_csv("data_final_project/KuaiRec/data/small_matrix.csv")

interactions.head()

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,14,148,4381,6067,2020-07-05 05:27:48.378,20200705.0,1593898000.0,0.722103
1,14,183,11635,6100,2020-07-05 05:28:00.057,20200705.0,1593898000.0,1.907377
2,14,3649,22422,10867,2020-07-05 05:29:09.479,20200705.0,1593898000.0,2.063311
3,14,5262,4479,7908,2020-07-05 05:30:43.285,20200705.0,1593898000.0,0.566388
4,14,8234,4602,11000,2020-07-05 05:35:43.459,20200705.0,1593899000.0,0.418364


In [3]:
interactions_big = pd.read_csv("data_final_project/KuaiRec/data/big_matrix.csv")

interactions_big.head()

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,0,3649,13838,10867,2020-07-05 00:08:23.438,20200705,1593879000.0,1.273397
1,0,9598,13665,10984,2020-07-05 00:13:41.297,20200705,1593879000.0,1.244082
2,0,5262,851,7908,2020-07-05 00:16:06.687,20200705,1593879000.0,0.107613
3,0,1963,862,9590,2020-07-05 00:20:26.792,20200705,1593880000.0,0.089885
4,0,8234,858,11000,2020-07-05 00:43:05.128,20200705,1593881000.0,0.078


In [4]:
videos = pd.read_csv("data_final_project/KuaiRec/data/item_daily_features.csv")

videos.head()

Unnamed: 0,video_id,date,author_id,video_type,upload_dt,upload_type,visible_status,video_duration,video_width,video_height,...,download_cnt,download_user_num,report_cnt,report_user_num,reduce_similar_cnt,reduce_similar_user_num,collect_cnt,collect_user_num,cancel_collect_cnt,cancel_collect_user_num
0,0,20200705,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,8,8,0,0,3,3,,,,
1,0,20200706,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,2,2,0,0,5,5,,,,
2,0,20200707,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,2,2,0,0,0,0,,,,
3,0,20200708,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,3,3,0,0,3,3,,,,
4,0,20200709,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,2,2,2,1,1,1,,,,


In [5]:
users = pd.read_csv("data_final_project/KuaiRec/data/user_features.csv")

users.head()

Unnamed: 0,user_id,user_active_degree,is_lowactive_period,is_live_streamer,is_video_author,follow_user_num,follow_user_num_range,fans_user_num,fans_user_num_range,friend_user_num,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
0,0,high_active,0,0,0,5,"(0,10]",0,0,0,...,184,6,3,0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,full_active,0,0,0,386,"(250,500]",4,"[1,10)",2,...,186,6,2,0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,full_active,0,0,0,27,"(10,50]",0,0,0,...,51,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,full_active,0,0,0,16,"(10,50]",0,0,0,...,251,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,full_active,0,0,0,122,"(100,150]",4,"[1,10)",0,...,99,4,2,0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
tags = pd.read_csv("data_final_project/KuaiRec/data/item_categories.csv")

tags.head()

Unnamed: 0,video_id,feat
0,0,[8]
1,1,"[27, 9]"
2,2,[9]
3,3,[26]
4,4,[5]


In [7]:
friends = pd.read_csv("data_final_project/KuaiRec/data/social_network.csv")

friends.head()

Unnamed: 0,user_id,friend_list
0,3371,[2975]
1,24,[2665]
2,4402,[38]
3,4295,[4694]
4,7087,[7117]


###  1.2 Ensure Data Correctness

Try to check for missing values, duplicates, and other data quality issues, like impossible values, negative timestamps, etc.

In [8]:
import ast

# I do the most simple thing, I remove the rows with missing values
interactions = interactions.dropna()
# I also remove the duplicates
interactions = interactions.drop_duplicates()
# I also remove the rows with negative timestamps
interactions = interactions[interactions["timestamp"] >= 0]


# I do the most simple thing, I remove the rows with missing values
interactions_big = interactions_big.dropna()
# I also remove the duplicates
interactions_big = interactions_big.drop_duplicates()
# I also remove the rows with negative timestamps
interactions_big = interactions_big[interactions_big["timestamp"] >= 0]

# I also remove the duplicates
videos = videos.drop_duplicates(subset="video_id")
# Replace all Nans by 0
videos = videos.fillna(0)

# I do the most simple thing, I remove the rows with missing values
users = users.dropna()
# I also remove the duplicates
users = users.drop_duplicates()

# I do the most simple thing, I remove the rows with missing values
tags = tags.dropna()
# I also remove the duplicates
tags = tags.drop_duplicates()

videos.head(100)


Unnamed: 0,video_id,date,author_id,video_type,upload_dt,upload_type,visible_status,video_duration,video_width,video_height,...,download_cnt,download_user_num,report_cnt,report_user_num,reduce_similar_cnt,reduce_similar_user_num,collect_cnt,collect_user_num,cancel_collect_cnt,cancel_collect_user_num
0,0,20200705,3309,NORMAL,2020-03-30,ShortImport,public,5966.0,720,1280,...,8,8,0,0,3,3,0.0,0.0,0.0,0.0
63,1,20200705,4978,NORMAL,2020-04-09,PictureSet,public,0.0,886,1015,...,17,11,0,0,13,12,0.0,0.0,0.0,0.0
126,2,20200705,939,NORMAL,2020-04-11,Kmovie,public,8000.0,720,1280,...,5,5,0,0,13,13,0.0,0.0,0.0,0.0
189,3,20200705,5889,NORMAL,2020-04-11,PictureSet,public,0.0,1080,1080,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
252,4,20200705,4284,NORMAL,2020-04-12,ShortCamera,public,18000.0,720,1280,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5356,95,20200705,2383,NORMAL,2020-06-23,ShortImport,public,14014.0,720,1280,...,3,3,0,0,51,50,0.0,0.0,0.0,0.0
5419,96,20200705,7259,NORMAL,2020-06-23,ShortImport,public,6647.0,540,960,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
5482,97,20200705,1011,NORMAL,2020-06-23,ShortImport,public,5733.0,540,960,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
5503,98,20200705,2801,NORMAL,2020-06-23,ShortImport,public,7966.0,720,1280,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0


## 2. Data engineering

This part focuses on not analyzing the data, but rather ease the future analysis by building meaningful data from the given raw data.

### 2.1 Meaningful features

In this part we will build several features to ease the later analysis


#### 2.1.1 Video features

The video features are its characteristics, themes, category. The name of these categories is not relevant, and is symbolized as integers. To ease the later analysis, and since a video can have multiple tags, we will represent it as a binary vector. This will prove useful for the content based filtering.

In [9]:
TRAIN_SIZE = 1000000

interactions_short = interactions[["user_id", "video_id", "watch_ratio"]]
interactions_big = interactions_big[["user_id", "video_id", "watch_ratio"]]
X_train = interactions_big.sample(n=TRAIN_SIZE, random_state=42)

X_test = interactions_short

In [10]:
from sklearn.preprocessing import MultiLabelBinarizer

# Binarize tags
mlb = MultiLabelBinarizer()
tags["feat"] = tags["feat"].apply(ast.literal_eval)
tags_matrix = mlb.fit_transform(tags["feat"])
df_tags = pd.DataFrame(tags_matrix, index=tags["video_id"], columns=mlb.classes_)
df_tags

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10724,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 2.1.2 Like-view ratio

Self explanatory, this is a great indicator of how good is a video.

In [11]:
df_video_appreciation = videos[["video_id", "like_cnt", "play_cnt"]].copy()
like_ratio = df_video_appreciation["like_cnt"] / df_video_appreciation["play_cnt"]
df_video_appreciation["ratio"] = like_ratio
df_video_appreciation

Unnamed: 0,video_id,like_cnt,play_cnt,ratio
0,0,573,10141,0.056503
63,1,1748,19205,0.091018
126,2,244,45038,0.005418
189,3,132,1237,0.106710
252,4,1,95,0.010526
...,...,...,...,...
343336,10723,24,214,0.112150
343337,10724,264,965,0.273575
343338,10725,851,15487,0.054949
343339,10726,44,7859,0.005599


### 2.2 Build the user-item interaction matrix

In this part, we will build the user-item interaction matrix. We will base our rating on the watch ratio of the video in the interaction data.


In [12]:
# Represent the actual grid
all_pairs = pd.MultiIndex.from_product(
    [users["user_id"], videos["video_id"]],
    names=["user_id", "video_id"]
).to_frame(index=False)

# Fill the values
df_complete = pd.merge(all_pairs, X_test, how="left", on=["user_id", "video_id"]) # We put X_test since X_train makes my machine crash

user_item_matrix = df_complete.pivot_table(
    index="user_id",     # rows
    columns="video_id",  # columns
    values="watch_ratio",     # ratio to fill
    fill_value=0         # If user did not see the video
)

user_item_matrix

video_id,103,109,120,122,128,130,131,133,136,137,...,10430,10436,10457,10462,10500,10506,10519,10552,10589,10595
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14,0.429126,1.482039,0.728738,0.477810,0.439333,1.150245,1.816317,0.781122,1.554396,2.307486,...,0.348932,0.965220,1.787169,1.816010,0.150323,1.535256,2.757278,0.143556,1.273362,1.719201
19,0.624466,1.070684,1.006064,0.759092,0.882691,0.639313,0.670019,1.407319,0.874814,0.722665,...,0.642896,0.633833,0.586222,1.178295,0.000000,0.977297,1.266322,0.265038,0.928168,1.107873
21,1.415049,0.000000,1.809125,0.000000,0.588365,0.619549,0.818749,1.944596,1.015039,0.575723,...,0.896847,0.918930,0.602573,0.995887,1.173871,0.957399,1.148837,0.216699,1.210398,1.713792
23,0.169223,2.549891,0.247487,0.438669,0.114338,0.828292,0.038440,2.455882,1.128438,1.021400,...,0.577134,0.000000,5.304503,0.610346,0.185161,4.725427,0.000000,0.430445,2.225363,0.000000
24,0.345049,0.449337,0.802936,0.797411,1.875599,0.783867,2.104939,6.418434,0.228018,3.892566,...,0.884743,0.578658,0.300125,2.151558,2.311935,1.848424,0.388630,0.103633,0.547944,0.093900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7142,0.575631,0.960989,0.950854,0.611157,0.569484,0.944994,0.481794,1.020178,0.805988,0.514642,...,0.727006,0.794809,0.843906,1.491536,1.525323,1.261752,1.630146,0.300578,1.047888,1.419171
7147,1.112427,0.380971,1.419818,0.617423,1.067951,1.020166,0.925878,1.656635,0.907465,0.722232,...,0.541901,0.813574,1.014475,1.575067,1.614839,1.191774,1.935745,0.261799,0.974836,1.364633
7153,1.338544,0.414703,0.664433,0.339429,1.038049,0.225616,0.000000,2.330027,2.045522,0.604141,...,0.568422,0.887281,0.287259,4.364816,0.898387,0.944044,1.031352,0.309217,0.723977,0.425931
7159,0.658155,0.231235,0.788894,0.870249,0.179318,1.141559,0.464358,0.173222,1.065303,0.718853,...,0.201295,0.444783,1.494014,3.151875,1.700323,1.479968,2.092679,0.393572,1.600266,1.596454


### Observation

We will not detail what a user-item matrix is here.
It is interesting to notice that this matrix is mostly full of actual values and not zeros. This is important because when a user did not see a video, the watch_ratio is zero. The matrix containing very few zeros in proportion indicates that most user saw most videos, giving us plenty of data to make a relevant Recommender system.

## 3. Model developpement

Build the models


### 3.1 ALS

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [14]:
spark = SparkSession.builder.appName("ALSMatrixFactorisation").getOrCreate()

sdf = spark.createDataFrame(X_test.copy()).repartition(200)
sdf = sdf.sample(fraction=1.0).cache()

indexer = [
    StringIndexer(inputCol=column, outputCol=column + "_index")
    for column in list(set(sdf.columns) - set(["watch_ratio"]))
]

pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(sdf).transform(sdf)
transformed.show()

(training, test) = transformed.randomSplit([0.8, 0.2], seed=42)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/17 21:33:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/17 21:34:16 WARN TaskSetManager: Stage 0 contains a task of very large size (4527 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+-------+--------+------------------+-------------+--------------+
|user_id|video_id|       watch_ratio|user_id_index|video_id_index|
+-------+--------+------------------+-------------+--------------+
|    127|    8514|0.2760039499670836|       1387.0|        1511.0|
|    131|    9984| 0.752586782376502|        540.0|        2987.0|
|    155|    3020|0.3067579685334637|       1210.0|        2922.0|
|    531|    9133|1.3205747126436782|        961.0|         836.0|
|    352|    5522|1.4036053392046235|        256.0|        2431.0|
|    127|    7046| 1.927276234567901|       1387.0|        2007.0|
|    261|    2121|0.4806658574648864|        106.0|        2482.0|
|    169|   10161|0.9316729116156052|        878.0|        3091.0|
|    262|    2400|1.3505952380952382|       1045.0|        1581.0|
|    534|    8692|0.7170757455793085|       1253.0|        1572.0|
|    261|    8822|1.1425933816219287|        106.0|        1953.0|
|    262|    3767|            1.1987|       1045.0|        290

In [15]:
als = ALS(
    maxIter=5,
    regParam=0.09,
    rank=25,
    userCol="user_id",
    itemCol="video_id",
    ratingCol="watch_ratio",
    coldStartStrategy="drop",
    nonnegative=True,
)

model = als.fit(training)

evaluator = RegressionEvaluator(
    metricName="rmse", labelCol="watch_ratio", predictionCol="prediction"
)

predictions = model.transform(test)
rmse = evaluator.evaluate(predictions)

print("RMSE=" + str(rmse))
predictions.show()

                                                                                

RMSE=1.3615544632861154


[Stage 108:(157 + 16) / 200][Stage 123:>  (0 + 0) / 1][Stage 125:>  (0 + 0) / 1]

+-------+--------+------------------+-------------+--------------+----------+
|user_id|video_id|       watch_ratio|user_id_index|video_id_index|prediction|
+-------+--------+------------------+-------------+--------------+----------+
|    833|    4238|1.1172151898734175|        244.0|        2730.0| 1.2507011|
|    833|    9793|1.5931754182540998|        244.0|         331.0| 1.4371209|
|   2366|     361| 0.759344262295082|       1365.0|        2070.0| 0.7641086|
|   2366|    1314|1.0809241562840208|       1365.0|        1453.0| 0.8904811|
|   2366|    1320|28.029094002552107|       1365.0|        2064.0| 0.5921173|
|   2366|    4044|0.9616504854368932|       1365.0|         617.0| 0.6841897|
|   2366|    7643|  1.22299974707023|       1365.0|         862.0|0.59396297|
|   2366|    9817|0.1258528911409076|       1365.0|        2821.0| 0.1260939|
|   2366|   10377|1.2126520681265207|       1365.0|         976.0|0.96740305|
|   3175|     493|0.4571428571428571|        366.0|        1687.

                                                                                

In [16]:
user_recs = model.recommendForAllUsers(20).show(10)



+-------+--------------------+
|user_id|     recommendations|
+-------+--------------------+
|    137|[{9178, 2.0058186...|
|    140|[{9178, 2.3449843...|
|    155|[{9178, 2.3093936...|
|    157|[{9178, 2.5392942...|
|    193|[{9178, 2.2090797...|
|    223|[{9178, 2.5672607...|
|    224|[{9178, 2.6533623...|
|    322|[{5365, 14.8271},...|
|    332|[{4238, 9.542087}...|
|    346|[{7559, 14.217927...|
+-------+--------------------+
only showing top 10 rows



                                                                                

### Observations 
ALS is trained on a subsamble of size 1000000 of the big_matrix to avoid memory overflows. We observe an RMSE equal to 1.36, which corresponds to the mean squared error for the watch_ratio. This is acceptable, even though there is margin for improvement.

### 3.2 Content Based

Here we will analyze the features of the video. We do not have a title nor a description to work with. We only have the video length and its features. We could also consider the resolution of the video or the music but this is not relevant for our recommandations.


#### 3.2.1 Similarity matrix for video metadata

First, let us build the similarity matrix for metadatas, including the author and the length.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

df_videos = videos[["author_id", "video_duration"]]
df_videos

Unnamed: 0,author_id,video_duration
0,3309,5966.0
63,4978,0.0
126,939,8000.0
189,5889,0.0
252,4284,18000.0
...,...,...
343336,236,4833.0
343337,5271,54720.0
343338,1924,15800.0
343339,7604,5132.0


In [18]:
# Compute similarity matrix for video metadata
similarity_matrix = cosine_similarity(df_videos)
df_sim_meta = pd.DataFrame(similarity_matrix, index=videos["video_id"], columns=videos["video_id"])
df_sim_meta

video_id,0,1,2,3,4,5,6,7,8,9,...,10718,10719,10720,10721,10722,10723,10724,10725,10726,10727
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.000000,0.485033,0.925076,0.485033,0.963034,0.995417,0.926366,0.989644,0.908524,0.991273,...,0.932619,0.968727,0.938923,0.923063,0.983719,0.897112,0.916973,0.926714,0.891247,0.915080
1,0.485033,1.000000,0.116575,1.000000,0.231533,0.399183,0.119961,0.354479,0.075270,0.365517,...,0.136779,0.686852,0.756346,0.111342,0.319976,0.048773,0.095883,0.120879,0.828884,0.796503
2,0.925076,0.116575,1.000000,0.116575,0.993185,0.957155,0.999994,0.970012,0.999139,0.967068,...,0.999793,0.801912,0.737883,0.999986,0.978267,0.997686,0.999783,0.999991,0.652233,0.693364
3,0.485033,1.000000,0.116575,1.000000,0.231533,0.399183,0.119961,0.354479,0.075270,0.365517,...,0.136779,0.686852,0.756346,0.111342,0.319976,0.048773,0.095883,0.120879,0.828884,0.796503
4,0.963034,0.231533,0.993185,0.231533,1.000000,0.984381,0.993577,0.991729,0.987495,0.990141,...,0.995353,0.866077,0.811515,0.992558,0.995766,0.982962,0.990545,0.993681,0.736133,0.772621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10723,0.897112,0.048773,0.997686,0.048773,0.982962,0.935249,0.997448,0.951240,0.999648,0.947524,...,0.996094,0.759432,0.690283,0.998030,0.961904,1.000000,0.998884,0.997381,0.599182,0.642762
10724,0.916973,0.095883,0.999783,0.095883,0.990545,0.950922,0.999707,0.964744,0.999786,0.961563,...,0.999152,0.789306,0.723679,0.999879,0.973741,0.998884,1.000000,0.999684,0.636319,0.678219
10725,0.926714,0.120879,0.999991,0.120879,0.993681,0.958401,1.000000,0.971057,0.998950,0.968163,...,0.999871,0.804494,0.740802,0.999954,0.979157,0.997381,0.999684,1.000000,0.655513,0.696481
10726,0.891247,0.828884,0.652233,0.828884,0.736133,0.843793,0.654814,0.816916,0.620224,0.823683,...,0.667537,0.975906,0.992880,0.648232,0.795232,0.599182,0.636319,0.655513,1.000000,0.998454


#### 3.2.2 Similarity matrix for video tags

Let us now build the similarity matrix for the video tags

In [19]:
# Compute similarity matrix for video tags
similarity_matrix = cosine_similarity(df_tags)
df_sim_tags = pd.DataFrame(similarity_matrix, index=tags["video_id"], columns=tags["video_id"])
df_sim_tags

video_id,0,1,2,3,4,5,6,7,8,9,...,10718,10719,10720,10721,10722,10723,10724,10725,10726,10727
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.000000,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.707107,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.000000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10723,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.707107,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
10724,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
10725,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.000000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10726,0.0,0.000000,0.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### 3.3 Two Towers

Let us implement a two tower algorithm. This is my try but I was not able to make it. I left it so you could see


#### 3.3.1 Build the towers

Let us start by building both towers used by the model. There is one tower for the users and one tower for the videos.

In [20]:
from sklearn.preprocessing import LabelEncoder

# Build tower for users (excluding some fields like actual follow, fan, friend counts)
user_tower = users[['user_id', 'user_active_degree', 'is_live_streamer', 
        'is_video_author', 'follow_user_num_range','fans_user_num_range', 'friend_user_num_range',  
        'onehot_feat0', 'onehot_feat1', 'onehot_feat2',
       'onehot_feat3', 'onehot_feat4', 'onehot_feat5', 'onehot_feat6',
       'onehot_feat7', 'onehot_feat8', 'onehot_feat9', 'onehot_feat10',
       'onehot_feat11', 'onehot_feat12', 'onehot_feat13', 'onehot_feat14',
       'onehot_feat15', 'onehot_feat16', 'onehot_feat17']].copy()

# Replace non integer fields by integers (labels)
user_tower["user_active_degree"] = LabelEncoder().fit_transform(user_tower["user_active_degree"])
user_tower["follow_user_num_range"] = LabelEncoder().fit_transform(user_tower["follow_user_num_range"])
user_tower["fans_user_num_range"] = LabelEncoder().fit_transform(user_tower["fans_user_num_range"])
user_tower["friend_user_num_range"] = LabelEncoder().fit_transform(user_tower["friend_user_num_range"])

user_tower

Unnamed: 0,user_id,user_active_degree,is_live_streamer,is_video_author,follow_user_num_range,fans_user_num_range,friend_user_num_range,onehot_feat0,onehot_feat1,onehot_feat2,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
0,0,2,0,0,0,0,0,0,1,17,...,184,6,3,0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1,0,0,4,1,2,0,3,25,...,186,6,2,0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,1,0,0,1,0,0,0,6,8,...,51,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,1,0,0,1,0,0,0,1,8,...,251,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,1,0,0,2,1,0,0,1,8,...,99,4,2,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7171,7171,1,0,1,5,1,0,0,3,8,...,259,1,4,0,1.0,0.0,0.0,0.0,0.0,0.0
7172,7172,1,0,0,1,1,2,0,3,25,...,11,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0
7173,7173,1,0,0,7,1,2,0,6,8,...,51,2,2,0,1.0,0.0,0.0,0.0,0.0,0.0
7174,7174,1,0,0,7,0,0,1,6,25,...,107,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# Build tower for videos (excluding some fields like )
video_tower = videos[['video_type', 'upload_type', 'video_duration','video_width', 'video_height', 'music_id', 'video_tag_id','show_cnt', 'show_user_num', 'play_cnt', 'play_user_num',
                'play_duration', 'complete_play_cnt', 'complete_play_user_num', 'valid_play_cnt', 'valid_play_user_num', 'long_time_play_cnt',
                'long_time_play_user_num', 'short_time_play_cnt', 'short_time_play_user_num', 'play_progress']].copy()

# Replace non integer fields by integers (labels)
video_tower["video_type"] = LabelEncoder().fit_transform(video_tower["video_type"])
video_tower["upload_type"] = LabelEncoder().fit_transform(video_tower["upload_type"])

# Put fields with high value to log scale
video_tower["play_duration"] = np.log1p(video_tower["play_duration"])
video_tower

Unnamed: 0,video_type,upload_type,video_duration,video_width,video_height,music_id,video_tag_id,show_cnt,show_user_num,play_cnt,...,play_duration,complete_play_cnt,complete_play_user_num,valid_play_cnt,valid_play_user_num,long_time_play_cnt,long_time_play_user_num,short_time_play_cnt,short_time_play_user_num,play_progress
0,1,15,5966.0,720,1280,3350323409,841,14665,11372,10141,...,18.301103,5657,4834,5503,4775,5503,4775,1939,1481,0.799860
63,1,11,0.0,886,1015,1812462382,0,17829,17329,19205,...,19.909640,8539,8073,12917,12506,9498,9307,3290,2965,0.624037
126,1,3,8000.0,720,1280,0,2566,43615,33679,45038,...,20.169493,30139,23968,31635,25230,29413,23615,8132,5208,0.830122
189,1,11,0.0,1080,1080,0,773,1309,1072,1237,...,16.273728,109,84,577,502,109,96,389,316,0.314513
252,1,14,18000.0,720,1280,3442844592,2413,103,99,95,...,14.180637,39,39,48,48,39,39,35,31,0.574927
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343336,1,15,4833.0,720,1280,4428603493,11,277,173,214,...,14.335440,117,106,114,104,114,104,83,66,0.596591
343337,1,7,54720.0,720,1280,1090207430,2,1100,1017,965,...,17.842481,535,523,754,721,657,637,113,99,0.574591
343338,1,15,15800.0,576,1024,4429406509,15,16996,16345,15487,...,19.595597,8149,8015,9317,9171,7949,7871,4342,4139,0.577613
343339,1,15,5132.0,528,960,68154,19,7644,7568,7859,...,18.674045,5480,5395,5382,5319,5382,5319,1648,1580,0.818123


In [22]:
from sklearn.preprocessing import normalize

user_embeds = normalize(user_tower)
video_embeds = normalize(video_tower)

video_embeds

array([[2.98478648e-10, 4.47717971e-09, 1.78072361e-06, ...,
        5.78750098e-07, 4.42046877e-07, 2.38741134e-10],
       [5.51735589e-10, 6.06909148e-09, 0.00000000e+00, ...,
        1.81521009e-06, 1.63589602e-06, 3.44303665e-10],
       [9.58401266e-06, 2.87520380e-05, 7.66721013e-02, ...,
        7.79371910e-02, 4.99135380e-02, 7.95589811e-06],
       ...,
       [2.25763880e-10, 3.38645820e-09, 3.56706931e-06, ...,
        9.80266767e-07, 9.34436700e-07, 1.30404212e-10],
       [1.40237280e-05, 2.10355920e-04, 7.19697723e-02, ...,
        2.31111038e-02, 2.21574903e-02, 1.14731317e-05],
       [9.16735413e-10, 1.28342958e-08, 5.19422285e-06, ...,
        1.22842545e-07, 1.14591927e-07, 1.92852585e-10]],
      shape=(10728, 21))

In [23]:
from tensorflow.keras import layers, Model

EMBED_DIM = 32

# User tower
user_input = layers.Input(shape=(), dtype=tf.string, name="user_id")
user_embedding = layers.Embedding(input_dim=num_users, output_dim=EMBED_DIM)(user_input)
user_vector = layers.Dense(EMBED_DIM, activation="relu")(user_embedding)

# Video tower
video_input = layers.Input(shape=(), dtype=tf.string, name="video_id")
video_embedding = layers.Embedding(input_dim=num_items, output_dim=EMBED_DIM)(video_input)
video_vector = layers.Dense(EMBED_DIM, activation="relu")(video_embedding)

ModuleNotFoundError: No module named 'tensorflow'

## 4. Recommender System

This part will focus on defining the recommender system to be used.


### 4.1 Extracting "True Data"

First and foremost, we will need a way to define a test set for each user, in order to determine whether our system is efficient or not.

In [24]:
def extract_N_best_test(user_id: int, amount: int) :
    watched_videos = X_test[X_test["user_id"] == user_id]
    sorted_videos = watched_videos.sort_values(by=['watch_ratio'], ascending = False)
    return sorted_videos.head(amount)["video_id"].tolist()

### 4.2 Recommending with the best tag

Three methods have been implemented for ranking the tags for a given user:
- Counting the total amount of views for each tag
- Couting the total watch_ratio for each tag
- Counting the average watch_ratio for each tag

The first method seems to work well, but the second method got the best results.
The third method did not seem to work, because we lose the notion of quantity for the recommendation. This means having a high watch_ratio on a tag based on a single occurence of a user watching a video with this tag will automatically make this tag the most recommended, even though the user only saw it once.

In [25]:
from operator import add


def find_best_tags(user_id: int) :
    """
    This function returns a list sorted by ascending order of user preference. 
    Depending on the situation, it can return a result corresponding to the sum of all watch ratios, the total amout of videos watched or the average of watch_ratios for a tag.
    """
    user_interactions = X_train[X_train["user_id"] == user_id]

    # Calculate order of prefered tags
    total_watch_ratio = [0] * 30
    total_watch = [0] * 30
    for _, interaction_row in user_interactions.iterrows():
        video_id = interaction_row["video_id"]
        tag_row = df_tags.loc[video_id]
        tag_list = tag_row[[i for i in range(1, 31)]].tolist()
        total_watch_ratio = list( map(add, total_watch_ratio, [a * interaction_row["watch_ratio"] for a in tag_list]))
        total_watch = list( map(add, total_watch, tag_list))

    # Choose result
    
    # final = total_watch                                            # Total watch count per tag
    final = total_watch_ratio                                      # Total watch ratio per tag
    # final = [a/b for a,b in zip(total_watch_ratio,total_watch)]    # Average watch ratio per tag
    return np.argsort(final).tolist()

find_best_tags(14)  

[3,
 6,
 15,
 13,
 12,
 21,
 22,
 19,
 26,
 28,
 29,
 23,
 20,
 2,
 18,
 1,
 10,
 17,
 5,
 8,
 11,
 14,
 9,
 7,
 24,
 0,
 4,
 25,
 16,
 27]

### 4.3 Recommending based on user watch history

Now, let us find the N best videos a user watched in the train set

In [26]:
def find_N_best_train(user_id: int, amount: int) :
    """
    This function returns the top N videos watched by the given user, based on the watch_ratio
    """
    watched_videos = X_train[X_train["user_id"] == user_id]
    sorted_videos = watched_videos.sort_values(by=['watch_ratio'], ascending = False)
    return sorted_videos.head(amount)["video_id"].tolist()

### 4.4 Defining metrics

In order to evaluate our recommender system, we need some kind of metrics. We will define 4 metrics: Precision@k, Recall@k and NDCG@k

In [27]:
from sklearn.metrics import mean_absolute_error, ndcg_score, precision_score, recall_score

def evaluate_output(relevant, retrieved, user_id, n = None) :

    if n is None :
        n = len(relevant)

    test_sample_user = X_test[X_test["user_id"] == user_id].copy()

    # y_true indicates whether an item is relevant or not
    y_true = np.array([[1 if item in relevant else 0 for item in test_sample_user["video_id"]]])
    y_pred = np.array([[1 if item in retrieved else 0 for item in test_sample_user["video_id"]]])

    
    # y_score indicates the order difference between the relevant order and the retrieved order
    ranked_scores = {item: 1 / (rank + 1) for rank, item in enumerate(retrieved)}
    y_score = np.array([[ranked_scores.get(item, 0) for item in test_sample_user["video_id"]]])

    n = len(relevant)
    print(f"Precision@{n}:", precision_score(y_true[0], y_pred[0], average="micro"))
    print(f"Recall@{n}:", recall_score(y_true, y_pred, average="micro"))
    print(f"NDCG@{n}:", ndcg_score(y_true, y_score))

### 4.5 Recommending System

That's where the sh#t gets real. We will assemble everything we have done so far.

#### 4.5.1 ALS Recommendations

Let's see how ALS alone performs

In [28]:
from pyspark.sql import Row

def recommend_N_ALS(user_id: int, amount: int) :
    user_filter = spark.createDataFrame([
        {"user_id": user_id},
    ])
    rows = model.recommendForUserSubset(user_filter, numItems=amount).first()['recommendations']
    return rows



ALS_results = recommend_N_ALS(14, 1000)
retrieved = [a.video_id for a in ALS_results]
relevant = extract_N_best_test(14, 1000)
evaluate_output(relevant, retrieved, 14)

Precision@1000: 0.7365491651205937
Recall@1000: 0.565
NDCG@1000: 0.912022516167198


### Observation 
We have a 0.8 precision score, which is indicating mostly relevant items were predicted. Recall is decent, which means that half of the relevant items were missing (which can be explained by the amount predicted). Finally, NDCG is great, indicating that the items were pretty well ranked.

#### 4.5.2 Content Based Recommending

Let's see how Content Based Filtering performs!

In [29]:
def recommend_N_CBF(user_id: int, amount: int) :
    best_N_watched = find_N_best_train(user_id, amount)
    N_evaluate = watched_videos = X_test[X_test["user_id"] == user_id].copy()
    N_best_scores = []

    
    for video_id_1 in N_evaluate["video_id"]:
        # Precompute similarities for all N watched videos
        sim_meta = df_sim_meta.loc[video_id_1, best_N_watched]
        sim_tags = df_sim_tags.loc[video_id_1, best_N_watched]

        similarities = sim_meta + sim_tags

        # Merge with watch ratios
        watch_ratios = X_train[
            (X_train["user_id"] == user_id) &
            (X_train["video_id"].isin(best_N_watched))
        ].set_index("video_id")["watch_ratio"]

        # Align and multiply
        aligned = similarities.multiply(watch_ratios, fill_value=0)

        # Take best similarity
        bestSim = aligned.max()
        N_best_scores.append(bestSim)

    N_evaluate["similarity"] = N_best_scores
    sorted_videos = N_evaluate.sort_values(by=['similarity'], ascending = False)
    return sorted_videos.head(amount)
    
            




CBF_results = recommend_N_CBF(14, 1000)
retrieved = CBF_results["video_id"].tolist()
relevant = extract_N_best_test(14, 1000)
evaluate_output(relevant, retrieved, 14)

Precision@1000: 0.5374149659863946
Recall@1000: 0.252
NDCG@1000: 0.8195102333839447


### Observation 
We have a 0.5 precision score, which is indicating that half of the predicted videos were relevant. Recall is quite low, which means that many relevant items were missing (which can be explained by the amount predicted). Finally, NDCG is good, indicating that the items were pretty well ranked.

### 4.5.3 Friends!

Lets recommend by tags!

In [30]:
def recommend_N_friends(user_id: int, amount: int) :
    df_friends = videos[["video_id", "author_id"]].copy()
    df_friends = df_friends[df_friends["video_id"].isin(X_test[X_test["user_id"] == user_id]["video_id"])]
    user_friends = friends[friends["user_id"] == user_id]["friend_list"].tolist()
    df_friends["friend"] = df_friends["author_id"].isin(user_friends).astype(int)
    sorted_videos = df_friends.sort_values(by=['friend'], ascending = False)
    return sorted_videos.head(amount)
    



friend_results = recommend_N_friends(14, 1000)
retrieved = friend_results["friend"].tolist()
relevant = extract_N_best_test(14, 1000)
evaluate_output(relevant, retrieved, 14)

Precision@1000: 0.6907854050711194
Recall@1000: 0.0
NDCG@1000: 0.8214976731131111


### Observation 
We have a 0.7 precision score, which is indicating mostly relevant items were predicted. Recall is 0 which is normal since we are just indicating whether the author is friend or not so many videos are missing. Finally, NDCG is acceptable, indicating that the items were pretty well ranked (random in this case).

### 4.6 Salade Tomate Oignons

Now that's it! Let's combine all of them!

In [33]:
def recommend_N(user_id, amount) :
    ALS = pd.DataFrame([row.asDict() for row in recommend_N_ALS(user_id, amount)])
    CBF = recommend_N_CBF(user_id, amount)
    friends = recommend_N_friends(user_id, amount)

    # Merge all datas
    merged = ALS.merge(CBF, on="video_id", how="outer").merge(friends, on="video_id", how="outer")

    # Replace missing values with 0
    merged.fillna(0, inplace=True)
    
    # Sum the scores
    merged["total_score"] = 1 * merged["rating"] + 3 * merged["similarity"] * (1 + merged["friend"])
    
    result = merged[["video_id", "total_score"]]
    sorted_videos = result.sort_values(by=['total_score'], ascending = False)
    return sorted_videos.head(amount)["video_id"]

    
retrieved = recommend_N(14, 1000)
relevant = extract_N_best_test(14, 1000)
evaluate_output(relevant, retrieved, 14)

Precision@1000: 0.658008658008658
Recall@1000: 0.092
NDCG@1000: 0.8330125009909803


## 5. Tests!

We will now evaluate our model.

In [None]:
K = [10, 100, 1000, 3000]
user = 14

for k in K :
    print(f"\tEvaluating for user {user}")
    retrieved = recommend_N(user, k)
    relevant = extract_N_best_test(14, k)
    evaluate_output(relevant, retrieved, user)

	Evaluating for user 14
Precision@10: 0.9969078540507111
Recall@10: 0.0
NDCG@10: 0.22014406258092545
	Evaluating for user 14
