# Overview

#### In this notebook, I build recommendation system for Vidio.com user premium. The dataset are Vidio.com play dataset from 1 februrary to 16 february 2020. I am using colaborative filtering to build recommendation system, which is recommendation based on activities and preferences of other users similar to us. I am using K-nearest neighbors with cosine_similarity as metric.

In [169]:
import pandas as pd
import numpy as np

# Read the File

In [151]:
df = pd.read_csv('20% Vidio stream cleaned.csv')

# Feature Engineering

### Filter data with only premium user and completed play

In [152]:
df = df[(df['is_premium'] == True)]

In [153]:
df['completed'] = df['completed'].replace(True, 1).replace(False, 0)

In [157]:
df= df[df['completed'] == 1]

### Pick the user id, film (items) and completed feature to do a collaborative Filtering recommendation engine

In [158]:
columns = ['hash_watcher_id', 'film_title', 'completed']
user_and_film = df[columns]

### make new columns total plays for each unique film_title

In [159]:
total_plays = user_and_film.groupby(['film_title']).size().reset_index(name = 'total_plays').sort_values(by = 'total_plays', ascending=False)
total_plays

Unnamed: 0,film_title,total_plays
94,I Love You Baby,857
79,HEART,127
105,Legend of the Blue Sea,78
259,Weightlifting Fairy Kim Bok-joo,76
57,Dr. Romantic,66
...,...,...
191,Single Lady,1
190,Si Jago Merah,1
49,Da Vinci - Animalogic,1
186,Shaadi Mein Zaroor Aana,1


In [160]:
df = user_and_film.merge(total_plays, left_on ='film_title', right_on='film_title').sort_values(by='total_plays', ascending=False)
df

Unnamed: 0,hash_watcher_id,film_title,completed,total_plays
782,4b500379c751b7c276c2b2c1db797d47f3a2c7a1b564e2...,I Love You Baby,1,857
645,de73c4a4aaea23a349b5ce112ea2583043f48ab8ee0edc...,I Love You Baby,1,857
647,81c8e36de488549443aaa24148aa41b8acde24064e88b5...,I Love You Baby,1,857
648,7e84bdb32d290a1f3776ac496b2df0cf233819cae4b7a7...,I Love You Baby,1,857
649,92941944553b3a2fb8422c6835965f75f739549b3abcd5...,I Love You Baby,1,857
...,...,...,...,...
2658,f92d17295d86b1672c796b91f2bbb1f350aea16f1a4024...,Opapatika,1,1
2659,c8d0e7451cdd40e861de1ea81c37d17e4a8e421990d1e2...,Akibat Hamil Muda,1,1
2016,3a1b787a8dd1f1978df7263065d61ef2efb61c001bcef4...,This Is Cinta,1,1
2017,78c9bec4bb38edb3c2e9a4b22b80c1767f92538d343ce9...,Foxcatcher,1,1


#### Filter data to only popular film_title, to ensure statistical significance

In [12]:
df.describe()

Unnamed: 0,total_plays
count,2832.0
mean,283.641243
std,378.795255
min,1.0
25%,14.0
50%,53.0
75%,857.0
max,857.0


In [161]:
df = df[df['total_plays'] >= 10]

### Reshape the data using pivot table function

In [162]:
df_sparse = df.reset_index().pivot_table(index='film_title', columns='hash_watcher_id', values = 'completed').fillna(0)

### Transform Sparse matrix to Compressed Sparse Row (CSR)

for refference about sparse matrix: https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

In [163]:
from scipy.sparse import csr_matrix

In [164]:
df_CSR = csr_matrix(df_sparse)

# Machine Learning Model

https://scikit-learn.org/stable/modules/neighbors.html

###### Using cosine  to measure the similarity between film_title based on cosine value (1: similar and 0: not similar). The cosine similarity is advantageous because even if the two similar vectors or documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity. So if there is a film_title watch by us and the other similar user, the other film watched by the similar will be a recommendation for us. its happend cause the dot product (numerator) in cosine equuation shared the same features (users in our case) so the cosine value will be larger. 

here is the explanation of cosine similarity in 3d dimension (3 features): https://www.machinelearningplus.com/nlp/cosine-similarity/

or this is the example of the content based recommendation method (feature: informations of item instead of users) using cosine similarity: https://towardsdatascience.com/using-cosine-similarity-to-build-a-movie-recommendation-system-ae7f20842599

In [165]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric='cosine', algorithm = 'brute')
model_knn.fit(df_CSR)

NearestNeighbors(algorithm='brute', metric='cosine')

## The Recommendation System

In [178]:
query_index = np.random.choice(df_sparse.shape[0])
distances, indices = model_knn.kneighbors(df_sparse.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}: '.format(df_sparse.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, df_sparse.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for The Secret Hotel: 
1: W – Two Worlds, with distance of 0.9087129070824723:
2: Kill Me, Heal Me, with distance of 0.9087129070824723:
3: The Good Wife, with distance of 0.9578924039466741:
4: The Cursed, with distance of 1.0:
5: The Gang Doctor, with distance of 1.0:
