### Content Based Recommendations

In [1]:
import os
import zipfile

import pandas as pd
import numpy as np
import scipy.sparse as sps

import matplotlib.pyplot as plt
import seaborn as sns 

### 1. Dowloading the Data

Still working with movilens dataset but **this time download additional dataset with user tags**

In [2]:
# ratings data
file_path = 'datasets\\Movielens10M\\decompressed\\ml-10M100K\\ratings.dat'

df_ratings = pd.read_csv(
    file_path,
    sep='::',
    header=None,
    dtype={0:int, 1:int, 2:int, 3:int},
    engine='python'
)

df_ratings.columns = ['user_id', 'item_id', 'rating', 'timestamp']
df_ratings

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,122,5,838985046
1,1,185,5,838983525
2,1,231,5,838983392
3,1,292,5,838983421
4,1,316,5,838983392
...,...,...,...,...
10000049,71567,2107,1,912580553
10000050,71567,2126,2,912649143
10000051,71567,2294,5,912577968
10000052,71567,2338,2,912578016


**getting tags data**

In [3]:
# first, extract tags data
zip_path = 'datasets/Movielens10M/movielens_10m.zip'
file_name = 'ml-10M100K/tags.dat'
target_dir = 'datasets/Movielens10M/decompressed'

zip_file = zipfile.ZipFile(zip_path)
file_path = zip_file.extract(file_name, path=target_dir)

In [4]:
# tags data
df_tags = pd.read_csv(
    file_path,
    sep='::',
    header=None,
    dtype={0:int, 1:int, 2:str, 3:int},
    engine='python'
)

df_tags.columns = ['user_id', 'item_id', 'tag', 'timestamp']
df_tags

Unnamed: 0,user_id,item_id,tag,timestamp
0,15,4973,excellent!,1215184630
1,20,1747,politics,1188263867
2,20,1747,satire,1188263867
3,20,2424,chick flick 212,1188263835
4,20,2424,hanks,1188263835
...,...,...,...,...
95575,71556,1377,Gothic,1188263571
95576,71556,2424,chick flick,1188263606
95577,71556,3033,comedy,1188263626
95578,71556,3081,Gothic,1188263565


In [5]:
unique_tags = df_tags['tag'].unique()
tags_unique_users = df_tags['user_id'].unique()
tags_unique_items = df_tags['item_id'].unique()

print('N Unique Tags: ', unique_tags.shape[0])
print('N Unique Users (Tags DF): ', tags_unique_users.shape[0])
print('N Unique Items (Tags DF): ', tags_unique_items.shape[0])
print(f'N Iteractions (Tags DF): {df_tags.shape[0]}')

N Unique Tags:  16529
N Unique Users (Tags DF):  4009
N Unique Items (Tags DF):  7601
N Iteractions (Tags DF): 95580


**Only a few users and items have tags**

### 2. Sparse Matrices Creation

Convert our `df_ratings` and `df_tags` into sparse matrices. However, we may have a small problem with `df_tags`, tags have string format that we need to convert into numbers.

Besides, we must also ensure that user and item indices that we use in `df_ratings` and `df_tags` are consistent. **To do so we use the same mapper**. First we get indices from `df_ratings` and then add new ids that appear only in `df_tags`

In [6]:
mapped_user_id, original_user_id = pd.factorize(df_ratings['user_id']) 
print('N Unique Users in Ratings DF: ', len(original_user_id))

mapped_item_id, original_item_id = pd.factorize(df_ratings['item_id']) 
print('N Unique Items in Ratings DF: ', len(original_item_id))

N Unique Users in Ratings DF:  69878
N Unique Items in Ratings DF:  10677


In [7]:
user_indices_all = pd.concat(
    [df_ratings['user_id'], df_tags['user_id']],
    ignore_index=True, axis='rows'
)

mapped_user_id_all, original_user_id_all = pd.factorize(user_indices_all)
print('N Unique Users in Ratings DF and Tags DF: ', len(original_user_id_all))

item_indices_all = pd.concat(
    [df_ratings['item_id'], df_tags['item_id']],
    ignore_index=True, axis='rows'
)

mapped_item_id_all, original_item_id_all = pd.factorize(item_indices_all)
print('N Unique Itmes in Ratings DF and Tags DF: ', len(original_item_id_all))

N Unique Users in Ratings DF and Tags DF:  71567
N Unique Itmes in Ratings DF and Tags DF:  10681


After "joining" `df_tags` we obtain more users and interacted items 

In [24]:
mapped_tag_id, original_tag_id = pd.factorize(df_tags['tag'].unique())
original_tag_id_indexed = pd.Series(mapped_tag_id, index=original_tag_id)
original_tag_id_indexed

ValueError: Length of passed values is 16529, index implies 16528.

In [None]:
mapped_id, original_id = pd.factorize(ICM_dataframe["FeatureID"].unique())
feature_original_ID_to_index = pd.Series(mapped_id, index=original_id)

print("Unique FeatureID in the URM are {}".format(len(feature_original_ID_to_index)))