<h1>Collaborative Filtering Movie Recommendation - Item Based - Tags</h1>

In this notebook, we focus on determining the similarity between movies based on their tags. To analyze movie tags, we will utilize two data files: tag_count.json and survey_answers.json. 

tag_count.json contains 212,704 entries, detailing the number of times MovieLens users have attached a particular tag to a movie. 

survey_answers.json includes 58,903 entries, showing the ratings that MovieLens users have given to movie-tag pairs. The users were asked to indicate the degree, to which a tag applies to a movie on a 5-point scale from the tag not applying at all (1 point) to applying very strongly (5 points). Users could also indicate that they are not sure about the degree (the -1 value).

After preprocessing these two datasets, we will construct a movie-tags matrix, indicating whether a movie contains a particular tag. In this matrix, "0" denotes the absence of a tag, while "1" signifies its presence. Finally, we will employ cosine similarity to assess the similarity of movies in terms of their tags.

<h3>import data</h3>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import tags.json
tag_count = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\tag_count.json', lines=True)
print(tag_count.shape)
tag_count.head()

(212704, 3)


Unnamed: 0,item_id,tag_id,num
0,1,86963,4
1,1,42940,1
2,1,37116,26
3,1,52206,1
4,1,34442,21


In [3]:
# examine dublicated data of tag_count
tag_count_duplicates = tag_count.duplicated()
print(tag_count[tag_count_duplicates])

Empty DataFrame
Columns: [item_id, tag_id, num]
Index: []


In [4]:
# survey_answers.json
survey_answers = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\survey_answers.json', lines=True)
print(survey_answers.shape)
survey_answers.head()

(58903, 4)


Unnamed: 0,user_id,item_id,tag_id,score
0,978707,3108,50126,3
1,978707,2858,50126,1
2,978707,1269,50126,1
3,978707,1136,50126,1
4,978707,1220,50126,1


In [5]:
# examine dublicated data of survey_answers
survey_answers_duplicates = survey_answers.duplicated()
print(survey_answers[survey_answers_duplicates])

Empty DataFrame
Columns: [user_id, item_id, tag_id, score]
Index: []


<h3>create a movie-tag matrix to calculate the cosine similarity</h3>

In [6]:
# merge tag_count and survey_answers
movie_tags = pd.merge(tag_count, survey_answers, on=['item_id', 'tag_id'], how='outer')
movie_tags.head()

Unnamed: 0,item_id,tag_id,num,user_id,score
0,1,86963,4.0,820082.0,1.0
1,1,86963,4.0,423144.0,2.0
2,1,86963,4.0,745878.0,1.0
3,1,86963,4.0,997388.0,1.0
4,1,86963,4.0,976135.0,5.0


In [7]:
# Build an empty movie_tag_matrix
unique_items = movie_tags['item_id'].unique()
unique_tags = movie_tags['tag_id'].unique()
zeros_matrix = np.zeros((len(unique_items), len(unique_tags)))
movie_tag_matrix = pd.DataFrame(zeros_matrix, index=unique_items, columns=unique_tags)
movie_tag_matrix.index.name = 'item_id'
movie_tag_matrix.columns.name = 'tag_id'
movie_tag_matrix.sort_index(axis=0, inplace=True)
movie_tag_matrix.sort_index(axis=1, inplace=True)

In [8]:
print(movie_tags.shape)

# find any of the missing values in the merged dataframe.
movie_tags[movie_tags.isna().any(axis=1)]

(259931, 5)


Unnamed: 0,item_id,tag_id,num,user_id,score
15,1,37116,26.0,,
16,1,52206,1.0,,
17,1,34442,21.0,,
25,1,104090,1.0,,
26,1,55743,10.0,,
...,...,...,...,...,...
259926,5010,60807,,705354.0,-1.0
259927,55245,91482,,705354.0,1.0
259928,56949,91482,,705354.0,1.0
259929,7255,91482,,705354.0,2.0


Most movies don't have a complete movie-tags information. We will analyse the tag_count and survey_answers seperately.

- Analysis of tag_count

In [9]:
# calculate mean and median value of the number of tags attached in "tag_count.json"
mean_num = movie_tags['num'].mean()
print(mean_num)
median_num = movie_tags['num'].median()
print(median_num)

5.529023895655787
1.0


In [10]:
# summarizes numbers of a movie received tags
movie_tags['num'].value_counts()

num
1.0      141183
2.0       26220
3.0       10782
4.0        6155
5.0        4118
          ...  
165.0         1
191.0         1
497.0         1
249.0         1
200.0         1
Name: count, Length: 258, dtype: int64

In the "tag_count.json" dataset, we observe that most tags have been attached to a movie only once, which could likely be due to random selection. To ensure more robust results, we will set a threshold of 2: a movie is considered to have a particular tag only if that tag has been attached to it more than once.

In [11]:
# label the movie_tag_matrix by tag_count > 1
for index, row in movie_tags[movie_tags['num'] > 1].iterrows():
    if row['item_id'] in movie_tag_matrix.index and row['tag_id'] in movie_tag_matrix.columns:
        movie_tag_matrix.at[row['item_id'], row['tag_id']] = 1

- Analysis of survey_answers

In [12]:
# determine numbers of survey_answers
columns = [-1, 1, 2, 3, 4, 5, np.NaN]  
rows = [
    (movie_tags['score'] == -1).sum(),
    (movie_tags['score'] == 1).sum(),
    (movie_tags['score'] == 2).sum(),
    (movie_tags['score'] == 3).sum(),
    (movie_tags['score'] == 4).sum(),
    (movie_tags['score'] == 5).sum(),
    pd.isna(movie_tags['score']).sum()  
]
df_num_score = pd.DataFrame(data={'score_count': rows}, index=columns)
df_num_score.index.name = 'score'
df_num_score

Unnamed: 0_level_0,score_count
score,Unnamed: 1_level_1
-1.0,7740
1.0,17208
2.0,6296
3.0,7013
4.0,8215
5.0,12431
,201028


Regarding the "survey_answers.json", we will determine that a movie possesses a specific tag if it has received a rating of 4.0 or 5.0 for that tag.

In [13]:
# label the movie_tag_matrix by rating with 4 or 5
for index, row in movie_tags[(movie_tags['score'] == 4) | (movie_tags['score'] == 5)].iterrows():
    if row['item_id'] in movie_tag_matrix.index and row['tag_id'] in movie_tag_matrix.columns:
        movie_tag_matrix.at[row['item_id'], row['tag_id']] = 1

In [14]:
# find the movie that doesn't have any label
num_zero_rows = ((movie_tag_matrix == 0).all(axis=1)).sum()
num_zero_rows

24770

In [15]:
# find the label that doesn't be attached to any movie
num_zero_columns = ((movie_tag_matrix == 0).all(axis=0)).sum()
num_zero_columns

0

Out of the 39,809 movies in our dataset, there are 24,770 movies for which we cannot confidently assign any tags. Conversely, for each of the 1,094 tags, we can confidently identify at least one movie associated with it.

- An example of movie-tags

We will use an example, Titanic (1997) (movie id: 1721) to check if this movie-tag map matrix makes sense.

In [16]:
tags = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\tags.json', lines=True)

In [17]:
# Extract tags from movie_tag_matrix for movie Titanic (1997) (movie id: 1721)
columns_with_value_1 = movie_tag_matrix.columns[movie_tag_matrix.loc[1721] == 1].tolist()
matching_tags = tags[tags['id'].isin(columns_with_value_1)]['tag'].tolist()
matching_tags

['historical',
 'nostalgic',
 'tear jerker',
 'girlie movie',
 'predictable',
 'music',
 'atmospheric',
 'history',
 'overrated',
 'oscar (best directing)',
 'realistic',
 'long',
 'nudity (topless)',
 'excellent',
 'catastrophe',
 'romance',
 'big budget',
 'ocean',
 'romantic',
 'scenic',
 'sentimental',
 'action',
 'disaster',
 'oscar winner',
 'based on true story',
 'epic',
 'love story',
 'oscar (best picture)',
 'boring',
 'love',
 'classic',
 'stylized',
 'pg-13',
 'based on a true story',
 'sex',
 'chick flick',
 'true story',
 'bittersweet',
 '70mm',
 'good acting',
 'drama',
 'time travel',
 'sacrifice',
 'period piece',
 'natural disaster',
 'nudity (topless - notable)',
 'too long',
 'survival',
 'oscar (best cinematography)']

The result seems to be promising. This movie-tags matrix works well for selecting features of movie-tags.

<h3>calculate the cosine similarity in movie-tags matrix</h3>

In [20]:
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

To find the similarest movies in terms of tags, for a movie, we first filtered out other movies that share the same tags with this movie, then calculate their cosine similarities. The result will be stored in a .csv file.

In [23]:
sparse_matrix = csr_matrix(movie_tag_matrix)

# Create a mapping between movie IDs and matrix row indices
movie_id_to_index = {movie_id: idx for idx, movie_id in enumerate(movie_tag_matrix.index)}

# Function to find and calculate similarity for movies with shared tags
def find_similar_movies(movie_id, top_n=100):
    if movie_id not in movie_id_to_index:
        raise ValueError(f"Movie ID {movie_id} not found in the dataset.")

    movie_index = movie_id_to_index[movie_id]
    shared_tags = movie_tag_matrix.columns[movie_tag_matrix.loc[movie_id] > 0]
    filtered_movie_ids = movie_tag_matrix.index[movie_tag_matrix[shared_tags].sum(axis=1) > 0].tolist()

    # Remove the current movie from the list
    filtered_movie_ids.remove(movie_id)

    # Calculate cosine similarity
    sim_scores = {}
    for other_movie_id in filtered_movie_ids:
        other_movie_index = movie_id_to_index[other_movie_id]
        sim_score = cosine_similarity(sparse_matrix[movie_index:movie_index+1], sparse_matrix[other_movie_index:other_movie_index+1])[0, 0]
        sim_scores[other_movie_id] = sim_score

    # Sort by similarity and select top N
    top_similar = sorted(sim_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]

    # Create a DataFrame for the result
    result_df = pd.DataFrame(top_similar, columns=['similar_movie', 'cos_sim'])
    result_df.insert(0, 'item_id', movie_id)

    return result_df

In [27]:
# using Titanic (1997) (movie id: 1721) as an example to test the result
metadata = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\metadata.json', lines=True)
similar_movie_example = find_similar_movies(1721, top_n=100)
metadata_renamed = metadata.rename(columns={'item_id': 'similar_movie'})
similar_movie_example = similar_movie_example.merge(metadata_renamed[['similar_movie', 'title']], on='similar_movie', how='left')
similar_movie_example

Unnamed: 0,item_id,similar_movie,cos_sim,title
0,1721,110,0.407543,Braveheart (1995)
1,1721,920,0.404962,Gone with the Wind (1939)
2,1721,4310,0.384900,Pearl Harbor (2001)
3,1721,8533,0.370389,"Notebook, The (2004)"
4,1721,6947,0.346479,Master and Commander: The Far Side of the Worl...
...,...,...,...,...
95,1721,260,0.224667,Star Wars: Episode IV - A New Hope (1977)
96,1721,88163,0.224133,"Crazy, Stupid, Love. (2011)"
97,1721,2932,0.223607,Days of Heaven (1978)
98,1721,64957,0.223105,"Curious Case of Benjamin Button, The (2008)"


Braveheart (1995), Gone with the Wind (1939) and Pearl Harbor (2001) are actually simialar movie to Titanic (1997) in terms of tags. The result is promising.

In [28]:
# Initialize an empty DataFrame to store the results
all_similar_movies_df = pd.DataFrame()

# Check each movie one by one; if the movie has tags, calculate similarity and merge the results
for movie_id in movie_tag_matrix.index:
    # Check if the movie has any tags (non-zero row)
    if movie_tag_matrix.loc[movie_id].sum() > 0:
        similar_movies = find_similar_movies(movie_id)
        all_similar_movies_df = pd.concat([all_similar_movies_df, similar_movies])

# Export to a CSV file
all_similar_movies_df.to_csv('similar_movies_tags.csv', index=False)