<h1>Collaborative Filtering Movie Recommendation - Item Based - Actors</h1>

In this notebook, we focus on determining the similarity between movies based on their actors. To analyze movie tags, we will utilize one data files: metadata.json. 

metadata.json contains 84,661 lines of movie information from MovieLens - movie title, directors, actors, date that the movie was added to MovieLens, average rating, movie id on IMDB, movie identification id.

To determine similarity between two movies, we will use Jaccard Similarity:
Let $x$ and $y$ be a pair of binary 0/1 vectors.

|       | $y_i = 1$ | $y_i = 0$ |
|-------|-----------|-----------|
| $x_i = 1$ | $M_{11}$  | $M_{10}$  |
| $x_i = 0$ | $M_{01}$  | $M_{00}$  |

- $M_{ij}$: number of elements in which $x = i$ and $y = j$.
- Jaccard$(x, y)$ = $M_{11}$/$({M_{01} + M_{10} + M_{11}})$

After preprocessing the datasets, we will generate a .csv file with three columns: movie_id; similar_movie; jac_sim.

<h3>import data</h3>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Importing matadata.json
metadata = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\metadata.json', lines=True)
metadata.head()

Unnamed: 0,title,directedBy,starring,dateAdded,avgRating,imdbId,item_id
0,Toy Story (1995),John Lasseter,"Tim Allen, Tom Hanks, Don Rickles, Jim Varney,...",,3.89146,114709,1
1,Jumanji (1995),Joe Johnston,"Jonathan Hyde, Bradley Pierce, Robin Williams,...",,3.26605,113497,2
2,Grumpier Old Men (1995),Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret , Sop...",,3.17146,113228,3
3,Waiting to Exhale (1995),Forest Whitaker,"Angela Bassett, Loretta Devine, Whitney Housto...",,2.86824,114885,4
4,Father of the Bride Part II (1995),Charles Shyer,"Steve Martin, Martin Short, Diane Keaton, Kimb...",,3.0762,113041,5


In [3]:
# Examining dublicated data of metadata
metadata_duplicates = metadata.duplicated()
print(metadata[metadata_duplicates])

Empty DataFrame
Columns: [title, directedBy, starring, dateAdded, avgRating, imdbId, item_id]
Index: []


In [4]:
# Examining distributions of number of starring in metadata
actor_count_distribution = {}

for index, row in metadata.iterrows():
    starring = row["starring"].split(', ') if pd.notna(row["starring"]) else []
    actor_count = len(starring)
    

    if actor_count in actor_count_distribution:
        actor_count_distribution[actor_count] += 1
    else:
        actor_count_distribution[actor_count] = 1


actor_count_distribution_sorted = sorted(actor_count_distribution.items(), key=lambda x: x[0])
actor_count_distribution = dict(actor_count_distribution_sorted)
actor_count_distribution

{1: 61074,
 2: 1495,
 3: 4646,
 4: 7563,
 5: 3687,
 6: 1806,
 7: 1065,
 8: 744,
 9: 515,
 10: 394,
 11: 324,
 12: 258,
 13: 197,
 14: 163,
 15: 179,
 16: 90,
 17: 83,
 18: 84,
 19: 46,
 20: 41,
 21: 42,
 22: 24,
 23: 16,
 24: 19,
 25: 13,
 26: 12,
 27: 11,
 28: 13,
 29: 15,
 30: 9,
 31: 4,
 32: 6,
 33: 1,
 34: 2,
 35: 4,
 36: 3,
 37: 2,
 38: 1,
 40: 2,
 41: 3,
 51: 1,
 52: 1,
 57: 2,
 70: 1}

Most movies in the database only contains one actors. Due to the sparseness of the data, Jaccard similarity will be a good choice to determine similarity between two movies.

<h3>calculate the Jaccard similarity</h3>

In [5]:
# Extract a set of all unique actors
all_actors = set()
for actors in metadata['starring']:
    all_actors.update(actors.split(', '))
actor_list = list(all_actors)

In [6]:
len(actor_list)

94646

There are 94646 unique actors in total.

In [7]:
movie_actors_df = pd.DataFrame({
    'item_id': metadata['item_id'],
    'actors': [actors.split(', ') for actors in metadata['starring']]
})

In [8]:
movie_actors_df

Unnamed: 0,item_id,actors
0,1,"[Tim Allen, Tom Hanks, Don Rickles, Jim Varney..."
1,2,"[Jonathan Hyde, Bradley Pierce, Robin Williams..."
2,3,"[Jack Lemmon, Walter Matthau, Ann-Margret , So..."
3,4,"[Angela Bassett, Loretta Devine, Whitney Houst..."
4,5,"[Steve Martin, Martin Short, Diane Keaton, Kim..."
...,...,...
84656,239306,"[William Shatner,Lynn Carlin,Ossie Davis,Vivec..."
84657,239308,"[Richard Crenna,Patty Duke,Vic Morrow,Arlene G..."
84658,239310,"[Chinawut Indracusin,Paisarnkulwong Vachiravit..."
84659,239312,"[วชิรวิชญ์ ไพศาลกุลวงศ์,ภูริพรรธน์ เวชวงศาเตชา..."


In [9]:
# Function to calculate the Jaccard similarity
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

In [10]:
movie_actors_dict = movie_actors_df.set_index('item_id')['actors'].apply(set).to_dict()

In [11]:
# Function to find similar movies based on shared actors
def find_similar_movies_by_actors(target_id, top_n=100):
    # Retrieve the set of actors for the target movie
    target_actors = movie_actors_dict[target_id]
    similarities = []

    # Iterate through each movie and calculate similarity
    for movie_id, actors in movie_actors_dict.items():
        if movie_id != target_id:
            if target_actors.intersection(actors):  # Check for shared actors
                jac_sim = jaccard_similarity(target_actors, actors)
                similarities.append((movie_id, jac_sim))

    # Sort by Jaccard similarity and limit to top_n results
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

    # Create a DataFrame for the result
    result_df = pd.DataFrame(sorted_similarities, columns=['similar_movie', 'jac_sim'])
    result_df.insert(0, 'movie_id', target_id)

    return result_df

In [12]:
# use movie id "1" to test the find_similar_movies_by_actors function
find_similar_movies_by_actors(1, top_n=100)

Unnamed: 0,movie_id,similar_movie,jac_sim
0,1,3114,0.476190
1,1,78499,0.216216
2,1,106022,0.200000
3,1,8961,0.105263
4,1,73469,0.100000
...,...,...,...
95,1,48518,0.062500
96,1,68554,0.062500
97,1,91325,0.062500
98,1,105504,0.062500


In [13]:
movie_actors_df[movie_actors_df['item_id'].isin([1, 3114])]

Unnamed: 0,item_id,actors
0,1,"[Tim Allen, Tom Hanks, Don Rickles, Jim Varney..."
3028,3114,"[Tom Hanks, Tim Allen, Joan Cusack, Kelsey Gra..."


In [14]:
# Initialize an empty DataFrame for aggregated results
similar_movie_actors = pd.DataFrame()

# Iterate through each movie in the DataFrame
for movie_id in movie_actors_df['item_id']:
    # Find similar movies for the current movie
    similar_movies_df = find_similar_movies_by_actors(movie_id)
    
    # Append the result to the aggregated DataFrame
    similar_movie_actors = pd.concat([similar_movie_actors, similar_movies_df])

# Reset index of the final DataFrame
similar_movie_actors.reset_index(drop=True, inplace=True)

# Export to CSV
similar_movie_actors.to_csv('similar_movie_actors.csv', index=False)