<h1>Collaborative Filtering Movie Recommendation - Item Based - Directors</h1>

In this notebook, we focus on determining the similarity between movies based on their tags. To analyze movie tags, we will utilize one data files: metadata.json. 

metadata.json contains 84,661 lines of movie information from MovieLens - movie title, directors, actors, date that the movie was added to MovieLens, average rating, movie id on IMDB, movie identification id.

After preprocessing the datasets, we will generate a .csv file with three columns: movie id; director; and other movies that share the same director with this movie.

<h3>import data</h3>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Importing matadata.json
metadata = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\metadata.json', lines=True)
metadata.head()

Unnamed: 0,title,directedBy,starring,dateAdded,avgRating,imdbId,item_id
0,Toy Story (1995),John Lasseter,"Tim Allen, Tom Hanks, Don Rickles, Jim Varney,...",,3.89146,114709,1
1,Jumanji (1995),Joe Johnston,"Jonathan Hyde, Bradley Pierce, Robin Williams,...",,3.26605,113497,2
2,Grumpier Old Men (1995),Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret , Sop...",,3.17146,113228,3
3,Waiting to Exhale (1995),Forest Whitaker,"Angela Bassett, Loretta Devine, Whitney Housto...",,2.86824,114885,4
4,Father of the Bride Part II (1995),Charles Shyer,"Steve Martin, Martin Short, Diane Keaton, Kimb...",,3.0762,113041,5


In [3]:
# Examining dublicated data of metadata
metadata_duplicates = metadata.duplicated()
print(metadata[metadata_duplicates])

Empty DataFrame
Columns: [title, directedBy, starring, dateAdded, avgRating, imdbId, item_id]
Index: []


In [4]:
# Create a dictionary to map directors to movies
director_to_movies = {}
for index, row in metadata.iterrows():
    director = row['directedBy']
    movie_id = row['item_id']
    if director not in director_to_movies:
        director_to_movies[director] = []
    director_to_movies[director].append(movie_id)

In [5]:
# Generate a new DataFrame for the CSV file
csv_data = []
for index, row in metadata.iterrows():
    movie_id = row['item_id']
    director = row['directedBy']
    # Filter out the current movie to only list other movies
    other_movies = [m for m in director_to_movies[director] if m != movie_id]
    csv_data.append([movie_id, director, other_movies])

In [6]:
csv_df = pd.DataFrame(csv_data, columns=['movie_id', 'director', 'other_movies'])

In [7]:
csv_df.head()

Unnamed: 0,movie_id,director,other_movies
0,1,John Lasseter,"[2355, 42191, 45517, 95446, 95628, 95856, 2139..."
1,2,Joe Johnston,"[2054, 2094, 2501, 4638, 7324, 74452, 88140, 1..."
2,3,Howard Deutch,"[460, 1290, 1837, 2145, 3861, 4509, 6841, 7381..."
3,4,Forest Whitaker,"[1888, 8869, 88761]"
4,5,Charles Shyer,"[360, 4080, 4959, 6944, 8948, 26514]"


In [8]:
# Export to CSV
csv_df.to_csv('similar_movies_director.csv', index=False)