Please place all data into `backend/` folder. If you decide to move stuff around just list so up here.

# REQUIREMENTS + Gen info


## install following requirements

> pip install pandas ipykernel tensorflow

## Suggestions

- I really reccomend using vscode extension "data wrangler". It helps visualize dfs. Otherwise, happy coding :)
---


## IMDb Dataset Details
Download from [imdb](https://datasets.imdbws.com/)

* Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. 
* The first line in each file contains headers that describe what is in each column. 
* A '\N' is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

### title.akas.tsv.gz

* titleId (string) - a tconst, an alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* title (string) – the localized title
* region (string) - the region for this version of the title
* language (string) - the language of the title
* types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
* attributes (array) - Additional terms to describe this alternative title, not enumerated
* isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.basics.tsv.gz

* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. '\N' for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz

* tconst (string) - alphanumeric unique identifier of the title
* directors (array of nconsts) - director(s) of the given title
* writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz
tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz

* tconst (string) - alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* nconst (string) - alphanumeric unique identifier of the name/person
* category (string) - the category of job that person was in
* job (string) - the specific job title if applicable, else '\N'
* characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz

* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

### name.basics.tsv.gz

* nconst (string) - alphanumeric unique identifier of the name/person
* primaryName (string)– name by which the person is most often credited
* birthYear – in YYYY format
* deathYear – in YYYY format if applicable, else '\N'
* primaryProfession (array of strings)– the top-3 professions of the person
* knownForTitles (array of tconsts) – titles the person is known for

# Data preperation + saving

In [None]:
# To install libraries, run: pip install pandas in a seperate terminal but just make sure you use the right environment
# I had to manually type in the interpreter path for the venv to work so you might have to do the same.
# in vscode on the search bar type a ">" and type "Python: Select Interpreter" and select the venv path you made

import pandas as pd

# Load all datasets
"""

name_df = pd.read_csv('data/name.basics.tsv', sep='\t')
title_akas_df = pd.read_csv('data/title.akas.tsv', sep='\t')
title_basics_df = pd.read_csv('data/title.basics.tsv', sep='\t')
title_crew_df = pd.read_csv('data/title.crew.tsv', sep='\t')
title_episode_df = pd.read_csv('data/title.episode.tsv', sep='\t')
title_principals_df = pd.read_csv('data/title.principals.tsv', sep='\t')
title_ratings_df = pd.read_csv('data/title.ratings.tsv', sep='\t')

"""
title_basics_df = pd.read_csv('data/title.basics.tsv', sep='\t')
title_ratings_df = pd.read_csv('data/title.ratings.tsv', sep='\t')
title_principals_df = pd.read_csv('data/title.principals.tsv', sep='\t')
name_df = pd.read_csv('data/name.basics.tsv', sep='\t')



In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

# Filter data to include only movies
movies_df = title_basics_df[title_basics_df['titleType'] == 'movie'].copy()

# Merge movies data w/ ratings data
# use tconst as key (from markdown), merge left means keep all rows from movies_df
movies_with_ratings = pd.merge(movies_df, title_ratings_df, on='tconst', how='left')

# Get actors for each movie
actors_df = title_principals_df[title_principals_df['category'].isin(['actor', 'actress'])]
actors_with_names = pd.merge(actors_df, name_df[['nconst', 'primaryName']], on='nconst', how='left')


In [None]:
movies_with_ratings.head()

In [None]:
actors_with_names.head()

In [None]:

# Get most popular actors (appear in most movies)
actor_counts = actors_with_names['nconst'].value_counts()
top_actors = actor_counts.head(500).index.tolist()  # We'll use top 500 actors as features (500 because theres too many actors )

# Create actor name lookup dictionary (will use in frontend)
actor_id_to_name = dict(zip(actors_with_names['nconst'], actors_with_names['primaryName']))

# Group actors by movie
movie_actors = actors_with_names[actors_with_names['nconst'].isin(top_actors)]
movie_actors = movie_actors.groupby('tconst')['nconst'].apply(list).reset_index()

# Merge actors with movies
final_movies_df = pd.merge(movies_with_ratings, movie_actors, on='tconst', how='left')

In [None]:
final_movies_df.head()

In [None]:

# Prepare final dataset
final_dataset = final_movies_df[[
    'tconst', # movie id 
    'primaryTitle', 
    'startYear', 
    'genres', 
    'nconst',  # List of actors
    'averageRating', 
    'numVotes'
]].copy()

# Handle missing values
final_dataset['averageRating'] = final_dataset['averageRating'].fillna(0)
final_dataset['numVotes'] = final_dataset['numVotes'].fillna(0)
final_dataset['nconst'] = final_dataset['nconst'].fillna('').apply(lambda x: [] if x == '' else x)
final_dataset = final_dataset.rename(columns={'nconst': 'actor_ids'})


In [None]:
final_dataset.head()

In [None]:

# Parse and normalize features
final_dataset['startYear'] = pd.to_numeric(final_dataset['startYear'], errors='coerce')
final_dataset = final_dataset.dropna(subset=['startYear'])
final_dataset['startYear'] = final_dataset['startYear'].astype(int)

# Normalize year 
current_year = 2023
final_dataset['year_normalized'] = (final_dataset['startYear'] - 1900) / (current_year - 1900)

# Parse genres
final_dataset['genres'] = final_dataset['genres'].fillna('')
final_dataset['genres'] = final_dataset['genres'].apply(lambda x: x.split(',') if x else [])


In [None]:
final_dataset.head()

In [None]:

# Create feature matrices
# For genres
genre_mlb = MultiLabelBinarizer()
genre_features = genre_mlb.fit_transform(final_dataset['genres'])
genre_feature_names = genre_mlb.classes_

# For actors
actor_mlb = MultiLabelBinarizer(classes=top_actors)
actor_features = actor_mlb.fit_transform(final_dataset['actor_ids'])
actor_feature_names = actor_mlb.classes_


In [None]:
# Create metadata for the frontend whenever NN is made 
# [DEPRECATED, see `create_NN.ipynb`]
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(final_dataset['tconst'])}
movie_index_to_id = {str(i): movie_id for i, movie_id in enumerate(final_dataset['tconst'])}
movie_id_to_title = {movie_id: title for movie_id, title in zip(final_dataset['tconst'], final_dataset['primaryTitle'])}

metadata = {
    'movie_id_to_index': movie_id_to_index,
    'movie_index_to_id': movie_index_to_id,
    'movie_id_to_title': movie_id_to_title,
    'genre_names': genre_feature_names.tolist(),
    'actor_ids': actor_feature_names.tolist(),
    'actor_names': {actor_id: actor_id_to_name.get(actor_id, '') for actor_id in actor_feature_names},
    #'feature_count': int(X.shape[1]),
    'year_index': 0,
    'genre_start_index': 1,
    'genre_count': len(genre_feature_names),
    'actor_start_index': 1 + len(genre_feature_names)
}

import json
with open('recommendation_metadata.json', 'w') as f:
    json.dump(metadata, f, default=str)

# Also save final dataset for reference
final_dataset[['tconst', 'primaryTitle', 'startYear', 'genres', 'actor_ids', 'averageRating', 'numVotes']].to_csv('processed_movies.csv', index=False)