# **IMDB Recommender Chatbot Project**
## **IMDB Movie Recommendation System Using RAG**

### Project Overview
This project demonstrates my ability to build a custom recommendation system using Retrieval-Augmented Generation (RAG) with IMDB datasets. The system recommends movies based on user preferences or trending content.

The dataset includes:
- movies and tv shows
- produced in the US
- from 2014 on
- 100 additional titles from [The 100 Best Movies Of All Time](https://www.empireonline.com/movies/features/best-movies-2/)

### Objectives
1. Showcase proficiency in data cleaning, merging, and preprocessing.
2. Implement a RAG pipeline for movie recommendations.
3. Present results interactively and narratively.

## **PART 2 - Preparing the Dataset**

This notebook focuses on cleaning and enriching the movie_df dataset by leveraging the IMDb non-commercial datasets. The goal is to ensure data completeness and consistency across all key attributes while preserving data integrity.

**Objectives**  

1.	Handling Missing Values:
- Identify and fill in missing values across all columns
- Use external IMDb datasets to replace missing values where applicable.

2.	Standardizing Movie Titles:
- If a movie’s title is not in English, replace it with its English version from IMDb’s alternative titles dataset.

By the end of this notebook, movie_df will be fully cleaned and standardized, making it ready for further analysis and modeling.

### **1. Import libraries and load the movie dataset prepared in Step 1**

In [4]:
import pandas as pd
import numpy as np
import os
import re
import pickle
import openai
from sklearn.metrics.pairwise import cosine_similarity

# # Import the OpenAI API key
# from config_loader import load_config_value
# openai_api_key = load_config_value("OPENAI_API_KEY")
# omdb_key = load_config_value("OMDB_API_KEY")

In [5]:
# Local paths
current_path = os.getcwd()
parent_path = os.path.dirname(current_path)

In [6]:
# Load previously fetched movie data
pickle_path = parent_path + "/Data/movie_data.pkl"

with open(pickle_path, "rb") as f:
    movie_data_list = pickle.load(f)

# Convert to DataFrame
movie_df = pd.DataFrame(movie_data_list)

In [7]:
movie_df.head()

Unnamed: 0,imdb_id,title,year,genre,director,actors,plot,country,awards,poster,rating,metascore,votes,type
0,tt4076360,Third Person,2015,"Documentary, Drama",Sharon Luzon,,"Suzan, an Arab- Israeli woman, must reshape he...",Israel,,https://m.media-amazon.com/images/M/MV5BOTI5OT...,,,,movie
1,tt3608918,Azzurrina,2023,Horror,Giacomo Franciosa,"Tatiana Luter, Paolo Stella, Gianfranco Terrin",The tale of Guendalina was passed down verball...,Italy,,https://m.media-amazon.com/images/M/MV5BOWQ5MG...,,,,movie
2,tt15908496,Dear Jackie,2021,Documentary,Henri Pardo,"Ronald Jones, Ivan Livingstone, Majiza Philips",Dear Jackie paints a picture of the Black comm...,Canada,,https://m.media-amazon.com/images/M/MV5BNDIwYz...,,,,movie
3,tt28378602,Zena s gumenim rukavicama,2023,Drama,Mario Sulina,"Areta Curkovic, Sandra Loncaric, Miro Cabraja",The film was inspired by the great strike of e...,Croatia,,https://m.media-amazon.com/images/M/MV5BODE4NG...,,,,movie
4,tt6340460,Los caminos de Cuba,2019,Documentary,Luciano Nacci,,,"Argentina, Cuba, Colombia",,https://m.media-amazon.com/images/M/MV5BMzhjOT...,,,,movie


### **2. Imputing missing values**

In [9]:
def check_missings(df):
    print('Missing values for each field:\n')
    
    columns = df.columns
    
    for column in columns:
        miss = int(movie_df[df[column] == 'N/A']['title'].count())
        print(" {} = {}".format(column, miss))

In [10]:
check_missings(movie_df)

Missing values for each field:

 imdb_id = 0
 title = 0
 year = 0
 genre = 1082
 director = 830
 actors = 5816
 plot = 4535
 country = 830
 awards = 24983
 poster = 2533
 rating = 17167
 metascore = 17167
 votes = 36738
 type = 0


#### **2.1 Approach**

- genre --> replace 'N/A' values using `title.basics.tsv` IMDB Dataset
- director --> replace 'N/A' values using `title.crew.tsv` and `name.basics.tsv` IMDB Datasets
- actors --> replace 'N/A' values using `title.principals.tsv` and `name.basics.tsv` IMDB Datasets
- plot --> Delate raws having plot = 'N/A'
- country  --> Delate raws having country = 'N/A'
- awards --> replace 'N/A' values with sentence '0 awards and 0 nominees'
- rating & votes --> replace 'N/A' values using `title.ratings.tsv` IMDB Dataset


In [12]:
## Load the IMDB datasets
imdb_dataset_path = os.path.dirname(parent_path) + '/IMDB_Dataset/'

# Load IMDb datasets
title_basics = pd.read_csv(imdb_dataset_path + 'title.basics.tsv', sep='\t', dtype=str, na_values=['\\N'])
title_crew = pd.read_csv(imdb_dataset_path + 'title.crew.tsv', sep='\t', dtype=str, na_values=['\\N'])
name_basics = pd.read_csv(imdb_dataset_path + 'name.basics.tsv', sep='\t', dtype=str, na_values=['\\N'])
title_principals = pd.read_csv(imdb_dataset_path + 'title.principals.tsv', sep='\t', dtype=str, na_values=['\\N'])
title_ratings = pd.read_csv(imdb_dataset_path + 'title.ratings.tsv', sep='\t', dtype={'tconst': str, 'averageRating': float, 'numVotes': int}, na_values=['\\N'])


In [13]:
## Rename 'imdb_id' field to align with the IMDB datasets id field
movie_df = movie_df.rename(columns={'imdb_id': 'tconst'})

In [14]:
# Convert 'N/A' values to NaN for easier handling
movie_df.replace('N/A', pd.NA, inplace=True)

# 1. Impute 'genre' using title.basics.tsv
movie_df = movie_df.merge(title_basics[['tconst', 'genres']], on='tconst', how='left')
movie_df['genre'].fillna(movie_df['genres'], inplace=True)
movie_df.drop(columns=['genres'], inplace=True)

# 2. Impute 'director' using title.crew.tsv and name.basics.tsv
title_crew = title_crew.explode('directors')  # Expand director lists
title_crew = title_crew.merge(name_basics[['nconst', 'primaryName']], left_on='directors', right_on='nconst', how='left')
director_mapping = title_crew.groupby('tconst')['primaryName'].apply(lambda x: ', '.join(x.dropna())).reset_index()
movie_df = movie_df.merge(director_mapping, on='tconst', how='left')
movie_df['director'].fillna(movie_df['primaryName'], inplace=True)
movie_df.drop(columns=['primaryName'], inplace=True)

# 3. Impute 'actors' using title.principals.tsv and name.basics.tsv
title_principals = title_principals[title_principals['category'].isin(['actor', 'actress'])]
title_principals = title_principals.merge(name_basics[['nconst', 'primaryName']], on='nconst', how='left')
actors_mapping = title_principals.groupby('tconst')['primaryName'].apply(lambda x: ', '.join(x.dropna())).reset_index()
movie_df = movie_df.merge(actors_mapping, on='tconst', how='left')
movie_df['actors'].fillna(movie_df['primaryName'], inplace=True)
movie_df.drop(columns=['primaryName'], inplace=True)

# 4. Remove rows where 'plot' is missing
movie_df.dropna(subset=['plot'], inplace=True)

# 5. Remove rows where 'country' is missing
movie_df.dropna(subset=['country'], inplace=True)

# 6. Replace missing 'awards' with default text
movie_df['awards'].fillna('0 awards and 0 nominees', inplace=True)

# 7. Impute 'rating' and 'votes' using title.ratings.tsv
movie_df = movie_df.merge(title_ratings[['tconst', 'averageRating', 'numVotes']], on='tconst', how='left')
movie_df['rating'].fillna(movie_df['averageRating'], inplace=True)
movie_df['votes'].fillna(movie_df['numVotes'], inplace=True)
movie_df.drop(columns=['averageRating', 'numVotes'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie_df['genre'].fillna(movie_df['genres'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie_df['director'].fillna(movie_df['primaryName'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate objec

In [15]:
movie_df.isnull().sum()

tconst           0
title            0
year             0
genre          373
director         0
actors        4584
plot             0
country          0
awards           0
poster        1301
rating           0
metascore    13229
votes            0
type             0
dtype: int64

#### **2.2 Additional Steps for imputing missing values**

1.	Dropping rows where 'genre' or 'actors' is null.
2.	Dropping the 'metascore' column.


In [17]:
# 8. Remove rows where 'genre' or 'actors' are missing
movie_df.dropna(subset=['genre', 'actors'], inplace=True)

# 9. Drop the 'metascore' column
movie_df.drop(columns=['metascore'], inplace=True)

In [18]:
movie_df.to_csv(parent_path + "/Data/movies_dataset_final.csv")

In [19]:
movie_df.isnull().sum()

tconst        0
title         0
year          0
genre         0
director      0
actors        0
plot          0
country       0
awards        0
poster      942
rating        0
votes         0
type          0
dtype: int64