# Exploring the Effects of Including Movie Posters as a Feature on Content-based Filtering Movie Recommendation Systems

## Movie Genre from its Poster

Database from: https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data

The movie posters are obtained from IMDB website. The collected dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters. Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to it. As the dataset also includes the IMDB score, it would be really interesting to see if movie poster is related to rating.

In [28]:
!pip install pillow matplotlib

^C


In [174]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl (97 kB)
Installing collected packages: click, nltk
Successfully installed click-8.1.7 nltk-3.8.1


In [175]:
#import libraries
import pandas as pd 
import numpy as np 
pd.options.display.float_format = '{:,.2f}'.format
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity as cosine
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import os
import matplotlib.pyplot as plt
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
from PIL import Image
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
np.set_printoptions(threshold=np.inf)

## Movie Genre Dataset

In [167]:
# Read the file manually and ignore decoding errors
with open('data/MovieGenre.csv', 'r', encoding='utf-8', errors='replace') as f:
    content = f.read()

# Write the content to a temporary file and then read using pandas
with open('data/MovieGenre_temp.csv', 'w', encoding='utf-8') as f:
    f.write(content)

df = pd.read_csv('data/MovieGenre_temp.csv')
df

Unnamed: 0,imdbIdimdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.30,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.90,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.60,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.70,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.90,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...
...,...,...,...,...,...,...
40103,83168,http://www.imdb.com/title/tt83168,Tanya's Island (1980),4.30,Drama,https://images-na.ssl-images-amazon.com/images...
40104,82875,http://www.imdb.com/title/tt82875,Pacific Banana (1981),4.70,Comedy,https://images-na.ssl-images-amazon.com/images...
40105,815258,http://www.imdb.com/title/tt815258,Werewolf in a Womens Prison (2006),4.50,Horror,https://images-na.ssl-images-amazon.com/images...
40106,79142,http://www.imdb.com/title/tt79142,Xiao zi ming da (1979),6.50,Action|Comedy,https://images-na.ssl-images-amazon.com/images...


## Poster Dataset

In [42]:
folder_path = 'data/SampleMoviePosters/SampleMoviePosters'
image_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
df['imdbIdimdbId'] = df['imdbIdimdbId'].astype(str)

## Filter the corresponding database content according to the poster code

In [43]:
poster_files = [f.split('.')[0] for f in os.listdir(folder_path) if f.endswith('.jpg')]

poster_files_set = set(poster_files)

# Filtering the Dataset
df_filtered = df[df['imdbIdimdbId'].isin(poster_files_set)]

In [44]:
# Generate image path
def get_poster_path(imdb_id):
    poster_path = os.path.join(folder_path, f"{imdb_id}.jpg")
    if os.path.exists(poster_path):
        return poster_path
    else:
        return None

In [45]:
# Add a new column to store the image path
df_filtered['Poster Path'] = df_filtered['imdbIdimdbId'].apply(get_poster_path)

df_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Poster Path'] = df_filtered['imdbIdimdbId'].apply(get_poster_path)


Unnamed: 0,imdbIdimdbId,Imdb Link,Title,IMDB Score,Genre,Poster,Poster Path
877,24252,http://www.imdb.com/title/tt24252,Liebelei (1933),7.70,Drama|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...
888,25316,http://www.imdb.com/title/tt25316,It Happened One Night (1934),8.20,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...
890,25164,http://www.imdb.com/title/tt25164,The Gay Divorcee (1934),7.60,Comedy|Musical|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...
940,17350,http://www.imdb.com/title/tt17350,The Scarlet Letter (1926),7.80,Drama,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...
942,25586,http://www.imdb.com/title/tt25586,Of Human Bondage (1934),7.30,Drama|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...
...,...,...,...,...,...,...,...
39884,23797,http://www.imdb.com/title/tt23797,Betty Boop's Big Boss (1933),6.60,Animation|Short|Comedy,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...
39890,22670,http://www.imdb.com/title/tt22670,Betty Boop's Museum (1932),6.90,Animation|Short|Comedy,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...
40044,19640,http://www.imdb.com/title/tt19640,Angora Love (1929),7.00,Comedy|Short,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...
40057,24676,http://www.imdb.com/title/tt24676,T̫ky̫ no onna (1933),7.20,Drama,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...


## Numerical Features 

In [46]:
df_filtered['Movie Name'] = df_filtered['Title'].str.extract(r'^(.*)\s\((\d{4})\)$')[0]
df_filtered['Year'] = df_filtered['Title'].str.extract(r'^(.*)\s\((\d{4})\)$')[1]
df_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Movie Name'] = df_filtered['Title'].str.extract(r'^(.*)\s\((\d{4})\)$')[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Year'] = df_filtered['Title'].str.extract(r'^(.*)\s\((\d{4})\)$')[1]


Unnamed: 0,imdbIdimdbId,Imdb Link,Title,IMDB Score,Genre,Poster,Poster Path,Movie Name,Year
877,24252,http://www.imdb.com/title/tt24252,Liebelei (1933),7.70,Drama|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...,Liebelei,1933
888,25316,http://www.imdb.com/title/tt25316,It Happened One Night (1934),8.20,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...,It Happened One Night,1934
890,25164,http://www.imdb.com/title/tt25164,The Gay Divorcee (1934),7.60,Comedy|Musical|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...,The Gay Divorcee,1934
940,17350,http://www.imdb.com/title/tt17350,The Scarlet Letter (1926),7.80,Drama,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...,The Scarlet Letter,1926
942,25586,http://www.imdb.com/title/tt25586,Of Human Bondage (1934),7.30,Drama|Romance,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...,Of Human Bondage,1934
...,...,...,...,...,...,...,...,...,...
39884,23797,http://www.imdb.com/title/tt23797,Betty Boop's Big Boss (1933),6.60,Animation|Short|Comedy,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...,Betty Boop's Big Boss,1933
39890,22670,http://www.imdb.com/title/tt22670,Betty Boop's Museum (1932),6.90,Animation|Short|Comedy,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...,Betty Boop's Museum,1932
40044,19640,http://www.imdb.com/title/tt19640,Angora Love (1929),7.00,Comedy|Short,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...,Angora Love,1929
40057,24676,http://www.imdb.com/title/tt24676,T̫ky̫ no onna (1933),7.20,Drama,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...,T̫ky̫ no onna,1933


## 'IMDB Score' and 'Year' as features

In [47]:
features = ["IMDB Score", "Year"]
subset_features = df_filtered[features].dropna()
subset_features

Unnamed: 0,IMDB Score,Year
877,7.70,1933
888,8.20,1934
890,7.60,1934
940,7.80,1926
942,7.30,1934
...,...,...
39884,6.60,1933
39890,6.90,1932
40044,7.00,1929
40057,7.20,1933


In [48]:
scaled_features = StandardScaler().fit_transform(subset_features)
numbers_df = pd.DataFrame(scaled_features, columns=features)
numbers_df

Unnamed: 0,IMDB Score,Year
0,1.04,0.94
1,1.74,1.12
2,0.90,1.12
3,1.18,-0.27
4,0.47,1.12
...,...,...
1013,-0.52,0.94
1014,-0.09,0.77
1015,0.05,0.25
1016,0.33,0.94


In [49]:
# Explode the DataFrame
df_filtered['Genre'] = df_filtered['Genre'].str.split('|')
# Explode the DataFrame
df_exploded = df_filtered.explode('Genre')

# Use get_dummies function to one-hot encode
df_encoded = pd.get_dummies(df_exploded, columns=['Genre'])

# Group the DataFrame by the original title of the movie and aggregate the features for each group while keeping the IMDB Score
one_hot_movies = df_encoded.groupby('Title').agg({**{col: 'sum' for col in df_encoded.columns if col.startswith('Genre_')}, 'IMDB Score': 'first'}).reset_index()

# Merge the one-hot encoded DataFrame with the original DataFrame
one_hot_movies = one_hot_movies.merge(df_filtered[['Title', 'Movie Name', 'Year', 'imdbIdimdbId', 'Imdb Link', 'Poster', 'Poster Path']], on='Title', how='right')
# one_hot_movies = one_hot_movies.drop(columns=('Title', 'Year', 'IMDB Score')
one_hot_movies

one_hot_movies.drop(columns=['Year', 'IMDB Score'], inplace=True)

merged = pd.concat([one_hot_movies, numbers_df], axis=1).set_index('Title')
merged

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Genre'] = df_filtered['Genre'].str.split('|')


Unnamed: 0_level_0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Thriller,Genre_War,Genre_Western,Movie Name,imdbIdimdbId,Imdb Link,Poster,Poster Path,IMDB Score,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Liebelei (1933),0,0,0,0,0,0,0,1,0,0,...,0,0,0,Liebelei,24252,http://www.imdb.com/title/tt24252,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...,1.04,0.94
It Happened One Night (1934),0,0,0,0,1,0,0,0,0,0,...,0,0,0,It Happened One Night,25316,http://www.imdb.com/title/tt25316,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...,1.74,1.12
The Gay Divorcee (1934),0,0,0,0,1,0,0,0,0,0,...,0,0,0,The Gay Divorcee,25164,http://www.imdb.com/title/tt25164,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...,0.90,1.12
The Scarlet Letter (1926),0,0,0,0,0,0,0,1,0,0,...,0,0,0,The Scarlet Letter,17350,http://www.imdb.com/title/tt17350,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...,1.18,-0.27
Of Human Bondage (1934),0,0,0,0,0,0,0,1,0,0,...,0,0,0,Of Human Bondage,25586,http://www.imdb.com/title/tt25586,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...,0.47,1.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss (1933),0,0,1,0,1,0,0,0,0,0,...,0,0,0,Betty Boop's Big Boss,23797,http://www.imdb.com/title/tt23797,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...,-0.52,0.94
Betty Boop's Museum (1932),0,0,1,0,1,0,0,0,0,0,...,0,0,0,Betty Boop's Museum,22670,http://www.imdb.com/title/tt22670,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...,-0.09,0.77
Angora Love (1929),0,0,0,0,1,0,0,0,0,0,...,0,0,0,Angora Love,19640,http://www.imdb.com/title/tt19640,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...,0.05,0.25
T̫ky̫ no onna (1933),0,0,0,0,0,0,0,1,0,0,...,0,0,0,T̫ky̫ no onna,24676,http://www.imdb.com/title/tt24676,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...,0.33,0.94


In [12]:
genres_features = [col for col in one_hot_movies.columns if "Genre" in col]
genres_features

['Genre_Action',
 'Genre_Adventure',
 'Genre_Animation',
 'Genre_Biography',
 'Genre_Comedy',
 'Genre_Crime',
 'Genre_Documentary',
 'Genre_Drama',
 'Genre_Family',
 'Genre_Fantasy',
 'Genre_Film-Noir',
 'Genre_History',
 'Genre_Horror',
 'Genre_Music',
 'Genre_Musical',
 'Genre_Mystery',
 'Genre_Romance',
 'Genre_Sci-Fi',
 'Genre_Short',
 'Genre_Sport',
 'Genre_Thriller',
 'Genre_War',
 'Genre_Western']

## Vit-GPT2

### Using Vit-GPT2 model to parse movie poster content

Code from: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

In [50]:
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")



In [59]:
# Function to generate caption for an image
def generate_caption(image_path):
    try:
        image = Image.open(image_path)
        pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
        output_ids = model.generate(pixel_values, max_length=16, num_beams=4, return_dict_in_generate=True).sequences
        caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return caption
    except Exception as e:
        print(f"Error processing image {image_path}: {e}")
        return None

In [61]:
# Assume 'merged' is your DataFrame with the 'Poster Path' column
merged['Poster Content'] = None

# Process images and show progress
total_images = len(merged)
processed_images = 0

for i, row in merged.iterrows():
    image_path = row['Poster Path']
    if image_path and os.path.exists(image_path):
        caption = generate_caption(image_path)
        merged.at[i, 'Poster Content'] = caption
    processed_images += 1
    print(f"Processed {processed_images}/{total_images} images")

Processed 1/1018 images
Processed 2/1018 images
Processed 3/1018 images
Processed 4/1018 images
Processed 5/1018 images
Processed 6/1018 images
Processed 7/1018 images
Processed 8/1018 images
Processed 9/1018 images
Processed 10/1018 images
Processed 11/1018 images
Processed 12/1018 images
Processed 13/1018 images
Processed 14/1018 images
Processed 15/1018 images
Processed 16/1018 images
Processed 17/1018 images
Processed 18/1018 images
Processed 19/1018 images
Processed 20/1018 images
Processed 21/1018 images
Processed 22/1018 images
Processed 23/1018 images
Processed 24/1018 images
Processed 25/1018 images
Processed 26/1018 images
Processed 27/1018 images
Processed 28/1018 images
Processed 29/1018 images
Processed 30/1018 images
Processed 31/1018 images
Processed 32/1018 images
Processed 33/1018 images
Processed 34/1018 images
Processed 35/1018 images
Processed 36/1018 images
Processed 37/1018 images
Processed 38/1018 images
Processed 39/1018 images
Processed 40/1018 images
Processed

In [81]:
# Display the DataFrame
merged.index = merged['Movie Name'].values
merged


Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_War,Genre_Western,Movie Name,imdbIdimdbId,Imdb Link,Poster,Poster Path,IMDB Score,Year,Poster Content
Liebelei,0,0,0,0,0,0,0,1,0,0,...,0,0,Liebelei,24252,http://www.imdb.com/title/tt24252,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...,1.04,0.94,a drawing of a cat on a white sheet
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,0,0,It Happened One Night,25316,http://www.imdb.com/title/tt25316,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...,1.74,1.12,a painting of a man and a woman
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,0,0,The Gay Divorcee,25164,http://www.imdb.com/title/tt25164,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...,0.90,1.12,a collage of photos of a man on a surfboard
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,0,The Scarlet Letter,17350,http://www.imdb.com/title/tt17350,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...,1.18,-0.27,a painting of a person holding a gun
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,0,0,Of Human Bondage,25586,http://www.imdb.com/title/tt25586,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...,0.47,1.12,a painting of a woman holding a book
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,0,Betty Boop's Big Boss,23797,http://www.imdb.com/title/tt23797,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...,-0.52,0.94,a doll of a woman sitting on a chair
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,0,Betty Boop's Museum,22670,http://www.imdb.com/title/tt22670,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...,-0.09,0.77,a painting of a woman holding a bottle of wine
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,0,Angora Love,19640,http://www.imdb.com/title/tt19640,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...,0.05,0.25,a man in a suit and tie
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,0,T̫ky̫ no onna,24676,http://www.imdb.com/title/tt24676,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...,0.33,0.94,a woman standing in a kitchen next to a table


In [168]:
merged.to_csv('data/final_movies_dataset.csv', index=False)

In [169]:
film = "It Happened One Night"

In [170]:
merged.loc[film]

Genre_Action                                                         0
Genre_Adventure                                                      0
Genre_Animation                                                      0
Genre_Biography                                                      0
Genre_Comedy                                                         1
Genre_Crime                                                          0
Genre_Documentary                                                    0
Genre_Drama                                                          0
Genre_Family                                                         0
Genre_Fantasy                                                        0
Genre_Film-Noir                                                      0
Genre_History                                                        0
Genre_Horror                                                         0
Genre_Music                                                          0
Genre_

## Recommendations Based on Genre, Year and IMDB Score

In [122]:
merged_copy = merged.copy()


In [123]:
merged_copy = merged_copy.drop(columns=['imdbIdimdbId', 'Imdb Link', 'Poster', 'Poster Path', 'Movie Name', 'Poster Content'])
merged_copy

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Short,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western,IMDB Score,Year
Liebelei,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1.04,0.94
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1.74,1.12
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0.90,1.12
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1.18,-0.27
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0.47,1.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,-0.52,0.94
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,-0.09,0.77
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0.05,0.25
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0.33,0.94


## Calculate similarity using cosine

In [124]:
# scaled_features = StandardScaler().fit_transform(subset_features)
# Get cosine distances
similarities = cosine(merged_copy)
# Store the results: Store the similarity matrix in a DataFrame, where the row and column indices are the indices of the movies, and each element represents the similarity between the corresponding movies.
similarities = pd.DataFrame(similarities, columns = merged_copy.index, index = merged_copy.index)

In [125]:
similarities

Unnamed: 0,Liebelei,It Happened One Night,The Gay Divorcee,The Scarlet Letter,Of Human Bondage,A Farewell to Arms,Duck Soup,M,Nosferatu,Wings,...,The Cuckoos,Here Is My Heart,The Three Must-Get-Theres,Polikushka,The Fall of the House of Usher,Betty Boop's Big Boss,Betty Boop's Museum,Angora Love,T̫ky̫ no onna,Scaramouche
Liebelei,1.00,0.77,0.67,0.63,0.95,0.56,0.49,0.67,0.14,0.75,...,-0.09,0.35,-0.14,0.10,0.04,0.09,0.17,0.10,0.79,0.49
It Happened One Night,0.77,1.00,0.85,0.45,0.66,0.19,0.75,0.61,0.26,0.56,...,0.08,0.57,0.10,-0.21,0.05,0.22,0.36,0.38,0.46,0.17
The Gay Divorcee,0.67,0.85,1.00,0.21,0.64,0.32,0.79,0.40,0.05,0.41,...,0.52,0.84,0.05,-0.29,0.04,0.35,0.42,0.41,0.42,0.11
The Scarlet Letter,0.63,0.45,0.21,1.00,0.43,0.06,0.38,0.75,0.56,0.73,...,-0.35,-0.03,0.21,0.67,0.02,-0.27,-0.11,-0.01,0.51,0.62
Of Human Bondage,0.95,0.66,0.64,0.43,1.00,0.71,0.38,0.52,-0.09,0.63,...,0.05,0.40,-0.25,0.01,0.04,0.21,0.23,0.11,0.84,0.41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0.09,0.22,0.35,-0.27,0.21,0.25,0.23,-0.09,-0.36,-0.16,...,0.53,0.53,-0.02,-0.35,0.36,1.00,0.98,0.75,0.25,-0.27
Betty Boop's Museum,0.17,0.36,0.42,-0.11,0.23,0.17,0.34,0.05,-0.21,-0.05,...,0.45,0.54,0.06,-0.29,0.39,0.98,1.00,0.80,0.26,-0.20
Angora Love,0.10,0.38,0.41,-0.01,0.11,0.06,0.37,0.06,-0.05,0.01,...,0.46,0.49,0.27,-0.11,0.50,0.75,0.80,1.00,0.12,-0.08
T̫ky̫ no onna,0.79,0.46,0.42,0.51,0.84,0.56,0.39,0.58,-0.13,0.44,...,0.08,0.43,-0.28,0.08,0.04,0.25,0.26,0.12,1.00,0.16


## *It Happenede One Night* as a Reference Movie
### Select the top 10 similarities

In [128]:
film = "It Happened One Night"
n = 10


In [130]:
sorted = similarities.sort_values(by = film, ascending = False)
#top_n = (sorted[film][0:n+1].index, sorted[film][0:n+1].values)
top_n = list(zip(sorted[film][1:n+1].index, sorted[film][1:n+1].values))    # 从1开始，因为第一个是自己
top_n

[('Maskerade', 0.9906348501956346),
 ('Just a Gigolo', 0.9786750794153523),
 ('Design for Living', 0.9577870225075642),
 ('The Girl Said No', 0.9463687489529412),
 ('Music in the Air', 0.9147579688489885),
 ('Trouble in Paradise', 0.9136694549170501),
 ('Show People', 0.9102769029354203),
 ('The Circus', 0.9102083857995251),
 ('Blondie of the Follies', 0.9087459882652774),
 ('Fast Life', 0.9084787850733306)]

## Add movie poster content features

### TF-IDF

In [195]:
merged_copy1 = merged.copy()


In [196]:
merged_copy1 = merged_copy1.drop(columns=['imdbIdimdbId', 'Imdb Link', 'Poster', 'Poster Path', 'Movie Name'])
merged_copy1

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Romance,Genre_Sci-Fi,Genre_Short,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western,IMDB Score,Year,Poster Content
Liebelei,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1.04,0.94,a drawing of a cat on a white sheet
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1.74,1.12,a painting of a man and a woman
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0.90,1.12,a collage of photos of a man on a surfboard
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1.18,-0.27,a painting of a person holding a gun
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0.47,1.12,a painting of a woman holding a book
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.52,0.94,a doll of a woman sitting on a chair
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.09,0.77,a painting of a woman holding a bottle of wine
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0.05,0.25,a man in a suit and tie
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0.33,0.94,a woman standing in a kitchen next to a table


In [197]:
# Get a list of English stop words
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

In [198]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tobys\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tobys\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [199]:
# Remove stop words
merged_copy1['Poster Content'] = merged_copy1['Poster Content'].apply(remove_stopwords)
merged_copy1

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Romance,Genre_Sci-Fi,Genre_Short,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western,IMDB Score,Year,Poster Content
Liebelei,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1.04,0.94,drawing cat white sheet
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1.74,1.12,painting man woman
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0.90,1.12,collage photos man surfboard
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1.18,-0.27,painting person holding gun
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0.47,1.12,painting woman holding book
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.52,0.94,doll woman sitting chair
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.09,0.77,painting woman holding bottle wine
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0.05,0.25,man suit tie
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0.33,0.94,woman standing kitchen next table


In [200]:
# Creating a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Perform TF-IDF encoding on poster content
poster_content = merged_copy1['Poster Content']
tfidf_matrix = vectorizer.fit_transform(poster_content)

# Convert the encoded content into a DataFrame and add it to merged_copy1
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Reset the index to ensure there are no duplicate indexes
merged_copy1 = merged_copy1.reset_index(drop=True)
tfidf_df = tfidf_df.reset_index(drop=True)

# Merge the original DataFrame and the TF-IDF encoded DataFrame
merged_copy1 = pd.concat([merged_copy1, tfidf_df], axis=1)

In [201]:
merged_copy1 = merged_copy1.drop(columns=['Poster Content']).set_index(merged_copy.index)
merged_copy1

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,water,wearing,white,window,wine,woman,women,wooden,york,young
Liebelei,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.37,0.00,0.00,0.00,0.00,0.00,0.00,0.00
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.60,0.00,0.00,0.00,0.00
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.42,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.18,0.00,0.00,0.00,0.00
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.67,0.20,0.00,0.00,0.00,0.00
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.18,0.00,0.00,0.00,0.00


## Calculate similarity using cosine

In [202]:
#Get cosine distances
similarities1 = cosine(merged_copy1)
similarities1 = pd.DataFrame(similarities1, columns = merged_copy1.index, index = merged_copy1.index)

In [203]:
similarities1

Unnamed: 0,Liebelei,It Happened One Night,The Gay Divorcee,The Scarlet Letter,Of Human Bondage,A Farewell to Arms,Duck Soup,M,Nosferatu,Wings,...,The Cuckoos,Here Is My Heart,The Three Must-Get-Theres,Polikushka,The Fall of the House of Usher,Betty Boop's Big Boss,Betty Boop's Museum,Angora Love,T̫ky̫ no onna,Scaramouche
Liebelei,1.00,0.64,0.54,0.47,0.75,0.44,0.41,0.57,0.11,0.60,...,-0.07,0.27,-0.11,0.08,0.03,0.07,0.13,0.07,0.58,0.37
It Happened One Night,0.64,1.00,0.74,0.38,0.63,0.33,0.64,0.55,0.26,0.53,...,0.10,0.47,0.13,-0.06,0.08,0.20,0.34,0.32,0.37,0.17
The Gay Divorcee,0.54,0.74,1.00,0.16,0.51,0.27,0.67,0.35,0.04,0.35,...,0.40,0.68,0.05,-0.20,0.03,0.28,0.34,0.32,0.32,0.09
The Scarlet Letter,0.47,0.38,0.16,1.00,0.39,0.08,0.30,0.67,0.45,0.58,...,-0.25,-0.02,0.17,0.55,0.20,-0.21,-0.05,-0.00,0.35,0.49
Of Human Bondage,0.75,0.63,0.51,0.39,1.00,0.68,0.31,0.47,-0.05,0.58,...,0.07,0.30,-0.17,0.24,0.19,0.18,0.26,0.08,0.62,0.39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0.07,0.20,0.28,-0.21,0.18,0.22,0.19,-0.07,-0.30,-0.13,...,0.42,0.42,-0.02,-0.26,0.27,1.00,0.78,0.56,0.19,-0.21
Betty Boop's Museum,0.13,0.34,0.34,-0.05,0.26,0.19,0.28,0.06,-0.16,-0.02,...,0.36,0.41,0.06,-0.13,0.34,0.78,1.00,0.58,0.20,-0.12
Angora Love,0.07,0.32,0.32,-0.00,0.08,0.08,0.28,0.05,-0.03,0.03,...,0.32,0.43,0.21,-0.04,0.34,0.56,0.58,1.00,0.08,-0.05
T̫ky̫ no onna,0.58,0.37,0.32,0.35,0.62,0.44,0.30,0.44,-0.10,0.32,...,0.07,0.31,-0.21,0.05,0.03,0.19,0.20,0.08,1.00,0.11


## Add the movie poster content parsed by the Vit-GPT2 model, and the top 10 with the highest similarity

In [204]:
sorted = similarities1.sort_values(by = film, ascending = False)
top_n = list(zip(sorted[film][1:n+1].index, sorted[film][1:n+1].values))    # 从1开始，因为第一个是自己
top_n

[('Show People', 0.9230902084952273),
 ('Maskerade', 0.8825092939149513),
 ('Trouble in Paradise', 0.8453558635646721),
 ('Design for Living', 0.8411719587530807),
 ('Just a Gigolo', 0.8361220741107812),
 ('One Way Passage', 0.8324917690394688),
 ('The Girl Said No', 0.8227934267481162),
 ('Music in the Air', 0.8175592104175639),
 ("L'Atalante", 0.8125466220189475),
 ('Blondie of the Follies', 0.812295145718485)]

## Reference List

Hugging Face (n.d.). nlpconnect/vit-gpt2-image-captioning · Hugging Face. [online] huggingface.co. Available at: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning [Accessed 12 Jun. 2024].

McCallum, L., Tanska, M. and Beaty, M. (n.d.). Build Software better, Together. [online] GitHub. Available at: https://git.arts.ac.uk/lmccallum/personalisation-23-24/blob/main/week-3-movies.ipynb [Accessed 12 Jun. 2024].

NEHA (n.d.). Movie Genre from its Poster. [online] www.kaggle.com. Available at: https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data [Accessed 10 Jun. 2024].