# Exploring the Effects of Including Movie Posters as a Feature on Content-based Filtering Movie Recommendation Systems

## Movie Genre from its Poster

Database from: https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data

The movie posters are obtained from IMDB website. The collected dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters. Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to it. As the dataset also includes the IMDB score, it would be really interesting to see if movie poster is related to rating.

In [28]:
!pip install pillow matplotlib

^C


In [53]:
#import libraries
import pandas as pd 
import numpy as np 
pd.options.display.float_format = '{:,.2f}'.format
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity as cosine
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import os
import matplotlib.pyplot as plt
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
from PIL import Image
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
np.set_printoptions(threshold=np.inf)

## Directly read the Movie Genre Dataset created by the movie_recommendation_Vit-GPT2 notebook

In [9]:
movie_df = pd.read_csv('data/final_movies_dataset.csv')
movie_df

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_War,Genre_Western,Movie Name,imdbIdimdbId,Imdb Link,Poster,Poster Path,IMDB Score,Year,Poster Content
0,0,0,0,0,0,0,0,1,0,0,...,0,0,Liebelei,24252,http://www.imdb.com/title/tt24252,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...,1.04,0.94,a drawing of a cat on a white sheet
1,0,0,0,0,1,0,0,0,0,0,...,0,0,It Happened One Night,25316,http://www.imdb.com/title/tt25316,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...,1.74,1.12,a painting of a man and a woman
2,0,0,0,0,1,0,0,0,0,0,...,0,0,The Gay Divorcee,25164,http://www.imdb.com/title/tt25164,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...,0.90,1.12,a collage of photos of a man on a surfboard
3,0,0,0,0,0,0,0,1,0,0,...,0,0,The Scarlet Letter,17350,http://www.imdb.com/title/tt17350,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...,1.18,-0.27,a painting of a person holding a gun
4,0,0,0,0,0,0,0,1,0,0,...,0,0,Of Human Bondage,25586,http://www.imdb.com/title/tt25586,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...,0.47,1.12,a painting of a woman holding a book
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1013,0,0,1,0,1,0,0,0,0,0,...,0,0,Betty Boop's Big Boss,23797,http://www.imdb.com/title/tt23797,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...,-0.52,0.94,a doll of a woman sitting on a chair
1014,0,0,1,0,1,0,0,0,0,0,...,0,0,Betty Boop's Museum,22670,http://www.imdb.com/title/tt22670,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...,-0.09,0.77,a painting of a woman holding a bottle of wine
1015,0,0,0,0,1,0,0,0,0,0,...,0,0,Angora Love,19640,http://www.imdb.com/title/tt19640,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...,0.05,0.25,a man in a suit and tie
1016,0,0,0,0,0,0,0,1,0,0,...,0,0,T̫ky̫ no onna,24676,http://www.imdb.com/title/tt24676,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...,0.33,0.94,a woman standing in a kitchen next to a table


## Add a New Model -- BLIP to Convert Poster to Text

In [10]:
from transformers import BlipProcessor, BlipForConditionalGeneration

Code from: https://www.analyticsvidhya.com/blog/2024/03/salesforce-blip-revolutionizing-image-captioning/

In [11]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

In [12]:
def generate_blip_caption(img_path, model, processor):
    try:
        image = Image.open(img_path).convert("RGB")
        inputs = processor(images=image, return_tensors="pt")
        out = model.generate(**inputs)
        caption = processor.decode(out[0], skip_special_tokens=True)
        return caption
    except Exception as e:
        print(f"Error processing {img_path}: {e}")
        return ""


In [13]:
# 提取所有电影海报的描述，并显示处理进度
def extract_all_captions(movie_df, model, processor):
    captions_list = []
    total_images = len(movie_df)
    for idx, row in movie_df.iterrows():
        img_path = row['Poster Path']  # 假设'Poster'列包含海报的文件路径
        if os.path.exists(img_path):
            caption = generate_blip_caption(img_path, model, processor)
        else:
            caption = ""
        captions_list.append(caption)
        print(f"Processed {idx + 1}/{total_images} images", end="\r")  # 打印进度
    return captions_list


In [14]:
captions = extract_all_captions(movie_df, model, processor)



Processed 1018/1018 images

In [16]:
movie_df.index = movie_df['Movie Name'].values
movie_df['Poster Content2'] = captions
movie_df

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Western,Movie Name,imdbIdimdbId,Imdb Link,Poster,Poster Path,IMDB Score,Year,Poster Content,Poster Content2
Liebelei,0,0,0,0,0,0,0,1,0,0,...,0,Liebelei,24252,http://www.imdb.com/title/tt24252,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\242...,1.04,0.94,a drawing of a cat on a white sheet,a poster for the film'lebe '
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,0,It Happened One Night,25316,http://www.imdb.com/title/tt25316,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\253...,1.74,1.12,a painting of a man and a woman,"a poster for the film's release of the film, t..."
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,0,The Gay Divorcee,25164,http://www.imdb.com/title/tt25164,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\251...,0.90,1.12,a collage of photos of a man on a surfboard,the man in the iron age movie poster
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,The Scarlet Letter,17350,http://www.imdb.com/title/tt17350,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\173...,1.18,-0.27,a painting of a person holding a gun,the scarlet letter
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,0,Of Human Bondage,25586,http://www.imdb.com/title/tt25586,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\255...,0.47,1.12,a painting of a woman holding a book,"a poster for the film's film, the woman in the..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,Betty Boop's Big Boss,23797,http://www.imdb.com/title/tt23797,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\237...,-0.52,0.94,a doll of a woman sitting on a chair,betty bop the complete collection
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,Betty Boop's Museum,22670,http://www.imdb.com/title/tt22670,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\226...,-0.09,0.77,a painting of a woman holding a bottle of wine,betty bop the definitive
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,Angora Love,19640,http://www.imdb.com/title/tt19640,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\196...,0.05,0.25,a man in a suit and tie,the complete collection of the comedy comedy
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,T̫ky̫ no onna,24676,http://www.imdb.com/title/tt24676,https://images-na.ssl-images-amazon.com/images...,data/SampleMoviePosters/SampleMoviePosters\246...,0.33,0.94,a woman standing in a kitchen next to a table,a group of people sitting around a table


In [76]:
movie_df.to_csv('data/final_movies_dataset_with_BLIP.csv', index=False)

In [77]:
new_movie_df = movie_df.copy()

In [78]:
new_movie_df = new_movie_df.drop(columns=['imdbIdimdbId', 'Imdb Link', 'Poster', 'Poster Path', 'Movie Name', 'Poster Content'])
new_movie_df 

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,Genre_Romance,Genre_Sci-Fi,Genre_Short,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western,IMDB Score,Year,Poster Content2
Liebelei,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1.04,0.94,a poster for the film'lebe '
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1.74,1.12,"a poster for the film's release of the film, t..."
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0.90,1.12,the man in the iron age movie poster
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1.18,-0.27,the scarlet letter
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0.47,1.12,"a poster for the film's film, the woman in the..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.52,0.94,betty bop the complete collection
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,-0.09,0.77,betty bop the definitive
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0.05,0.25,the complete collection of the comedy comedy
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0.33,0.94,a group of people sitting around a table


### TF-IDF

In [79]:
# Get a list of English stop words
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

In [80]:
# Remove stop words
new_movie_df['Poster Content2'] = new_movie_df['Poster Content2'].apply(remove_stopwords)
new_movie_df
new_movie_df.to_csv('data/final_movies_dataset_with_BLIP111111.csv', index=False)

In [81]:
# Creating a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Perform TF-IDF encoding on poster content
poster_content = new_movie_df['Poster Content2']
tfidf_matrix = vectorizer.fit_transform(poster_content)

# Convert the encoded content into a DataFrame and add it to new_movie_df
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Reset the index to ensure there are no duplicate indexes
new_movie_df = new_movie_df.reset_index(drop=True)
tfidf_df = tfidf_df.reset_index(drop=True)

# Merge the original DataFrame and the TF-IDF encoded DataFrame
new_movie_df = pd.concat([new_movie_df, tfidf_df], axis=1)

In [82]:
new_movie_df = new_movie_df.drop(columns=['Poster Content2']).set_index(movie_df.index)
new_movie_df

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,...,yellow,york,yorks,young,zara,zero,zombie,zoo,zoon,zorro
Liebelei,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
It Happened One Night,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
The Gay Divorcee,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
The Scarlet Letter,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
Of Human Bondage,0,0,0,0,0,0,0,1,0,0,...,0.57,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0,0,1,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
Betty Boop's Museum,0,0,1,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
Angora Love,0,0,0,0,1,0,0,0,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
T̫ky̫ no onna,0,0,0,0,0,0,0,1,0,0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00


## Calculate similarity using cosine

In [83]:
#Get cosine distances
similarities = cosine(new_movie_df)
similarities = pd.DataFrame(similarities, columns = new_movie_df.index, index = new_movie_df.index)

In [84]:
similarities

Unnamed: 0,Liebelei,It Happened One Night,The Gay Divorcee,The Scarlet Letter,Of Human Bondage,A Farewell to Arms,Duck Soup,M,Nosferatu,Wings,...,The Cuckoos,Here Is My Heart,The Three Must-Get-Theres,Polikushka,The Fall of the House of Usher,Betty Boop's Big Boss,Betty Boop's Museum,Angora Love,T̫ky̫ no onna,Scaramouche
Liebelei,1.00,0.67,0.55,0.47,0.80,0.44,0.41,0.57,0.11,0.64,...,-0.07,0.31,-0.10,0.08,0.03,0.07,0.13,0.07,0.58,0.39
It Happened One Night,0.67,1.00,0.73,0.35,0.58,0.16,0.65,0.54,0.22,0.50,...,0.06,0.49,0.09,-0.16,0.04,0.19,0.29,0.29,0.35,0.14
The Gay Divorcee,0.55,0.73,1.00,0.16,0.52,0.26,0.68,0.37,0.04,0.36,...,0.40,0.68,0.07,-0.19,0.03,0.28,0.34,0.31,0.32,0.11
The Scarlet Letter,0.47,0.35,0.16,1.00,0.32,0.04,0.30,0.60,0.43,0.56,...,-0.25,-0.02,0.16,0.46,0.01,-0.21,-0.08,-0.00,0.35,0.45
Of Human Bondage,0.80,0.58,0.52,0.32,1.00,0.56,0.32,0.43,-0.07,0.55,...,0.04,0.36,-0.18,0.00,0.03,0.17,0.18,0.08,0.60,0.32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Betty Boop's Big Boss,0.07,0.19,0.28,-0.21,0.17,0.20,0.19,-0.07,-0.30,-0.13,...,0.40,0.42,-0.02,-0.26,0.27,1.00,0.91,0.63,0.18,-0.21
Betty Boop's Museum,0.13,0.29,0.34,-0.08,0.18,0.14,0.28,0.04,-0.17,-0.04,...,0.34,0.41,0.05,-0.21,0.28,0.91,1.00,0.58,0.19,-0.15
Angora Love,0.07,0.29,0.31,-0.00,0.08,0.04,0.28,0.05,-0.04,0.01,...,0.32,0.36,0.20,-0.08,0.34,0.63,0.58,1.00,0.08,-0.05
T̫ky̫ no onna,0.58,0.35,0.32,0.35,0.60,0.41,0.30,0.44,-0.10,0.32,...,0.05,0.31,-0.21,0.05,0.03,0.18,0.19,0.08,1.00,0.11


In [85]:
film = "It Happened One Night"
n = 10

## Add the movie poster content parsed by the BLIT model, and the top 10 with the highest similarity

In [86]:
sorted = similarities.sort_values(by = film, ascending = False)
#top_n = (sorted[film][0:n+1].index, sorted[film][0:n+1].values)
top_n = list(zip(sorted[film][1:n+1].index, sorted[film][1:n+1].values))    # 从1开始，因为第一个是自己
top_n

[('Maskerade', 0.8397928142824813),
 ('Just a Gigolo', 0.8150259087957853),
 ('One Way Passage', 0.808743077291467),
 ('City Lights', 0.8005745951932457),
 ('Show People', 0.7980115446386775),
 ('Design for Living', 0.7923628187895732),
 ('Trouble in Paradise', 0.791141679501293),
 ('Music in the Air', 0.7900176373572237),
 ('Fast Life', 0.7834150142701359),
 ("L'Atalante", 0.7773696383280915)]

# Reference List

David, M. (2024). Salesforce BLIP: Revolutionizing Image Captioning. [online] Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2024/03/salesforce-blip-revolutionizing-image-captioning/ [Accessed 12 Jun. 2024].

McCallum, L., Tanska, M. and Beaty, M. (n.d.). Build Software better, Together. [online] GitHub. Available at: https://git.arts.ac.uk/lmccallum/personalisation-23-24/blob/main/week-3-movies.ipynb [Accessed 12 Jun. 2024].

NEHA (n.d.). Movie Genre from its Poster. [online] www.kaggle.com. Available at: https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data [Accessed 10 Jun. 2024].