#**Thriller Movie Recommender System**
Click shift-return to move through each cell of this movie recommender system and find your next favorite thriller movie!

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import files
uploaded = files.upload()

Saving imdb_movies.csv to imdb_movies.csv


In [None]:
df = pd.read_csv("imdb_movies.csv")
pd.set_option("display.max_columns", None)

df.head(2)

Unnamed: 0,id,movie_title,status,overview,tagline,genres,keywords,release_date,release_month,release_day,release_year,runtime,tmdb_rating,tmdb_vote_count,imdb_rating,imdb_numVotes,popularity,budget,revenue,production_companies,production_countries,directors,writers,cast
0,393897,EFC,Post Production,Riding off the heels of her sister Scarlett's ...,Its not just another fight. Its the fight for ...,Drama,"strong woman, female athlete, boxing, women fi...",3/6/25,3,6,2025,90,0.0,0.0,7.2,850,1.553,0,0.0,Universe Pictures Group,Canada,"Jaze Bordeaux, Wayne Wells","Ilham Aragrag, Jaze Bordeaux, Greg Jackson","Avaah Blackwell, Karlee Rose, Jaclyn Vogl, Sav..."
1,848538,Argylle,Post Production,When the plots of reclusive author Elly Conway...,"Once you know the secret, don't let the cat ou...","Action, Adventure, Comedy","cat, spy, secret agent, writer's block, author...",1/31/24,1,31,2024,135,0.0,0.0,5.6,87075,48.247,0,0.0,"Marv Films, Apple Studios, Cloudy Productions","United Kingdom, United States of America",Matthew Vaughn,Jason Fuchs,"Bryce Dallas Howard, Sam Rockwell, Bryan Crans..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35867 entries, 0 to 35866
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    35867 non-null  int64  
 1   movie_title           35867 non-null  object 
 2   status                35867 non-null  object 
 3   overview              35672 non-null  object 
 4   tagline               17446 non-null  object 
 5   genres                35398 non-null  object 
 6   keywords              35866 non-null  object 
 7   release_date          35867 non-null  object 
 8   release_month         35867 non-null  int64  
 9   release_day           35867 non-null  int64  
 10  release_year          35867 non-null  int64  
 11  runtime               35867 non-null  int64  
 12  tmdb_rating           35867 non-null  float64
 13  tmdb_vote_count       35866 non-null  float64
 14  imdb_rating           35867 non-null  float64
 15  imdb_numVotes      

#**Clean-up Decisions**

While cleaning up my data, I tried to think about what kind of information would be relevant to people looking to find a movie to watch with this recommender system. The pieces of information that stuck out to me, besides obviously the genres and movie titles, were release year, runtime, imdb rating, and number of imdb votes. This way, the viewer can get a sense of how good/popular the movies being recommended are, as well how long they are.

First, I searched for any NaNs in the columns that I was focusing on, and removed them. Then, I dropped any movies with a runtime of under 30 minutes, an imdb rating of 5.0 or lower, and/or a number of imdb votes that was under 5000. This way, the user is hopefully being recommended good movies that have fairly accurate ratings, as well as being "actual movie" length.

After this, I exploded out the genres of all of the remaining movies, so there was only one genre per movie in my new data frame. Then I was able to drop any movies without the "Thriller" genre attached, so users would be recommended movies that they are actually interested in seeing by using this particular recommender system.

I was then left with 54 movies, and I decided from here to move on to the next steps of building my movie recommender system.

In [None]:
new_df = df[["movie_title","genres","release_year","runtime","imdb_rating","imdb_numVotes"]]
new_df.head()

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
0,EFC,Drama,2025,90,7.2,850
1,Argylle,"Action, Adventure, Comedy",2024,135,5.6,87075
2,Mean Girls Musical,Comedy,2024,0,5.6,35073
3,The Beekeeper,"Action, Thriller",2024,0,6.3,141983
4,Distant,"Science Fiction, Comedy, Romance",2024,0,5.7,1011


In [None]:
new_df.isna().sum()

Unnamed: 0,0
movie_title,0
genres,469
release_year,0
runtime,0
imdb_rating,0
imdb_numVotes,0


In [None]:
new_df = new_df.dropna(subset = "genres")
new_df.head()

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
0,EFC,Drama,2025,90,7.2,850
1,Argylle,"Action, Adventure, Comedy",2024,135,5.6,87075
2,Mean Girls Musical,Comedy,2024,0,5.6,35073
3,The Beekeeper,"Action, Thriller",2024,0,6.3,141983
4,Distant,"Science Fiction, Comedy, Romance",2024,0,5.7,1011


In [None]:
new_df1 = new_df.drop(new_df[new_df.runtime <=30].index)
new_df1.head()
#I chose 30 minutes as the minimum to ensure that not only would users be recommended movies that were actual "movie length", but I was also keeping in mind that some thrillers,
#like Black Mirror, are only around 40 minutes (depending on the episode).
#Technically Black Mirror is considered a TV show, but I figured this was still relevant when building this system.

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
0,EFC,Drama,2025,90,7.2,850
1,Argylle,"Action, Adventure, Comedy",2024,135,5.6,87075
9,Femme,Thriller,2024,99,7.3,5951
15,The Unbreakable Tatiana Suarez,Documentary,2024,81,6.6,161
16,1928: The Year the Thames Flooded,"Documentary, History",2024,67,7.0,13


In [None]:
new_df2 = new_df1.drop(new_df1[new_df1.imdb_rating <= 5.0].index)
new_df2.head()
#The average rating was around 5.0, so I decided to get rid of any movies that were considered "below average".

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
0,EFC,Drama,2025,90,7.2,850
1,Argylle,"Action, Adventure, Comedy",2024,135,5.6,87075
9,Femme,Thriller,2024,99,7.3,5951
15,The Unbreakable Tatiana Suarez,Documentary,2024,81,6.6,161
16,1928: The Year the Thames Flooded,"Documentary, History",2024,67,7.0,13


In [None]:
new_df3 = new_df2.drop(new_df2[new_df2.imdb_numVotes <= 5000].index)
new_df3.head()
#After taking a look at the data, I decided to remove any movies that had a low number of imdb votes, because I figured that the ratings reflected wouldn't be as accurate
#compared to movies with a higher number of votes.

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
1,Argylle,"Action, Adventure, Comedy",2024,135,5.6,87075
9,Femme,Thriller,2024,99,7.3,5951
40,Ordinary Angels,"Drama, Family",2024,116,7.4,13484
133,Woman of the Hour,"Drama, Crime",2024,89,6.6,45437
143,Doctor Who: Space Babies,"Drama, Family, Science Fiction",2024,46,5.2,8338


In [None]:
print("Total number of rows in this dataframe: " + str(len(new_df3)))

Total number of rows in this dataframe: 6221


In [None]:
genre_df = new_df3.assign(genres = new_df3.genres.str.split(",")).explode("genres")
genre_df["genres"] = genre_df["genres"].str.strip()
genre_df.head()

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
1,Argylle,Action,2024,135,5.6,87075
1,Argylle,Adventure,2024,135,5.6,87075
1,Argylle,Comedy,2024,135,5.6,87075
9,Femme,Thriller,2024,99,7.3,5951
40,Ordinary Angels,Drama,2024,116,7.4,13484


In [None]:
print("Total number of rows in the new dataframe: " + str(len(genre_df)))

Total number of rows in the new dataframe: 15728


In [None]:
display(genre_df[(genre_df["imdb_numVotes"] > 10000) & (genre_df["imdb_rating"] > 8) & (genre_df["genres"] == "Drama")])

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
1006,Oppenheimer,Drama,2023,181,8.3,825418
2134,Top Gun: Maverick,Drama,2022,131,8.2,745163
4729,The Rescue,Drama,2021,107,8.3,21583
6052,Hamilton,Drama,2020,160,8.3,122155
8586,Joker,Drama,2019,122,8.4,1567692
8821,Ford v Ferrari,Drama,2019,153,8.1,504461
11004,Green Book,Drama,2018,130,8.2,609284
11629,Logan,Drama,2017,137,8.1,871218
15230,Hacksaw Ridge,Drama,2016,139,8.1,619462
16674,Inside Out,Drama,2015,95,8.1,844255


In [None]:
genre_df = genre_df.drop(genre_df[genre_df.genres != "Thriller"].index)
genre_df

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
9,Femme,Thriller,2024,99,7.3,5951
910,I.S.S.,Thriller,2023,95,5.3,10785
1114,The Dive,Thriller,2023,91,5.5,5311
1672,See for Me,Thriller,2022,92,5.8,7650
2553,Fall,Thriller,2022,107,6.4,115612
2905,On the Line,Thriller,2022,104,5.4,15798
3886,Every Breath You Take,Thriller,2021,105,5.4,7353
4629,The Dunes,Thriller,2021,84,5.3,5823
5701,Dangerous Lies,Thriller,2020,96,5.4,19183
6078,The Big Ugly,Thriller,2020,106,5.1,6975


In [None]:
print("Total number of rows in the new dataframe: " + str(len(genre_df)))

Total number of rows in the new dataframe: 54


#**Creating the algorithm used**
Once I was left with a decent amount of movies to include in my recommender system, I went on to create the algorithm I would use to ensure users would be recommended movies relevant to their interests. Big thank you to Wendy for sharing an algorithm that was easy to use and modify, and that I ended up using for my recommender system!

When cleaning up my data, I took a slightly different approach than what was outlined in the skeleton. Instead of using the algorithm to decide the minimum number of votes and the minimum ratings of the movies to be included in the recommender system, I did this manually. Because of this, I had to alter the way I wrote out the algorithm, so that I was still receiving results that I wanted while making sure the algorithm was functional in the ways that I needed it to be.

**Just like in the model that Wendy provided, the algorithm uses 5 kinds of data:**

Minimum number of votes needed to make the cut and be included in the model (m), total number of reviewer votes for the movie (v), average imdb rating of each movie (r, for individual rating), average imdb rating for all of the movies in the dataframe (R, for collective rating), and total number of movies in the dataframe (x).

In [None]:
m = genre_df["imdb_numVotes"].quantile(0.0)
m_round = round(m)

print("The minimum number of imdb votes needed to make the cut and be included in the model is: " + str(m_round))
#Even though I had manually set my minimum to 5000 votes, this shows that there weren't any movies that were right at 5000 votes;
#the movie with the lowest number of votes actually had 5199.

The minimum number of imdb votes needed to make the cut and be included in the model is: 5199


In [None]:
R = genre_df["imdb_rating"].mean()
R_avg = round(R)

print("The average imdb rating for all movies in the dataframe is: " + str(R_avg) + " out of 10")
#This shows that even though I had manually set a minimum rating of 5.0, the average rating for all of the movies included is still relatively low.

The average imdb rating for all movies in the dataframe is: 6 out of 10


In [None]:
def weighted_rating(x, m = m, R = R):
    v = x["imdb_numVotes"]
    r = x["imdb_rating"]
    return round((v/(v+m) * r) + (m/(m+v) * R))

genre_df["imdb_rating"] = genre_df.apply(weighted_rating, axis = 1)

genre_df.sort_values(by = ["imdb_rating"], ascending = False)
genre_df.head()

Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
9,Femme,Thriller,2024,99,7,5951
910,I.S.S.,Thriller,2023,95,6,10785
1114,The Dive,Thriller,2023,91,6,5311
1672,See for Me,Thriller,2022,92,6,7650
2553,Fall,Thriller,2022,107,6,115612


In [None]:
def build_chart(genre_df, percentile = 0.95):

    print("How short can the movie be?")
    low_time = int(input())

    print("How long can the movie be?")
    high_time = int(input())
    #These are the only two questions I included, because the system provides additional information with the movies recommended to the user. It will only provide the "best" movies
    #to the user by default, but if the percentile is adjusted, more movies (with lower ratings and/or amount of votes) can be provided.

    movies = genre_df.copy()

    movies = movies[(movies["runtime"] >= low_time) &
                    (movies["runtime"] <= high_time)]

    R = movies["imdb_rating"].mean()
    m = movies["imdb_numVotes"].quantile(percentile)

    q_movies = movies.copy().loc[movies["imdb_numVotes"] >= m]

    q_movies["imdb_rating"] = q_movies.apply(lambda x: (x["imdb_numVotes"]
                                                               /(x["imdb_numVotes"]+ m)
                                                               * x["imdb_rating"])
                                                               + (m/(m+x["imdb_numVotes"]) * R),
                                                               axis = 1)

    q_movies["imdb_rating"] = q_movies["imdb_rating"].astype(int)

    q_movies = q_movies.sort_values("imdb_rating", ascending = False)

    return q_movies

#**Time to use the recommender system!**
This recommender system asks you two questions pertaining to the length of the movie you're willing to watch: how short can the movie be, and how long? Then the system will give you a list of movies that fit into the parameters you have set. It will also tell you when it was released, how long it is, the rating, and how many imdb votes it has received.

In [None]:
build_chart(genre_df).head()

How short can the movie be?
30
How long can the movie be?
200


Unnamed: 0,movie_title,genres,release_year,runtime,imdb_rating,imdb_numVotes
26469,State of Play,Thriller,2009,127,6,162800
28923,Fracture,Thriller,2007,113,6,224140
33075,Phone Booth,Thriller,2003,81,6,291249


#**Thank you for trying my recommender system!**
Credit: Thank you to Wendy for allowing me to remix and reuse the code provided in the skeleton! And thank you to my friends who were willing to test this recommender system for me!