# MRS

Here I'll work on content-based movie recommender based on the previous notebook

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

In [1]:
import pandas as pd
from ast import literal_eval

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
df["adult"].value_counts()

adult
False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: count, dtype: int64

In [5]:
df["video"].value_counts()

video
False    45367
True        93
Name: count, dtype: int64

In [6]:
df["status"].value_counts()

status
Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: count, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (too many null values), 'imdb_id', 'spoken_languages', 'overview' and 'tagline'

Apart from this, let's drop not much useful for recommender columns

In [7]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
        "belongs_to_collection",
        "original_language",
        "production_companies",
        "production_countries"
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
id,862,8844,15602
popularity,21.946943,17.015539,11.7129
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5
vote_count,5415.0,2413.0,92.0


Now let's have a look at dtypes

## Converting dtypes to more appropriate ones

In [8]:
df.dtypes

budget           object
genres           object
id               object
popularity       object
release_date     object
revenue         float64
runtime         float64
title            object
vote_average    float64
vote_count      float64
dtype: object

First of all, let's handle 'release_date' column

In [9]:
# Convert 'release_date' column to datetime type
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
# Count the number of rows with bad date values
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 90


Since 90 rows compared to 45,000 in total is nothing, we can freely remove them

In [10]:
# Remove rows with null or NaT values
df = df.dropna(subset=['release_date'])
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 0


The column 'budget' contains non-numerical values like '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'. Let's remove them

In [11]:
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)

I don't like that columns with whole numbers like 'runtime' or 'vote_count' have dtype set to float. Let's change that

In [12]:
# Specify columns and their new data types
dict_columns_to_convert = {
    "budget": "int64",
    "revenue": "int64",
    "runtime": "int",
    "vote_count": "int"
}
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)
# Fill NaN values with 0
cols_to_fill = list(dict_columns_to_convert.keys())
df[cols_to_fill] = df[cols_to_fill].fillna(0)
# Convert columns to integer type
df = df.astype(dict_columns_to_convert)
# Check the data types of the DataFrame
print(df.dtypes)

budget                   int64
genres                  object
id                      object
popularity              object
release_date    datetime64[ns]
revenue                  int64
runtime                  int32
title                   object
vote_average           float64
vote_count               int32
dtype: object


## Working with 'genres' column

In [14]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genres"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["title", "genres"]].head(3)

Unnamed: 0,title,genres
0,Toy Story,"[Animation, Comedy, Family]"
1,Jumanji,"[Adventure, Family, Fantasy]"
2,Grumpier Old Men,"[Comedy, Romance]"


In [15]:
# Flatten the list of genre names
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
# Get the unique genre names
unique_genre_names = set(flat_genre_names)
# Print the unique genre names
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 20 unique genres.
{'Adventure', 'War', 'Horror', 'Mystery', 'TV Movie', 'Science Fiction', 'Documentary', 'Western', 'Music', 'Animation', 'Action', 'Thriller', 'Romance', 'Fantasy', 'Family', 'Crime', 'Drama', 'Comedy', 'Foreign', 'History'}


We can see that 'genres' colomn has faulty data like 'Carousel Productions' or 'Vision View Entertainment', which sound like production companies, not genres. Thus, let's remove such values from the column

In [16]:
# Define the list of valid genre names
valid_genres = {
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Mystery',
    'Romance', 'Science Fiction', 'Thriller', 'War', 'Western'
}
# Filter the genre_names column to include only the valid genres
df["genres"] = df["genres"].apply(lambda x: [genre for genre in x if genre in valid_genres])

Now let's check again

In [17]:
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
unique_genre_names = set(flat_genre_names)
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 17 unique genres.
{'Adventure', 'Family', 'War', 'Animation', 'Horror', 'Crime', 'Mystery', 'Science Fiction', 'Comedy', 'Action', 'Thriller', 'Documentary', 'History', 'Fantasy', 'Western', 'Romance', 'Drama'}


In [18]:
df["genres"].value_counts().head(7)

genres
[Drama]              5617
[Comedy]             3873
[Documentary]        3164
[]                   2522
[Drama, Romance]     1951
[Comedy, Drama]      1845
[Comedy, Romance]    1325
Name: count, dtype: int64

One movie can belong to many genres and one genre can be applied to many movies. It's a many-to-many relationship. Ideally, this kind of relationship is supposed to be broken into two 1:M relationships and connected with an intermidiate or junction table. However, because

- it's a project to show my knowledge mainly of writing SQL queries
- I'm applying to a junior data analyst position, and, at that role, you're not supposed to design databases
- preparation part is already too long
- maximum string length for genres is known (80 symbols for the movie with the title 'Yu-Gi-Oh')

I'll keep things simple and connect genre names by comma.

In [19]:
# Convert the list of genres into a string with comma as a delimiter
df["genres"] = df["genres"].apply(lambda x: ", ".join(x) if x else None)

In [20]:
df["genres"].value_counts().head(7)

genres
Drama                     5617
Comedy                    3873
Documentary               3164
Drama, Romance            1951
Comedy, Drama             1845
Comedy, Romance           1325
Comedy, Drama, Romance    1153
Name: count, dtype: int64

In [21]:
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30.0,65.0,0.0
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance"
id,862,8844,15602
popularity,21.946943,17.015539,11.7129
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00
revenue,373.554033,262.797249,0.0
runtime,81,104,101
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5
vote_count,5415,2413,92


## Final steps of data preparation with Pandas

Time to add index col, change dtypes, and rearrange columns a little bit because I'm not happy with the order of columns

In [22]:
new_cols_order = [
    "id",
    "title",
    "release_date",
    "runtime",
    "genres",
    "budget",
    "revenue",
    "popularity",
    "vote_average",
    "vote_count"
]
df = df[new_cols_order]
df.head(3).transpose()

Unnamed: 0,0,1,2
id,862,8844,15602
title,Toy Story,Jumanji,Grumpier Old Men
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00
runtime,81,104,101
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance"
budget,30.0,65.0,0.0
revenue,373.554033,262.797249,0.0
popularity,21.946943,17.015539,11.7129
vote_average,7.7,6.9,6.5
vote_count,5415,2413,92


Columns explanation:
- id - row id
- title - official title of the movie
- franchise - a particular franchise to which the movie belongs, if applicable
- release_date - theatrical release date of the movie
- runtime - movie duration/runtime in minutes
- genres - genres associated with the movie, separated by a comma
- production_country - the country/countries where the movie was shot/produced. If several countries were involved, the cell contains the value 'Multiple'
- production_companies - production companies involved in making of the movie
- original_language - the language in which the movie was originally shot
- budget - movie budget in dollars
- revenue - total movie revenue in dollars
- popularity - popularity score assigned by TMDB
- vote_average - average movie rating
- vote_count - number of votes by users, counted by TMDB

Let's save the cleaned up dataset, which we'll use in the next chapters

In [23]:
# df.to_csv("data/data.csv", index=False)