# MovieLens Analytics

In this project, I'll analyze 45,000 movies from MovieLens Dataset consisting of movies up to July 2017 with the use of PostgreSQL and Pandas

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

To keep things more compact and readable, let's explore data and figure out what columns we'll use in later queries with the help of Pandas

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
df["adult"].value_counts()

False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64

In [5]:
df["video"].value_counts()

False    45367
True        93
Name: video, dtype: int64

In [6]:
df["status"].value_counts()

Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (not many not null values), 'id' and 'imdb_id' (we'll stick to one table for now), 'spoken_languages', 'overview' and 'tagline' (we won't be conducting text analysis here + rows can get inconsistent with big amount of text from these cols)

In [7]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "id",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
production_countries,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0


##### IMPORTANT: strange cases like budget of 0 for 'Grumpier Old Men' at the right will be handled in later SQL queries

Time to extract data from JSON

## Working with 'belongs_to_collection' column

In [8]:
from ast import literal_eval

def extract_franchise_name(x):
    try:
        # Use literal_eval to safely evaluate the string as a Python dictionary
        # Extract the 'name' value from the dictionary
        return literal_eval(x)["name"]
    except (ValueError, TypeError):
        return None
    
# Apply the extract_franchise_name function to each value in the 'belongs_to_collection' column
df["franchise"] = df["belongs_to_collection"].apply(extract_franchise_name).str.strip()
# Remove the word 'Collection' (case-insensitive) from the end of each franchise name
df["franchise"] = df["franchise"].str.replace(r"[Cc]ollection$", "", regex=True)
# Remove trailing spaces before and after the string
df["franchise"] = df["franchise"].str.strip()
df = df.drop(["belongs_to_collection"], axis=1)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
production_countries,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men


In [9]:
df["franchise"].value_counts()

The Bowery Boys                    29
Totò                               27
Zatôichi: The Blind Swordsman      26
James Bond                         26
The Carry On                       25
                                   ..
Superman (DC Universe Animated)     1
Kathleen Madigan                    1
The Big Bottom Box                  1
Joséphine - Saga                    1
Red Lotus                           1
Name: franchise, Length: 1693, dtype: int64

## Working with 'production_countries' column

In [10]:
def process_countries(countries):
    try:
        countries_list = literal_eval(countries)
        if len(countries_list) == 1:
            return countries_list[0]["name"]
        elif len(countries_list) > 1:
            return "Multiple"
        else:
            return "None"
    except (ValueError, TypeError):
        return "None"

# Apply the process_countries function to each value in the 'production_countries' column
df["production_country"] = df["production_countries"].apply(process_countries)
df = df.drop(["production_countries"], axis=1)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5


In [11]:
df["production_country"].value_counts()

United States of America                17851
Multiple                                 7027
None                                     6288
United Kingdom                           2238
France                                   1654
                                        ...  
Botswana                                    1
Algeria                                     1
Luxembourg                                  1
United States Minor Outlying Islands        1
Cambodia                                    1
Name: production_country, Length: 108, dtype: int64

## Working with 'genres' column

In [12]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genres"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["title", "genres"]].head(3)

Unnamed: 0,title,genres
0,Toy Story,"[Animation, Comedy, Family]"
1,Jumanji,"[Adventure, Family, Fantasy]"
2,Grumpier Old Men,"[Comedy, Romance]"


In [13]:
# Flatten the list of genre names
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
# Get the unique genre names
unique_genre_names = set(flat_genre_names)
# Print the unique genre names
print(unique_genre_names)

{'Crime', 'Rogue State', 'TV Movie', 'Comedy', 'GoHands', 'Foreign', 'War', 'Drama', 'The Cartel', 'Mardock Scramble Production Committee', 'Sentai Filmworks', 'BROSTA TV', 'Animation', 'Mystery', 'Aniplex', 'Family', 'Action', 'Odyssey Media', 'Pulser Productions', 'Vision View Entertainment', 'Documentary', 'Music', 'Thriller', 'Adventure', 'Horror', 'Romance', 'Carousel Productions', 'History', 'Telescene Film Group Productions', 'Science Fiction', 'Fantasy', 'Western'}


We can see that 'genres' colomn has faulty data like 'Carousel Productions' or 'Vision View Entertainment', which sound like production companies, not genres. Thus, let's remove such values from the column

In [14]:
# Define the list of valid genre names
valid_genres = {
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
    'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western'
}
# Filter the genre_names column to include only the valid genres
df["genres"] = df["genres"].apply(lambda x: [genre for genre in x if genre in valid_genres])

Now let's check again

In [15]:
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
unique_genre_names = set(flat_genre_names)
print(unique_genre_names)

{'Crime', 'TV Movie', 'Comedy', 'War', 'Drama', 'Animation', 'Mystery', 'Family', 'Action', 'Documentary', 'Music', 'Thriller', 'Adventure', 'Horror', 'Romance', 'History', 'Science Fiction', 'Fantasy', 'Western'}


In [16]:
df["genres"].value_counts().head(25)

[Drama]                      5298
[Comedy]                     3698
[Documentary]                2849
[]                           2464
[Drama, Romance]             1793
[Comedy, Drama]              1748
[Comedy, Romance]            1165
[Comedy, Drama, Romance]     1096
[Horror]                     1013
[Drama, Thriller]             701
[Horror, Thriller]            691
[Crime, Drama]                614
[Thriller]                    494
[Crime, Drama, Thriller]      432
[Action, Thriller]            374
[Drama, History]              373
[Action]                      322
[Western]                     318
[Drama, War]                  316
[Action, Comedy]              307
[Action, Drama]               305
[Documentary, Music]          299
[Horror, Science Fiction]     273
[Comedy, Horror]              272
[Action, Crime, Thriller]     266
Name: genres, dtype: int64

In [17]:
# Sort the DataFrame by the length of the genre_names lists in descending order
df_sorted = df[df["genres"].apply(lambda x: len(x) > 0)].copy()  # Remove empty lists
df_sorted["genres_length"] = df_sorted["genres"].apply(len)
df_sorted = df_sorted.sort_values(by="genres_length", ascending=False)

# Display the DataFrame with the longest genres lists first
df_sorted[["title", "genres", "genres_length"]].head(15)

Unnamed: 0,title,genres,genres_length
8084,Yu-Gi-Oh! The Movie,"[Action, Adventure, Animation, Comedy, Family,...",8
35682,Cool Cat Saves the Kids,"[Action, Comedy, Crime, Drama, Family, Fantasy...",8
16387,Malice in Wonderland,"[Action, Crime, Drama, Fantasy, Romance, Scien...",7
2301,Young Sherlock Holmes,"[Action, Adventure, Crime, Drama, Family, Myst...",7
34383,Princes and Princesses,"[Animation, Comedy, Drama, Family, Fantasy, Ro...",7
5015,Vampire Hunter D: Bloodlust,"[Action, Adventure, Animation, Fantasy, Horror...",7
41345,Black Butler,"[Action, Adventure, Crime, Drama, Fantasy, Hor...",7
11833,The Librarian: Quest for the Spear,"[Action, Adventure, Comedy, Drama, Fantasy, Ro...",7
11743,Origin: Spirits of the Past,"[Action, Adventure, Animation, Drama, Fantasy,...",7
2414,Westworld,"[Action, Adventure, Drama, Horror, Science Fic...",7


Now let's separate genre names by comma, to keep things simple. In the real world, it's a questinable approach to say the least. It's a many-to-many relationship that is supposed to be broken into two 1:M relationships and connected with intermidiate or junction table. However, the maximum string length is known and we're not planning to add new data anytime soon. Plus, the idea of this project is to show that I have an understanding of Pandas and SQL, and the Pandas part is already too long, so let's shorten

## Working with 'production_companies' column

In [18]:
df["production_companies"].value_counts().head(10)

[]                                                                 11875
[{'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]                  742
[{'name': 'Warner Bros.', 'id': 6194}]                               540
[{'name': 'Paramount Pictures', 'id': 4}]                            505
[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]      439
[{'name': 'Universal Pictures', 'id': 33}]                           320
[{'name': 'RKO Radio Pictures', 'id': 6}]                            247
[{'name': 'Columbia Pictures Corporation', 'id': 441}]               207
[{'name': 'Columbia Pictures', 'id': 5}]                             146
[{'name': 'Mosfilm', 'id': 5120}]                                    145
Name: production_companies, dtype: int64