# MovieLens Analytics

In this project, I'll analyze 45,000 movies from MovieLens Dataset consisting of movies up to July 2017 with the use of PostgreSQL and Pandas

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

To keep things more compact and readable, let's explore data and figure out what columns we'll use in later queries with the help of Pandas

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
df["adult"].value_counts()

False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64

In [5]:
df["video"].value_counts()

False    45367
True        93
Name: video, dtype: int64

In [6]:
df["status"].value_counts()

Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (not many not null values), 'id' and 'imdb_id' (we'll stick to one table for now), 'spoken_languages', 'overview' and 'tagline' (we won't be conducting text analysis here + rows can get inconsistent with big amount of text from these cols)

In [7]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "id",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
production_countries,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0


##### IMPORTANT: strange cases like budget of 0 for 'Grumpier Old Men' at the right will be handled in later SQL queries

Time to extract data from JSON

Firstly, handle 'belongs_to_collection'

In [8]:
from ast import literal_eval

def extract_franchise_name(x):
    try:
        # Use literal_eval to safely evaluate the string as a Python dictionary
        # Extract the 'name' value from the dictionary
        return literal_eval(x)["name"]
    except (ValueError, TypeError):
        return None
    
# Apply the extract_franchise_name function to each value in the 'belongs_to_collection' column
df["franchise"] = df["belongs_to_collection"].apply(extract_franchise_name).str.strip()
# Remove the word 'Collection' (case-insensitive) from the end of each franchise name
df["franchise"] = df["franchise"].str.replace(r"[Cc]ollection$", "", regex=True)
# Remove trailing spaces before and after the string
df["franchise"] = df["franchise"].str.strip()
df = df.drop(["belongs_to_collection"], axis=1)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
production_countries,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men


In [9]:
df["franchise"].value_counts()

The Bowery Boys                    29
Totò                               27
Zatôichi: The Blind Swordsman      26
James Bond                         26
The Carry On                       25
                                   ..
Superman (DC Universe Animated)     1
Kathleen Madigan                    1
The Big Bottom Box                  1
Joséphine - Saga                    1
Red Lotus                           1
Name: franchise, Length: 1693, dtype: int64

In [10]:
df["production_countries"][88]

"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"

Now we'll preprocess 'production_countries' col

In [11]:
def process_countries(countries):
    try:
        countries_list = literal_eval(countries)
        if len(countries_list) == 1:
            return countries_list[0]["name"]
        elif len(countries_list) > 1:
            return "Multiple"
        else:
            return "None"
    except (ValueError, TypeError):
        return "None"

# Apply the process_countries function to each value in the 'production_countries' column
df["production_country"] = df["production_countries"].apply(process_countries)
df = df.drop(["production_countries"], axis=1)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5


In [12]:
df["production_country"].value_counts()

United States of America                17851
Multiple                                 7027
None                                     6288
United Kingdom                           2238
France                                   1654
                                        ...  
Botswana                                    1
Algeria                                     1
Luxembourg                                  1
United States Minor Outlying Islands        1
Cambodia                                    1
Name: production_country, Length: 108, dtype: int64

Now let's handle genra

In [14]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genre_names"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["genres", "genre_names"]]

Unnamed: 0,genres,genre_names
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[Animation, Comedy, Family]"
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[Adventure, Family, Fantasy]"
2,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[Comedy, Romance]"
3,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[Comedy, Drama, Romance]"
4,"[{'id': 35, 'name': 'Comedy'}]",[Comedy]
...,...,...
45461,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...","[Drama, Family]"
45462,"[{'id': 18, 'name': 'Drama'}]",[Drama]
45463,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...","[Action, Drama, Thriller]"
45464,[],[]


In [22]:
df["genres"].value_counts().head(40)

[{'id': 18, 'name': 'Drama'}]                                                                    5000
[{'id': 35, 'name': 'Comedy'}]                                                                   3621
[{'id': 99, 'name': 'Documentary'}]                                                              2723
[]                                                                                               2442
[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]                                  1301
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}]                                      1135
[{'id': 27, 'name': 'Horror'}]                                                                    974
[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]                                  930
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]     593
[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}]                       

In [23]:
df["genre_names"].value_counts().head(40)

[Drama]                              5000
[Comedy]                             3621
[Documentary]                        2723
[]                                   2442
[Comedy, Drama]                      1667
[Drama, Romance]                     1644
[Comedy, Romance]                    1143
[Comedy, Drama, Romance]             1031
[Horror]                              974
[Horror, Thriller]                    680
[Drama, Thriller]                     677
[Crime, Drama]                        604
[Thriller]                            465
[Crime, Drama, Thriller]              428
[Drama, History]                      362
[Action, Thriller]                    337
[Western]                             318
[Drama, War]                          308
[Drama, Foreign]                      298
[Documentary, Music]                  292
[Action, Comedy]                      281
[Action]                              278
[Horror, Science Fiction]             267
[Comedy, Horror]                  

Time to work with 'production_companies'

In [17]:
df["production_companies"].value_counts().head(10)

[]                                                                 11875
[{'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]                  742
[{'name': 'Warner Bros.', 'id': 6194}]                               540
[{'name': 'Paramount Pictures', 'id': 4}]                            505
[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]      439
[{'name': 'Universal Pictures', 'id': 33}]                           320
[{'name': 'RKO Radio Pictures', 'id': 6}]                            247
[{'name': 'Columbia Pictures Corporation', 'id': 441}]               207
[{'name': 'Columbia Pictures', 'id': 5}]                             146
[{'name': 'Mosfilm', 'id': 5120}]                                    145
Name: production_companies, dtype: int64