# MovieLens Analytics

In this project, I'll analyze 45,000 movies from MovieLens Dataset consisting of movies up to July 2017 with the use of PostgreSQL and Pandas

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

To keep things more compact and readable, let's explore data and figure out what columns we'll use in later queries with the help of Pandas

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
df["adult"].value_counts()

False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64

In [5]:
df["video"].value_counts()

False    45367
True        93
Name: video, dtype: int64

In [6]:
df["status"].value_counts()

Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (not many not null values), 'id' and 'imdb_id' (we'll stick to one table for now), 'spoken_languages', 'overview' and 'tagline' (we won't be conducting text analysis here + rows can get inconsistent with big amount of text from these cols)

In [7]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "id",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
original_language,en,en,en
popularity,21.946943,17.015539,11.7129
production_companies,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
production_countries,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o...","[{'iso_3166_1': 'US', 'name': 'United States o..."
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0


##### IMPORTANT: strange cases like budget of 0 for 'Grumpier Old Men' at the right will be handled in later SQL queries

Time to extract data from JSON

In [15]:
df["belongs_to_collection"][0:7]

0    {'id': 10194, 'name': 'Toy Story Collection', ...
1                                                  NaN
2    {'id': 119050, 'name': 'Grumpy Old Men Collect...
3                                                  NaN
4    {'id': 96871, 'name': 'Father of the Bride Col...
5                                                  NaN
6                                                  NaN
Name: belongs_to_collection, dtype: object

In [18]:
from ast import literal_eval
import numpy as np

def extract_franchise_name(x):
    try:
        return literal_eval(x)["name"]
    except (ValueError, TypeError):
        return None

df["franchise"] = df["belongs_to_collection"].apply(extract_franchise_name)

# Display the DataFrame with extracted collection names
df[["belongs_to_collection", "franchise"]]

Unnamed: 0,belongs_to_collection,franchise
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",Toy Story Collection
1,,
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",Grumpy Old Men Collection
3,,
4,"{'id': 96871, 'name': 'Father of the Bride Col...",Father of the Bride Collection
...,...,...
45461,,
45462,,
45463,,
45464,,


In [17]:
from ast import literal_eval

df["franchise"] = df["belongs_to_collection"].apply(
    lambda x: literal_eval(x)["name"] if isinstance(x, str) else None
)

# Display the DataFrame with extracted collection names
df[["belongs_to_collection", "franchise"]]

from ast import literal_eval
import numpy as np

# Check if the value is a string before applying literal_eval()
df["franchise"] = df["belongs_to_collection"].apply(
    lambda x: literal_eval(x)["name"] if isinstance(x, str) else None
)

# Display the DataFrame with extracted collection names
df[["belongs_to_collection", "franchise"]]


TypeError: 'float' object is not subscriptable

In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame with NaN values
data = {'belongs_to_collection': ["{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}", np.nan]}
df = pd.DataFrame(data)

# Replace NaN values with empty strings
df['belongs_to_collection'] = df['belongs_to_collection'].fillna('')

# Define a lambda function to extract the name from the JSON string
df['collection_name'] = df['belongs_to_collection'].apply(lambda x: eval(x)['name'] if x else None)

# Display the DataFrame with extracted collection names
df[['belongs_to_collection', 'collection_name']]

Unnamed: 0,belongs_to_collection,collection_name
0,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",Grumpy Old Men Collection
1,,


In [None]:
df["production_countries"].value_counts()

[{'iso_3166_1': 'US', 'name': 'United States of America'}]                                                                                                          17851
[]                                                                                                                                                                   6282
[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]                                                                                                                     2238
[{'iso_3166_1': 'FR', 'name': 'France'}]                                                                                                                             1654
[{'iso_3166_1': 'JP', 'name': 'Japan'}]                                                                                                                              1356
                                                                                                                                                      

In [None]:
df["genres"][1100]

"[{'id': 12, 'name': 'Adventure'}, {'id': 35, 'name': 'Comedy'}, {'id': 14, 'name': 'Fantasy'}]"

In [None]:
import json

# Parse the JSON string in the 'genres' column and extract genre names
df["genre_names"] = df["genres"].apply(lambda x: [genre["name"] for genre in json.loads(x)])

# Display the DataFrame with extracted genre names
print(df[["genres", "genre_names"]].head())

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)