### File: Exploratory_Data_Analysis.ipynb - CineMatch

- Contributor: Sudesh Kumar Santhosh Kumar
- Email: santhosh@usc.edu
- GitHub: [Sudesh Kumar](https://github.com/SudeshKumarSanthosh)
- Date: 4th December, 2023

### Description:

This Jupyter Notebook serves as the backbone of the CineMatch project's initial data exploration stage. Within these pages lies a comprehensive analysis of the MovieLens dataset, a treasure trove of movie-related information. The notebook meticulously dissects various facets of the dataset, including plots, genres, production details, financial metrics, and popularity scores.

`Key highlights of this exploratory analysis include:`

* Detailed statistical summaries and visualizations that uncover underlying patterns and trends in movie data.

* A thorough examination of the dataset's structure and content, laying the groundwork for the application of advanced deep learning and transformer technologies in CineMatch.

* Insightful correlations and comparisons between different cinematic attributes, offering a multifaceted understanding of what shapes movie popularity and viewer preferences.

* Innovative approaches to data cleaning and preprocessing, ensuring the integrity and usability of the dataset for further modeling.

This exploratory data analysis is not just a preliminary step but a crucial foundation for the sophisticated recommendation algorithms that will drive CineMatch. It aims to provide a clear, data-informed picture of the diverse landscape of cinema, setting the stage for a personalized and enriched movie selection experience.

## Part 1 - Installation and Dataset Loading
-----

### S1. Setting up the notebook

### 1.1. Importing all necessary packages

In [1]:
import string
import warnings     
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter
from typing import Dict, Text
from ast import literal_eval
from datetime import datetime
from wordcloud import WordCloud


### 1.1.1 Definition of Directories and Helper Functions

In [2]:
def analyze_dataframe(df):
    # Identifying categorical and numerical features
    categorical_features = [col for col in df.columns if df[col].dtype == "object"]
    numerical_features = [col for col in df.columns if df[col].dtype == "float64"]

    # Printing general information about the DataFrame
    print(f"Shape of the Data: {df.shape}")
    print(f"Total number of Data-Points in the Data, N: {df.shape[0]}")
    print(f"Total number of dimensions in the Data, D: {df.shape[1]}")
    print()
    
    # Printing details about features
    print(f"Categorical Features: {categorical_features}")
    print(f"Numerical Features: {numerical_features}")
    print(f"Total number of Categorical Features: {len(categorical_features)}")
    print(f"Total number of Numerical Features: {len(numerical_features)}")
    
    
def get_text(text, obj='name'):
    """
    Extracts and concatenates values from a string representation of a list of dictionaries.

    This function is useful for processing stringified data structures, particularly when 
    working with datasets that store complex, structured data in string formats. It evaluates 
    the string as a Python expression to convert it into a list of dictionaries and then 
    extracts specified values.

    Parameters:
    - text (str): A string representation of a list of dictionaries.
    - obj (str, optional): The key for which the value is to be extracted from each dictionary.
                           Defaults to 'name'.

    Returns:
    - str: A single string if there's only one dictionary in the list, or a comma-separated 
           string concatenating the values extracted from each dictionary if there are multiple.

    Example:
    >>> get_text("[{'name': 'Drama'}, {'name': 'Comedy'}]")
    'Drama, Comedy'
    """
    from ast import literal_eval

    text = literal_eval(text)
    
    if len(text) == 1:
        for i in text:
            return i[obj]
    else:
        s = []
        for i in text:
            s.append(i[obj])
        return ', '.join(s)


### 1.2. Loading all the Datasets into the notebook

In [3]:
warnings.filterwarnings('ignore')

credits = pd.read_csv('../Data/credits.csv')
keywords = pd.read_csv('../Data/keywords.csv')
movies = pd.read_csv('../Data/movies_metadata.csv')

pd.options.display.max_columns = 30
pd.set_option('display.float_format', '{:,}'.format) #to display float with commas

## Part 2 - Analysis of the Movies Metadata Dataset
------

### 2.1 Quick Look into the Dataset

In [4]:
movies.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

### 2.1.1 Initial Observations from the data.

In [6]:
analyze_dataframe(movies)

Shape of the Data: (45466, 24)
Total number of Data-Points in the Data, N: 45466
Total number of dimensions in the Data, D: 24

Categorical Features: ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'spoken_languages', 'status', 'tagline', 'title', 'video']
Numerical Features: ['revenue', 'runtime', 'vote_average', 'vote_count']
Total number of Categorical Features: 20
Total number of Numerical Features: 4


### 2.2 Dropping Irrelevant Columns.


In the preprocessing phase of our analysis, we have chosen to remove specific columns from the `movies` DataFrame. The columns dropped are: `'belongs_to_collection'`, `'homepage'`, `'imdb_id'`, `'poster_path'`, `'status'`, `'title'`, and `'video'`. The decision to exclude these columns is based on the following considerations:

- **Belongs to Collection**: Information about whether a movie is part of a collection may not significantly impact our movie recommendation algorithms. Our focus is more on individual movie characteristics rather than their association with collections.

- **Homepage**: The URL of a movie’s homepage is typically not a determinant of its quality, popularity, or relevance to a viewer's preferences. Thus, this information is not critical for our analysis.

- **IMDb ID**: While useful for uniquely identifying movies, the IMDb ID is a technical attribute that does not contribute to the predictive or analytical aspects of our model.

- **Poster Path**: The path to the movie’s poster image is primarily a visual element and doesn’t provide quantifiable data for our analysis.

- **Status**: The status (e.g., Released, Post Production) is not a key factor in recommendation, as we are primarily interested in already available movies.

- **Title**: The movie title, while important for identification, is not used as a feature in our recommendation model. We focus on deeper, content-based features rather than titles which are arbitrary and don't hold analytical value.

- **Video**: Information about whether the entry is a video does not contribute to understanding the movie’s content or its appeal to the audience, which are more crucial for recommendation.

In [7]:
movies.drop(['belongs_to_collection', 'homepage', 'imdb_id', 
             'poster_path', 'status', 'title', 'video'], 
            axis='columns', inplace=True)

# Removing corrupted Data from the Dataset
movies.drop(movies.index[[19730, 29503, 35587]], inplace=True)
movies = movies.reset_index(drop=True)

In [8]:
movies.head(3)

Unnamed: 0,adult,budget,genres,id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count
0,False,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",,7.7,5415.0
1,False,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Roll the dice and unleash the excitement!,6.9,2413.0
2,False,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Still Yelling. Still Fighting. Still Ready for...,6.5,92.0


In [9]:
categoricalFeatures = [col for col in movies.columns if movies[col].dtype == "object"]
numericalFeatures = [col for col in movies.columns if movies[col].dtype == "float64"]


print(f"Shape of the Train Data: {movies.shape}")
print(f"Total number of Data-Points in the Movies Data, N: {movies.shape[0]}")
print(f"Total number of dimesions in the Movies Data, D: {movies.shape[1]}")

print()

print(f"Categorical Features: {categoricalFeatures}")
print(f"Numerical Features: {numericalFeatures}")
print(f"Total number of Categorical Features: {len(categoricalFeatures)}")
print(f"Total number of Numerical Features: {len(numericalFeatures)}")

Shape of the Train Data: (45463, 17)
Total number of Data-Points in the Movies Data, N: 45463
Total number of dimesions in the Movies Data, D: 17

Categorical Features: ['adult', 'budget', 'genres', 'id', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'spoken_languages', 'tagline']
Numerical Features: ['revenue', 'runtime', 'vote_average', 'vote_count']
Total number of Categorical Features: 13
Total number of Numerical Features: 4


### 2.4 Merging Credits Dataset and Movies Dataset

In [10]:
movies['id'] = movies['id'].astype('int64')

df = movies.merge(keywords, on='id').merge(credits, on='id')
analyze_dataframe(df)

Shape of the Data: (46628, 20)
Total number of Data-Points in the Data, N: 46628
Total number of dimensions in the Data, D: 20

Categorical Features: ['adult', 'budget', 'genres', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'spoken_languages', 'tagline', 'keywords', 'cast', 'crew']
Numerical Features: ['revenue', 'runtime', 'vote_average', 'vote_count']
Total number of Categorical Features: 15
Total number of Numerical Features: 4


### 2.5. Data Cleaning and Preprocessing

In [11]:
df['original_language'] = df['original_language'].fillna('')
df['runtime'] = df['runtime'].fillna(0)
df['tagline'] = df['tagline'].fillna('')

# Dropping Missing Values
df.dropna(inplace=True)

### 2.5.1 Formating the JSON data to String values
The columns - genres, production_companies, production_countries and spoken_languages looks like json objects converted to string. The following code cells extracts relevant substring with regular expression.

In [12]:
### Cleaning the Columns
df['genres'] = df['genres'].apply(get_text)
df['production_companies'] = df['production_companies'].apply(get_text)
df['production_countries'] = df['production_countries'].apply(get_text)
df['crew'] = df['crew'].apply(get_text)
df['spoken_languages'] = df['spoken_languages'].apply(get_text)
df['keywords'] = df['keywords'].apply(get_text)

### 2.5.2 Creation of new Columns from Cast

In [13]:
df['characters'] = df['cast'].apply(get_text, obj='character')
df['actors'] = df['cast'].apply(get_text)

df.drop('cast', axis=1, inplace=True)

### 2.5.3 Remove Duplicates from the DataFrame and reset index

In [14]:
df = df[~df['original_title'].duplicated()]
df = df.reset_index(drop=True)

### 2.5.4 Final Overview of the Transformed & Cleansed Dataset

In [15]:
df.head()

Unnamed: 0,adult,budget,genres,id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,keywords,crew,characters,actors
0,False,30000000,"Animation, Comedy, Family",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,Pixar Animation Studios,United States of America,1995-10-30,373554033.0,81.0,English,,7.7,5415.0,"jealousy, toy, boy, friendship, friends, rival...","John Lasseter, Joss Whedon, Andrew Stanton, Jo...","Woody (voice), Buzz Lightyear (voice), Mr. Pot...","Tom Hanks, Tim Allen, Don Rickles, Jim Varney,..."
1,False,65000000,"Adventure, Fantasy, Family",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,1995-12-15,262797249.0,104.0,"English, Français",Roll the dice and unleash the excitement!,6.9,2413.0,"board game, disappearance, based on children's...","Larry J. Franco, Jonathan Hensleigh, James Hor...","Alan Parrish, Samuel Alan Parrish / Van Pelt, ...","Robin Williams, Jonathan Hyde, Kirsten Dunst, ..."
2,False,0,"Romance, Comedy",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"Warner Bros., Lancaster Gate",United States of America,1995-12-22,0.0,101.0,English,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0,"fishing, best friend, duringcreditsstinger, ol...","Howard Deutch, Mark Steven Johnson, Mark Steve...","Max Goldman, John Gustafson, Ariel Gustafson, ...","Walter Matthau, Jack Lemmon, Ann-Margret, Soph..."
3,False,16000000,"Comedy, Drama, Romance",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81452156.0,127.0,English,Friends are the people who let you be yourself...,6.1,34.0,"based on novel, interracial relationship, sing...","Forest Whitaker, Ronald Bass, Ronald Bass, Ezr...","Savannah 'Vannah' Jackson, Bernadine 'Bernie' ...","Whitney Houston, Angela Bassett, Loretta Devin..."
4,False,0,Comedy,11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,"Sandollar Productions, Touchstone Pictures",United States of America,1995-02-10,76578911.0,106.0,English,Just When His World Is Back To Normal... He's ...,5.7,173.0,"baby, midlife crisis, confidence, aging, daugh...","Alan Silvestri, Elliot Davis, Nancy Meyers, Na...","George Banks, Nina Banks, Franck Eggelhoffer, ...","Steve Martin, Diane Keaton, Martin Short, Kimb..."


In [16]:
analyze_dataframe(df)

Shape of the Data: (42373, 21)
Total number of Data-Points in the Data, N: 42373
Total number of dimensions in the Data, D: 21

Categorical Features: ['adult', 'budget', 'genres', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'spoken_languages', 'tagline', 'keywords', 'crew', 'characters', 'actors']
Numerical Features: ['revenue', 'runtime', 'vote_average', 'vote_count']
Total number of Categorical Features: 16
Total number of Numerical Features: 4
