# Visualizations

This notebook's purpose is to provide exploratory and explanatory data analysis of the dataset prior to any processing for machine learning.

## Library Importation, Folder Creation and Function Implementation

In [1]:
#Importing numpy and pandas for basic data manipulation
import numpy as np
import pandas as pd

#importing os for operating system interfacing
import os

#Importing matplotlib and seaborn for basic data visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Creating new folder for visuals
vis_folder = "Visualizations/"
os.makedirs(vis_folder, exist_ok=True)
os.listdir(vis_folder)

[]

In [3]:
df = pd.read_csv("Data/tmdb_data.csv.gz", lineterminator = "\n")
df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,,,,,,,,,,,...,,,,,,,,,,
1,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,Two rural teens sing and dance their way throu...,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,Earth is in a state of constant war and two co...,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,"After falling prey to underworld, four friends...",...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,Two neighbors become intimate after discoverin...,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.119,2204.0,PG


### Functions
Various functions to assist in visualizations

## Preliminary Data Cleaning

This part of the notebook is meant to determine which columns/rows to drop if too much data is missing. All duplicates have been dropped in a prior notebook.

In [4]:
df.shape

(62809, 25)

In [5]:
#Checking for missing values
df.isna().sum()

adult                        1
backdrop_path            22291
belongs_to_collection    58600
budget                       1
genres                       1
homepage                 47721
id                           1
original_language            1
original_title               1
overview                  1337
popularity                   1
poster_path               4904
production_companies         1
production_countries         1
release_date               953
revenue                      1
runtime                      1
spoken_languages             1
status                       1
tagline                  38870
title                        1
video                        1
vote_average                 1
vote_count                   1
certification            47627
dtype: int64

In [6]:
#Checking for missing values by percentage
df.isna().sum()/len(df)* 100

adult                     0.001592
backdrop_path            35.490137
belongs_to_collection    93.298731
budget                    0.001592
genres                    0.001592
homepage                 75.977965
id                        0.001592
original_language         0.001592
original_title            0.001592
overview                  2.128676
popularity                0.001592
poster_path               7.807798
production_companies      0.001592
production_countries      0.001592
release_date              1.517298
revenue                   0.001592
runtime                   0.001592
spoken_languages          0.001592
status                    0.001592
tagline                  61.886035
title                     0.001592
video                     0.001592
vote_average              0.001592
vote_count                0.001592
certification            75.828305
dtype: float64

In [7]:
# Dropping "adult" column to see if that resolves majority of issues 
df = df.dropna(subset = "adult")

df.isna().sum()/len(df)*100

adult                     0.000000
backdrop_path            35.489110
belongs_to_collection    93.298624
budget                    0.000000
genres                    0.000000
homepage                 75.977582
id                        0.000000
original_language         0.000000
original_title            0.000000
overview                  2.127118
popularity                0.000000
poster_path               7.806330
production_companies      0.000000
production_countries      0.000000
release_date              1.515730
revenue                   0.000000
runtime                   0.000000
spoken_languages          0.000000
status                    0.000000
tagline                  61.885429
title                     0.000000
video                     0.000000
vote_average              0.000000
vote_count                0.000000
certification            75.827920
dtype: float64

The vast majority of movies do not belong to a collection or have a homepage. We will remove those columns. Backdrop_path is the path to the "backdrop" of a movie and can also be removed safely. 

Certification (G, PG, PG-13, etc.) will be kept for now, but all missing values will be input as "missing"

In [10]:
df = df.drop(columns = ["backdrop_path", "belongs_to_collection", "homepage"])

df.isna().sum()/len(df)*100

adult                    0.000000
budget                   0.000000
genres                   0.000000
id                       0.000000
original_language        0.000000
original_title           0.000000
overview                 2.127118
popularity               0.000000
poster_path              7.806330
production_companies     0.000000
production_countries     0.000000
release_date             1.515730
revenue                  0.000000
runtime                  0.000000
spoken_languages         0.000000
status                   0.000000
tagline                 61.885429
title                    0.000000
video                    0.000000
vote_average             0.000000
vote_count               0.000000
certification           75.827920
dtype: float64

## Exploratory Data Analysis

Predominantly univariate analyses

## Explanatory Data Analysis

Predominantly multivariate analyses