# Business Analysis on Movie Production
> ## IMDB Files ETL (Extract, Transform and Load)

## Business Problem

A client has requested the production of a MySQL database on movies. The dataset is to be extracted from a subset of IMDB's publicly available dataset. 
The client specified the folllowing files for the database production:
>- title.basics.tsv.gz
>- title.ratings.tsv.gz
>- title.akas.tsv.gz

An indepth analysis is to be performed on the dataset and the factors that makes a movie successfull is to be recommended. 


## Data Source

The dataset were obtained from the IMDB website.

> Data dictionary: https://www.imdb.com/interfaces/

> Data download link: https://datasets.imdbws.com/


## Database Specifications:
The following specifications were given for the production of the database.

>- Exclude any movie with missing values for genre or runtime
>- Include only full-length movies (titleType = "movie").
>- Include only fictional movies (not from documentary genre)
>- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
>- Include only movies that were released in the United States

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## Files



In [2]:
# Saving urls into variables
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
ratings_url ="https://datasets.imdbws.com/title.ratings.tsv.gz"
akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"

### Title Basics

> **Filters:**
>>- Replace "\N" with np.nan
>>- Eliminate movies that have a null value for genre and runtimeMinutes
>>- Include only movies with startyear 2000-2022
>>- Include only movies with full length and title type "Movie"
>>- Replace "\N" with np.nan
Eliminate movies that include "Documentary" in genre


In [3]:
#Loading the data into a dataframe saved as a variable.
basics= pd.read_csv(basics_url, sep ='\t', low_memory =False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


>- **Replace "\N" with np.nan**

In [4]:
#Converting the \N null value into a pandas recognizable format of np.nan
basics =basics.replace({"\\N":np.nan})

>- **Eliminating movies with null values in genre and runtimeMinutes**



Checking the null values

In [5]:
#Null values check
basics.isnull().sum()

tconst                  0
titleType               0
primaryTitle           11
originalTitle          11
isAdult                 1
startYear         1197975
endYear           8905264
runtimeMinutes    6573562
genres             410067
dtype: int64

In [6]:
#Dropping null values in genre and runtime
basics = basics.dropna(subset=['genres', 'runtimeMinutes', 'startYear'])

In [7]:
#Confirming changes
basics.isnull().sum()

tconst                  0
titleType               0
primaryTitle            0
originalTitle           0
isAdult                 0
startYear               0
endYear           2277259
runtimeMinutes          0
genres                  0
dtype: int64

>- **Filtering only movies with full-length (titleType = "movie")**

In [8]:
#selecting only full length movies with title type as movie
basics = basics[basics["titleType"] == 'movie']

In [9]:
#Confirming changes
basics.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
1172,tt0001184,movie,Don Juan de Serrallonga,Don Juan de Serrallonga,0,1910,,58,"Adventure,Drama"
1273,tt0001285,movie,The Life of Moses,The Life of Moses,0,1909,,50,"Biography,Drama,Family"


>- **Filtering only movies that were released between 2000 and 2021(2000, and 2021 inclusive)**

In [10]:
#Converting the column startYear into an integer from string.
basics['startYear'] = basics['startYear'].astype(int)

In [11]:
#Selecting the start years between 2000 to 2022
basics = basics[(basics['startYear'] >=2000 ) & (basics['startYear'] <=2021)]

In [12]:
#Confirming changes
basics.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8996686,tt9916362,movie,Coven,Akelarre,0,2020,,92,"Drama,History"
8996770,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019,,123,Drama
8996811,tt9916622,movie,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,,57,Documentary
8996838,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,0,2007,,100,Documentary
8996871,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,0,2013,,49,Documentary


>- **Filtering only fictional movies(not from documentary genre)**

In [13]:
#selecting the movies that are NOT documentary
non_documentary = basics['genres'].str.contains('Documentary', case = False)
basics = basics[~non_documentary]

In [14]:
#confirming changes
basics.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8996593,tt9916170,movie,The Rehearsal,O Ensaio,0,2019,,51,Drama
8996602,tt9916190,movie,Safeguard,Safeguard,0,2020,,90,"Action,Adventure,Thriller"
8996641,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020,,84,Thriller
8996686,tt9916362,movie,Coven,Akelarre,0,2020,,92,"Drama,History"
8996770,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019,,123,Drama


>- **Saving the filtered basics dataset temporary as a .csv file**

In [15]:
#Creating an empty directory
os.makedirs('Movie_files', exist_ok=True)

In [16]:
#Confirming if the empty directory was created
os.listdir('Movie_files/')

[]

>- **Saving of dataframe**

In [17]:
#Saving the basic dataframe as a .csv.gz file in the empty directory
basics.to_csv("Movie_files/Title_Basics.csv.gz",compression = "gzip", index =False)

In [18]:
#confirming saved file
basic= pd.read_csv("Movie_files/Title_Basics.csv.gz", low_memory = False)
basic.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


**The Title_Basic file was successfully saved**

In [19]:
#Deleting the basics file to give room for filtering of other files
del basics

### Title Akas
>**Filters:**
>>- Filtering only US entries.
>>- Replace \N with np.nan
>>- Dropping runtimesMinutes and genre with null values

In [20]:
# Loading the file title_akas.## Using low_memory=True to create more memory space
#akas_url = r"C:\Users\heill\Downloads\title.akas.tsv.gz"
akas= pd.read_csv(akas_url, sep ='\t', low_memory =True)
akas.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [21]:
#Filtering only US entries
akas =akas.loc[akas["region"]=="US"]

In [22]:
#Confirming changes
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
14,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,\N,imdbDisplay,\N,0
36,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
41,tt0000005,6,Blacksmith Scene #1,US,\N,alternative,\N,0


>- **Converting the \N null value format into np.nan**

In [23]:
#Converting the \N null value format into np.nan
akas =akas.replace({"\\N":np.nan})

In [24]:
#Checking the null values
akas.isnull().sum()

titleId                  0
ordering                 0
title                    0
region                   0
language           1324672
types               302122
attributes         1284080
isOriginalTitle       1375
dtype: int64

>- **Filtering the previous filtered Title_Basics file to include only the US tables filtered in the akas dataframe.**

In [25]:
# loading the Title_Basics file again
basics = pd.read_csv(r"Movie_files/Title_Basics.csv.gz", low_memory=False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [26]:
#Filtering the IDs that are in both basics and akas files
US_filtered = basics['tconst'].isin(akas["titleId"])
US_filtered.head()

0     True
1     True
2     True
3    False
4     True
Name: tconst, dtype: bool

In [27]:
#Selecting only the entries with the previous filtered IDs
Basics= basics[US_filtered]
Basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
7,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"


In [28]:
#Saving the basic dataframe as a .csv.gz file in the Data_files directory
Basics.to_csv("Movie_files/Title_Basics.csv.gz",compression = "gzip", index =False)

### Title Ratings
**Filters:**
>- Replace "\N" with np.nan


In [29]:
#Loading file into a dataframe and saving in a variable
ratings =pd.read_csv(ratings_url, sep="\t", low_memory =False)

In [30]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1887
1,tt0000002,5.9,250
2,tt0000003,6.5,1678
3,tt0000004,5.8,163
4,tt0000005,6.2,2498


>- **Replace "\N" with np.nan**

In [31]:
#Replacing \N with np.nan
ratings = ratings.replace({"\\N": np.nan})

In [32]:
#Save the dataframe into a csv.gz file
ratings.to_csv("Movie_files/Title_Ratings.csv.gz", compression ="gzip", index=False)

In [33]:
#Confirming if saving of file was successful
ratings = pd.read_csv("Movie_files/Title_Ratings.csv.gz", low_memory=False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1887
1,tt0000002,5.9,250
2,tt0000003,6.5,1678
3,tt0000004,5.8,163
4,tt0000005,6.2,2498


**The Title_Ratings file was successfully saved**