Name: Akinde Kadjo

**Project 3 part 1 Goal:Produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, I will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.**

# Imports and Data Loading

In [1]:
#Importing all of the libraries that may be needed for the project
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
#Loading the data from the url
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
ratings_url ="https://datasets.imdbws.com/title.ratings.tsv.gz"
akas_url ="https://datasets.imdbws.com/title.akas.tsv.gz"
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)

# Data Cleaning

## Akas df Cleaning

In [3]:
#Replace "\N" with np.nan
akas.replace({'\\N':np.nan}, inplace = True)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0
1,tt0000001,2,Carmencita,DE,,,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,,imdbDisplay,,0
3,tt0000001,4,Καρμενσίτα,GR,,imdbDisplay,,0
4,tt0000001,5,Карменсита,RU,,imdbDisplay,,0


In [4]:
akas.shape

(33781109, 8)

In [5]:
#keep only US movies
in_us = akas['region'] == 'US'
akas_df = akas[in_us]
akas_df.shape

(1366189, 8)

In [6]:
#saving it to the Data folder
akas_df.to_csv('Data/akas.csv') 

## Basic df Cleaning

In [7]:
#Replace "\N" with np.nan
basics.replace({'\\N':np.nan}, inplace = True)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [8]:
basics.shape

(9362100, 9)

In [9]:
#Eliminate movies that are null for runtimeMinutes
basics.dropna(subset=['runtimeMinutes'], inplace=True)
basics.shape

(2561800, 9)

In [10]:
#Eliminate movies that are null for genre
basics.dropna(subset=['genres'], inplace=True)
basics.shape

(2493633, 9)

In [11]:
#keep only titleType==Movie
movie_filt = basics['titleType'] == 'movie'
basics_movie = basics[movie_filt]
basics_movie.shape

(371520, 9)

In [12]:
basics_movie.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
930,tt0000941,movie,Locura de amor,Locura de amor,0,1909,,45,Drama


In [13]:
#keep startYear 2000-2022
basics_movie['startYear'] = basics_movie['startYear'].astype(float)
year_filt = basics_movie['startYear'] > 2000
basics_year = basics_movie[year_filt]
basics_year.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  basics_movie['startYear'] = basics_movie['startYear'].astype(float)


(215584, 9)

In [14]:
#Eliminate movies that include "Documentary" in genre
is_documentary = basics_year['genres'].str.contains('documentary',case=False)
basic_genre = basics_year[~is_documentary]
basic_genre.shape

(142395, 9)

In [15]:
#Keep only US movies 
# Filter the basics table down to only include the US by using the filter akas dataframe
keepers =basic_genre['tconst'].isin(akas_df['titleId'])
basics_df = basic_genre[keepers]
basics_df.shape

(82225, 9)

In [16]:
#saving it to the Data folder
basics_df.to_csv('Data/basics.csv') 

## Ratings df Cleaning

In [17]:
#Replace "\N" with np.nan
ratings.replace({'\\N':np.nan}, inplace = True)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1922
1,tt0000002,5.8,259
2,tt0000003,6.5,1734
3,tt0000004,5.6,174
4,tt0000005,6.2,2545


In [42]:
ratings.shape

(1246148, 3)

In [45]:
#keep only US movies
# Filter the basics table down to only include the US by using the filter akas dataframe
keepers =ratings['tconst'].isin(akas_df['titleId'])
ratings_df = ratings[keepers]
ratings_df.shape

(474250, 3)

In [None]:
#saving it to the Data folder
ratings_df.to_csv('Data/ratings.csv') 