# IMDB Movie Data

### Jordan Miranda

In this notebook we'll be looking at a small dataset from IMDB that contains different movies with information about the movie such as genre, duration, country, etc.

The goal within this notebook is to clean the dataset and prepare it for possible feature engineering and modelling later on. To clean this data we'll need to address any null values, clean/standardize values across the dataset, and possibly renaming columns to appropriately reflect the data they contain.

We'll get started by importing the standard tools used for data manipulation with Python - `pandas` and `NumPy`. 
If any data visualization is necessary we'll import `Matplotlib` and `seaborn`.

In [1]:
# importing standard data tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

Next we'll need to import the data we'll be working with.

In [2]:
# Loading in our dataset, including the delimiter used to separate the values
movies_df = pd.read_csv("data/messy_IMDB_dataset.csv", sep=";")
movies_df.head()

Unnamed: 0,IMBD title ID,Original titl�,Release year,Genr�,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152.0,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"


## Introductory Information on the Dataset

Now that we've loaded the dataset, let's gather some basic info about the data such as its shape, datatypes, missing values (if any), and so on.

In [3]:
movies_df.shape

(101, 12)

This dataset contains 101 rows with 12 columns. Let's look at the names of the columns and their datatypes.

In [4]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   IMBD title ID   100 non-null    object 
 1   Original titl�  100 non-null    object 
 2   Release year    100 non-null    object 
 3   Genr�           100 non-null    object 
 4   Duration        99 non-null     object 
 5   Country         100 non-null    object 
 6   Content Rating  77 non-null     object 
 7   Director        100 non-null    object 
 8   Unnamed: 8      0 non-null      float64
 9   Income          100 non-null    object 
 10   Votes          100 non-null    object 
 11  Score           100 non-null    object 
dtypes: float64(1), object(11)
memory usage: 9.6+ KB


## Initial Problems

Here are the 11 columns (index excluded) that are in this dataset. Using `.info()` we can spot a few problems with the data already. Below we'll identify some of the initial concerns, with no particular priority (currently).

### Datatypes

There are 10 columns with a datatype of `object` and 1 column having a datatype of `float`. From looking at the `.head()` we can see that the datatypes for some of these columns shouldn't be `object` but due to poor data entry formatting they've defaulted to `object`. 

One of the major tasks with this dataset will be to reformat the data entries so that they are consistent across each column and once done, assign the correct datatype to each respective column.

### Missing Values

Looking across some columns we can see there are a few missing values. The column with the most missing values is `Unnamed: 8` with every value missing, second is `Content Rating` with 34 missing values. Some external research may need to be performed to source Content Ratings or we can impute the values based on EDA.

### Inconsistent Data Entry Formatting

Looking at `Release year` we can see that the entries in this column vary in their ordering of MM/DD/YYYY. Looking at the `.head()` in the beginning of the dataset we also see inconsistent formatting with `Country`, `Income`, `Votes`, and `Score`. 

### Inaccurate/Misspelled Column Titles

Columns `Original title` and `Genre` contain characters that aren't displayed properly so we'll need to rename them to be readable.  We'll also need to correct the spelling of IMDB in the first column `IMBD title ID`.

We can also see the `Release year` column isn't accurately representing the data in the column. The data within the column contains not just the release year but also the day and month, therefore the column's appropriate title would be `Release Date`. This will be the first problem we'll address as it's a simple fix.

## Cleaning

Now that we've identified some of the initial problems, we can start cleaning the data set. If during our cleaning we discover new issues we will make note of the issue and address it later.

### Columns

As mentioned earlier, fixing the column titles will be one of the simpler fixes so we'll take care of that now.


In [5]:
# Getting all of the columns
movies_df.columns

Index(['IMBD title ID', 'Original titl�', 'Release year', 'Genr�', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       ' Votes ', 'Score'],
      dtype='object')

In [6]:
# Renaming the columns mentioned earlier
# using .rename and providing a dictionary of the columns that need fixing
movies_df.rename(columns={
    "IMBD title ID": "IMDB Title ID",
    movies_df.columns[1]: "Movie Title",
    "Release year": "Release Date",
    movies_df.columns[3]: "Genre",
    "Income": "Revenue",
    " Votes ": "Number of Votes"}, # Renaming income & votes as well for added clarity
inplace=True)

In [7]:
# Checking to see the renaming was done properly
movies_df.sample(3)
# Looks good!

Unnamed: 0,IMDB Title ID,Movie Title,Release Date,Genre,Duration,Country,Content Rating,Director,Unnamed: 8,Revenue,Number of Votes,Score
75,tt0105236,Reservoir Dogs,1992-10-09,"Crime, Drama, Thriller",99.0,USA,R,Quentin Tarantino,,$ 2889963,896.551,7.9
8,tt1375666,Inception,2010-09-24,"Action, Adventure, Sci-Fi",148.0,USA,PG-13,Christopher Nolan,,$ 869784991,2.002.816,8..8
14,tt0133093,The Matrix,1999-05-07,"Action, Sci-Fi",,USA,R,"Lana Wachowski, Lilly Wachowski",,$ 465718588,1.632.315,++8.7


We've successfully renamed our problem columns. However, we do see there's 1 column leftover named `Unnamed: 8`. The entire column consists of null values and it's hard to identify without external context what this column would've been in this dataset. We're likely going to drop this column as there's no way we'll be able to impute any values into this column.

Before dropping we'll confirm the column is full of null values.

In [8]:
# Finding number of null values in the unnamed column
unnamed_nulls = movies_df["Unnamed: 8"].isnull().sum()

# Taking the number of rows in the dataset
num_of_rows = movies_df.shape[0]

# If this returns True, all values in the unnamed column are null
unnamed_nulls == num_of_rows

True

We've double checked the column consists of all Null values. Let's drop the column.

In [9]:
# Dropping the column
movies_df.drop(columns="Unnamed: 8", inplace=True)

# Checking to see the drop was done successfully
movies_df.sample(3)
# Looks good!

Unnamed: 0,IMDB Title ID,Movie Title,Release Date,Genre,Duration,Country,Content Rating,Director,Revenue,Number of Votes,Score
38,tt0114814,The Usual Suspects,1995-11-30,"Crime, Mystery, Thriller",106,USA,R,Bryan Singer,$ 23341568,968.947,8.4
56,tt4154756,Avengers: Infinity War,2018-04-25,"Action, Adventure, Sci-Fi",149,USA,,"Anthony Russo, Joe Russo",$ 2048359754,796.486,8.2
89,tt0033467,Citizen Kane,1948-11-25,"Drama, Mystery",119,USA,,Orson Welles,$ 1594107,389.322,7.6


We've successfully dropped the unnamed column. Let's continue on.

***Work in progress***