# IMDB Movie Data

### Jordan Miranda

In this notebook we'll be looking at a small dataset from IMDB that contains different movies with information about the movie such as genre, duration, country, etc.

The goal within this notebook is to clean the dataset and prepare it for possible feature engineering and modelling later on. To clean this data we'll need to address any null values, clean/standardize values across the dataset, and possibly renaming columns to appropriately reflect the data they contain.

We'll get started by importing the standard tools used for data manipulation with Python - `pandas` and `NumPy`. 
If any data visualization is necessary we'll import `Matplotlib` and `seaborn`.

In [2]:
# importing standard data tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

Next we'll need to import the data we'll be working with.

In [24]:
# Loading in our dataset, including the delimiter used to separate the values
movies_df = pd.read_csv("data/messy_IMDB_dataset.csv", sep=";")
movies_df.head()

Unnamed: 0,IMBD title ID,Original titl�,Release year,Genr�,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152.0,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"


## Introductory Information on the Dataset

Now that we've loaded the dataset, let's gather some basic info about the data such as its shape, datatypes, missing values (if any), and so on.

In [12]:
movies_df.shape

(101, 12)

This dataset contains 101 rows with 12 columns. Let's look at the names of the columns and their datatypes.

In [9]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101 entries, tt0111161 to tt0045152
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Original titl�  100 non-null    object 
 1   Release year    100 non-null    object 
 2   Genr�           100 non-null    object 
 3   Duration        99 non-null     object 
 4   Country         100 non-null    object 
 5   Content Rating  77 non-null     object 
 6   Director        100 non-null    object 
 7   Unnamed: 8      0 non-null      float64
 8   Income          100 non-null    object 
 9    Votes          100 non-null    object 
 10  Score           100 non-null    object 
dtypes: float64(1), object(10)
memory usage: 9.5+ KB


## Initial Problems

Here are the 11 columns (index excluded) that are in this dataset. Using `.info()` we can spot a few problems with the data already. Below we'll identify some of the initial concerns, with no particular priority (currently).

### Datatypes

There are 10 columns with a datatype of `object` and 1 column having a datatype of `float`. From looking at the `.head()` we can see that the datatypes for some of these columns shouldn't be `object` but due to poor data entry formatting they've defaulted to `object`. 

One of the major tasks with this dataset will be to reformat the data entries so that they are consistent across each column and once done, assign the correct datatype to each respective column.

### Missing Values

Looking across some columns we can see there are a few missing values. The column with the most missing values is `Unnamed: 8` with every value missing, second is `Content Rating` with 34 missing values. Some external research may need to be performed to source Content Ratings or we can impute the values based on EDA.

### Inaccurate Column Titles

Columns `Original title` and `Genre` contain characters that aren't displayed properly so we'll need to reassign their names

In [22]:
movies_df.columns[1]

'Original titl�'