In [1]:
import pandas as pd
import numpy as np

IMDB = pd.read_csv('../Datasets/IMDB-Movie-Data.csv', index_col= 0)
IMDB.head()

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


## Dtypes

The data type for a column in a DataFrame or a Series is known as the **dtype**.

You can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `rating` column in the `IMDB` DataFrame:

In [2]:
IMDB.Rating.dtype

dtype('float64')

Alternatively, the `dtypes` property returns the `dtype` of _every_ column in the DataFrame:

In [3]:
IMDB.dtypes

Title                  object
Genre                  object
Description            object
Director               object
Actors                 object
Year                    int64
Runtime (Minutes)       int64
Rating                float64
Votes                   int64
Revenue (Millions)    float64
Metascore             float64
dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the `object` type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `Runtime (Minutes)` column from its existing `int64` data type into a `float64` data type:

In [6]:
IMDB['Runtime (Minutes)'].astype('float64')


Rank
1       121.0
2       124.0
3       117.0
4       108.0
5       123.0
        ...  
996     111.0
997      94.0
998      98.0
999      93.0
1000     87.0
Name: Runtime (Minutes), Length: 1000, dtype: float64

A DataFrame or Series index has its own `dtype`, too:

In [7]:
IMDB.index.dtype

dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. But these data types are more rarely used.

## Missing Data

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [10]:
IMDB[pd.isnull(IMDB.Metascore)]

Unnamed: 0_level_0,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
26,Paris pieds nus,Comedy,Fiona visits Paris for the first time to assis...,Dominique Abel,"Fiona Gordon, Dominique Abel,Emmanuelle Riva, ...",2016,83,6.8,222,,
27,Bahubali: The Beginning,"Action,Adventure,Drama","In ancient India, an adventurous and daring ma...",S.S. Rajamouli,"Prabhas, Rana Daggubati, Anushka Shetty,Tamann...",2015,159,8.3,76193,6.50,
28,Dead Awake,"Horror,Thriller",A young woman must save herself and her friend...,Phillip Guzman,"Jocelin Donahue, Jesse Bradford, Jesse Borrego...",2016,99,4.7,523,0.01,
40,5- 25- 77,"Comedy,Drama","Alienated, hopeful-filmmaker Pat Johnson's epi...",Patrick Read Johnson,"John Francis Daley, Austin Pendleton, Colleen ...",2007,113,7.1,241,,
43,Don't Fuck in the Woods,Horror,A group of friends are going on a camping trip...,Shawn Burkett,"Brittany Blanton, Ayse Howard, Roman Jossart,N...",2016,73,2.7,496,,
...,...,...,...,...,...,...,...,...,...,...,...
968,The Walk,"Adventure,Biography,Crime","In 1974, high-wire artist Philippe Petit recru...",Robert Zemeckis,"Joseph Gordon-Levitt, Charlotte Le Bon,Guillau...",2015,123,7.3,92378,10.14,
970,The Lone Ranger,"Action,Adventure,Western",Native American warrior Tonto recounts the unt...,Gore Verbinski,"Johnny Depp, Armie Hammer, William Fichtner,To...",2013,150,6.5,190855,89.29,
972,Disturbia,"Drama,Mystery,Thriller",A teen living under house arrest becomes convi...,D.J. Caruso,"Shia LaBeouf, David Morse, Carrie-Anne Moss, S...",2007,105,6.9,193491,80.05,
990,Selma,"Biography,Drama,History",A chronicle of Martin Luther King's campaign t...,Ava DuVernay,"David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...",2014,128,7.5,67637,52.07,


Replacing missing values is a common operation.  Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN` with an `"0.0"`:

In [11]:
IMDB.Metascore.fillna(0.0)

Rank
1       76.0
2       65.0
3       62.0
4       59.0
5       40.0
        ... 
996     45.0
997     46.0
998     50.0
999     22.0
1000    11.0
Name: Metascore, Length: 1000, dtype: float64

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.