# 2.2 Cleaning Data with Pandas

Data cleaning is most of the work when it comes to data analysis. So lets cover the important parts here. Lets load in some data from the Internet Movie Database to clean up and explore.

In [None]:
import pandas as pd
import numpy as np

In [None]:
## CSV File
df = pd.read_csv("../data/imdb.csv")

print("First 30 rows...")
display(df.head(30))

print('info')
df.info()

## Handling Nulls

Looking at the data above, you will see that some columns use the string `\N` to denote a null. Lets replace both of these with an actual `np.nan` null value. We can use df.replace() to do this.


In [None]:
df = df.replace(r'\N', np.nan) # Not the use of a raw string with r'' to avoid having to escape the \ character.
df.head(30)

# Fixing Data Types

Lets look at the documentation for the columns provided by imdb:
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

If we call df.info() we can get an idea of our datatypes and other info:


In [None]:
df.info()

From this we can see that some columns arent in the format we'd want. runtimeMinutes, startYear and endYear are currently `object` columns, isAdult is a float64 column when it could be a boolean.

Lets fix these:

In [None]:
# Convert isAdult by simply changing its type.
df['isAdult'] = df['isAdult'].astype(bool)

In [None]:
# We can convert string columns to numeric with pd.to_numeric()
df['startYear'] = pd.to_numeric(df['startYear'], errors='coerce') # turn non-numeric records into np.nan
df['endYear'] = pd.to_numeric(df['endYear'], errors='coerce') # turn non-numeric records into np.nan
df['runtimeMinutes'] = pd.to_numeric(df['runtimeMinutes'], errors='coerce') # turn non-numeric records into np.nan

# Check our info again
df.info()

df.head(30)

## Adding columns

We might want to add extra columns to our dataframe. We can do this by assigning a value to a new column name. This can either be a single value or an expression that uses the data in our other columns.

Lets add a decade column based on the startYear column.



In [None]:
# Divide by 10 and round down to get the decade. 
# We can apply np.floor to the result of our division to do the rounding.
df['decade'] = ((df['startYear'] / 10).apply(np.floor) * 10)
df

# Dropping Data

Perhaps we want a nice safe dataset for our instructional class. Lets' drop the adult titles from our dataframe. We can do this with df.drop(). By default it will return a new dataframe with the data we specify dropped, if we use the argument `inplace=True` it will modifiy the dataframe in place without our having to re-assign df to the result. 

Drop takes alist of column or index values to drop from the dataframe.

In [None]:
# Get the index of all the titles we want to drop.
adult = df[df['isAdult'] == True].index

# Drop the data
df.drop(adult, inplace=True)

df

Now that that's dropped all records wheere isAdult is True, we dont really need the column any more.

In [None]:
df.drop(columns='isAdult', inplace=True)
df

That's a good start. We can start to explore the data in the next notebook. Theres a version of this saved already as imdb_clean so you can jump ahead to the next notebook and try that out now.