#### Handling missing and null values
We can check whether our DataFrame contains missing values using the function isnull(), which returns a matching DataFrame with boolean values, where a True value means the corresponding value in our DataFrame is indeed missing. 



In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("c2_tmdb_5000_movies.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [2]:
df.isnull().head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Note that the function notnull() is the opposite of isnull(). That is, it returns a DataFrame where the True / False values are interchanged.

Usually, we will not be interested in the DataFrame of boolean values itself. Instead, we would like to know how many missing values we have, and where they occur. We can do this by calling the function sum() on the boolean DataFrame. This function sums up the values across each column of the DataFrame. When applied to a boolean DataFrame, it automatically treats True values as 1 and False values as a 0.

In [3]:
df.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

So now we can see that the majority of the missing values are in the column homepage. Once we identified the missing values, we have several ways of dealing with them. The important thing to note is that there is not necessarily always one correct answer to the question of what to do with missing values. Our goal here is to expose you to different options and give you the tools to implement them. The decision about which one to use will come down to you, based on your specific application.

#### Dropping missing data:
Either we drop an observation (a row of our DataFrame) or we drop a variable (a column of our DataFrame).In fact, Pandas has a special function just for this, called dropna(). It removes all rows from the DataFrame that have at least one NaN value. It is important to note that like most other pandas functions, the object returned is a copy of the original DataFrame, and the original DataFrame remains unchanged. If we want the changes to be applied to the original DataFrame, we can include the argument inplace=True

In [5]:
df2 = df.dropna()
df2.isnull().sum().sum()

0

(Note that we use .sum() twice, to sum up all values across both dimensions of the DataFrame. Try to use only one sum() to see what happens.)

Sometimes, this might not be the best method, since it could remove a large number of observations. Let’s check if this was the case here:

In [6]:
df.shape

(4803, 20)

In [7]:
df2.shape

(1493, 20)

Indeed, a very big chunk of the data was removed, so this is not very useful. We could instead opt to drop a row only if all the values in the row are NaN. We can do this by including the how='all'argument. Try this out:

In [8]:
df3 = df.dropna(how="all")
df3.shape

(4803, 20)

For our case, this is not very useful either, because it makes no changes to the DataFrame: none of the rows contains NaN in all the values.

In [10]:
# drop function applied for columns has teh same logig as in rows
df.dropna(axis=1).shape

# with the parameter thresh, which allows us to mention a minimum number of non-null values for the row/column to be kept! Let’s try this out:

df.dropna(thresh=4000, axis=1, inplace=True)
#The result:
df.isnull().sum().sum()

6

#### Replacing missing values

Pandas has a function fillna() that helps us replace missing values with some specific value that we can choose. Let’s look once again where our missing data is:

In [11]:
df.isnull().sum()

budget                  0
genres                  0
id                      0
keywords                0
original_language       0
original_title          0
overview                3
popularity              0
production_companies    0
production_countries    0
release_date            1
revenue                 0
runtime                 2
spoken_languages        0
status                  0
title                   0
vote_average            0
vote_count              0
dtype: int64

So our missing values are located in the columns overview, release_date, and runtime. Let’s take a peek at these columns so we can see the format of the data in them:

In [12]:
df["overview"].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [13]:
df["release_date"].head(5)

0    2009-12-10
1    2007-05-19
2    2015-10-26
3    2012-07-16
4    2012-03-07
Name: release_date, dtype: object

In [14]:
df["runtime"].head(5)

0    162.0
1    169.0
2    148.0
3    165.0
4    132.0
Name: runtime, dtype: float64

In [15]:
# replace the missing value in the column 'overview' with a string
df["overview"].fillna(value="Overview not available", inplace=True)
df["overview"].isnull().sum()

0

In [16]:
# for the column 'release_date', we will be replacing the missing data by propagating the non-missing values
# find the index of the missing value
df.loc[df["release_date"].isnull(), "release_date"]

4553    NaN
Name: release_date, dtype: object

In [18]:
# take a look at the values just before and after:
# Remember for df.loc[] both start and end are included
df.loc[4552:4554, "release_date"]

4552    2012-03-28
4553           NaN
4554    2015-03-10
Name: release_date, dtype: object

In [19]:
#  what would happen if we used forward propagation
df["release_date"].fillna(method="ffill")[4552:4555]

4552    2012-03-28
4553    2012-03-28
4554    2015-03-10
Name: release_date, dtype: object

In [20]:
# backward propagation
df["release_date"].fillna(method="bfill")[4552:4555]

4552    2012-03-28
4553    2015-03-10
4554    2015-03-10
Name: release_date, dtype: object

In [22]:
#Lets stick with the forward one for our case (this is enough, the previous commands were just for visualization)
df["release_date"].fillna(method="ffill", inplace=True)

Finally, we arrive at the 'runtime' column. Here, we will replace the missing value with the mean value of this column. By default, this function excludes all the NaN values. In the special case where all values are NaN, then the mean would also be NaN. We can now pass this mean as the replacement value for the fillna() function:

In [23]:
df["runtime"].mean()

106.87585919600083

In [25]:
df["runtime"].fillna(value=df["runtime"].mean(), inplace=True)
df.isnull().sum().sum()

0

A final note: as mentioned before, what to do with NaN will be up to you. In this unit, we simply wanted to show you some options. You could of course also simply have gone and try to find the release date of the movie that had a missing value, and filled in the value accordingly.