# Handling missing values

In [2]:
import pandas as pd
import numpy as np

In [6]:
df=pd.read_csv('tmdb_5000_movies.csv')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

In [12]:
df.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4801.0,4803.0,4803.0
mean,29045040.0,57165.484281,21.492301,82260640.0,106.875859,6.092172,690.217989
std,40722390.0,88694.614033,31.81665,162857100.0,22.611935,1.194612,1234.585891
min,0.0,5.0,0.0,0.0,0.0,0.0,0.0
25%,790000.0,9014.5,4.66807,0.0,94.0,5.6,54.0
50%,15000000.0,14629.0,12.921594,19170000.0,103.0,6.2,235.0
75%,40000000.0,58610.5,28.313505,92917190.0,118.0,6.8,737.0
max,380000000.0,459488.0,875.581305,2787965000.0,338.0,10.0,13752.0


#### The NaN value
Pandas represents missing data with NaN, which stands for 'Not a Number'. This is a special floating-point value from NumPy. We can perform operations with NaN values without resulting in errors, but the result of any operations with a NaN value will be another NaN value. Try out the following

In [13]:
np.nan
#nan
np.nan+2
#nan
np.nan*0
#nan

nan

Pandas has several built-in functions that helps us detect, remove and replace NaN values such as

- isnull()
- notnull()
- dropna()
- fillna()

#### Detecting missing values


In [15]:
df.isnull()[0:10]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [17]:
df.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

#### Dropping missing values

In [18]:
df2=df.dropna()

In [21]:
df2.isnull().sum().sum()

0

In [23]:
df.shape

(4803, 20)

In [24]:
df2.shape

(1493, 20)

**Indeed, a very big chunk of the data was removed,** so this is not very useful. We could instead opt to drop a row only if all the values in the row are NaN. We can do this by including the how='all'argument. Try this out:

In [25]:
df3 = df.dropna(how='all')
df3.shape
#(4803, 20)

(4803, 20)

In [27]:
df.dropna(axis=1);

then the two columns with lots of NaN values are dropped, but we also lose columns with only one, two or three NaN values, which ideally we would like to keep. On the other hand, if we use

In [29]:
df.dropna(how='all', axis=1);

Then as with the rows, no columns are dropped, and the DataFrame is unchanged. Pandas has a solution for this problem: with the parameter thresh, which allows us to mention a minimum number of non-null values for the row/column to be kept! Let's try this out:

In [30]:
df.dropna(thresh=4000, axis=1, inplace=True)

In [33]:
df.isnull().sum().sum()

6

If we take a look at the resulting DataFrame, we will notice that columns homepage and tagline were dropped, since they both contained less then 4000 non-null values. This actually removes almost all of our NaN without dropping an excessive number of rows or columns:

#### Replacing missing values
Dropping missing data decreases the number of samples in our data set and as a result the power of our analysis and the predictive strength for our machine learning models. Therefore it is preferable to first try to retain data by suitably replacing missing values before using dropna() as a last resort.

epending on the data there are multiple strategies to replace missing values. If we actually know the correct values or might be able to provide an argument for a reasonable value, we can correct them directly. Otherwise, we aim to infer or estimate the missing data points suitably. **For this, we might use methods like ffill, an estimate like the mean or use machine learning models like k-NN (see later).**

Pandas has a **function fillna()** that helps us replace missing values with some specific value that we can choose. Let's look once again where our missing data is:

In [35]:
df.isnull().sum()

budget                  0
genres                  0
id                      0
keywords                0
original_language       0
original_title          0
overview                3
popularity              0
production_companies    0
production_countries    0
release_date            1
revenue                 0
runtime                 2
spoken_languages        0
status                  0
title                   0
vote_average            0
vote_count              0
dtype: int64

In [36]:
df['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [37]:
df['release_date'].head(5)

0    2009-12-10
1    2007-05-19
2    2015-10-26
3    2012-07-16
4    2012-03-07
Name: release_date, dtype: object

In [38]:
df['runtime'].head(5)

0    162.0
1    169.0
2    148.0
3    165.0
4    132.0
Name: runtime, dtype: float64

In [39]:
df['runtime'][0:5]

0    162.0
1    169.0
2    148.0
3    165.0
4    132.0
Name: runtime, dtype: float64

In [40]:
df['overview'].fillna(value='Overview not available', inplace=True)

In [41]:
df['overview'].isnull().sum()

0

Next, for the column 'release_date', we will be replacing the missing data by propagating the non-missing values, meaning that we replace the missing data with the closest non-missing value (in the same column) in either the forward or backward direction. The function fillna() allows us to do this with two parameters:

- ffill: for forward propagation
- bfill: for backward propagation

In [42]:
df.loc[df['release_date'].isnull(), 'release_date']

4553    NaN
Name: release_date, dtype: object

In [43]:
df.loc[4552:4554, 'release_date']  

4552    2012-03-28
4553           NaN
4554    2015-03-10
Name: release_date, dtype: object

In [44]:
df['release_date'].fillna(method='ffill')[4552:4555]

4552    2012-03-28
4553    2012-03-28
4554    2015-03-10
Name: release_date, dtype: object

In [45]:
df['release_date'].fillna(method='bfill')[4552:4555]

4552    2012-03-28
4553    2015-03-10
4554    2015-03-10
Name: release_date, dtype: object

In [47]:
df['runtime'].mean()


106.87585919600083

In [48]:
df['runtime'].fillna(value=df['runtime'].mean(), inplace=True)

In [49]:
df.isnull().sum().sum()

1