# How To Randomly Add NaN to Pandas Dataframe?
https://cmdlinetips.com/2019/05/how-to-randomly-add-nan-to-pandas-dataframe/

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [5]:
data_url = "https://goo.gl/ioc2Td"      # Gapminder data
gapminder = pd.read_csv(data_url)
print(gapminder.iloc[0:5,0:4])

  continent       country  gdpPercap_1952  gdpPercap_1957
0    Africa       Algeria     2449.008185     3013.976023
1    Africa        Angola     3520.610273     3827.940465
2    Africa         Benin     1062.752200      959.601080
3    Africa      Botswana      851.241141      918.232535
4    Africa  Burkina Faso      543.255241      617.183465


In [6]:
gapminder.shape

(142, 38)

In [3]:
# Let us drop two columns from the dataframe using Pandas “drop” function. Now the resulting dataframe contains data.
gapminder = gapminder.drop(['continent','country'], axis=1)

In [12]:
#gapminder.count()
gapminder.count().sum()   # Total number of data point

5396

Let us create a boolean NumPy array of the same size as our Pandas dataframe. We create the boolean 2d-array such that it contains about 50% of its elements are True and False. We will using Numpy’s random module to create random numbers and use to create boolean array

In [21]:
nan_mat = np.random.random(gapminder.shape) < 0.5
print(type(nan_mat))
print(len(nan_mat))
print(len(nan_mat[0]))
nan_mat

<class 'numpy.ndarray'>
142
38


array([[ True,  True, False, ..., False, False,  True],
       [ True,  True,  True, ..., False,  True, False],
       [ True, False,  True, ..., False,  True, False],
       ...,
       [ True, False, False, ..., False, False,  True],
       [False, False, False, ...,  True, False, False],
       [False, False,  True, ..., False,  True, False]])

We can get the total number of True elements, i.e. total number NaNs we will be adding to the dataframe using NumPy’s sum function.

In [13]:
nan_mat.sum()

2693

Pandas’ function mask to each element in the dataframe. The mask function will use the element in the dataframe if the condition is False and change it to NaN if it is True.

In [14]:
gapminder_NaN = gapminder.mask(nan_mat)

In [15]:
# We can verify that the dataframe has NaNs introduced randomly as we intended.
gapminder_NaN.iloc[0:3,0:5]


Unnamed: 0,continent,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962
0,Africa,Algeria,,3013.976023,2550.81688
1,,,3520.610273,,4269.276742
2,,,1062.7522,959.60108,949.499064


In [16]:
# We can count the total number of nulls or NaNs and see that it is approximately about 50%.
gapminder_NaN.isnull().sum(axis = 0).sum()

2693