https://medium.com/@akaivdo/how-to-process-null-values-in-pandas-b4364b439f53

### How to Process Null Values in Pandas

we will talk about how to process null/empty data in DataFrame. In the real world, it's impossible to have values in every row and column of the data. So when we encounter a data set containing null values, we have to decide whether to delete, replace, etc. to handle it according to our needs.

* Delete rows or columns containing null values (dropna())

* Replace null values with specified values (fillna())

* Select rows or columns containing null values

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_excel('DATA.xlsx')

In [4]:
df

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,,B,Science,
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,
6,Benjamin,M,eee,A,Technology,
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,,A,Engineering,
9,A lexander,M,hhh,B,Engineering,100.0


### Delete data containing null values by dropna()

If we want to drop all the data containing null values, we just use dropna() method of DataFrame.

In [5]:
df.dropna()

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
7,Lucas,M,ffff,B,Technology,85.0
9,A lexander,M,hhh,B,Engineering,100.0
11,Emma,F,iii,B,Engineering,89.0
13,Charlotte,F,kkk,B,Math,88.0
16,Isabella,F,mmm,A,Math,78.0
17,Mia,F,nnn,B,Math,86.0


In [6]:
# If you just want to drop columns containing NaN values, you can use the axis option and specify “columns”.

In [7]:
df.dropna(axis ="columns")

Unnamed: 0,name,sex,class,curriculum
0,Liam,M,A,Science
1,Noah,M,B,Science
2,\n Oliver,\nM,C,\nScience
3,Elijah,M,B,Technology
4,William,M,A,Technology
5,James,M,C,Technology
6,Benjamin,M,A,Technology
7,Lucas,M,B,Technology
8,Henry,M,A,Engineering
9,A lexander,M,B,Engineering


In [None]:
# If you only want to remove rows containing at least one null value, you can use how=”any” option.

In [8]:
df.dropna(how="any")

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
7,Lucas,M,ffff,B,Technology,85.0
9,A lexander,M,hhh,B,Engineering,100.0
11,Emma,F,iii,B,Engineering,89.0
13,Charlotte,F,kkk,B,Math,88.0
16,Isabella,F,mmm,A,Math,78.0
17,Mia,F,nnn,B,Math,86.0


In [None]:
# Also, you can remove columns all values are null by adding axis=1 option.

In [38]:
df.dropna(how = "all", axis = 1 )


Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,,B,Science,
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,
6,Benjamin,M,eee,A,Technology,
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,,A,Engineering,
9,A lexander,M,hhh,B,Engineering,100.0


In [39]:

df.dropna(how ="any",axis = 1)

Unnamed: 0,name,sex,class,curriculum
0,Liam,M,A,Science
1,Noah,M,B,Science
2,\n Oliver,\nM,C,\nScience
3,Elijah,M,B,Technology
4,William,M,A,Technology
5,James,M,C,Technology
6,Benjamin,M,A,Technology
7,Lucas,M,B,Technology
8,Henry,M,A,Engineering
9,A lexander,M,B,Engineering


If you only want to remove some rows based on some columns whether containing null values or not. You can use the subset option.
For example, if we want to remove all rows whose address is empty, we can achieve it using the below code.

In [13]:
df.dropna(subset = ["address"])

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,
6,Benjamin,M,eee,A,Technology,
7,Lucas,M,ffff,B,Technology,85.0
9,A lexander,M,hhh,B,Engineering,100.0
11,Emma,F,iii,B,Engineering,89.0
12,Ava,F,jjj,A,Engineering,


### Series just use dropna()

In [14]:
df["score"].dropna()

0      71.0
2      80.0
3      68.0
4      77.0
7      85.0
9     100.0
10     68.0
11     89.0
13     88.0
14     74.0
16     78.0
17     86.0
18     99.0
Name: score, dtype: float64

### Fill null value with the specified value

You can use fillna() method of DataFrame to fill null values with other values.

### Replace all null values with the same value

In [15]:
df.fillna("N/A")

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,,B,Science,
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,
6,Benjamin,M,eee,A,Technology,
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,,A,Engineering,
9,A lexander,M,hhh,B,Engineering,100.0


### Replace each column with a different value

In [16]:
df.fillna({"address":"zzz","score":0})

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,zzz,B,Science,0.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,0.0
6,Benjamin,M,eee,A,Technology,0.0
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,zzz,A,Engineering,0.0
9,A lexander,M,hhh,B,Engineering,100.0


### Fill numeric column with the mean or median value

We can use the mean() method to get the mean value of each numeric column and use the median() method to get the median value of each numeric column and use them to fill null values in each column.

In [17]:
# In our data, we have only one numeric column with null values, so we will use Series instead.

In [18]:
df["score"].mean()

81.76923076923077

In [19]:
df["score"].fillna(df["score"].mean())

0      71.000000
1      81.769231
2      80.000000
3      68.000000
4      77.000000
5      81.769231
6      81.769231
7      85.000000
8      81.769231
9     100.000000
10     68.000000
11     89.000000
12     81.769231
13     88.000000
14     74.000000
15     81.769231
16     78.000000
17     86.000000
18     99.000000
19     81.769231
Name: score, dtype: float64

In [20]:
df.score.median()

80.0

In [21]:
df.score.fillna(df.score.median())

0      71.0
1      80.0
2      80.0
3      68.0
4      77.0
5      80.0
6      80.0
7      85.0
8      80.0
9     100.0
10     68.0
11     89.0
12     80.0
13     88.0
14     74.0
15     80.0
16     78.0
17     86.0
18     99.0
19     80.0
Name: score, dtype: float64

### Fill null values with previous value or next value

We can specify method=’ffill’ to fill null values with the previous value, and specify method=’bfill’ to fill null values with the next value. This is very useful in Time Series data.

In [25]:
 df.fillna(method="ffill")

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,aaa,B,Science,71.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,77.0
6,Benjamin,M,eee,A,Technology,77.0
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,ffff,A,Engineering,85.0
9,A lexander,M,hhh,B,Engineering,100.0


In [None]:
# fill null values with next value If the null value is in the last row, It will keep NaN.

In [26]:
df.fillna(method="bfill")

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
1,Noah,M,bbb,B,Science,80.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,85.0
6,Benjamin,M,eee,A,Technology,85.0
7,Lucas,M,ffff,B,Technology,85.0
8,Henry,M,hhh,A,Engineering,100.0
9,A lexander,M,hhh,B,Engineering,100.0


###  Select rows or columns with null values

In [27]:
df[df["address"].isnull()]

Unnamed: 0,name,sex,address,class,curriculum,score
1,Noah,M,,B,Science,
8,Henry,M,,A,Engineering,
10,Olivia,F,,C,Engineering,68.0
14,Sophia,F,,A,Math,74.0
19,Harper,F,,B,Math,


In [40]:
df[df["address"].notnull()]

Unnamed: 0,name,sex,address,class,curriculum,score
0,Liam,M,aaa,A,Science,71.0
2,\n Oliver,\nM,bbb,C,\nScience,80.0
3,Elijah,M,CCC,B,Technology,68.0
4,William,M,ddd,A,Technology,77.0
5,James,M,ggg,C,Technology,
6,Benjamin,M,eee,A,Technology,
7,Lucas,M,ffff,B,Technology,85.0
9,A lexander,M,hhh,B,Engineering,100.0
11,Emma,F,iii,B,Engineering,89.0
12,Ava,F,jjj,A,Engineering,


In [None]:
# We must use isnull() to check the value in pandas Dataframe is null or not like below other than using general python syntax.

In [41]:
df[df["name"] == "Oliver"]["address"].isnull()

Series([], Name: address, dtype: bool)

In [33]:
# If we use general python syntax, we will encounter errors and will not get expected results.