# Handling rows with NaN in pandas DataFrames
In statistics, often collected data is not perfect. A common issue is that it was not possible to collect all data sets completely. In these cases, some fields in a pandas DataFrame are marked as 'not a number' or `NaN`. We cannot overwrite these missing values with `0` for exmaple because that would manipulate the statistics of the dataset obviously. A common approach is to exlude incomplete rows from the table depending on what should be analysed.

See also
* [How to drop empty rows from a Pandas dataframe in Python](https://www.kite.com/python/answers/how-to-drop-empty-rows-from-a-pandas-dataframe-in-python#:~:text=Use%20df.,contain%20NaN%20under%20those%20columns.)

The dataset of counts for the BBBC001 image is a good example.

In [1]:
import pandas as pd
import numpy as np

In [2]:
dat = pd.read_csv('https://raw.githubusercontent.com/BiAPoL/Bio-image_Analysis_with_Python/main/biostatistics/data/BBBC001.csv', header=1, sep=';')

dat

Unnamed: 0,Annotator name (pseudonym is ok),BBBC001 manual count,BBBC001 CLIJ Voronoi Otsu Labeling,BBBC001 StarDist,BBBC001 Find Maxima
0,Robert,370,367.0,379.0,
1,Lenka B.,365,360.0,373.0,375.0
2,Jozef F.,390,367.0,379.0,426.0
3,Lukas M..,370,367.0,,
4,Luisa W.,383,,,
...,...,...,...,...,...
83,Lucas V.,356,367.0,,
84,Lara L.,368,367.0,,
85,Laura M.,367,367.0,,
86,Julia,367,,,


Pandas' [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function allows to remove rows where elements are `NaN`:

In [3]:
dat.dropna()

Unnamed: 0,Annotator name (pseudonym is ok),BBBC001 manual count,BBBC001 CLIJ Voronoi Otsu Labeling,BBBC001 StarDist,BBBC001 Find Maxima
1,Lenka B.,365,360.0,373.0,375.0
2,Jozef F.,390,367.0,379.0,426.0
10,Eric S,396,367.0,380.0,264.0
15,Lucie K.,392,409.0,379.0,426.0
25,Petra G.,370,367.0,387.0,426.0
38,Lauren S,376,370.0,376.0,426.0
48,Lukas C.,386,367.0,379.0,369.0
54,Aemilia,377,376.0,373.0,426.0
62,GMN,384,370.0,379.0,382.0
79,Lena T.,389,371.0,382.0,426.0


This reduces the number of rows in our dataset dramatically. Maybe it's possible to eliminate specific columns and then keep more values from the remaining. In order to take a closer look at that, we can iterate over columns and count the number of `NaN` elements. 

In [4]:
for row in dat:
    if dat[row].dtype != object:
        print("Number of NaNs in " + row, len(dat[np.isnan(dat[row])]))

Number of NaNs in BBBC001 manual count 0
Number of NaNs in BBBC001 CLIJ Voronoi Otsu Labeling 55
Number of NaNs in BBBC001 StarDist 71
Number of NaNs in BBBC001 Find Maxima 76


Thus, we can analyse a lot more datasets if we concentrate on the manual and Voronoi Otsu Labeling counts only.

This command eliminates all rows where there are NaN in the listed columns:

In [5]:
dat.dropna(subset=['BBBC001 CLIJ Voronoi Otsu Labeling'])

Unnamed: 0,Annotator name (pseudonym is ok),BBBC001 manual count,BBBC001 CLIJ Voronoi Otsu Labeling,BBBC001 StarDist,BBBC001 Find Maxima
0,Robert,370,367.0,379.0,
1,Lenka B.,365,360.0,373.0,375.0
2,Jozef F.,390,367.0,379.0,426.0
3,Lukas M..,370,367.0,,
5,Niclas D.,382,367.0,,
9,G.J.P.,380,367.0,,
10,Eric S,396,367.0,380.0,264.0
12,Louis B.,377,367.0,,
13,Max N.,387,367.0,,
15,Lucie K.,392,409.0,379.0,426.0
