# The missing values

<u>A row in a DataFrame represents an observation or a data point</u>. 
<u>A column is a feature or attribute of that observation</u>.

In some cases, we don’t have all the feature values of some observations. Let’s say we have a DataFrame that contains information on a bank customer, such as name, age, income, address, and so on. If we don’t have the age information of a customer, it’s considered a missing value.

Missing values are essentially data we don’t have. There are many reasons for having missing values in a DataFrame, such as bad input, issues that occurred during transformation, and so on. Handling missing values is a very important part of data cleaning and preprocessing. The Pandas library provides flexible methods for handling them efficiently.

![image.png](attachment:901ecb78-e08e-42ac-a0bd-51fb882005e1.png)

# Missing value types

To provide a robust system for handling missing values, it’s essential to represent them with standard formats. The standard missing value representation in a <font color='red'>DataFrame</font> is <font color='red'>NaN</font>. It’s not compatible with integer values, however. So, whenever there’s a <font color='red'>NaN value</font> in an <font color='red'>integer</font> column, the data type of the whole column is upcasted to <font color='red'>float</font>.

To overcome this issue, a new missing value representation for integers, <font color='red'><NA></font>, was introduced with Pandas 1.0. We need to explicitly declare the data type as <font color='red'>pd.Int64Dtype()</font>. Let’s look at an example to demonstrate the difference clearly. We’ll first create a <font color='red'>DataFrame</font> that contains some missing values. The Pandas library accepts both Python’s None and NumPy’s np.nan as missing values, so we can use both to indicate the missing values.

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 3, np.nan],
    "B": [2.4, 6.2, 5.1, np.nan],
    "C": ["foo","zoo","bar", np.nan]
})

print(df)

     A    B    C
0  1.0  2.4  foo
1  2.0  6.2  zoo
2  3.0  5.1  bar
3  NaN  NaN  NaN


In [5]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 3, 6.77],
    "B": [2.4, 6.2, 5.1, np.nan],
    "C": ["foo","zoo","bar", np.nan]
})

print(df)

      A    B    C
0  1.00  2.4  foo
1  2.00  6.2  zoo
2  3.00  5.1  bar
3  6.77  NaN  NaN


As we see in the output of the above cell, the values in column <font color='red'>A</font> are converted to <font color='red'>float</font> because of the missing value in the last row. If we change the data type of this column to <font color='red'>pd.Int64Dtype()</font>, the values will be integers.

In [6]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 3, np.nan],
    "B": [2.4, 6.2, 5.1, np.nan],
    "C": ["foo","zoo","bar", np.nan]
})

df["A"] = df["A"].astype(pd.Int64Dtype())

print(df)

      A    B    C
0     1  2.4  foo
1     2  6.2  zoo
2     3  5.1  bar
3  <NA>  NaN  NaN


We now have integer values in <font color='red'>**column A</font>**.