<a href="https://colab.research.google.com/github/Saifullah785/python-data-science-handbook-notes/blob/main/03_04_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Handling Missing Data:**

This section introduces the concept of missing data and various strategies to address it in data analysis.

# **Trade-offs in Missing Data Conventions:**

This discusses the different ways missing data can be represented (like None or NaN) and the implications of choosing one convention over another, especially concerning performance and compatibility.

# **Missing Data in Pandas:**

This focuses on how the Pandas library handles missing data, primarily using NaN (Not a Number) for numerical data and None for object-type data.


In [35]:
# Import necessary libraries
import numpy as np
import pandas as pd

In [36]:
# Create a NumPy array with a None value
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [37]:
# Time the sum operation on a large integer array
%timeit np.arange(1E6, dtype=int).sum()

1.14 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [38]:
# Time the sum operation on a large object array (which includes None)
%timeit np.arange(1E6, dtype=object).sum()

66.7 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
# Attempt to sum the array containing None, which results in a TypeError
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

# **NaN: Missing Numerical Data**

This specifically delves into NaN, a special floating-point value used to indicate missing numerical values in NumPy and Pandas, highlighting its behavior in arithmetic operations.

In [40]:
# Create a NumPy array with a NaN value
vals2 = np.array([1, np.nan, 3, 4])
vals2

array([ 1., nan,  3.,  4.])

In [41]:
# Demonstrate arithmetic operation with NaN
1 + np.nan

nan

In [42]:
# Demonstrate another arithmetic operation with NaN
0 * np.nan

nan

In [43]:
# Attempt to perform aggregation functions on an array with NaN, which results in NaN
vals2.sum(), vals2.min(), vals2.max()

(np.float64(nan), np.float64(nan), np.float64(nan))

In [44]:
# Use NumPy functions that ignore NaN values for aggregation
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(np.float64(8.0), np.float64(1.0), np.float64(4.0))

# **NaN and None in Pandas**

This explains how Pandas uses both NaN and None to represent missing values, noting that the choice can affect the data type of a Series or DataFrame.

In [45]:
# Create a Pandas Series with both NaN and None values
pd.Series([1, np.nan, 2, None])

Unnamed: 0,0
0,1.0
1,
2,2.0
3,


In [46]:
# Create a Pandas Series of integers
x = pd.Series(range(2), dtype=int)
x

Unnamed: 0,0
0,0
1,1


In [47]:
# Assign None to an integer Series, which upcasts the dtype to float
x[0] = None
x

Unnamed: 0,0
0,
1,1.0


# **Pandas Nullable Dtypes**

This introduces newer Pandas data types that can natively handle missing values without forcing an upcast to a floating-point type, preserving the original integer or boolean types.

In [48]:
# Create a Pandas Series using a nullable integer dtype
pd.Series([1, np.nan, 2, None, pd.NA], dtype = 'Int32')

Unnamed: 0,0
0,1.0
1,
2,2.0
3,
4,


# **Operating on Null Values**

isnull: Generates a Boolean mask indicating missing values

notnull: Opposite of isnull

dropna: Returns a filtered version of the data

fillna: Returns a copy of the data with missing values filled or imputed

This covers the various methods provided by Pandas to work with missing data, such as identifying, dropping, and filling null values.

# **Detecting Null Values**

This focuses on the isnull() and notnull() methods in Pandas, which are used to create boolean masks to identify the locations of missing values in a Series or DataFrame.

In [49]:
# Create a Pandas Series with mixed data types and missing values
data = pd.Series([1, np.nan, 'hello', None])
data

Unnamed: 0,0
0,1
1,
2,hello
3,


In [50]:
# Filter the Series to keep only non-null values
data[data.notnull()]

Unnamed: 0,0
0,1
2,hello


In [51]:
# Create a Pandas DataFrame with missing values
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [52]:
# Drop rows with any missing values
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [53]:
# Drop columns with any missing values
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [54]:
# Add a column with all missing values
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [55]:
# Drop columns where all values are missing
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [56]:
# Drop rows with less than 3 non-missing values
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


# **Filling Null Values**

This explains how to replace missing values with specified values or using methods like forward-fill (ffill) or back-fill (bfill) to impute missing data based on neighboring values.

In [57]:
# Create a Pandas Series with missing values and a specific index
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'), dtype='Int32')
data

Unnamed: 0,0
a,1.0
b,
c,2.0
d,
e,3.0


In [58]:
# Fill missing values with a constant value (0)
data.fillna(0)

Unnamed: 0,0
a,1
b,0
c,2
d,0
e,3


In [59]:
# Fill missing values using forward fill (ffill)
data.fillna(method='ffill')

  data.fillna(method='ffill')


Unnamed: 0,0
a,1
b,1
c,2
d,2
e,3


In [60]:
# Fill missing values using back fill (bfill)
data.fillna(method='bfill')

  data.fillna(method='bfill')


Unnamed: 0,0
a,1
b,2
c,2
d,3
e,3


In [61]:
# Display the DataFrame again
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [62]:
# Fill missing values using forward fill (ffill) across columns
df.fillna(method='ffill', axis=1)

  df.fillna(method='ffill', axis=1)


Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
