# Handling Missing Data in Pandas

## Identifying Missing Data

In Pandas, the `isna()` and `isnull()` methods are used interchangeably to check for missing values within a DataFrame or Series. These methods have no functional difference; they yield identical results. Both methods generate a Boolean mask, where `True` indicates a missing value, and `False` indicates a non-missing value {cite:p}.

The choice between `isna()` and `isnull()` boils down to personal preference or coding style. Both methods exist to accommodate different user preferences. While `isnull()` is more commonly used, `isna()` is an alternative name for the same functionality. It was added for compatibility with other libraries and to enhance code readability for some users [Molin and Jee, 2021, Pandas Developers, 2023].

<font color='Blue'><b>Example</b></font>:

In [None]:
import pandas as pd

data = pd.Series([1, None, 3, None, 5])

# Original Series
print("Original Series:")
print(data)

# Using isnull() to identify missing values
missing_values = data.isnull()

# displaying missing_values
print("\nIdentifying Missing Values:")
print(missing_values)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/pd_Missing_Data_Fig1.png" alt="picture" width="400">
</center>

In [None]:
import pandas as pd

data = pd.Series([1, None, 3, None, 5])

# Original Series
print("Original Series:")
print(data)

# Using isna() to identify missing values (same as isnull())
missing_values = data.isna()

# displaying missing_values
print("\nIdentifying Missing Values with isna():")
print(missing_values)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/pd_Missing_Data_Fig2.png" alt="picture" width="400">
</center>

## Eliminating Missing Data

In Pandas, the `dropna()` method is employed to eliminate missing values from a DataFrame. This method allows the removal of rows or columns that contain one or more missing values based on the specified axis. The fundamental syntax for `dropna()` is [Molin and Jee, 2021, Pandas Developers, 2023]:

```python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

For a comprehensive reference, you can access the complete syntax description of the `dropna()` method [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html).

<font color='Blue'><b>Example</b></font>:

In [None]:
import pandas as pd

# Create a DataFrame with missing values
data = pd.DataFrame({'A': [1, 2, None, 4],
                     'B': [None, 6, 7, 8],
                     'C': [10, 11, None, None]})

# Drop rows containing any missing values
print('Drop rows containing any missing values:')
cleaned_data = data.dropna()
display(cleaned_data)

# Drop columns containing all missing values
print('Drop columns containing all missing values:')
cleaned_data_cols = data.dropna(axis=1, how='all')
display(cleaned_data_cols)

# Drop rows with at least 2 non-missing values
print('Drop rows with at least 2 non-missing values:')
cleaned_data_thresh = data.dropna(thresh=2)
display(cleaned_data_thresh)

## Filling Missing Data

### Constant Fill, Forward Fill, and Backward Fill

In Pandas, the `fillna()` function is a versatile tool for replacing missing or NaN (Not a Number) values within a DataFrame or Series. This method is particularly useful during data preprocessing or cleaning tasks, enabling effective handling of missing data [Molin and Jee, 2021, Pandas Developers, 2023].

**Syntax:**
```python
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
```
Available methods are `'fill'` (or `'pad'`) for forward filling (propagating the last valid value forward) and `'bfill'` (or `'backfill'`) for backward filling (propagating the next valid value backward).

You can see full description of the function [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html).

<font color='Blue'><b>Example</b></font>:

In [None]:
import numpy as np
import pandas as pd

# Create a simple time series DataFrame with missing values
date_rng = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
            '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
            '2023-01-09', '2023-01-10']
data = {'Temperature': [25.0, 24.5, np.nan, 23.0, np.nan, 22.0, 21.5, np.nan, 20.0, 19.5]}
df = pd.DataFrame(data, index=date_rng)
print("Original Data:")
display(df)

# Forward fill to propagate the last valid observation forward
df_filled_ffill = df.fillna(method='ffill')
print('Forward fill to propagate the last valid observation forward:')
display(df_filled_ffill)

# Backward fill to propagate the next valid observation backward
df_filled_bfill = df.fillna(method='bfill')
print('Backward fill to propagate the next valid observation backward:')
display(df_filled_bfill)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/pd_fill_04.png" alt="picture" width="800">
</center>

Fill NaN values with summary statistics like mean or median. We can also compute custom aggregations based on the context of our data.

In [None]:
# Display the DataFrame with missing values filled using interpolation
print("Fill NaN values with the mean:")
_mean = df['Temperature'].mean().round(2)
df_filled_mean = df.fillna(_mean)
display(df_filled_mean)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/pd_fill_06.png" alt="picture" width="650">
</center>

### Interpolation

### Filling with Interpolation

The Pandas library in Python offers the `interpolate` function as a versatile tool for filling missing or NaN (Not-a-Number) values within a DataFrame or Series. This function primarily employs interpolation techniques to estimate and insert values where data gaps exist [Pandas Developers, 2023].

Key Parameter:
1. `method`: This parameter allows you to specify the interpolation method to be used. Common methods include 'linear,' 'polynomial,' 'spline,' and 'nearest,' among others. The choice of method depends on the nature of your data and the desired interpolation behavior.

* **How It Works:**
The `interpolate` function uses the specified interpolation method to estimate the missing values based on the surrounding data points. For instance, with linear interpolation, missing values are estimated as points lying on a straight line between the known data points. Polynomial interpolation uses polynomial functions to approximate the missing values more flexibly. Spline interpolation constructs piecewise-defined polynomials to capture complex data patterns.

* **Use Cases:**
Pandas' `interpolate` function is particularly useful in data preprocessing, time series analysis, and data imputation tasks. It aids in maintaining the integrity of the data while ensuring a more complete dataset for downstream analysis.

You can see the function description [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

<font color='Blue'><b>Example:</b></font>

In [None]:
import pandas as pd
import numpy as np

# Interpolate missing values linearly
df_interpolate = df.interpolate(method='linear')

# Display the DataFrame with missing values filled using interpolation
print("DataFrame with Missing Values Filled:")
display(df_interpolate)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/pd_fill_05.png" alt="picture" width="600">
</center>

## Strategies for Handling Missing Categorical Data in Pandas

Managing missing categorical data in pandas is crucial for preserving data quality and conducting meaningful analyses. To address this issue, we can consider the following effective strategies:

### Leaving Missing Values as NaN

In some scenarios, it is advisable to maintain categorical values as NaNs (Not a Number) when they are missing. This approach can be appropriate when the absence of data conveys valuable information or carries significance.

By retaining missing categorical values as NaNs, we ensure that the missingness itself is treated as a distinct category, allowing our analysis to capture the potential significance of missing data. This approach is particularly valuable when the missingness pattern has meaning in our dataset, such as in surveys where "prefer not to answer" is a valid response {cite:p}`mccaffrey2020introduction`.

### Impute Missing Categorical Values with the Most Frequent Category

To handle missing categorical values effectively, a practical approach is to impute them with the most frequently occurring category within the respective column. This strategy ensures that missing data is replaced with a category that maintains the overall distribution of the data, minimizing potential bias in your analysis {cite:p}`PandasDocumentation`.

<font color='Blue'><b>Example:</b></font>

In [None]:
import pandas as pd

# Create a DataFrame with missing categorical values
data = {'Category': ['A', 'B', 'C', None, 'A', None, 'B']}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
display(df)

# Impute missing values with the most frequent category
most_frequent_category = df['Category'].mode()[0]
df['Category'].fillna(most_frequent_category, inplace=True)

# Display the DataFrame with missing values filled using the most frequent category
print("DataFrame with Missing Values Filled:")
display(df)

In this example, the missing categorical values are replaced with the most frequent category, ensuring that the distribution of categories remains representative of the original data.

## Imputing Missing Categorical Values with a Specific Category

When dealing with missing categorical values, it is often beneficial to replace them with a designated category that signifies the absence of data or a special category chosen for this purpose. This approach ensures that the missingness is clearly indicated in our dataset and prevents any ambiguity when interpreting the results of our analysis {cite:p}`PandasDocumentation`.

<font color='Blue'><b>Example:</b></font>

In [None]:
import pandas as pd

# Create a DataFrame with missing categorical values
data = {'Category': ['A', 'B', None, 'C', 'B']}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
display(df)

# Fill missing values with a specific category, e.g., 'Unknown'
specific_category = 'Unknown'
df['Category'].fillna(specific_category, inplace=True)

# Display the DataFrame with missing values filled using the specific category
print("DataFrame with Missing Values Filled:")
display(df)

In this example, missing categorical values are replaced with the designated category 'Unknown,' making it explicit that these values were missing originally.

## Advantages and Disadvantages of Imputing Missing Data

In the context of data analysis, imputing missing data, whether in Pandas or any other data analysis tool, is a common practice. However, it comes with its set of advantages and disadvantages {cite:p}`data2016secondary,little2019statistical`. Here, we outline the key considerations:

**Advantages of Imputing Missing Data:**

1. **Preserving Data Integrity**: Imputing missing data allows us to retain more data points in our analysis, thereby preserving the overall integrity of our dataset.

2. **Avoiding Bias**: The removal of rows with missing data can introduce bias if the missingness is not entirely random. Imputation helps mitigate bias by replacing missing values with plausible estimates.

3. **Statistical Power**: Imputation enhances the statistical power of our analysis since we have more data to work with, potentially leading to more robust results.

4. **Compatibility with Algorithms**: Many machine learning and statistical algorithms require complete datasets. Imputing missing data renders our dataset compatible with a wider range of modeling techniques.

5. **Enhancing Visualizations**: Imputed data can improve the quality of visualizations and exploratory data analysis by providing a more comprehensive view of our dataset.

**Disadvantages of Imputing Missing Data:**

1. **Potential Bias**: Imputing missing data introduces the potential for bias if the chosen imputation method is not suitable for the underlying data distribution or the mechanism causing the missingness.

2. **Loss of Information**: Imputing missing data may result in the loss of information, especially if the imputed values do not accurately represent the missing data.

3. **Inaccurate Estimates**: Imputed values may not precisely reflect the true values of the missing data, particularly when the missingness is due to unique or unobservable factors.

4. **Increased Variability**: Imputed data can increase the variability in our dataset, potentially affecting the stability of our analyses.

5. **Complexity**: The selection of the appropriate imputation method and the handling of missing data can be complex, especially in datasets with multiple variables and diverse missing data patterns.