# Handling Missing Data


> Handling missing data is an important step in data cleaning, as it can significantly impact the results of your analysis. Missing data can lead to biased estimates, reduce the statistical power of your analysis, and ultimately lead to invalid conclusions.




## Representations of Missing Values

There are multiple ways you might see missing values represented in your data, depending on where the data originated from. 

- **`NaN` (Not a Number):** Specific to numerical arrays, `NaN` is a standard IEEE floating-point representation for undefined or unrepresentable numeric values. In Pandas, it's commonly used to represent missing data in numerical datasets.
- **`NaT` (Not a Timestamp):** Similar to `NaN`, but for time series data

- **`None`:** A Python-specific identifier, `None` represents the absence of a value. It's used in Python code to denote nothingness or that a variable has no value.

- **`NULL`:** Often found in database systems, `NULL` denotes an absence or undefined value. It's used to signify missing or irrelevant data in SQL and similar environments. When you import data into Pandas, it will not be recognised as a missing value, and should be replaced with either `NaN` or `None`.





## Identifying Missing Data

>The Pandas `DataFrame` object has dedicated methods for identifying missing values, allowing you to quickly identify and index missing data in your dataset

Run the following code snippet to load a simple `DataFrame` with some missing data:


In [1]:
import pandas as pd
df = pd.read_csv('https://cdn.theaicore.com/content/lessons/3949170b-c8b8-4353-9983-cdfb18b6efbe/example_data.csv')
df.at[1, 'Column4'] = 'NULL'
df.at[4, 'Column4'] = 'NULL'
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,
1,2,b,,,
2,3,,3.3,carrot,
3,4,d,4.4,banana,
4,5,e,5.5,,




### The `.info()` Method

We have encountered this method in an earlier lesson. It provides a count of the non-null values in each column.


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Column1  5 non-null      int64  
 1   Column2  5 non-null      object 
 2   Column3  4 non-null      float64
 3   Column4  5 non-null      object 
 4   Column5  0 non-null      float64
dtypes: float64(2), int64(1), object(2)
memory usage: 332.0+ bytes



### The `isna()` and `notna()` Methods

The `isna()` method is  used to detect missing values in a `DataFrame` or `Series`. It returns a Boolean mask of the same size as the input, indicating `True` where elements are missing (`NaN`, `None`, or `NaT`) and `False` otherwise. There is also a `notna()` method which does the inverse: it returns `True` when a value is not a missing value.

- **Limitations:** It won't detect other forms of missing data representations, like placeholders (e.g., `-999`, `N/A` or empty strings) used to signify missing values.

- **Please Note:** There is another method called `.isnull()`, which has the exact same functionality as `.isna()`. It does not handle `NULL` entries from SQL databases. It is just the same method with a different name.



In [3]:
df.isna()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,False,False,False,False,True
1,False,False,True,False,True
2,False,False,False,False,True
3,False,False,False,False,True
4,False,False,False,False,True


As the output is a Boolean, which can also be interpreted as `1`s and `0`s, we can also use `isna()` to quickly give us the percentage empty values in each column, with the following snippet:

In [4]:
print("percentage of missing values in each column:")
df.isna().mean() * 100

percentage of missing values in each column:


Column1      0.0
Column2      0.0
Column3     20.0
Column4      0.0
Column5    100.0
dtype: float64

## Removing Rows and Columns
>If a significant amount of datapoints is missing from a row or column, one strategy is to drop it from the `DataFrame`. The upside of this is that it avoids the need for a data imputation strategy, while the downside is the risk that your data model will lose potentially useful information.

###  The `dropna()` Method

> The `dropna()` method in Pandas is a versatile tool for handling missing data. It provides a versatile way to remove missing values from a `DataFrame` or `Series`, which can be customised through its parameters to suit various data cleaning needs.

#### `dropna()` Optional Parameters

- **`axis`:** 
  - `axis=0` or `axis='index'` (default) to drop rows with missing values
  - `axis=1` or `axis='columns'` to drop columns with missing values
- **`how`:** 
  - `how='any'` (default) drops the row/column if **any** `NA` values are present
  - `how='all'` drops the row/column only if **all** values are `NA`
- **`thresh`:** 
  - Sets a threshold for the number of non-NA values. Rows/columns with *fewer* **non-NA** values than the threshold will be dropped
- **`subset`:** 
  - Defines a list of columns in which to look for missing values, useful when `axis=0`
- **`inplace`:** 
  - If `True`, the operation modifies the `DataFrame` in place. Default is `False`, which returns a new `DataFrame`


#### Example Usage: Dropping Rows
By default, `df.dropna()` or `df.dropna(axis=0)` removes rows that contain any missing value. 

For a more conservative approach, you can use `df.dropna(how='all')` to remove only those rows where all values are missing. 

Another useful feature is setting a threshold with `thresh=n`, which retains rows that have a minimum of `n` **non-missing** values.




In [5]:
drop_df=df.dropna(axis=0, thresh=4) # drops rows with less than 4 non-missing values
drop_df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,
2,3,,3.3,carrot,
3,4,d,4.4,banana,
4,5,e,5.5,,


## Example Usage: Dropping Columns

By switching the `axis` parameter to `axis=1`, you can remove any column with missing data. Similarly, to drop columns only if all their values are missing, you use `df.dropna(axis=1, how='all')`. This can be particularly useful when cleaning data where certain columns have a high proportion of missing values, and their exclusion has minimal impact on the overall analysis.

In [6]:
df_drop = df.dropna(axis=1, how='all') # drops columns consisting entirely of missing values
df_drop.head()

Unnamed: 0,Column1,Column2,Column3,Column4
0,1,a,1.1,apple
1,2,b,,
2,3,,3.3,carrot
3,4,d,4.4,banana
4,5,e,5.5,


#### Example Usage: Dropping Rows with `NaN` in a Specific Subset of Columns

In certain scenarios, you might need to drop rows if a `NaN` appears in a specific subset of columns, for example if the column is especially crucial for the analysis you are performing. This can be achieved by passing a list of columns to the `subset` parameter of `.dropna()`.

In [7]:
df_drop = df.dropna(subset=['Column3','Column4']) # drops rows with missing values in Column3 or Column4
df_drop.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,
2,3,,3.3,carrot,
3,4,d,4.4,banana,
4,5,e,5.5,,


## Handling Other Missing Data
As discussed earlier, some values in your dataset might represent missing data, while not being recognised as such in Pandas. An example is in `Column4` of our example `DataFrame`, where the value `NULL` is used, or `Column2` which contains an empty string `' '`. 

In such cases, the easiest thing to do is to replace the relevant values with a `None` value, or an `np.nan` value from the `numpy` library. This can be achieved using the `.replace()` method, which replaces one value with another. We will learn about more advanced uses of this method in another lesson.

In [8]:
import numpy as np

df['Column4'] = df['Column4'].replace('NULL', np.nan) # replaces NULL with None in Column4
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,
1,2,b,,,
2,3,,3.3,carrot,
3,4,d,4.4,banana,
4,5,e,5.5,,



## Imputing Missing Data


> Aside from deleting rows or columns with missing data, the other common strategy is to impute the values instead. This process involves substituting missing values with estimated ones, allowing for a more comprehensive analysis. Choosing the right imputation method is important for maintaining the integrity of the dataset and the validity of the analysis.

### Methods for Imputing Missing Data

1. **Fill with a Specific Value:**

The `fillna()` method is used to replace all missing (NA) values in a `DataFrame` or `Series`.
   


In [9]:
df_imputed=df.fillna(0)  # Replaces all NAs with 0
df_imputed.head()    


Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,0.0
1,2,b,0.0,0,0.0
2,3,,3.3,carrot,0.0
3,4,d,4.4,banana,0.0
4,5,e,5.5,0,0.0



2. **Forward Fill and Backward Fill:**

> These two approaches are accessed as arguments to the `method` parameter of `.fillna()`. They are useful for data where neighbouring rows are likely to be related, as they propagate the value from the rows to the missing value. This is useful if you have sorted your data in such a way that you can be confident that this approach will make sense. 

   - Forward fill (`ffill`) propagates the last valid observation forward
   - Backward fill (`bfill`) fills the missing values with the next valid observation


In [10]:
df_imputed=df.copy()
df_imputed['Column3'].fillna(method='bfill', inplace=True)  # Forward fills the NaNs in Column3
df_imputed['Column4'].fillna(method='ffill', inplace=True)  # Backward fills the NaNs in Column4
df_imputed.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,1,a,1.1,apple,
1,2,b,3.3,apple,
2,3,,3.3,carrot,
3,4,d,4.4,banana,
4,5,e,5.5,banana,



3. **Using the Mean, Median, or Mode:**

For numerical data, another common strategy is to replace missing values with the mean, median, or mode.

   


In [12]:

imputed_column = df['Column3'].fillna(df['Column3'].median())  # Median
imputed_column.head()



0    1.10
1    3.85
2    3.30
3    4.40
4    5.50
Name: Column3, dtype: float64

## Key Takeaways


- Handling missing data is vital for avoiding biased estimates and invalid conclusions
- Missing values in Pandas can be represented as `NaN`, `NaT`, `None`, or `NULL`
- Pandas `DataFrame` has methods to identify and index missing data in a dataset
- The `df.info()` method in Pandas provides a count of non-null values in each column
- Pandas `DataFrame` has 5 columns with different data types and varying number of non-null values
- Use `isna()` or `isnull()` to detect missing values in Pandas, but they won't detect placeholders for missing data
- Dropping rows or columns with many missing values avoids imputation but risks losing useful data
- The `dropna()` method in Pandas removes missing values, with customisable parameters for different data cleaning needs
- Use `df.dropna(axis=1, how='all')` to remove columns with all missing values, and `df.dropna(subset=['Column'])` to drop rows with missing values in specific columns
- Use the `.replace()` method to convert unrecognized missing data into a format Pandas can identify, such as `None` or `np.nan`
- Imputing missing values, not just deleting them, allows for comprehensive analysis while maintaining data integrity
- Pandas provides methods to handle missing values: `fillna()` to replace with a specific value, forward fill (`ffill`) or backward fill (`bfill`), or replace with the mean, median, or mode