# Handling Missing Data in Pandas

## Introduction
In this notebook, we will dive deep into handling missing data using Pandas.
Missing data is common in real-world datasets and can arise due to various reasons like data entry errors,
incomplete data collection, or merging datasets from different sources.
Handling missing data appropriately is crucial for accurate analysis and modeling.

In [1]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame with missing data for demonstration
data = {
    'Country': ['USA', 'Canada', 'Germany', 'France', 'UK', 'Italy', 'Spain', np.nan],
    'GDP': [21.43, 1.73, np.nan, 2.78, 2.83, 2.00, np.nan, np.nan],
    'Population': [331, 37.59, 83.02, np.nan, 67.22, 60.36, 46.75, np.nan],
    'Life Expectancy': [78.9, 82.3, np.nan, 82.5, 81.2, 82.1, 83.5, np.nan],
    'Literacy Rate': [99, np.nan, np.nan, 99, 99, 98, np.nan, np.nan]
}

df = pd.DataFrame(data)

In [2]:
print("Original DataFrame:")
print(df)
print(df.describe())

Original DataFrame:
   Country    GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.43      331.00             78.9           99.0
1   Canada   1.73       37.59             82.3            NaN
2  Germany    NaN       83.02              NaN            NaN
3   France   2.78         NaN             82.5           99.0
4       UK   2.83       67.22             81.2           99.0
5    Italy   2.00       60.36             82.1           98.0
6    Spain    NaN       46.75             83.5            NaN
7      NaN    NaN         NaN              NaN            NaN
             GDP  Population  Life Expectancy  Literacy Rate
count   5.000000    6.000000         6.000000           4.00
mean    6.154000  104.323333        81.750000          98.75
std     8.553019  112.172726         1.579557           0.50
min     1.730000   37.590000        78.900000          98.00
25%     2.000000   50.152500        81.425000          98.75
50%     2.780000   63.790000        82.200000          9

## Understanding Missing Data
Missing data can be problematic because many algorithms require complete datasets to function properly.
It can also introduce bias if the missing data is not handled correctly. The first step in dealing with missing data
is to understand the nature and extent of the missingness.

## Identifying Missing Data
To identify where missing data exists, we can use the `isnull()` method, which returns a DataFrame of the same shape as `df`,
with `True` where the values are `NaN` and `False` elsewhere.

In [4]:
print("\nIdentifying Missing Data with `isnull()`:")
print(df.isnull())


Identifying Missing Data with `isnull()`:
   Country    GDP  Population  Life Expectancy  Literacy Rate
0    False  False       False            False          False
1    False  False       False            False           True
2    False   True       False             True           True
3    False  False        True            False          False
4    False  False       False            False          False
5    False  False       False            False          False
6    False   True       False            False           True
7     True   True        True             True           True


### To count the total number of missing values in each column, we can use `isnull().sum()`.

In [7]:
print("\nCounting Missing Values in Each Column with `isnull().sum()`:")
print(df.isnull().sum())


Counting Missing Values in Each Column with `isnull().sum()`:
Country            1
GDP                3
Population         2
Life Expectancy    2
Literacy Rate      4
Missing Count      0
dtype: int64


### To count the total number of missing values in each column, we can use `isnull().sum()(axis=1)`.

In [8]:
# Counting missing values by row
df['Missing Count'] = df.isnull().sum(axis=1)
print("\nDataFrame with Missing Value Counts by Row:")
print(df)


DataFrame with Missing Value Counts by Row:
   Country    GDP  Population  Life Expectancy  Literacy Rate  Missing Count
0      USA  21.43      331.00             78.9           99.0              0
1   Canada   1.73       37.59             82.3            NaN              1
2  Germany    NaN       83.02              NaN            NaN              3
3   France   2.78         NaN             82.5           99.0              1
4       UK   2.83       67.22             81.2           99.0              0
5    Italy   2.00       60.36             82.1           98.0              0
6    Spain    NaN       46.75             83.5            NaN              2
7      NaN    NaN         NaN              NaN            NaN              5


### To count the total number of missing values in the entire DataFrame, we can use `isnull().sum().sum()`.

In [9]:
print("\nTotal Number of Missing Values in the DataFrame with `isnull().sum().sum()`:")
print(df.isnull().sum().sum())


Total Number of Missing Values in the DataFrame with `isnull().sum().sum()`:
12


## Filling Missing Data with `fillna()`
One way to handle missing data is to fill it with a specific value. The `fillna()` method allows us to replace `NaN` values with a specified value.

In [10]:
# Filling missing values with 0
df = df.drop(columns=['Missing Count'])
df_fillna_zero = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_fillna_zero)


DataFrame after filling missing values with 0:
   Country    GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.43      331.00             78.9           99.0
1   Canada   1.73       37.59             82.3            0.0
2  Germany   0.00       83.02              0.0            0.0
3   France   2.78        0.00             82.5           99.0
4       UK   2.83       67.22             81.2           99.0
5    Italy   2.00       60.36             82.1           98.0
6    Spain   0.00       46.75             83.5            0.0
7        0   0.00        0.00              0.0            0.0


In [11]:
# Filling missing values with the mean of numeric columns only
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
df_fillna_mean = df.copy()
df_fillna_mean[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

print("\nDataFrame after filling missing values with the mean of each numeric column:")
print(df_fillna_mean)


DataFrame after filling missing values with the mean of each numeric column:
   Country     GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.430  331.000000            78.90          99.00
1   Canada   1.730   37.590000            82.30          98.75
2  Germany   6.154   83.020000            81.75          98.75
3   France   2.780  104.323333            82.50          99.00
4       UK   2.830   67.220000            81.20          99.00
5    Italy   2.000   60.360000            82.10          98.00
6    Spain   6.154   46.750000            83.50          98.75
7      NaN   6.154  104.323333            81.75          98.75


In [12]:
# Filling missing values with forward fill (ffill)
df_fillna_ffill = df.ffill()
print("\nDataFrame after forward filling missing values (ffill):")
print(df_fillna_ffill)


DataFrame after forward filling missing values (ffill):
   Country    GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.43      331.00             78.9           99.0
1   Canada   1.73       37.59             82.3           99.0
2  Germany   1.73       83.02             82.3           99.0
3   France   2.78       83.02             82.5           99.0
4       UK   2.83       67.22             81.2           99.0
5    Italy   2.00       60.36             82.1           98.0
6    Spain   2.00       46.75             83.5           98.0
7    Spain   2.00       46.75             83.5           98.0


In [13]:
# Filling missing values with backward fill (bfill)
df_fillna_bfill = df.bfill()
print("\nDataFrame after backward filling missing values (bfill):")
print(df_fillna_bfill)


DataFrame after backward filling missing values (bfill):
   Country    GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.43      331.00             78.9           99.0
1   Canada   1.73       37.59             82.3           99.0
2  Germany   2.78       83.02             82.5           99.0
3   France   2.78       67.22             82.5           99.0
4       UK   2.83       67.22             81.2           99.0
5    Italy   2.00       60.36             82.1           98.0
6    Spain    NaN       46.75             83.5            NaN
7      NaN    NaN         NaN              NaN            NaN


## Interpolating Missing Data
Interpolation is another method to estimate missing values based on other values in the data.
The `interpolate()` method is used to fill missing values in a linear fashion by default,
but it also supports other methods like polynomial and spline interpolation.

In [14]:
# Now apply linear interpolation
df_interpolate_linear = df.interpolate()
print("\nDataFrame after Linear Interpolation of Missing Values:")
print(df_interpolate_linear)


DataFrame after Linear Interpolation of Missing Values:
   Country     GDP  Population  Life Expectancy  Literacy Rate
0      USA  21.430      331.00             78.9           99.0
1   Canada   1.730       37.59             82.3           99.0
2  Germany   2.255       83.02             82.4           99.0
3   France   2.780       75.12             82.5           99.0
4       UK   2.830       67.22             81.2           99.0
5    Italy   2.000       60.36             82.1           98.0
6    Spain   2.000       46.75             83.5           98.0
7      NaN   2.000       46.75             83.5           98.0


  df_interpolate_linear = df.interpolate()


## Dropping Missing Data with `dropna()`
Sometimes it is more appropriate to remove rows or columns with missing data. The `dropna()` method allows us to do this.

In [15]:
# Dropping rows with any missing values
df_dropna_rows = df.dropna()
print("\nDataFrame after Dropping Rows with Any Missing Values:")
print(df_dropna_rows)


DataFrame after Dropping Rows with Any Missing Values:
  Country    GDP  Population  Life Expectancy  Literacy Rate
0     USA  21.43      331.00             78.9           99.0
4      UK   2.83       67.22             81.2           99.0
5   Italy   2.00       60.36             82.1           98.0


In [16]:
# Dropping columns with any missing values
df_dropna_columns = df.dropna(axis=1)
print("\nDataFrame after Dropping Columns with Any Missing Values:")
print(df_dropna_columns)


DataFrame after Dropping Columns with Any Missing Values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7]


# Z-Score

This method calculates how far away a data point is from the mean in terms of standard deviations. It is generally considered that data points with a Z-score greater than 2 or 3 (or less than -2 or -3) are considered outliers.

In [18]:
# Select only numeric columns for Z-score calculation
numeric_df = df.select_dtypes(include=[np.number])

# Calculate Z-score for numeric columns
df_zscore = (numeric_df - numeric_df.mean()) / numeric_df.std()

# Print Z-scores to analyze
print("Z-scores:")
print(df_zscore)

# Set a threshold for identifying outliers
threshold = 2

# Find outliers
outliers = df_zscore[(df_zscore > threshold) | (df_zscore < -threshold)]
print("Outliers based on Z-score method:")
print(outliers)

Z-scores:
        GDP  Population  Life Expectancy  Literacy Rate
0  1.786036    2.020782        -1.804303            0.5
1 -0.517244   -0.594916         0.348199            NaN
2       NaN   -0.189915              NaN            NaN
3 -0.394481         NaN         0.474817            0.5
4 -0.388635   -0.330770        -0.348199            0.5
5 -0.485676   -0.391925         0.221581           -1.5
6       NaN   -0.513256         1.107906            NaN
7       NaN         NaN              NaN            NaN
Outliers based on Z-score method:
   GDP  Population  Life Expectancy  Literacy Rate
0  NaN    2.020782              NaN            NaN
1  NaN         NaN              NaN            NaN
2  NaN         NaN              NaN            NaN
3  NaN         NaN              NaN            NaN
4  NaN         NaN              NaN            NaN
5  NaN         NaN              NaN            NaN
6  NaN         NaN              NaN            NaN
7  NaN         NaN              NaN         

## Understanding the Impact of Handling Missing Data
It's important to understand how handling missing data can impact your analysis.
For instance, dropping rows with missing data might lead to a loss of important information,
while filling missing data might introduce bias if not done carefully.

# Challenge: Handling Missing Data in the Manufacturing Process Dataset

Now that you've learned how to handle missing data using Pandas, try to complete the following tasks with the provided manufacturing process dataset (`cost_of_living_US.csv`):

1. **Load the Dataset** - Load the `cost_of_living_US.csv` file into a DataFrame.
2. **Explore the Data** - Explore the shape, columns, and general information of the DataFrame using `shape`, `columns`, `info()`, and `describe()`.
3. **Find any outliers** - Look for any outliers (if any) - use a threshold or 2 or 3. Decide on what you feel is best way to handle the outlier.
4. **Identify Missing Data** - Identify which columns have missing data using `isnull().sum()`.
5. **Identify Missing Data** - Count the total number of missing values in the entire dataset.
6. **Fill Missing Data** - Fill the missing values in each column with the mean of that column. (do not save to df)
7. **Fill Missing Data** - Create a new DataFrame with missing values forward-filled using the last valid observation. (do not save to df)
8. **Interpolate Missing Data** - Perform linear interpolation to fill in the missing values. (do not save to df)

--

## Bonus Challenges

### Bonus 1: Analyse the Effect of Different Methods
- Compare the summary statistics (`describe()`) of the dataset after filling missing values with the mean, forward-filling, and interpolating. Which method seems to retain the most natural distribution of data?

### Bonus 2: Dropping data
- Discuss whether it would be more appropriate to fill in the missing values or drop these columns entirely, considering the context of the data.

### Bonus 3: Create a Custom Function for Missing Data
- Write a custom function that takes in a DataFrame and a column name, generates two adapted DataFrames and fills missing values using both mean and linear interpolation and returns the both DataFrames. Test this function on the `cost_of_living_US.csv` dataset.

In [None]:
#1

In [None]:
#2

In [None]:
#3

In [None]:
#4

In [None]:
#5

In [None]:
#6

In [19]:
#7

In [None]:
#8

In [20]:
# Bonus 1

In [21]:
# Bonus 2

In [22]:
# Bonus 3