# 3.1 - Missingness

## Types of missing values
The techniques for dealing with missing values (NaNs) depend on the type of these missing values, how/why they appear:

a) *Missing Completely at Random* (MCAR): missingness unrelated to the data, just random holes in dataset. Dropping the missing values does not bias the results.

b) *Missing at Random* (MAR): missingness is related to the data - e.g. on a questionnaire about work harassment, some gender/age groups may be less willing to answer than others, leading to higher missingness concentration for them. Simply dropping the missing values can bias results. Imputing technique can help.

c) *Missing not at Random* (MNAR): missingness depends on variables that have not been recorded - e.g.,  patients may drop out of a study because they experience some really bad side effect that was not measured. Dropping the missing values biases the results. Imputing can help, but does not solve it.

## Techniques
### A. Dropping values
Drop either a row with missing values or even an entire column that has too many missing rows. But beware of biasing the analysis!

### B. Imputation
Fill the missing data with representative, 'well-chosen', values.
#### B.1 With some centrality measure
- Quantitative data: NaN <- mean or median over column (i.e. over the other data points of the same feature)
- Qualitative data: NaN <- mode over column

#### B.2 Model-based imputation
Attempt to predict the values of the missing data using complete data from other variables - the missing features then become predictors for this analysis.
- Quantitative data: regression
- Qualitative data: classification

Just never use the actual response variable 'y' as a feature!

#### B.3 Missing as an actual label
Replace NaN with a new class e.g. "wished not to respond"

In [3]:
# Example of B.1
import pandas as pd
import numpy as np

data = {
    'col1': [1, 2, 5, 8],
    'col2': [4.5, 3, np.nan, 6.2],  # Introduce missing value (NaN)
    'col3': ['A', 'B', 'B', 'C'] 
}
df = pd.DataFrame(data)

def mean_imputation(df, column):
    """Imputes missing values in the specified column with the column's mean"""
    mean_value = df[column].mean()
    df[column].fillna(mean_value, inplace=True)
    return df

# Impute missing values in 'col2'
df_imp = mean_imputation(df.copy(), 'col2')  # Use a copy to avoid modifying original
print(f'Original dataset with missing values:\n {df}')
print(f'Imputated dataset:\n {df_imp}')

Original dataset with missing values:
    col1  col2 col3
0     1   4.5    A
1     2   3.0    B
2     5   NaN    B
3     8   6.2    C
Imputated dataset:
    col1      col2 col3
0     1  4.500000    A
1     2  3.000000    B
2     5  4.566667    B
3     8  6.200000    C


In [5]:
# Example of B.1. using sklearn
# Extension of model A to try other imputing strats
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create dataset with more diverse missing values
data = {
    'col1': [1, 2, 2, 10, np.nan],
    'col2': [4.5, 8, 6.2, 10.0, np.nan],  
    'col3': ['A', 'B', 'C', 'D', np.nan] 
}
df = pd.DataFrame(data)
print(f'Original dataset with missing values:\n {df}')

# Strategies available in SimpleImputer
strategies = ['mean', 'median', 'most_frequent', 'constant']

# Perform imputation with each strategy
for strategy in strategies:
    imputer = SimpleImputer(strategy=strategy, fill_value=2.32) #fill for constant only
    df_copy = df.copy()  # Preserve original data

    # Impute on numerical columns only
    df_copy[['col1', 'col2']] = imputer.fit_transform(df_copy[['col1', 'col2']]) 

    print(f"\nImputation on 'col1' and 'col2' with strategy '{strategy}':\n", df_copy)

Original dataset with missing values:
    col1  col2 col3
0   1.0   4.5    A
1   2.0   8.0    B
2   2.0   6.2    C
3  10.0  10.0    D
4   NaN   NaN  NaN

Imputation on 'col1' and 'col2' with strategy 'mean':
     col1    col2 col3
0   1.00   4.500    A
1   2.00   8.000    B
2   2.00   6.200    C
3  10.00  10.000    D
4   3.75   7.175  NaN

Imputation on 'col1' and 'col2' with strategy 'median':
    col1  col2 col3
0   1.0   4.5    A
1   2.0   8.0    B
2   2.0   6.2    C
3  10.0  10.0    D
4   2.0   7.1  NaN

Imputation on 'col1' and 'col2' with strategy 'most_frequent':
    col1  col2 col3
0   1.0   4.5    A
1   2.0   8.0    B
2   2.0   6.2    C
3  10.0  10.0    D
4   2.0   4.5  NaN

Imputation on 'col1' and 'col2' with strategy 'constant':
     col1   col2 col3
0   1.00   4.50    A
1   2.00   8.00    B
2   2.00   6.20    C
3  10.00  10.00    D
4   2.32   2.32  NaN


In [None]:
# Example of B.2 with KNN
import numpy as np
import pandas as pd

# Create a DataFrame with some missing values (NaN)
data = {
    'feature_1': [1, 5, np.nan, 4], #neighbors of this point are rows 1 and 3
    'feature_2': [8, 5, 7, 10],
    'feature_3': [8000, 5, 7, 10],

#    'target': ['A', 'B', 'A', 'B']
}
df = pd.DataFrame(data)
print(df)


from sklearn.impute import KNNImputer

# Create a kNNImputer instance (let's use 3 neighbors)
imputer = KNNImputer(n_neighbors=2)

# Fit the imputer to the data (finds the nearest neighbors)
df_filled = imputer.fit_transform(df)

# Convert back to a DataFrame for better viewing
df_filled = pd.DataFrame(df_filled, columns=df.columns)
print(df_filled)

print(f"og value: {df['feature_1'][2]}, imputed value: {df_filled['feature_1'][2]}, mean over neighbors = {np.mean(df['feature_1'][[1,3]])}")