<a href="https://colab.research.google.com/github/Tarleton-Math/data-science-20-21/blob/master/data_science_20_21_notes_10_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Intro to Naive Bayes and Imputation of Missing Data
## Class Notes 2020-10-06
## Data Science (masters)
## Math 5364 & 5366, Fall 20 & Spring 21
## Tarleton State University
## Dr. Scott Cook

In [None]:
! pip install --upgrade numpy
! pip install --upgrade pandas

Requirement already up-to-date: numpy in /usr/local/lib/python3.6/dist-packages (1.19.2)
Requirement already up-to-date: pandas in /usr/local/lib/python3.6/dist-packages (1.1.2)


Today we meet new
- Dataset called [HouseVotes84](https://archive.ics.uci.edu/ml/datasets/congressional+voting+records)
    - voting records of 435 the member of the US House of Representatives on 16 important bills in 1984
    - Goal - predict party affiliation from the voting record.
- Preprocessing - [Imputing missing data](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)
    - Datasets often have holes where data is missing.  These can cause significant problems for machine learners, so we need to impute (fill) these holes in a reasonable way we could defend to a skeptic.
    - [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) offers many useful simple imputation strategies.
    - [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer) imputes missing values based on the mean value of that feature for the $k$ nearest observations.  This is a more recent strategy that has become quite popular.
- Classifier - [Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html)
    - We'll focus on the [Bernoulli Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes), which is designed for boolean features.
    - Another variant is [Multinomial Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes), designed for discrete features that count the occurences certain events, patterns, samples, etc such as word counts in text.
    - Another variant is [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes), which is designed for continuous features that have approximately normal distributions with each target class.  $P(x_i|y) \sim N(\mu_{iy}, \sigma_{iy})$

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
from sklearn.naive_bayes import BernoulliNB

# Column names (bills in the 1984 congressional session)
columns = ['party', 'handicapped-infants', 'water-project-cost-sharing',
           'adoption-of-the-budget-resolution', 'physician-fee-freeze',
           'el-salvador-aid', 'religious-groups-in-schools',
           'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
           'immigration', 'synfuels-corporation-cutback', 'education-spending',
           'superfund-right-to-sue', 'crime', 'duty-free-exports',
           'export-administration-act-south-africa']

# Observe that we can pull this dataset directly from the web - no need for a local copy
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 header=None, names=columns)

# Target = party (column 0)
# Features = 16 votes (columns >=1)
X, y = data.iloc[:,1:].copy(), data.iloc[:,0].copy()
n, d = X.shape

# Replace 'y'→True, 'n'→False, '?'→NaN
X[X=='y'] = True
X[X=='n'] = False
X[X=='?'] = np.nan

# holdout/model split
holdout_frac = 0.1
holdout_splitter = StratifiedShuffleSplit(n_splits=1, test_size=holdout_frac, random_state=42)
model_idx, holdout_idx = next(holdout_splitter.split(X, y))
X_m, y_m = X.iloc[model_idx]  , y.iloc[model_idx]
X_h, y_h = X.iloc[holdout_idx], y.iloc[holdout_idx]
X_m.shape, X_h.shape

def display_results(grid, cutoff=1.0):
    res = grid.cv_results_
    df = pd.DataFrame(res['params'])
    df['score'] = (res['mean_test_score'] * 100)
    mask = df['score'] >= df['score'].max() * min(1-cutoff, 1)
    with pd.option_context('display.max_rows', 100, 'precision', 2):
        display(df[mask].sort_values('score', ascending=False))
    return df

To start, let's just try to apply BernoulliNB directly.

In [2]:
classify = BernoulliNB()
classify.fit(X_m, y_m)

ValueError: ignored

The missing values screw up NB.  So, we need to impute values before using it.  First, create 2 convenience functions.

In [3]:
def display_df(df, highlight_cells=None):
    """Display df with formatting and highlight specified cells"""
    highlight = df.copy()
    highlight[:] = None
    if highlight_cells is not None:
        highlight[highlight_cells] = 'background-color: green'
    display(df.style
        .apply(lambda z: highlight, axis=None)
        .set_precision(2)
        .set_properties(**{'text-align':'center', 'border-width':'thin','border-style':'solid'})
        .set_table_attributes('style="border-collapse:collapse"')
        )

def impute(X, imputer):
    X_imp = X.copy()
    X_imp[:] = imputer.fit_transform(X)
    display_rows = 10
    display_df(X_imp.iloc[:display_rows], highlight_cells=X.iloc[:display_rows].isnull())  # display new df, highlight imputed cells
    return X_imp

In [4]:
impute(X_m, SimpleImputer(strategy='most_frequent'));

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
177,False,False,True,False,False,False,True,True,True,True,True,False,False,False,True,True
396,True,True,True,False,True,True,False,True,False,False,True,False,True,False,True,True
322,True,True,True,False,True,True,False,True,False,False,True,False,True,True,False,True
166,True,False,True,True,True,True,True,True,False,True,False,True,False,True,True,True
362,True,True,True,False,False,True,True,True,True,True,True,True,True,False,False,True
113,False,True,False,True,True,True,False,False,False,True,False,True,True,True,False,False
191,False,True,False,True,True,True,False,True,False,True,False,True,True,True,False,True
108,True,True,True,False,False,False,True,True,True,False,False,False,False,False,True,True
257,False,False,False,True,True,False,False,False,False,False,False,True,False,True,False,True
12,False,True,True,False,False,False,True,True,True,False,False,False,True,False,False,True


The highlighted cells, such as (322, 'aid-to-nicaraguan-contras') were inputed using the most frequent value in the same columns.  See [this](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for other available simple imputations strategies.

In [7]:
X_m['aid-to-nicaraguan-contras'].value_counts()

True     218
False    162
Name: aid-to-nicaraguan-contras, dtype: int64

So, let's make a pipeline and see that error is gone.

In [8]:
pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
                 ('classify', BernoulliNB()),
                ])
pipe.fit(X_m, y_m)

Pipeline(memory=None,
         steps=[('impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='most_frequent',
                               verbose=0)),
                ('classify',
                 BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None,
                             fit_prior=True))],
         verbose=False)

Great!!  Now, back to supervised learning.

Though it is overkill, let's get in the habit of using the [hyper-parameter optimizers](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to make tuning and cross-validation easy and automatic.

In [9]:
pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
                 ('classify', BernoulliNB()),
                ])
hyperparams = {}
grid = GridSearchCV(pipe, hyperparams, cv=10, scoring='accuracy').fit(X_m, y_m)
display_results(grid);

Unnamed: 0,score
0,89.52


Great!  Now, this simple imputer has a conceptual flaw.  It assigns the same fill value to ALL missing values in a column, regardless of the target class (dem/rep) of that row.  Don't you think we should impute differently depending on which party that congressperson belongs to?

One idea is to impute based on the most frequent vote along all members of the same party (ignoring the other party).

That's a good idea, but the code to do it is a little involved (though certainly doable).  Also, that induces some circularity.  We're tyring to predict party affiliation.  We probably shouldn't be USING the thing we're trying to predict within the predictor.  That's cheating (but it is still done occasionally because its not especially egregious).

Instead, let's impute using the $k$ neareast neighbors based on only the features $X$ (not targets $y$)
- Suppose X[i,j] is missing
- Ignoring row $i$ and column $j$, let idx be the $k$ rows that are closest to row $i$
- Impute X[i,j]= X.iloc[idx, j].mean()


In [11]:
impute(X_m, KNNImputer(n_neighbors=5));

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
177,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0
396,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
322,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.2,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.8
166,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
362,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0
113,0.0,0.8,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
191,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.4
108,1.0,0.6,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
257,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
12,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


That's great, except now the entries aren't boolean T/F.  No problem ... BernoulliNB will convert to boolean via the hyperparameter called "binarize".  Any value greater than "binarize" becomes True while any less than "binarize" becomes false.  So, let's use GridSearchCV to optimize:
 this threshhold.
 - binarize
 - n_neighbors

In [12]:
%%time
pipe = Pipeline([('impute', KNNImputer()),
                 ('classify', BernoulliNB()),
                ])
hyperparams = {'impute__n_neighbors' : np.arange(1, 10),
               'classify__binarize': np.linspace(0, 1, 5),
               }
grid = GridSearchCV(pipe, hyperparams, cv=10, scoring='accuracy').fit(X_m, y_m)
display_results(grid, cutoff=0.1);

Unnamed: 0,classify__binarize,impute__n_neighbors,score
35,0.75,9,89.53
33,0.75,7,89.53
21,0.5,4,89.53
5,0.0,6,89.53
6,0.0,7,89.53
34,0.75,8,89.53
31,0.75,5,89.53
22,0.5,5,89.27
20,0.5,3,89.27
25,0.5,8,89.27


Boo! Shifting from the coarse SimpleImputer(strategy='most_frequent') to the most refined KNNImputer only increased accuracy by 0.01%.

Let's go [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) to look for other hyperparameters to tune.

Here's some background on [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing).

In [15]:
%%time
pipe = Pipeline([('impute', KNNImputer()),
                 ('classify', BernoulliNB()),
                ])
hyperparams = {'impute__n_neighbors' : np.arange(1, 10),
               'impute__weights': ['uniform', 'distance'],
               'classify__binarize': np.linspace(0, 1, 5),
               'classify__alpha': np.linspace(0.01, 2, 5),
               }
grid = GridSearchCV(pipe, hyperparams, cv=10, scoring='accuracy').fit(X_m, y_m)
display_results(grid, cutoff=0.1);

Unnamed: 0,classify__alpha,classify__binarize,impute__n_neighbors,impute__weights,score
3,0.01,0.00,2,distance,90.55
5,0.01,0.00,3,distance,90.55
23,0.01,0.25,3,distance,90.55
21,0.01,0.25,2,distance,90.55
25,0.01,0.25,4,distance,90.29
...,...,...,...,...,...
392,2.00,0.25,8,uniform,88.76
212,1.00,0.25,8,uniform,88.76
393,2.00,0.25,8,distance,88.51
213,1.00,0.25,8,distance,88.51


CPU times: user 1min 56s, sys: 1min 29s, total: 3min 25s
Wall time: 1min 43s


Ah, that squeezed another 1% out.  Looks like small values for alpha are doing better, so let's focus more narrowly to see if there's a better model hiding.

In [17]:
%%time
pipe = Pipeline([('impute', KNNImputer()),
                 ('classify', BernoulliNB()),
                ])
hyperparams = {'impute__n_neighbors' : np.arange(1, 10),
               'impute__weights': ['uniform', 'distance'],
               'classify__binarize': np.linspace(0, 1, 5),
               'classify__alpha': np.linspace(1e-5, 0.3, 10),
               }
grid = GridSearchCV(pipe, hyperparams, cv=10, scoring='accuracy').fit(X_m, y_m)
display_results(grid, cutoff=0.1);

Unnamed: 0,classify__alpha,classify__binarize,impute__n_neighbors,impute__weights,score
185,6.67e-02,0.00,3,distance,90.55
113,3.33e-02,0.25,3,distance,90.55
3,1.00e-05,0.00,2,distance,90.55
293,1.00e-01,0.25,3,distance,90.55
5,1.00e-05,0.00,3,distance,90.55
...,...,...,...,...,...
687,2.33e-01,0.75,2,distance,88.76
235,6.67e-02,0.75,1,distance,88.76
159,3.33e-02,0.75,8,distance,88.76
236,6.67e-02,0.75,2,uniform,88.76


CPU times: user 3min 51s, sys: 2min 49s, total: 6min 40s
Wall time: 3min 21s


Nope.  Looks like 90.55% may be the best we can do.  Not bad.