# Naive Bayes Classifiers

## Introduction

Naive Bayes is a class of simple classifiers based on Bayes' Rule and strong (or naive) independence assumptions between features. In this problem, you will implement a Naive Bayes Classifier for the Census Income Data Set from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/).

## Dataset Description

The dataset consists 32561 instances, each representing an individual. The goal is to predict whether a person makes over 50K a year based on 14 features. The features are:

| column | type | description |
| --- |:---:|:--- |
| age | continuous | trips around the sun to date
| final_weight | continuous | census weight attribute; constructed from the original census data |
| education_num | continuous | numeric education scale -- their maximum educational level as a number |
| capital_gain | continuous | income from investment sources |
| capital_loss | continuous | losses from investment sources |
| hours_per_week | continuous | number of hours worked every week |
| work_class | categorical | `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked` |
| education | categorical | `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool` |
| marital_status | categorical | `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse` |
| occupation | categorical | `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces` |
| relationship | categorical | `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried.` |
| race | categorical | `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black` |
| sex | categorical | `Female`, `Male` |
| native_country | categorical | (41 values not shown here) |

In [19]:
import collections
import pandas as pd
import numpy as np
import scipy
from scipy import stats

import gzip
from testing.testing import test

## Q1. Data Preparation

First, you need to load in the above data, provided to you as a CSV file. As the data is from UCI repository, it is already quite clean. However, some instances contain missing `occupation`, `native_country` or `work_class` (represented as ? in the CSV file) and these have to be discarded from the training set. Also, replace the `income` column with `label`, which is 1 if `income` is `>50K` and 0 otherwise. Finally, ensure you reset the index so the row numbers are contiguous.

In [20]:
def read_csv(fn):
    with gzip.open(fn, "rt", newline='', encoding="UTF-8") as file:
        return pd.read_csv(file)

def load_data_test(load_data):
    pass
    # df = load_data()
    # 
    # DF_TYPES = {
    #     "age"  : "int64",
    #     "work_class"  : "object",
    #     "final_weight"  : "int64",
    #     "education"  : "object",
    #     "education_num"  : "int64",
    #     "marital_status"  : "object",
    #     "occupation"  : "object",
    #     "relationship"  : "object",
    #     "race"  : "object",
    #     "sex"  : "object",
    #     "capital_gain"  : "int64",
    #     "capital_loss"  : "int64",
    #     "hours_per_week"  : "int64",
    #     "native_country"  : "object",
    #     "label"  : "int64"
    # }
    # 
    # test.equal(DF_TYPES, { k: str(df[k].dtypes) for k in DF_TYPES })
    # 
    # # Check for blank entries:
    # test.equal(any(df['occupation'].eq("?")), False)
    # test.equal(any(df['native_country'].eq("?")), False)
    # test.equal(any(df['work_class'].eq("?")), False)
    # 
    # # Make sure there's no income column:
    # test.true("income" not in df.columns)
    # 
    # # Index handling:
    # test.equal(repr(df.index), "RangeIndex(start=0, stop=30162, step=1)")
    
@test
def load_data(file_name="census.csv.gz"):
    """ loads and processes data in the manner specified above

    args:
        file_name : str -- path to csv file containing data

    returns: pd.DataFrame -- processed dataframe
    """
    df = read_csv(file_name)
    df = df.replace("?",np.nan)
    df = df.dropna(axis=0)
    df = df.replace(">50K",1)
    df = df.replace("<=50K",0)
    df = df.rename(columns={"income":   "label"})
    df = df.reset_index()
    return df

### TESTING load_data: PASSED 0/0
###



## Overview of Naive Bayes classifier

Let $X_1, X_2, \ldots, X_k$ be the $k$ features of a dataset, with class label given by the variable $y$. A probabilistic classifier assigns the most probable class to each instance $(x_1,\ldots,x_k)$, as expressed by
$$ \hat{y} = \arg\max_y P(y\ \mid\ x_1,\ldots,x_k) $$

Using Bayes' theorem, the above *posterior probability* can be rewritten as
$$ P(y\ \mid\ x_1,\ldots,x_k) = \frac{P(y) P(x_1,\ldots,x_n\ \mid\ y)}{P(x_1,\ldots,x_k)} $$
where
- $P(y)$ is the prior probability of the class
- $P(x_1,\ldots,x_k\ \mid\ y)$ is the likelihood of data under a class
- $P(x_1,\ldots,x_k)$ is the evidence for data

Naive Bayes classifiers assume that the feature values are conditionally independent given the class label, that is,
$ P(x_1,\ldots,x_n\ \mid\ y) = \prod_{i=1}^{k}P(x_i\ \mid\ y) $. This strong assumption helps simplify the expression for posterior probability to
$$ P(y\ \mid\ x_1,\ldots,x_k) = \frac{P(y) \prod_{i=1}^{k}P(x_i\ \mid\ y)}{P(x_1,\ldots,x_k)} $$

For a given input $(x_1,\ldots,x_k)$, $P(x_1,\ldots,x_k)$ is constant. Hence, we can say that:
$$ P(y\ \mid\ x_1,\ldots,x_k) \propto P(y) \prod_{i=1}^{k}P(x_i\ \mid\ y) $$

Thus, the class of a new instance can be predicted as:

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{k}P(x_i\ \mid\ y)$$

where $P(y)$ is commonly known as the **class prior** and $P(x_i\ \mid\ y)$ is the **feature predictor**.

Observe that this is the product of $k+1$ probability values, which can result in very small numbers. When working with real-world data, this often leads to an [arithmetic underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow). We will instead be adding the logarithm of the probabilities:

$$\hat{y} = \arg\max_y \underbrace{\log P(y)}_\text{log-prior} + \underbrace{\sum_{i=1}^{k} \log P(x_i\ \mid\ y)}_\text{log-likelihood}$$

The rest of the assignment deals with how each of these probability distributions -- $P(y), P(x_1\ \mid\ y), \ldots, P(x_k\ \mid\ y)$ -- are estimated from data.


### Feature Predictor

Naive Bayes classifiers are popular because we can independently model each feature and mix-and-match model types based on the prior knowledge. For example, we might know (or assume) that $(X_i|y)$ has some distribution, so we can directly use the probability density or mass function of the distribution to model $(X_i|y)$.

In this assignment, you will be using two classes of likelihood models:
- Gaussian models, for continuous real-valued features (parameterized by mean $\mu$ and variance $\sigma$)
- Categorical models, for features in discrete categories (parameterized by $\mathbf{p} = <p_0,p_1\ldots>$, one parameter per category)

You need to implement a generic predictor class for each type of model. Your class should have the following methods:

- `fit()`: Learn parameters for the likelihood model using an appropriate Maximum Likelihood Estimator.
- `partial_log_likelihood()`: Use the previously learnt parameters to compute the probability density or mass of a given feature value, and return the natural logarithm of this value.

## Q2. Gaussian Feature Predictor

The Gaussian distribution is characterized by two parameters - mean $\mu$ and standard deviation $\sigma$:
$$ f_Z(z) = \frac{1}{\sqrt{2\pi}\sigma} \exp{(-\frac{(z-\mu)^2}{2\sigma^2})} $$

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the MLE for mean and standard deviation are:
$$ \hat{\mu} = \frac{1}{n} \sum_{j=1}^{n} z_j $$

$$ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{j=1}^{n} (z_j-\hat{\mu})^2} $$

`scipy.stats.norm` may be helpful, as may `pandas.DataFrame.var`. If you use the latter, remember to correctly set the `ddof`!

In [21]:
def gaussian_pred_test(gaussian_predictor):
    pass
    # g = gaussian_predictor(2)
    # 
    # np.random.seed(0xDEADBEEF)
    # rnd = np.random.normal(loc=0.0, scale=1.0, size=(1000,))
    # 
    # data = pd.Series(np.concatenate([rnd, 100-rnd]))
    # labels = pd.Series(np.array([0]*1000 + [1]*1000))
    # 
    # g.fit(data, labels)
    # 
    # test.equal(tuple(g.partial_log_likelihood([0., 50., 100.]).shape), (2, 3))
    # # If the equality is not exact, you may need to change the test to ensure the absolute difference is no more than 1e-4
    # test.true(np.allclose(g.partial_log_likelihood([0., 50., 100.]), [[-0.9234573135702573, -1242.233086628376, -4963.217354198167], [-4963.217354198166, -1242.2330866283753, -0.9234573135702564]], rtol=0, atol=1e-4))

class GaussianPredictor:
    
    # use logpdf? one value per x
    # two for loops
    # 0, 50, 100
    # for i in range
    # X[ y == i for i in ...]
    # series
    """ Feature predictor for a normally distributed real-valued, continuous feature.

        attr:
            k : int -- number of classes
            mu : np.ndarray[k] -- vector containing per class mean of the feature
            sigma : np.ndarray[k] -- vector containing per class std. deviation of the feature
    """
    
    def __init__(self, k):
        """ constructor

        args : k -- number of classes
        """
        self.k = k
        self.mu = np.zeros(k)
        self.sigma = np.zeros(k)
        pass

    def fit(self, x, y):
        """update predictor statistics (mu, sigma) for Gaussian distribution

        args:
            x : pd.Series -- feature values
            y : np.Series -- class labels
            
        return : GaussianPredictor -- return self for convenience
        
        """
        # df = pd.DataFrame({"values":x,"labels":y})
        # groups = df.groupby("labels")
        # self.mu = np.array((groups.mean()))
        # self.sigma = np.sqrt(np.array(groups.var(ddof=0)))
        y=np.array(y)
        x=np.array(x)
        
        
        for i in range(self.k):
            ybools = (y==i)
            # given_i = np.array([x[j] for j in range(len(y)) if y[j] == i])
            given_i = np.extract(ybools,x)
            self.mu[i]=given_i.mean()
            self.sigma[i]=np.sqrt(given_i.var())
        return self
            
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class

        args:
            x : pd.Series -- feature values

        return: np.ndarray[self.k, len(x)] : log likelihood for this feature for each class
        """
        logpdfs = np.zeros((self.k,len(x)))
        # for i in range(self.k):
        #     for j in range(len(x)):
        #         logpdfs[i][j]=stats.norm(loc=self.mu[i],scale=self.sigma[i]).logpdf(x[j])
        for i in range(self.k):
            logpdfs[i]=stats.norm(loc=self.mu[i],scale=self.sigma[i]).logpdf(x)
        return logpdfs
@test
def gaussian_pred(k):
    return GaussianPredictor(k)

### TESTING gaussian_pred: PASSED 0/0
###



## Q3. Categorical Feature Predictor

The categorical distribution with $l$ categories $\{0,\ldots,l-1\}$ is characterized by parameters $\mathbf{p} = (p_0,\dots,p_{l-1})$ where $\sum\mathbf p = 1$.

If $C$ is categorically distributed, the probability of observing $z$ is:

$$ \Pr(C=z; \mathbf{p}) = \begin{cases}
    p_0 & \text{ if } z=0
\\  p_1 & \text{ if } z=1
\\  \vdots
\\  p_{l-1} & \text{ if } z=(l-1)
\end{cases}$$

Given $n$ samples $z_1, \ldots, z_n$ from $C$, the smoothed Maximum Likelihood Estimator for $\mathbf p$ is:
$$ \hat{p_t} = \frac{n_t + \alpha}{n + l\alpha} $$

where $n_t = \sum_{j=1}^{n} [z_j=t]$ (i.e., the number of times the label $t$ occurred in the sample) and $n$ is the total number of samples. The smoothing is done to avoid zero-count problem (similar in spirit to $n$-gram model in NLP.)

In this problem, you need to write a predictor that learns a different categorical distribution $C_i$ for each of $k$ possible classes. You should maintain a dictionary from each possible input token (i.e. each value) to an array of length $k$ that contains $(\Pr(C_0=z), \Pr(C_1=z), ..., \Pr(C_{k-1}=z))$.

In [22]:
#ln (100/102) 
#ln (ln 1/102)
# 3 is the number of classes
def categorical_pred_test(categorical_pred):
    # d = categorical_pred(3)
    # 
    # data = pd.Series(["A"]*99 + ["B"]*99 + ["C"]*99)
    # labels = pd.Series([0]*99 + [1]*99 + [2]*99)
    # d.fit(data, labels)
    # 
    # pll = d.partial_log_likelihood(["A", "B", "C", "A", "B", "C"])
    # test.equal(tuple(pll.shape), (3, 6))
    # n = -4.624972813284271
    # p = -0.019802627296179754
    # test.equal(pll.tolist(), [[p, n, n, p, n, n], [n, p, n, n, p, n], [n, n, p, n, n, p]])
    # 
    # p = categorical_pred(3)
    # 
    # data = pd.Series(["A"]*99 + ["B"]*99 + ["C"]*99)
    # labels = pd.Series([0]*99 + [1]*99 + [2]*99)
    # p.fit(data, labels)
    # 
    # test.true(np.allclose(p.p['A'], [0.98039216, 0.00980392, 0.00980392], atol=1e-6))
    # 
    # pll = p.partial_log_likelihood(["A", "B", "C", "A", "B", "C"])
    # test.equal(tuple(pll.shape), (3, 6))
    # n = np.log(1/102)
    # p = np.log(100/102)
    # 
    # test.true(np.allclose(pll, [[p, n, n, p, n, n], [n, p, n, n, p, n], [n, n, p, n, n, p]]))
    pass
    # p = categorical_pred(2)
    # 
    # 
    # data = pd.Series(["A"]*50 + ["B"]*50 + ["C"]*50)
    # labels = pd.Series([0]*75 + [1]*75)
    # p.fit(data, labels)
    # # print(p.p)
    # 
    # 
    # test.true(np.allclose(p.p['A'], [0.65384614, 0.01282051], atol=1e-6))
    # test.true(np.allclose(p.p['B'], [0.33333334, 0.33333334], atol=1e-6))
    # test.true(np.allclose(p.p['C'], [0.01282051, 0.65384614], atol=1e-6))
    # 
    # 
    # 
    # pll = p.partial_log_likelihood(["A", "B", "C"])
    # test.equal(tuple(pll.shape), (2, 3))
    # n = np.log(1/78)
    # m = np.log(51/78)
    # l = np.log(26/78)
    # 
    # 
    # 
    # test.true(np.allclose(pll, [[m, l, n], [n, l, m]], atol=1e-6))
    

class CategoricalPredictor:
    """ Feature predictor for a categorical feature.

        attr: 
            k : int -- number of classes
            p : Dict[feature_value, np.ndarray[k]] -- dictionary of vectors containing per-class probability of a feature value;
    """
    
    def __init__(self, k):
        """ constructor

        args : k -- number of classes
        """
        self.k=k
        pass

    def fit(self, x, y, alpha=1.):
        """ initializes the predictor statistics (p) for Categorical distribution
        
        args:
            x : pd.Series -- feature values
            y : pd.Series -- class labels
        
        kwargs:
            alpha : float -- smoothing factor

        return : CategoricalPredictor -- returns self for convenience:
        """
        y=np.array(y)
        x=np.array(x)
        self.p = {}
        for char in set(x):
            self.p[char]=np.zeros(self.k)
            
        
        for i in range(self.k):
            # n=sum([1 for j in range(len(y)) if y[j] == i])
            ybools= (y==i)
            
            n=np.sum( ybools )
            for char in set(x):
                xbools= (x == char)
                # nj=sum([1 for j in range(len(y)) if x[j] == char and y[j] == i ])
                nj = np.sum(np.logical_and(xbools,ybools))
                (self.p[char])[i] = (nj + alpha)/ (n+len(set(x))*alpha)
        return self

    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class

        args:
            x : pd.Series -- vector of feature values

        return : np.ndarray[self.k, len(x)] -- matrix of log likelihood for this feature
        """
        like = np.zeros((self.k,len(x)))
        
        for i in range(self.k):
            for j,char in enumerate(x):
                # try:
                like[i][j]=np.log(self.p[char][i])
                # except KeyError:
                #     print(j,x[:10],self.p)
        return like

@test
def categorical_pred(k):
    return CategoricalPredictor(k)



### TESTING categorical_pred: PASSED 0/0
###



In [25]:
import pandas as pd
import wget
from pathlib import Path
import numpy as np
import collections

#clean sch_bus_ind y/n to integer 0,1

def type_boolean(c):
    if c == "Y": return 1
    elif c == "N": return 0
    # elif c == "nan": return np.nan
    else:
        return np.nan
    # raise ValueError(c)

def ROAD_CONDITION(c): # 8 is other 9 is unknown, 1,7->2, 3->4, 4->3, 5,6->5, 2,8,9->nan
    if c == 1 or c == 7:
        return 2
    elif c == 3:
        return 4
    elif c == 4:
        return 3
    elif c == 5 or c == 6:
        return 5
    else:
        return np.nan

def INTERSECT_TYPE(c): # 10 is other 99 is unkonw
    if c <= 9:
        return c
    else:
        return np.nan

def ILLUMINATION(c):
    if c <= 6:
        return c
    else:
        return np.nan
    
def WEATHER(c):
    if c <= 7:
        return c
    else:
        return np.nan
    
def TIME(c): # extract only the hour
    if c <= 2500:
        return c // 100
    else:
        return np.nan

if not Path('all-crashes-2004-2018.csv.zip').exists():
    wget.download("https://data.wprdc.org/dataset/3130f583-9499-472b-bb5a-f63a6ff6059a/resource/ec578660-2d3f-489d-9ba1-af0ebfc3b140/download/all-crashes-2004-2018.csv.zip")
# zf = zipfile.ZipFile('all-crashes-2004-2018.csv.zip') 
df_io = pd.read_csv('all-crashes-2004-2018.csv.zip')
# print(df.head())
# print(list(df))
# static_columns = "ROAD_CONDITION,INTERSECT_TYPE,URBAN_RURAL,DISTRICT,STATE_ROAD,LOCAL_ROAD,SNOW_SLUSH_ROAD,LANE_CLOSED,TIME_OF_DAY,SPEED_LIMIT"
# dynamic_columns = "ILLUMINATION,MOTORCYCLE_COUNT,HEAVY_TRUCK_COUNT,WEATHER,HAZARDOUS_TRUCK,SCH_BUS_IND,AUTOMOBILE_COUNT"
# output_columns = "PERSON_COUNT,FATAL_COUNT,INJURY_COUNT,MAX_SEVERITY_LEVEL,MAJOR_INJURY"
# df_io = df[(static_columns+","+dynamic_columns+","+output_columns).split(',')]
# print(df_io.head())
# print(df_io.dtypes)
# print(df_io.info())

# df_io['SCH_BUS_IND'] = df_io['SCH_BUS_IND'].apply(type_boolean)



# df_io['ROAD_CONDITION'] = df_io['ROAD_CONDITION'].apply(ROAD_CONDITION)
# df_io['INTERSECT_TYPE'] = df_io['INTERSECT_TYPE'].apply(INTERSECT_TYPE)
# df_io['ILLUMINATION'] = df_io['ILLUMINATION'].apply(ILLUMINATION)
# df_io['WEATHER'] = df_io['WEATHER'].apply(WEATHER)
# df_io['TIME_OF_DAY'] = df_io['TIME_OF_DAY'].apply(TIME)

# df_io = df_io.astype("Int64")

# print(df_io.head())
# print(df_io.dtypes)
# print(df_io.info())

# drop col that will not be used
static = ['ROAD_CONDITION', 'INTERSECT_TYPE', 'LANE_CLOSED', 'TIME_OF_DAY', 'SPEED_LIMIT', 'ILLUMINATION']
dynamic = ['MOTORCYCLE_COUNT', 'HEAVY_TRUCK_COUNT', 'HAZARDOUS_TRUCK', 'AUTOMOBILE_COUNT', 'SCH_BUS_IND', 'WEATHER']
label = ['PERSON_COUNT', 'FATAL_COUNT', 'INJURY_COUNT', 'MAX_SEVERITY_LEVEL', 'MAJOR_INJURY']
categorical = ['ROAD_CONDITION', 'INTERSECT_TYPE', 'LANE_CLOSED', 'ILLUMINATION', 'HAZARDOUS_TRUCK', 'SCH_BUS_IND', 'WEATHER']
gussian = ['TIME_OF_DAY', 'SPEED_LIMIT', 'MOTORCYCLE_COUNT', 'HEAVY_TRUCK_COUNT', 'AUTOMOBILE_COUNT']
data = static + dynamic
for col in df_io.columns:
    if col not in static and col not in dynamic and col not in label:
        df_io.drop(col, axis = 1, inplace = True)  
# print(df_io[15:25])
# df_io['TIME_OF_DAY'] = df_io['TIME_OF_DAY'].astype("Int64")


# clean data
df_io['SCH_BUS_IND'] = df_io['SCH_BUS_IND'].apply(type_boolean)
df_io['ROAD_CONDITION'] = df_io['ROAD_CONDITION'].apply(ROAD_CONDITION)
df_io['INTERSECT_TYPE'] = df_io['INTERSECT_TYPE'].apply(INTERSECT_TYPE)
df_io['ILLUMINATION'] = df_io['ILLUMINATION'].apply(ILLUMINATION)
df_io['WEATHER'] = df_io['WEATHER'].apply(WEATHER)
df_io['TIME_OF_DAY'] = df_io['TIME_OF_DAY'].apply(TIME)

# drop rows contain nan
df_io = df_io.dropna()
df_io = df_io.astype("int64")
# df_io[categorical] = df_io[categorical].astype("object")
# df_io = df_io.astype("object")
df_io[static] = df_io[static].astype("object")
print(df_io[15:25])
# group data into dataset/label
df_data = df_io[data].copy()
df_label = df_io[label].copy()
# print(df_data.info())
# print(df_label.info())

static_df_data = df_io[static+['MAX_SEVERITY_LEVEL']].copy()
static_df_data = static_df_data.rename(columns={'MAX_SEVERITY_LEVEL':"label"})
print(static_df_data.info())


   TIME_OF_DAY ILLUMINATION  WEATHER ROAD_CONDITION INTERSECT_TYPE  \
29          11            1        1              5              0   
48          13            1        4              2              0   
52          10            1        2              2              0   
53          15            1        1              2              0   
54          16            1        1              3              0   
59          13            1        2              2              0   
61          11            1        4              5              0   
67          21            2        4              4              2   
69           7            1        4              5              0   
70          10            1        4              3              0   

    SCH_BUS_IND  PERSON_COUNT  AUTOMOBILE_COUNT  MOTORCYCLE_COUNT  \
29            0             2                 2                 0   
48            0             3                 2                 0   
52            0       

## Q4. Putting things together

It's time to put all the feature predictors together and do something useful! You will implement a class that puts these classifiers to good use:

- `__init__()`: Compute the log prior for each class and initialize the feature predictors (based on feature type). The smoothed prior for class $t$ is given by
$$ \text{prior}(t) = \frac{n_t + \alpha}{n + k\alpha} $$
where $n_t = \sum_{j=1}^{n} [y_j=t]$, (i.e., the number of times the label $t$ occurred in the sample), $n$ is the number fo entries in the sample, and $k$ is the number of label values. 
- `log_likelihood()`: Compute the sum of the log prior and partial log likelihoods for all features. Use it to predict the final class label.
- `predict()`: Use the output of log_likelihood to predict a class label; break ties by predicting the class with lower id.

**Note:** Your implementation should not assume the data will always be the same as the census data. We may pass any dataset to your class. You can assume that:

1. the input will contain a `label` column of type `int64` with values $0,\ldots,k-1$ for some $k$
2. all other columns will be either of type `object` (for categorical data) or `int64` (for integer data)
3. if you encounter a column of an invalid type, throw an exception

In [28]:
import collections
def naive_bayes_test(naive_bayes):
    df = static_df_data
    cl = naive_bayes(df)
    # cl = naive_bayes(static_df_data)
    
    # test.equal(cl.log_prior.tolist(), [-0.28626858222129903, -1.3905468592226538])
    # 
    # test.true(isinstance(cl.predictor['age'], GaussianPredictor) and
    #     isinstance(cl.predictor['work_class'], CategoricalPredictor) and
    #     isinstance(cl.predictor['final_weight'], GaussianPredictor) and
    #     isinstance(cl.predictor['education'], CategoricalPredictor) and
    #     isinstance(cl.predictor['education_num'], GaussianPredictor) and
    #     isinstance(cl.predictor['marital_status'], CategoricalPredictor) and
    #     isinstance(cl.predictor['occupation'], CategoricalPredictor) and
    #     isinstance(cl.predictor['relationship'], CategoricalPredictor) and
    #     isinstance(cl.predictor['race'], CategoricalPredictor) and
    #     isinstance(cl.predictor['sex'], CategoricalPredictor) and
    #     isinstance(cl.predictor['capital_gain'], GaussianPredictor) and
    #     isinstance(cl.predictor['capital_loss'], GaussianPredictor) and
    #     isinstance(cl.predictor['hours_per_week'], GaussianPredictor) and
    #     isinstance(cl.predictor['native_country'], CategoricalPredictor))    
    # 
    ll = cl.log_likelihood(df.drop("label", axis="columns"))
    # test.equal(tuple(ll.shape), (2, 30162))
    # test.equal(ll[:,:2].tolist(), [[-49.84977999441486, -50.38520793711001], [-53.407383777033196, -51.30832341372758]])
    # 
    lp = cl.predict(df.drop("label", axis="columns"))
    print(collections.Counter(static_df_data['label']))
    print(collections.Counter(lp))
    # test.equal(tuple(lp.shape), (30162,))
    # test.equal(sum(lp), 5407)
    # test.equal(lp[:10].tolist(), [0]*8 + [1]*2)

class NaiveBayesClassifier:
    """ Naive Bayes classifier for a mixture of continuous and categorical attributes.
        We use GaussianPredictor for continuous attributes and CategoricalPredictor for categorical ones.
        
        attr:
            predictor : Dict[column_name,model] -- model for each column
            log_prior : np.ndarray -- the (log) prior probability of each class
    """

    def __init__(self, df, alpha=1.):
        """initialize predictors for each feature and compute class prior
        
        args:
            df : pd.DataFrame -- processed dataframe, without any missing values.
        
        kwargs:
            alpha : float -- smoothing factor for prior probability
        """
        label = df["label"]
        k = max(label)+1
        self.log_prior = np.zeros(k)
        n = len(label)
        for i in range(k):
            ybools= (label==i)
            
            nt=np.sum( ybools )
            self.log_prior[i] = np.log( (nt+alpha)/(n+(k*alpha)))
        
        self.predictor = dict()
        types = dict(df.dtypes)
        for key in types:
            if key != "label" and key!= "index":
                if str(types[key])=="int64":
                    self.predictor[key]=GaussianPredictor(k).fit(df[key],label)
                elif str(types[key])=="object":
                    self.predictor[key]=CategoricalPredictor(k).fit(df[key],label)
                else:
                    raise TypeError
        self.k=k            
        pass

    def log_likelihood(self, x):
        """log_likelihood for input instances from log_prior and partial_log_likelihood of feature predictors

        args:
            x : pd.DataFrame -- processed dataframe (ignore label if present)

        returns : np.ndarray[num_classes, len(x)] -- array of log-likelihood
        """
        # try:
        #     x = x.drop("label")
        # except:
        #     pass
        
        # like = np.zeros((self.k, len(x)))
        like = np.array(([self.log_prior,]*len(x))).transpose()
        # print (self.k,len(x))
        # print(like.shape)
        # for i in range(self.k):
        #     for j in range(len(x)):
        #         
        #         # like[i][j] = self.log_prior[i]+np.sum( self.predictor[key].partial_log_likelihood(x[j])[i][j] for key in self.predictor )
        #         like[i][j]=self.log_prior[i]
        for key in self.predictor:
            model = self.predictor[key]
            z= model.partial_log_likelihood(x[key])
            like +=z
            print(key)
            for i in range(z.shape[0]):
                print(collections.Counter(z[i]))
            # like[i][j] += model.partial_log_likelihood(x[key])[i][j]
                             
        return like           

    def predict(self, x):
        """predicts label for input instances, breaks ties in favor of the class with lower id.

        args:
            x : pd.DataFrame -- processed dataframe (ignore label if present)

        returns : np.ndarray[len(x)] -- vector of class labels
        """
        pred = np.argmax(self.log_likelihood(x),axis = 0)
        return pred

@test
def naive_bayes(*args, **kwargs):
    return NaiveBayesClassifier(*args, **kwargs)

ROAD_CONDITION
Counter({-0.4625732344111501: 29495, -1.714689874031898: 6913, -1.9608807955276653: 5258, -3.0040717871794476: 1944})
Counter({-0.22716231090912753: 29495, -2.286669637688146: 6913, -2.928523523860541: 5258, -3.0338840395183673: 1944})
Counter({-0.2508343514593957: 29495, -2.1838687861546635: 6913, -2.6996819514316934: 5258, -3.169685580677429: 1944})
Counter({-0.289554824465291: 29495, -2.108515970850141: 6913, -2.384105997536893: 5258, -3.275078921426758: 1944})
Counter({-0.3172211412915822: 29495, -2.0214612325030807: 6913, -2.2937441709691435: 5258, -3.257603319762408: 1944})
Counter({-1.3862943611198906: 43610})
Counter({-1.3862943611198906: 43610})
Counter({-1.3862943611198906: 43610})
Counter({-0.3075069923984463: 29495, -1.980305073843198: 6913, -2.434022132975402: 5258, -3.2438935878434885: 1944})
Counter({-0.32119406432510683: 29495, -2.0406555166446796: 6913, -2.205735267004128: 5258, -3.3637709761430385: 1944})
INTERSECT_TYPE
Counter({-0.42606658920162344: 27