# CMPT 423/820 Assignment 2 Question 4
### Model Solution and Grading Scheme

Currently, Scikit-Learn has 4 implementations of Naive Bayes.  Each implementation assumes that all the features have the same kind of feature distribution.  For example, the Scikit-Learn implementation of Gaussian Naive Bayes Classifier assumes that all features are numeric, and the histogram of the features given the class label are more-or-less bell shaped around a mean value.  On the other hand, the Scikit-Learn implementation of the Categorical Naive Bayes Classifier assumes that all features are categorical.
**This is a limitation of the software, not a theoretical limitation of Naive Bayes.**

In this question, you are invited to explain, or describe how we could use these two classifiers to handle mixed data.  

### Using Scikit Learn on mixed data
Suppose we have a categorical class label $Y$, some continuous features $F$ and some categorical features $C$.  For our purposes, it doesn't matter how many of each we have.  I will derive a formula based on one of each, and generalize after that.

The classification formula for Naive Bayes doesn't care about the distinction between continuous or categorical fetures, so we have the same formula as before:

$$ y(c,f) = \arg \max_{Y} P(Y |c,f)  $$

As usual, we will start with the class conditional distribution, and apply Bayes Rule, and conditional independence:

$$ P(Y |c,f) = \frac{P(c|Y)P(f|Y)P(Y)}{P(c,f)}$$


Unfortunately, Scikit Learn cannot fit this model with mixed data directly.  Not yet, anyway.  But It can fit continuous features using Gaussian Naive Bayes, and categorical features using Categorical Naive Bayes (in Version 0.22).  

We can split the data into two parts, the categorical features, C (with Y), and the continuous features F (with Y).  Fitting them independently gives us 2 classifiers:

$$ y_1(c) = \arg \max_{Y} P(Y |c)  = \arg \max_{Y} \frac{P(c|Y)P(Y)}{P(c)} $$
$$ y_2(f) = \arg \max_{Y} P(Y |f)  = \arg \max_{Y} \frac{P(f|Y)P(Y)}{P(f)} $$


The question is now, how to derive a formula for $y(c,f)$ given $y(c)$, and $y(f)$?

Since $y(c,f)$ uses $P(c|Y)$ and $P(f|Y)$, I will find expressions for these two factors.  We start with 

$$ P(Y |c) = \frac{P(c|Y)P(Y)}{P(c)} $$

and rearrange to get

$$ P(c|Y) = \frac{P(Y|c)P(c)}{P(Y)} $$ 

(this is also just Bayes Rule).  Similarly:

$$ P(f|Y) = \frac{P(Y|f)P(f)}{P(Y)} $$ 

Now we substitute into $y(c,f)$ and simplify as follows:

$$ P(Y |c,f) = \frac{\frac{P(Y|c)P(c)}{P(Y)}\frac{P(Y|f)P(f)}{P(Y)}P(Y)}{P(c,f)}
= \frac{P(Y|c)P(Y|f)}{P(Y)} \frac{P(c)P(f)}{P(c,f)} $$

I've gathered the factors that depend on $Y$ and those that do not depend on $Y$.  Note that $P(c)P(f) \not = P(c,f)$ in general.

Finally:

$$ y(c,f) = \arg \max_{Y} P(Y |c,f)  
=  \arg \max_{Y}  \frac{P(Y|c)P(Y|f)}{P(Y)} \frac{P(c)P(f)}{P(c,f)} 
=  \arg \max_{Y}  \frac{P(Y|c)P(Y|f)}{P(Y)}  
$$

The last step is true because the factor involving $c,f$ does not depend on $Y$, so it's constant, and cannot affect the maximization.

Scikit Learn allows us to access $P(X|Y)$ (where $X$ is any feature), and $P(Y)$ for any Naive Bayes classifier it fits using:
* Any classifier
    * Method `predict_proba()`: this is $P(X|Y)$ where $X$ is a feature.
    * Method `predict_log_proba()`: this is $\log P(X|Y)$ where $X$ is a feature.
    * Attribute `class_prior_`: this is $P(Y)$
* Categorical NB
    * Attribute `feature_log_prob`: This is $\log P(X|Y)$ where $X$ is a feature.
    * Attribute `class_log_prior`: This is $\log P(Y)$. 
* Gaussian NB
    * Attribute `sigma_`, `theta_`: These are $\mu_{FY}, \sigma_{FY}$ for each feature $F$ and each class $Y$.  With these values, we can calculate $P(F|Y)$ using Numpy or Python's random modulem using the Normal distribution ${\cal N}(\mu_{FY}, \sigma_{FY}^2)$

Now we ask the two fitted classifiers for all the $P(C|Y)$ and $P(F|Y)$.  Then we can multiply these together (or better, add the log probabilities).   Then we get $P(Y)$ (from either classifier -- they should report the same values), and divide (or subtract the log priors).  Then it's a linear search through the probabilities, looking for the maximum.

Here's Pythonesque pseudo-code, ignoring details like multi-dimensional arrays returned by Scikit Learn:
```
    # To fit the mixed classifier
    clf_1 = CategoricalNB()
    clf_2 = GaussianNB()
    clf_1.fit(C,Y)
    clf_1.fit(F,Y)

    # to classify a given sample c,f:
    lP_Y = clf_1.class_log_prior_
    lP_C_Y = clf_1.feature_log_proba
    mu, sigma = clf_2.mu_, clf_2.sigma
    P_F_Y = NormalVariate(F, mu, sigma)
    
    lP_Y_CF = lP_C_Y + lP_F_Y + log(P_F_Y) - lP_Y
    select from lP_Y_CF the largest value
    return the corresponding Y value
```    


### Grading Guideline
This question should be graded according to the level of accomplishment achieved.  It is an open-ended question.

The derivation above is an example of a formal derivation, based on the probabilistic foundations of Naive Bayes.  
* A good formal derivation (10/10) will:
    * Apply the two simpler formulae
    * Account for $P(Y)$ appearing in both of the smaller classifiers.
    * Rule out the constant factor as irrelevant.
    * Show that Scikit Learn can give us the information we need in the form of attributes or methods
    
* A weaker derivation (7/10) will 
    * Multiply the results of the method `predict_proba()` which is $P(X|Y)$ together.
    * Ignore or omit $P(Y)$
    * Try to calculate the constant factor
    * Assume that the constant factor is equal to 1

* An adequate informal description (6/10) can:
    * suggest the use of `predict_proba()` for the question, without a proof or derivation
    * suggest asking both classifiers for a class, and use the answer as a vote; for this a tie breaking scheme needs to be included.

* A weak informal description (3/10) might:
    * suggest asking both classifiers for a class, and use the answer as a vote without a tie breaking scheme.
    * suggest some plausible approach not making use of the Naive Bayes classifiers provided by Scikit Learn.

* A totally inadequate description (0/10):
    * suggest an implausible approach
    

A demonstration requires:
* Creating a mixed data set, or finding a mixed dataset on line. 
* Applying the technique above to the data set.
* Evaluating performance by:
    * Comparing accuracy (or error) of some examples
    * Evaluation of accuracy (or error) on a test set


### Grading Summary
* Approach (10 possible marks)
    * Nothing submitted: 0/10  
    * Totally inadequate: 0/10
    * Informal, but weak: 3/10
    * Adequate Informal: 6/10
    * Weak formal derivation: 7/10
    * Good formal derivation: 10/10
* Demonstration (5 possible marks)
    * Nothing submitted: 0/5
    * Showed a few examples: 2/5
    * Fitted with mixed training set and evaluated with mixed test set: 5/5
