# Lab 2: Naive Bayes

In this lab we'll be training and evaluating a Naive Bayes's classifier on our two triangles datasets to predict whether a participant is depressed/scizophrenic. 

First off, let's start with some imports, as usual.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
#also random forrest and some preprocessing/data splitting stuff

Now, load the datasets from Lab1 and save them as ```depression_data``` and ```scizophreania_data```.

In [None]:
# Code here.


## 1. sklearn and the object-oriented framework

Before we get started, now is maybe a good time for a quick introduction to object-oriented programming, and how it is applied in the ```sklearn``` library. ```scikit-learn```, or ```sklearn``` for short is a machine learning toolkit containing many useful things, ranging from classifiers such as Naive Bayes or Logistic Regression, over data transformation techniques such as PCA/dimensionality reduction, to preprocessing and evaluation modules. Stop here and take a browse through its <a href="http://scikit-learn.org/stable/documentation.html">documentation</a>, to get an overview. You do not need to understand everything, or study it front to back, but try to get a feel for what kinds of stuff it contains, and how the programs using it are typically structured.

Like many python libraries (one you might be quite familiat with is ```psychopy```), sklearn is based on object-oriented programming. Let's start with a quick intro, since I think understanding objects, methods and attributes will help you understand what's going on with the code we'll be using below. Personally, it took me quite some time to figure out exactly what objects are, and why they are so useful. If you have done any programming course in your life, you will have come across them sooner or later, and you'll probably have seen some pretty stupid examples, like this one:

In [1]:
#let's make a bike:
class Bike:
    def __init__(self,height,wheel_diameter):
        #it has properties, such as a height and a wheel diameter. it's position always starts at 0
        self.height = height
        self.wheel_diameter = wheel_diameter
        self.position = 0
        
    def cycle(self,speed):
        #when you cycle, the bike moves forward by the speed times the wheel diameter 
        #(this is not real cycling, I have no idea about the physics of it)
        self.position = self.wheel_diameter * speed
    
    def get_position(self):
        #the user can call this function to locate the bike at its current position
        return self.position

#okay. let's ride the bike. first, we *instantiate* the object:
my_supercool_bike = Bike(120,70) #let's just take some numbers (that could be centimeters) for the height and wheel

#let's find out the bike's initial position:
initial_pos=my_supercool_bike.get_position()
print("Initial position of the bike: {}".format(initial_pos))

#let's ride the bike
my_supercool_bike.cycle(10) #let's say we cycle at 10km/h

#check the position again
final_pos=my_supercool_bike.get_position()
print("Final position of the bike: {}".format(final_pos))

Initial position of the bike: 0
Final position of the bike: 700


Easy. Everything clear?! 

Of course not. While this shows the basic structure of defining your own object, it doesn't really tell you why objects are useful, or what they are, really.

Let's start with what the example <b>does</b> show: 
<ul>
<li>Objects are defined with the ```class``` statement
<li>They have <b>methods</b> and <b>attributes</b>. <b>Methods</b> are like functions, and you can call them with ```object.method()```, e.g. ```Bike.cycle()```. <b>Attributes</b> are internal variables that the object stores.
<li>Within the object, the word ```self``` is placed instead of the object name. When the object is instantiated, the variable that is used to instantiate the object replaces ```self``` on the outside. For example, if the method ```cycle``` were to be called inside ```get_position```, we'd use ```self.cycle()```, but once we have our own <b>instance</b> of the object (my_supercool_bike), we call the method by using the object name. The same goes for attributes of the object. 
<li>Every object has an ```__init__``` method that automatically gets called whenever a new object of this type is instantiated. Don't worry about the details of this now. In the case of our bike, this just means that we fill in the attribute values specified by the user, and start at a specific position.
</ul>

What the example fails to show, is why on earth anyone would find this useful. It's just as easy to get the same result using regular variables and functions, like so:

In [2]:
#define some initial variables
my_wheel_diameter = 70
my_position = 0
my_speed=10
        
def cycle(speed,wheel_diameter,position):
    position = wheel_diameter * speed
    return position
    
#let's find out the bike's initial position:
initial_pos=my_position
print("Initial position of the bike: {}".format(initial_pos))

#cycle and check the position again
final_pos=cycle(my_speed,my_wheel_diameter,my_position)
print("Final position of the bike: {}".format(final_pos))

Initial position of the bike: 0
Final position of the bike: 700


So, why bother? The reason why classes are useful is exactly their ability to store attributes and methods <b>internally</b>. In the above code, we had to define some global variables in the beginning of our script, and pass them into the function(s) we want to use. Whatever value is returned has to be stored in another variable, or else is forever lost. The function does not remember anything. Objects <b>remember</b> their attribute values. If you want to cycle again, the ```Bike``` object above will remember its current position and wheel diameter, and take it from there. The standalone cycle function won't do that - you have to pass its current position as an argument.

Imagine cooking an elaborate meal, but you only have one pan (a function) and you need to fill anything you make into different plates from your cupboard (variables). Now imagine instead you have this super fancy kitchenaid, which automatically stores all the intermediate results and procedures in some secret hidden place (nobody cares where that is), but you can ask it for a cake or the recipe, anytime you want.

For Machine Learning, this ability of objects to store and remember (so to say: "learn") information is extremely useful. For example, you can instantiate a specific type of model, train its parameters on some data, and then run it on that same data, additional data, etc. You might even throw away all the data altogether, and just keep the model parameters. All you need is the model object, and it remembers where its parameters (e.g., probabilities of observing certain words) are currently at. For some models, you can even update some parameters, using only new, freshly acquired data.

Don't worry if the above introduction is not entirely clear to you. Hopefully you start thinking about python and the world of objects, and it helps you make sense about the way models and methods are used with sklearn. Btw, you probably already know quite a few objects: python strings, lists and dictionaries are all their own type of object, and they have class-specific methods you can call, such as ```.lower()``` for a string, or ```.sort()``` for a list. Psychopy modules also consist largely of objects.

## 2. Train-test split 

Let's start with the schizophrenia dataset. We'll build a model which predicts, for each instance/participant, whether they are schizophrenic or not.

First, we need to split out dataset into a training and test set. We don't have enough data for a proper validation set; we could use cross-validation if we were to tune some hyperparameters. For now, let's keep things simple and not apply and smoothing, etc.

Before you start coding, think about a sensible way to divide the data. How much can we afford to reserve for testing? How much test data do we need? Are the classes very imbalanced? Should we choose our test set randomly, or make sure the distribution of classes is the same for the trainset and testset?

*Take some notes here:*

sklearn conveniently splits our dataset for us... (set random seed)

In [None]:
#Code here.

Since we don't want to apply any smoothing (any we have way too many features for such a small dataset), we will remove all of the test words that are unobserved in our training set.

...(instructions)

In [None]:
#Code here.

## 3. A dummy baseline

To see how good our classifier can become, we first need to establish a baseline. In actual research, the baseline is usually the current state-of-the-art. Just to see if we can get anything done at all, we usually also use what is called a <b>dummy baseline</b>. This is the worst performance we can get, without training our model. 

One way of establishing a dummy baseline is to choose a random class for each instance, and measure the accuracy. However, imagine in our dataset, 90% of all participants are actually not scizophrenic, and only 10% are. A completely random classifier would do much worse than one that predicts that every person is not scizophrenic. However, as a diagnosis tool, those two dummy classifiers are equally useless.

Our second dummy classifier therefore selects the most frequent class in the training set for every test person. If this were to give 90% accuracy (because 90% of all participants are in fact in the most frequent class), then we'd hope for our actual model (the Naive Bayes classifier) to get over 90% correct. Failing to beat the dummy baseline can hint at potential problems with our model: Maybe we have too little data, too many features, false assumptions, or just a buggy implementation.

Below, establish two dummy baselines:

1. A random one (using np.random.rand() to create an output array of the size of the dataset (number of participants))
2. One that chooses the most frequent class in the training set for each test instance (you should be able to figure this one out yourself; hint: determine the most frequent class by looking at the training labels, the measure how many participants fall into that class in the test set)

In [None]:
#Code here.


## 4. The Multinomial Naive Bayes Classifier
- build the basic model

## 5. Feature representation and feature selection 
- observe features with high log probability in each class; select the highest(?)
- try with binary features
- remove stopwords

## 6. Naive Bayes for Depression [A]

Now it's your turn: Can you apply the techniques you have learned to build a model for depression?

Think about the following before you start:
- Do you want to use 2 classes (depressed/not depressed) or 3 (first-time depressed/chronically depressed/not depressed)?
- How does your choice affect class imbalance? 
- How do you want to split your dataset for training and testing?
- Which dummy baseline will you use?
- Which features/feature representation seems most useful?

*Note: If you are short on time, skip to task 7 now and come back to this later, so we can wrap this up together.*

*Take some notes here:*


In [None]:
#Code here.

## 7. Findings

Think about your findings. What have they taught you?
- What are the problems with the Naive Bayes approach?
- Do you think it is likely that the Naive Bayes assumptions were met? Why/why not?
- Have you learned anything about depression/scizophrenia?
- What's missing?
- How could this approach be used to investigate a specific research question? What additional variables/data would you need for this?

*Take some notes here:*

If you have time, you can read more about the Naive Bayes algorithm and how to improve its performance <a href="https://machinelearningmastery.com/better-naive-bayes/">here</a>.

# 8. Going Further: Finding informative features with Decision Trees [A]