> **DO NOT EDIT IF INSIDE `computational_analysis_of_big_data_2017` folder** 

# Week 5

*Thursday, September 28, 2017*

## Outline

**Principles of machine learning**. Machine learning is a technique for learning patterns in data that enable computers to make decisions and predictions. It's probably one of the hottest skills to master as a scientist or engineer in research or industry today. This week, we'll get an overview of what machine learning is, what it can be used for and what its limits are. Without worrying too much about what goes on behind the scenes, we will play around with a few classifiers in Python and test model performance using cross validation. The exercises today cover:

* Feature representation
* Model fitting
* Model evaluation
* Prediction results

## Material

* *DSFS Chapter 11 - Machine Learning*.: Very general introduction to machine learning and some related concepts.

* *Machine Learning is Fun! [Part 1](https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471)*: The first chapter in a blog post series about deep learning. It's really nice, so make sure to read it! If you have time for it, go ahead and read the following parts of the series too, since we won't cover deep learning in this course. It is a fascinating research field that leverages today's insanely fast computers to do unbelievable things with data.

## Exercises

We want to predict whether a character is a hero or a villain from information that we can extract from their markup. This is a large problem that includes some data wrangling, model fitting and a bit of evaluation. Therefore the problem is split into parts.

### Feature representation
The first part to this problem is feature representation. In it's raw format, the data cannot be given to a machine learning algorithm. What we must do is extract features from the data and put them into matrix format. This is the same as what we did when we looked at a dog (the data) and extracted into a matrix whether it was fluffy, sad looking, etc. (the features).

While language modeling is very popular and we could have extracted word frequencies and all sorts of fancy stuff, we are going to stick to something simple: **Team affiliations**.

We can represent the team affiliations of each character as a row in a matrix where each column corresponds to a particular team. This should look something like this (numbers are made up):
<img src="http://ulfaslak.com/computational_analysis_of_big_data/exer_figures/example_boa.png" width="400"/>

> **Ex. 5.1.1**: Create the team affiliation matrix for the data. This is your feature matrix for the classification problem you will solve later in this exercise set. Therefore, you should also—in a seperate *target* array—store whether characters are heros (denote by 0) or villains (denote by 1). For now, skip ambiguous characters, but write your code in such a way that it won't be too hard to redo this for ambiguous characters. Print the shapes of your matrix and target array.

>*Hint: You may want to copy some of your code from Ex. 4.2.3.1.*

### Model fitting

> **Ex. 5.2.1**: Train a classifier on all of your data and test its accuracy.

>* If your affiliation matrix is `X_ta` and your target array is `y_ta` you can do this by instantiating a model like:
>
        from sklearn.naive_bayes import BernoulliNB
        model = BernoulliNB()
        model.fit(X_ta, y_ta)  # <--- This is the training/fitting/learning step
> The `BernoulliNB` is a version of the Naive Bayes classifier which associates certain features with labels and asks what the probability of a label for a data point is given its features. You are free to use any other classifier if you want. Popular ones are trees, random forests, support vector machines, feed forward neural networks, logistic regression, and the list goes on. With `sklearn`, they are just as easy to employ as the `BernoulliNB` classifier.


>1. Test the accuracy of your model. You can use the `.predict` method on the `model` object to get predictions for a matrix of data points. Report the accuracy of your model on the same data that you trained the model on, alongside the baseline accuracy of a "dumb" model that consequently guesses for the majority class.

>2. Report the precision, recall and F1 scores. `sklearn` has implementations that you can use if you are short for time. Extra credit for doing it using only basic linear algebra operations with `numpy`, though.

### Model evaluation

> **Ex. 5.3.1**: Investigate how well your model generalizes. You may have noticed that the performance seemed a little too good to be true in Ex 5.2.1.
1. Why did you get such a high accuracy in the previous exercise?
2. Split your data into a test and training set of equal size. Train the model only on the test set and report its accuracy and F1 scores on the training and test sets.
3. Comment on the difference you observe.

> **Ex. 5.3.2**: Implement cross validation. The performance of a classifier is strongly dependent on the amount of data it is trained on. In Ex. 5.3.1 you train it on only half of the data and test it on the other half. If you rerun that code multiple times, with random 50/50 partitions, you are gonna see a lot of uncertainty in performance. Cross validation solves this problem by training on a larger subset of the data and testing on a smaller one, and taking the average performance over K-folds of this process.
1. Implement cross validation over 10 folds. For each fold you must record the training and test accuracies. In the end, visualize the distributions of test- and training accuracy as histograms in the same plot.
2. Comment on the result.

### Predicting good or evil

>**Ex. 5.4.1**: Let's put our classifier to use!
* Retrain your model on all of your data.
* Create a team affiliation representation of the ambiguous characters
* Use the model the estimate the probability that each character is a villain (let's call this *villainness*). You can use the `.predict_proba` method on the model to get probability estimates rather than class assignments.
* **Visualize the villainness distribution for all ambiguous characters**. Comment on the result.