# Machine Learning Primer

* Machine learning is a group of algorithms which adapt their behavior based on observations
    * Learns to perform a specific task without using explicit instructions
* Objective is to perform better on the given task after observing examples

### Example task: Given a fruit, recognize whether it's an apple &#127822; or a banana &#127820;.

* **Rule-based approach:** Write instructions how to solve the task, i.e. how to recognize apples &#127822; and bananas &#127820;. 
    * If it's red, it must be an apple &#127822;
    * If it's thin and quite long, it's a banana &#127820;
    * If its shape is roundish, it's an apple &#127822; 
    * If it has brown texture on top of yellow background, it's a banana &#127820;
    * Otherwise I'm not sure, but lets guess it's an apple &#127822;
    
    
* **Machine learning approach:** Give a bag of apples &#127822; and bananas &#127820; each labeled with the correct class (&#127822;/&#127820;), and let the machine to figure out the usual patterns how apples &#127822; and bananas &#127820; looks like and how to differentiate those two

### Motivation

* Many real-life scenarios are too complicated to define as a set rules
* Natural language has certain predefined structures and rules (grammar), but also exceptions, and exceptions of exception
    * No formal requirements to follow the rules
    * Not all rules are known
* Almost all modern NLP methods are based on machine learning
    * Neural networks is the hot topic currently

### Machine Learning in a nutshell (classification)

* Dataset: Examples describing the task you are trying to solve
    * The dataset must reflect the task, and represent the real-life situation well
    * Can you predict tomorrow's weather, if you know today's menu?
    * If you want to model daily weather conditions, you need to include measurements throughout the year, not only summer time

* Find a hypothesis (mathematical function) that explains patterns in your data as well as possible
    * Start from random guessing
    * Let the machine to see one example (without the ground truth label)
    * Ask it to predict (or guess) the correct label
    * Punish/reward (update the hypothesis)
    * Continue until satisfied
    
* Now we have the function (trained model), which is able to guess the ground truth label for a given example
    
    
### Supervised / Unsupervised Learning

* **Supervised:** Each example has a predefined label, which is known (the ground truth)
* The set of possible labels is fixed
* Classification, Regression


* **Unsupervised:** Bunch of examples representing the situation, but the ground truths are not known
* The set of groups/labels not known
* Clustering
    * Divide the data into X groups, but we do not know beforehand what kind of groups there should be
    * Clustering can be 'controlled' by how the examples are described to the algorithm (feature representation)

## Classification

* For a given example, predict one (or more) of the pre-defined classes
* E.g. *Divide a bag of fruits into apples &#127822; and bananas &#127820;*
    * For each fruit, guess whether it's an apple &#127822; or banana &#127820;
* Supervised learning: Each example in the training data includes a ground truth label

### Data representation

* Each example must be somehow described to the machine
    * Fruits: Color, size, shape, taste
    * Picture: Pixel values
    * Text: which words/characters the text includes
    
&#8594; Features
* Each example in our dataset can be represented with a set of these features (feature vector)
* If we have only two different features (e.g. size and color), we can easily visualize the data in a 2D plot
    * When we have hundreds of features, meaningful visualization is very difficult
    
### Training

* Try to find a hypothesis that explains your data
* In classification: if you have two classes, and two distinct features (2D plot), try ro find a line that separates these two classes
    * With more classes and more features, the function is more complicated...
    
* How?
* Start from a random hypothesis (line), say all points above the line are apples &#127822;, all below are bananas &#127820;
* Take the first example from your dataset, and based on the features you can 'draw' the point to the plot
* Compare the hypothesis to the ground truth, if correct reward or do nothing, if wrong punish
    * Punish basically means a method to update our hypothesis
    * Cost function: How wrong our current hypothesis was, and how radically it should be updated
* Update the hypothesis (parameters in the line)
* Continue until hypothesis is as good as possible
    * not always perfect, sometimes a simple line cannot separate two groups   


### Overfitting

* It's possible to generate a hypothesis which explains any kind of data perfectly
* It can learn to remember the training data exactly
* Does not generalize at all!
    * Does not create reasonable predictions for new, previously unseen data points
* Complexity of the hypothesis must be controlled (regularization)

![overfitting.png](figs/overfitting.png)


## Evaluation

* How good is the trained model? How many mistakes (wrong classifications) it does on average?
* Ability to remember your training data does not tell how good your model is
    * Computers have almost unlimited memory
* You need to test how well your hypothesis (trained model) generalizes to new, unseen data examples
* Separate training, development and test data needed!!!
* **Training data:** Used for training the model
* **Development data:** Used for testing different model parameters, for example level of regularization needed
* **Test data:** Never touched during training / model development, used for evaluating the final model


Evaluation metric:

* Accuracy: Out of all test examples, how many were classified correctly?
