# Notebook 3 - Machine Learning Basics & Classifiers


## 5. ML Basics

### 5.1. What is machine learning and how do machines learn?

According to Wikipedia, "Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data". Well, this definition seems quite broad and
it doesn't say much about this *mythical* learning process. However, since machines have been developed by humans, their way of learning may be similar to how we learn, right?

Yes, but not fully. Imagine being put in an unknown place on Earth. If you see that it is raining for 10 consecutive days at 10 am, it will be very likely for you that on the next day it will be also raining at 10 am. In this case, your prediction is based on the observation and identification of the regular *pattern*. However, for people, it is almost always insufficient to predict based on observation only. Besides observing, we also try to *deduct* WHY something happens and what is the mechanism behind certain phenomena. In this case, by taking other factors into account you could discover that rains are caused by high humidity and perhaps convection cycle.

What happens if a computer is given the task to predict when will be the next rainfall given that it was raining for 10 consecutive days at 10 am? The computer algorithm will try to detect and learn the *pattern* based on the given observation dataset (10-day rain history). After a successful learning process, it will be able to give predictions. So far, machine learning seems similar to human learning but here is the difference - machines will be able to recognize patterns only in the dataset they were given. The computer algorithm won't unexpectedly say: "Hello, 10-day rain history is cool but I think rain is related with something different, perhaps the geographical conditions". It will not say that because it doesn't even know that geographical conditions exist.

Now, imagine that the "raining hour" wasn't always 10 am - sometimes it started raining a little bit earlier, sometimes later. How does the computer learn that it should start raining *around* 10 am on the next day? Well, it depends on the learning algorithm implementation. For example, it may try to find the distribution of "raining hours" in the 24h period and determine what will be the boundary for determining when the rainfall takes place.

The expected outcome of the machine learning process is a function which maps an input to an output. In the case of rain prediction it could be a function which given date and time outputs whether it will be raining or not. Examples:
- f(Monday 10am) = raining
- f(Sunday 9pm) = not raining

Machine learning often has two separate stages: 
- training, which consists of developing the most accurate mapping function using most of the dataset
- testing, which is taking a small, independent part of the dataset to estimate the performance of the model in an unbiased way

### 5.2. Supervised vs unsupervised learning

There are many ways in which machine learning can be classified. One of them is the classification based on the supervision of the learning process.

Supervised learning is the type of learning that uses labeled datasets to develop a function that maps from the data to the related label. A good example of this would be a classification algorithm for detecting spam. We could have a big set of manually marked text messages as spam or normal ones. While learning, the classification algorithm may adjust itself based on these labels (e.g. based on labels it knows that words “money” or “win” occur more often in spammy messages so the resulting function may classify messages containing both “money” and “win” as spam).

On the other hand, we have unsupervised learning which doesn’t use labels to learn how to classify. These models try to learn the inherent structure of the dataset. Comparing to the supervised spam detection model, an unsupervised model won’t have the power of labels and will not be able to match some specific features with a given label. Instead, it will try to represent all messages in a specific way, so it will be possible to identify some distinguishing properties and in the end - group them.

### 5.3. Datasets sources

There are three ways of getting a dataset. You can either collect data by yourself, you can use python package builtin datasets (`scikit-learn` comes with a variety of exemplary datasets) or you can use datasets available online. The first solution isn't simple, since gathering data is often related with getting access to non-public APIs, databases or other resources and this takes a lot of time and effort. Exemplary datasets are great for learning and experimenting. However, there are already created datasets available online, which you can use without any cost. Of course, they may be messy and you will need to adjust them to your needs, but it is still a better choice since data has been already gathered. 

The most popular website with many publicly available datasets is [Kaggle](https://www.kaggle.com/).

### 5.4. Datasets 

Datasets used for machine learning should follow a set of general and technical rules. More general ones:
- dataset should correspond to type of ML task - datasets for classification will be different from those for regression
- when choosing a dataset, make sure it is a good representation of the environment (e.g. if you want to detect spam messges, your data set shouldn't contain only spam)
- if you want to create a universal model, try to make the dataset diverse as well 

Technical issues:
- many unnecessary columns in CSV files
- missing data
- duplicated data 
- invalid data


Let's see how to deal with these techincal issues using a dataset containing medal winning sportsmen and sportswomen info from Summer Olympic Games 2000-2016. (Modified data set from kaggle [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results))
Firstly, please familiarize with this dataset!

In [None]:
import pandas as pd

olympic_df = pd.read_csv('res/olympic.csv')
olympic_df.head()

#### Unnecessary columns

One of the most common technical issues with datasets is that we get numbers of columns in CSV files but we want to use only a few of them. For example we don't need the "Games" series containing games names because we already have this info in the Year series. Also, since all games in this dataset were Summer Olympic Games, we can certainly remove the "Season" series as well. Let's remove them!

In [None]:
drop_list = ["Games", "Season"]
olympic_df.drop(drop_list, axis=1, inplace=True)  # inplace set to True tells pandas to make chagnes directly on the olympic_df
olympic_df.head()

#### Missing data

Now, datasets often come with missing data. This means that there are incomplete entries with some fields leaved empty. Visually in pandas missing value is represented by `NaN`. Look at the first row in the olympic_df - the first entry does not have value in the "Weight" series. There may be more incomplete entries like this or in other series. Let's see how we can use pandas to detect it.

In [None]:
olympic_df[olympic_df.isna().any(axis=1)]

If we detect a missing value in a field that interests us, we may want to remove the whole entry.

In [None]:
print("Row count before removal: {}".format(len(olympic_df)))
olympic_df.dropna(axis=0, how='any', inplace=True)
print("Row count after removal: {}".format(len(olympic_df)))

#### Duplicated data

Now, we can easily detect and remove duplicates in the same way

In [None]:
olympic_df[olympic_df.duplicated()]

In [None]:
olympic_df.drop_duplicates(keep="first", inplace=True)  # you can also specify a 'subset' parameter to specify in which column you want to search for duplicates

#### Invalid data

The next important step before moving forward is data validation and cleaning. Accidental typos may happen in differnt stages of data colletion but we definitely don't want anyone with a height of 500cm or a "Bronz" medal as this may add harmful noise. Let's start with looking at the dataset summary for each series.

In [None]:
olympic_df.describe()

Now, it's clearly impossible to have 290 years or to weigh 72000 kg. Let's set some natural limits for Age, Height and Weight series and filter the dataset based on them.

In [None]:
invalid_olympic_df = olympic_df[~(olympic_df['Age'].between(10, 120)) | ~(olympic_df['Height'].between(50, 280)) | ~(olympic_df['Weight'].between(20, 500))]
invalid_olympic_df

In [None]:
olympic_df.drop(index=invalid_olympic_df.index, inplace=True)

Ok, how about medal names of the string type?

In [None]:
correct_medal_names = ["Gold", "Silver", "Bronze"]
weird_medal_names = olympic_df[~olympic_df['Medal'].isin(correct_medal_names)]
weird_medal_names

In [None]:
olympic_df.drop(index=weird_medal_names.index, inplace=True)

### 5.5. Dataset training-test split

When we develop ML model there are 2 main types of data subsets we need for each process:
1. **training set** - training the model
2. **test set** - testing the model 

The fundamental principle is to test models on previously unseen data. If we test a model using the same dataset that was used for training, it may end up being *overfitted* and behave artifficially well, compared to the situation where it is given unseen data. This can thus introduce bias and cause huge inaccuracies in production. We want our evaluation procedure to be separated from the training set and the model development procedure. Moreover, we don't want to make any assumptions based on the evaluation process for how the model should behave.

So, how do we split the dataset into the **training set** and the **test set**? Since we need much more data for learning and only a small (but representative) part for testing, we often divide it into:
- 80% training
- 20% testing

Although this proportion is used very often, it is not a rule and it may differ for different models. Let's see how we can use `scikit-learn` module to split data.

Let's say we want to create a model which based on age, height and weight tells us which olympic medal should be given to the person. To do this we will need to create a separate DataFrame containing body properties (age, height, weight) and a DataFrame containing corresponding label (type of the medal). Body properties would be an input to the model, where the expected outcome would be a medal name label. Since this model is going to be a function, we will refer to body properties as **X**s and medal name labels as **y**s. 

In [None]:
olympic_age_height_weight = olympic_df.iloc[:, 3:6]
olympic_medal = olympic_df["Medal"]

Now, let's **randomly** split data into training set and testing set.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    olympic_age_height_weight, olympic_medal,
    test_size=0.2, random_state=0)  # random state is a seed for the shuffling function. If you don't want shuffling remove it and add "shuffle=False" parameter.
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)


As you can see, the `train_test_split()` method automatically generated required subsets. However, let's take a closer look on how it works. By deafult, it generates subsets at random, which is not always the best solution since we want those subset to be best representations of the whole dataset (and real environment as well). 

The standard proportion for medals should be around 33:33:33 for bronze, silver and gold so if you pick one sample at random, there is about the same chance for each medal.

But what happens if the proportion is something like 45:45:10? Imagine a situation where great majority of samples labeled as "Gold" (medal) have been put in a test set. Because of this, the model may give inaccurate results since during training it has seen only a few gold medalists. In this case we want both training and testing set to have the same proportions as the whole dataset (45:45:10).

Let's see how to achieve it using `stratify` parameter on a **slightly modified olympic dataset**.

In [None]:
olympic_low_gold_df = pd.read_csv("res/olympic_low_gold.csv")
olympic_low_gold_df['Medal'].value_counts()

olympic_low_gold_df['Medal'].value_counts(normalize=True)  # numbers sum up to 1

As you can see, gold medals are about 1.6% of all medals. We want to have the same proportion in both the training set and the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    olympic_age_height_weight_gold, olympic_medal_gold,
    test_size=0.2, stratify=olympic_medal_gold, random_state=0)  # stratify parameter takes an array of values to choose correct samples

print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

### 5.6. Cross validation & validation set

During the model development, we choose a machine learning algorithm, that is most suitable for a given problem. Machine learning algorithms generally aren't plug and play - we must adjust algorithm parameters (called hyperparameters) or choose some specific algorithm behaviours. We do this, becuase each problem is different and it must be approached in a specific way. 

When designing machine learning models, especially language processing ones, it is important to have knowledge about data it will work with. Based on it, it is possible to tune algorithm parameters, so the algorithm may work better. So basically, we need a representative set of samples, which we can use to tune model parameters. What are the possible options? 

 - Maybe we can use a test set, since it is a good representation of the whole dataset and it still won't be used for training. Nevertheless, by doing so, the algorithm will get some insight into the test set and it may artificially inflate the model performance - **bad idea**.
 - The other solution is to create a separate set, called a **validation set** and use it to adjust model parameters. This, however, drastically reduces the number of samples used for training, and may result in a decreased performance - **better if much data, generally also bad idea**.
 - So the final solution is to use **Cross-Validation** (**CV**), which is a smart way of using the same training set for training and validation.

First of all, we divide a dataset into an untoachable test set and the subset of remaining data. We will use the remaining data for the cross-validation.
CV works in the following way: We randomly choose a training and a validation set out of the remaining data. Then we train our model using the training subset and tune parameters based on the validation subset. Then we repeat the process n-times with a different, randomly selected training and validation set. The unbiased error rate (model performance) is calculated at the end using the held out test set.

**TODO**
K Fold
Stratified

### 5.7. Evaluation - accuracy, precision, recall

After training a model, we need a metric to understand how well (or not) it is doing. In the case of supervised learning we use labeled data, so we can compare the model outcome with the human-produced label, also called a **gold label**. Labels predicted by the model we will call **system labels**. 

Okay, let's say we create a spam detector which given a text, classifies it as spam or not. To measure performance of such a model, we could create a simple metric which says what was a percentage of the correct guesses (where a system label is identical to to a gold label). This metric is called **accuracy** and is expressed as a ratio of the number of correctly predicted system labels to the total number of guesses.

 There are 4 possible situations:
- The message was indeed spam and the system classified it as spam (**true positive**)
- The message was indeed spam and the system classified it as not spam (**false negative**)
- The message was not spam and the system classified it as not spam (**true negative**)
- The message was not spam and the system classified it as spam (**false positive**)
<div style="text-align:center"><img src="res/pic1.png" width="500"/></div>

Using terms above, **accuracy** is a expressed as the sum of true positives and true negatives divided by all samples.

So for example we had a test set of 10 samples. Five of them has been correctly classified as spam (true positive), 2 were correctly classified as not spam (true negative) and 3 were incorrectly classified as spam (false positive). In this case the accuracy of the model would be (5+2)/10 which is 70%, cool! 

Now, this may seem unreasonable at first glance, but accuracy isn't the best metric, sometimes even totally meaningless. Imagine a model which detects tweets about `coffee` (binary decision: "tweet is about coffee" or "tweet is not about coffee"). Our test set should reflect the real conditions, where the great majority of tweets is NOT about coffee. Let's say our test set will contain 100,000 samples, from which only a small sample of tweets will be about coffee (1% or so). Now, what will be the accuracy of this model if it **doesn't work** and alwyas says that a sample is "not about coffee"? Let's calculate it. We have 100,000 tweets in the test set:
- 1,000 tweets are about coffee
- 99,000 tweets are not about coffee

Since the model always says that a tweet is not about coffee we have 1,000 false negatives and 99,000 true negatives. The accuracy is the 99,000 / 100,000, which is 99%! Wow - what a perfect model!

To detect situations like this there are two other metrics, which used together create a useful set of evaluation metrics. 
The first one is called **precision** - it expresses the ratio of true positives to everything system labeled as positive (true positives and false positives).
The other one is called **recall** - it expresses the ratio of true positives to all samples which were in fact gold labeled as positive (true positives and false negatives).

Precision and recall are different from accuracy as they emphasize **true positives** - things we are supposed to be looking for.

<div style="text-align:center"><img src="res/pic2.png" width="700"/></div>

There are ways of defining a single metric taking into account both precision and recall. One of the simplest combinations is called the **F-measure**. The beta parameter is used to make one metric more important than the other. Although it may be useful in many cases, the most popular value for beta is 1, where precision and recall are equally important.

<div style="text-align:center"><img src="res/pic4.png" width="200"/></div>


Example of some metrics for a binary decision problem:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# To give any sense to these numbers we can say these are gold labels for you spam detector. 1 means spam, 0 otherwise
y_true = np.array([1, 0, 1, 1, 0]) 

# This array will represent outcoming system labels
y_pred = np.array([0, 0, 0, 1, 0])

print("accuracy:", accuracy_score(y_true, y_pred))
print("precision:", precision_score(y_true, y_pred))
print("recall:", recall_score(y_true, y_pred))
print("f1 metric:", f1_score(y_true, y_pred))

# print(classification_report(y_true, y_pred))


Do you remember coffee tweet problem? Let's try to artificially create it and inspect metrics of that imaginary model, that classifies every tweet as "not about coffee".

In [None]:
# 100,000 samples, 99,000 not about coffee labeled with 0, 1,000 about coffee labeled with 1
y_true = np.zeros((100000,))

# Generate 1,000 indices for the y_true to be set to 1
from random import sample
coffee_indices = sample(range(100000), 1000)

for coffee_index in coffee_indices:
    y_true[coffee_index] = 1

unique, counts = np.unique(y_true, return_counts=True)
print(dict(zip(unique, counts)))

Okay, now we have an array of 100,000 samples, from which only 1,000 "is about coffee". Our model always says that the sample is not about the coffee, so it will label all of them with 0.

In [None]:
y_pred = np.zeros((100000,))

Let's see model stats!

In [None]:
print("accuracy:", accuracy_score(y_true, y_pred))
print("precision:", precision_score(y_true, y_pred))
print("recall:", recall_score(y_true, y_pred))
print("f1 metric:", f1_score(y_true, y_pred))

As we discussed earlier: although accuracy is fantastic, other metrics say that our model doesn't work at all (beacuse it doesn't!).

## 6. Classifiers

### 6.1 Types of classifiers

- linear classifiers
- non-linear classifiers (kNN)

d
- generative approach
- discriminative approach

### 6.2. Naive-Bayes (NB)

### 6.3. Logistic Regression 


### 6.4. Decision Tree


### 6.5. Support Vector Machine (SVM)