# Notebook 3 - Machine Learning Basics & Classifiers


## 5. ML Basics

### 5.1. What is machine learning and how do machines learn?

According to Wikipedia, "Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data". Well, this definition seems quite broad and
it doesn't say much about this *mythical* learning process. However, since machines have been developed by humans, their way of learning may be similar to how we learn, right?

Yes, but not fully. Imagine being put in an unknown place on Earth. If you see that it is raining for 10 consecutive days at 10 am, it will be very likely for you that on the next day it will be also raining at 10 am. In this case, your prediction is based on the observation and identification of the regular *pattern*. However, for people, it is almost always insufficient to predict based on observation only. Besides observing, we also try to *deduct* WHY something happens and what is the mechanism behind certain phenomena. In this case, by taking other factors into account you could discover that rains are caused by high humidity and perhaps convection cycle.

What happens if a computer is given the task to predict when will be the next rainfall given that it was raining for 10 consecutive days at 10 am? The computer algorithm will try to detect and learn the *pattern* based on the given observation dataset (10-day rain history). After a successful learning process, it will be able to give predictions. So far, machine learning seems similar to human learning but here is the difference - machines will be able to recognize patterns only in the dataset they were given. The computer algorithm won't unexpectedly say: "Hello, 10-day rain history is cool but I think rain is related with something different, perhaps the geographical conditions". It will not say that because it doesn't even know that geographical conditions exist.

Now, imagine that the "raining hour" wasn't always 10 am - sometimes it started raining a little bit earlier, sometimes later. How does the computer learn that it should start raining *around* 10 am on the next day? Well, it depends on the learning algorithm implementation. For example, it may try to find the distribution of "raining hours" in the 24h period and determine what will be the boundary for determining when the rainfall takes place.

The expected outcome of the machine learning process is a function which maps an input to an output. In the case of rain prediction it could be a function which given date and time outputs whether it will be raining or not. Examples:
- f(Monday 10am) = raining
- f(Sunday 9pm) = not raining

Machine learning often has two separate stages: 
- training, which consists of developing the most accurate mapping function using most of the dataset
- testing, which is taking a small, independent part of the dataset to estimate the performance of the model in an unbiased way

### 5.2. Supervised vs unsupervised learning

There are many ways in which machine learning can be classified. One of them is the classification based on the supervision of the learning process.

Supervised learning is the type of learning that uses labeled datasets to develop a function that maps from the data to the related label. A good example of this would be a classification algorithm for detecting spam. We could have a big set of manually marked text messages as spam or normal ones. While learning, the classification algorithm may adjust itself based on these labels (e.g. based on labels it knows that words “money” or “win” occur more often in spammy messages so the resulting function may classify messages containing both “money” and “win” as spam).

On the other hand, we have unsupervised learning which doesn’t use labels to learn how to classify. These models try to learn the inherent structure of the dataset. Comparing to the supervised spam detection model, an unsupervised model won’t have the power of labels and will not be able to match some specific features with a given label. Instead, it will try to represent all messages in a specific way, so it will be possible to identify some distinguishing properties and in the end - group them.

### 5.3. Continous vs discrete outputs

### 5.4. Datasets 

TODO: datasets source

Datasets used for machine learning should follow a set of general and technical rules. More general ones:
- dataset should correspond to type of ML task - datasets for classification will be different from those for regression
- when choosing a dataset, make sure it is a good representation of the environment (e.g. if you want to detect spam messges, your data set shouldn't contain only spam)
- if you want to create a universal model, try to make the dataset diverse as well 

Technical issues:
- many unnecessary columns in CSV files
- missing data
- duplicated data 
- invalid data

Let's see how to deal with these techincal issues using a dataset containing medal winning sportsmen and sportswomen info from Summer Olympic Games 2000-2016. (Modified data set from kaggle [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results))
Firstly, please familiarize with this dataset!

In [None]:
import pandas as pd

olympic_df = pd.read_csv('res/olympic.csv')
olympic_df.head()

One of the most common technical issues with datasets is that we get numbers of columns in CSV files but we want to use only a few of them. For example we don't need the "Games" series containing games names because we already have this info in the Year series. Also, since all games in this dataset were Summer Olympic Games, we can certainly remove the "Season" series as well. Let's remove them!

In [None]:
drop_list = ["Games", "Season"]
olympic_df.drop(drop_list, axis=1, inplace=True)  # inplace set to True tells pandas to make chagnes directly on the olympic_df
olympic_df.head()

Now, datasets often come with missing data. This means that there are incomplete entries with some fields leaved empty. Visually in pandas missing value is represented by `NaN`. Look at the first row in the olympic_df - the first entry does not have value in the "Weight" series. There may be more incomplete entries like this or in other series. Let's see how we can use pandas to detect it.

In [None]:
olympic_df[olympic_df.isna().any(axis=1)]


If we detect a missing value in a field that interests us, we may want to remove the whole entry.

In [None]:
print("Row count before removal: {}".format(len(olympic_df)))
olympic_df.dropna(axis=0, how='any', inplace=True)
print("Row count after removal: {}".format(len(olympic_df)))

Now, we can easily detect and remove duplicates in the same way

In [None]:
olympic_df[olympic_df.duplicated()]

In [None]:
olympic_df.drop_duplicates(keep="first", inplace=True)  # you can also specify a 'subset' parameter to specify in which column you want to search for duplicates

The next important step before moving forward is data validation and cleaning. Accidental typos may happen in differnt stages of data colletion but we definitely don't want anyone with a height of 500cm or a "Bronz" medal as this may add harmful noise. Let's start with setting some natural limits for Age, Height and Weight series and checking if all df values are in the appropriate range.

In [None]:
invalid_olympic_df = olympic_df[~(olympic_df['Age'].between(10, 120)) | ~(olympic_df['Height'].between(50, 280)) | ~(olympic_df['Weight'].between(20, 500))]
invalid_olympic_df

In [None]:
olympic_df.drop(index=invalid_olympic_df.index, inplace=True)

Ok, how about medal names of the string type?

In [None]:
weird_medal_names = olympic_df[~olympic_df['Medal'].isin(["Gold", "Silver", "Bronze"])]
weird_medal_names

In [None]:
olympic_df.drop(index=weird_medal_names.index, inplace=True)

### 5.5. Dataset training-validation-test split

### 5.6. Cross validation

### 5.7. Performance measures - accuracy, precision, recall

## 6. Classifiers

### 6.1 Types of classifiers

- linear classifiers
- non-linear classifiers (kNN)

d
- generative approach
- discriminative approach

### 6.2. Naive-Bayes (NB)

### 6.3. Logistic Regression 


### 6.4. Decision Tree


### 6.5. Support Vector Machine (SVM)