# Lecture 6: Fundamentals of machine learning

Here are the packages that we will be using for the lecture today

In [2]:
if (!require("pacman")) install.packages("pacman")

Loading required package: pacman



In [3]:
pacman::p_load(dplyr, ggplot2, rsample, caret, AmesHousing)

Installing package into ‘/home/dawie/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)


caret installed

Installing package into ‘/home/dawie/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)


AmesHousing installed



## References

The primary reference for the first four lectures is `Boehmke and Greenwell` and can be found [here](https://bradleyboehmke.github.io/HOML/index.html). I will follow this book quite closely, with most of the notes simply being a summary of what can be found in the book. I recommend you read the referenced text if you want the full detail. There are also some sections in the book that I have skipped due to time constraints. If you want to look at a textbook for the first part of the course  it can be found [here](https://www.statlearning.com/) -- a second edition is coming soon! We do not follow the content of the book closely in these notes, but you can read the book for a more theoretical discussion. We will also be looking at [this](https://mdsr-book.github.io/mdsr2e/ch-foundations.html) and [this](https://online.stat.psu.edu/stat508/lesson/2) set of notes from time to time. 

# Introduction to machine learning

These notes are a concise summary of some of the material in the reference section. They are also provided as a code supplement to the more theoretical textbook that we cover in this course. The goal of the module is to be practical and focus more on the application than deeper theory. The theory is vitally important to understanding application to its fullest extent, but we see this course as an opportunity to introduce students to different topics and from there they can pursue the underlying theoretical structures. In essence, we don't want to make this just another course that is laden with theory. We want it to have practical value for you in the workplace. 

*Note: Our discussion below (and for the rest of my part of the module) focuses only on **supervised learning**. Nico will cover aspects of unsupervised learning.*

Machine learning is all about learning from data. The main idea is that we have some type of outcome measurement and we wish to use a set of features to train learners and make predictions. In order for us to train the learners we need a training set. In supervised learning the training data includes the indicated target. Using this training data we construct a statistical learner which will enable us to predict outcomes for new unseen objects. We can also say that in this process we are generating a prediction model. A good model / learner is one that accurately predicts the outcome. 

**Side note:** To make things a bit easier, when we talk about predictor, feature or attribute this is the same as independent variable in the context of econometrics. Likewise, when we use target, outcome or response this is in line with the idea of dependent variable.  

Let us use a basic example to try and understand what this conceptual overview means. If you are starting our with machine learning one of the things you can try and do is to participate in [Kaggle](https://www.kaggle.com/) competitions. This is basically a website where you can go and participate in fun machine learning competitions. One of the most famous competitions is based on data from the Titanic shipwreck. The goal of this prediction competition is to use machine learning to create a model that predicts which passengers survived this famous shipwreck. The outcome measurement in this example is passenger survival, while and the features are all the attributes of the people on board the ship. These attributes include things like name, age, gender, socio-economic class, etc. In this example we have a well-defined learning task, we intend to use specific attributes to predict some outcome measurement. 

## Supervised learning

Let us quickly mention what is meant by supervised learning. Supervision refers to the fact that the outcome provides a supervisory role. The learner in this type of setting attempts to optimise a function to find the combination of attributes that results in the predicted value being as close to the actual target output as possible. Supervised learning problems can be categorised as either regression or classification. We will discuss both regression and classification problems in greater detail in the lectures to come. 


# The modelling process

Since we often work with large datasets in machine learning we will normally have a checklist of operations that we need to perform before we can truly decide which model will work best. Even in the case of small datasets, it is worthwhile undergoing the described routine. Performing this process provides confidence that we have covered all our bases and that the outcomes of the model are as accurate as possible. 

As succinctly stated in `Boehmke and Greenwell`, "[a]pproaching ML modeling correctly means approaching it strategically by spending our data wisely on learning and validation procedures, properly pre-processing the feature and target variables, minimizing data leakage, tuning hyperparameters, and assessing model performance."

The concepts mentioned in this quote don't carry significant value for you at this moment, but I can guarantee that each of these items are crucial to the construction and evaluation of a machine learning model. So before we start with the machine learning algorithms, let us find out what these things mean and why they are important. The general predictive machine learning process that we will follow is illustrated in the figure below, taken from `Boehmke and Greenwell`. 

**Include the figure here**

## Data splitting

The illustrate the modeling process we are going to be using a housing dataset called `Ames Housing`. 

In [5]:
ames <- AmesHousing::make_ames()  # Ames housing data

It is important that our algorithm is able to not only fit the data well, but also new data that might become available. We want our algorithm to generalise to unseen data. In order to determine how well our model might do in a new environment we can split the data that we have into training and test data sets (as depicted in the figure above). A training set is used to determine the attributes of our final model and once we have this final model we can evaluate the performance of the model against the test data set. This is akin to in-sample and out of sample forecasting in time series econometrics. There are two common ways to split the data, namely simple random sampling and stratified sampling. We will only look at simple random sampling for now, but you can refer to the textbook for notes on stratified sampling. 


### Simple random sampling

The easiest way to split the data is using a simple random sample. There are multiple ways that you can do this. You can use base R, but there are also more specialised packages available. One way to do simple random sampling, with a 70-30 split in the data, would be the following.


In [6]:
# Using base R
set.seed(123)  # Set the seed for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7))  # Use the sample function
train_1 <- ames[index_1, ]  # Select the first 70% of data for training set
test_1  <- ames[-index_1, ]  # The rest of the data is for the test set

This seems somewhat complicated and requires you to think much too hard about the problem. Remember that we are trying to be as efficient as possible here! It turns out that there is already a package in `R` that allows us to construct a training and test set with simple random sampling, namely the `rsample` package. 

In [7]:
# Using rsample package
set.seed(123)  # Set the seed for reproducibility
split_1  <- initial_split(ames, prop = 0.7)  # Split the dataset 
train_2  <- training(split_1)  # Training set
test_2   <- testing(split_1)  # Test set