Skip to content
An introduction to machine learning using R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
Artificial Neural Networks
Evaluation Metrics
K-Means and HCA
K-Nearest Neighbor
Naive Bayes
Random Forests
Reinforcement Learning
Support Vector Machines

An Introduction to Machine Learning in R

K-Nearest Neighbor

Predicting Heart Disease in Patients

K-nearest neighbors is an extremely simple classification and regression algorithm that classifies or predicts a data point based on the majority vote of its nearest neighbors. The important characteristics of KNN are how many neighbors to consider (K) and the method used to calculate distance.

Decision Trees and Random Forests

Predicting the Quality of Wine

Decision trees are perhaps the most intuitive classification models for humans to interpret. Decision tree algorithms build trees by splitting data based off of some sort of information measure recursively until a stopping point. The many nuances of improving performance are omitted from this discussion but can easily be found online. The important characteristic is that each node contains a subset of the data, a node is split in such a way as to minimize the amount of diversity of its children.

Random forests are an ensemble method for decisions trees. Many trees are built using different methods, usually by taking subsets of the training data. The classification of new data is the majority vote of all the trees (classification) or the average value (regression). Random forests generalize data better than single decision trees.

Naïve Bayes

Filtering SMS Spam

Naïve Bayes classifiers use Bayes rule to classify data using the probabilities gleaned from existing data or new data as it becomes available. Naïve Bayes is very good at text classification even with minimal data preprocessing or wrangling.

Artificial Neural Networks

Predicting the Edibility of Mushrooms

Artificial neural networks borrow from biology. Like their namesake, artificial neural networks model the neural connections present in our own brains. Nodes in a network are connected to inputs and outputs. Whether or not the output is activated depends on the inputs, the weight on those inputs, and the activation function. Putting many nodes together results in a network that is able to represent any function. What is learned or trained in an ANN are the weights on each input.

Support Vector Machines

Predicting the Edibility of Mushrooms

Support Vector Machines are not really “machines” but more a clever algorithm with a simple concept that has a complicated implementation. The simple concept is just a hyperplane that separates classes with the widest possible margin between the hyperplane and closest data points. New examples will be classified by which side of the hyperplane (decision boundary) they fall. The complicated, and clever, implementation of this algorithm involves finding this hyperplane using vectors and quadratic programming. Basically, only the points closest to the decision boundary will be considered (the support vectors). For data sets that are not linearly separable, the kernel trick is used.

Evaluation Metrics

Classifying Abalone Age

Choosing the correct evaluation metrics are important when evaluating a model. Accuracy is a simple metric that is able to provide quick feedback for how good a model is, but for skewed data it can be misleading. For example, if the data contains 99990 examples with class ‘0’ and 10 examples of class ‘1,’ the model will be 99.99% accurate by always guessing ‘0.’ As another example, take a neural net with 88 outputs indicating keys on a piano, if the ANN always guesses that no note is played (all zeros), it is still around 98% accurate.

There are many different metrics to consider, the best one depends on the goal of the model. In the case of cancer diagnosis, it is much better to lean towards a positive diagnosis even when there is no cancer (false positive) than to have a negative diagnosis when there is a presence of cancer (false negative). For spam detection, it is better to let some spam through if it means never blocking ham messages. The model should be evaluated based on the most important metric.

This project uses the caret package to tune and train a Multi-Layer Perceptron (ANN) and a SVM for the purpose of investigating different evaluation metrics.

K-Means and Hierarchical Cluster Analysis

Exploring Wholesale Data

K-means and hierarchical cluster analysis (HCA) are both unsupervised learning techniques that describe data with clusters.

Reinforcement Learning

Solving the Tower of Hanoi and Playing Tic-Tac-Toe

Reinforcement learning is a more diverse area of machine learning than supervised and unsupervised learning. The simple idea behind reinforcement learning comes from behaviorist psychology. That is, in some state, an agent will take some action that it has learned will maximize its instant or future reward. There are many different types of reinforcement learning problems and many different types of algorithms to solve them. Using the ReinforcementLearning package in R, a policy can be learned from examples of state-action-reward transitions.

You can’t perform that action at this time.