# Machine Learning

The concept of machine learning (ML) was developed in the 1950's, but only recently became accessible to the more general public with the advancement of computer hardware. ML is a subset of artificial intelligence and is defined as “the science of getting computers to act without being explicitly programmed” (Stanford). There are three main types of machine learning algorithms - supervised learning (labeled data), unsupervised learning (unlabeled data) and reinforcement learning (motivated by rewards) algorithms. We will focus on supervised learning with artificial neural networks, a specific type of ML algorithm.

# Python review & general concepts

NumPy and matplotlib are python libraries used in scientific computing. NumPy allows us to easily perform mathematical operations on arrays and matrices while matplotlib provides plotting capabilites similar to MATLAB. You can install these packages individually or download the Anadacona distribution, which includes most of the starting packages for data scientists. You will also need to install OpenCV, which is used for reading, saving and displaying images, and the OS module, which allows us to interface with the computer's operating system. Let's begin by reviewing several commands and concepts that are applicable in building machine learning algorithms. Write your code in the blank coding cells below, and hit control + enter to run. If a concept is unclear, feel free to email me at hs764@cornell.edu!

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import cv2
import os

### Inputs/features

Run to download the data below. Inputs, also commonly called features, of machine learning algorithms can range from numerical values to images to strings, depending on the type of problem and selected algorithm. For example, recurrent neural networks used in natural language processing typically require inputs of strings while convolutional neural networks use images. The data below will be a set of five images.

In [2]:
images = []
for filename in os.listdir('Assignment0images'):
    image = cv2.imread(os.path.join('Assignment0images', filename))
    if image is not None:
        images.append(image)
images = np.array(images)

Check the shape of your input and think about it means:

Visualizing the features you feed into your algorithm is extremely useful in debugging. The quality of inputs can have a large impact on your outputs, thus preprocessing and checking your inputs can be important for increasing accuracy. We will cover preprocessing later. Display the images using matlibplot:

The colors seem off. It turns out that cv2 reads images as RBG. Convert RBG to RGB and display images to check:

Check the data type of your input:

Notice that cv2 downloads images as unit8 (8-bit unsigned integer), which is not exactly friendly for mathematical operations. Convert the data type of your input to float32:

### Outputs/labels

Outputs of machine learning algorithms can also have different forms: numerical values, images, categories, strings, etc., and they may differ from the form of your input. Outputs, also called labels in classification problems, for the input images are provided below.

In [None]:
labels = ['cat','cat','dog','horse','whale']

The labels above are categorial variables, which can be rewritten in one-hot encoding which is easier and more efficient to feed into an algorithm. Rewrite the labels above with one-hot encoding:

### Shuffing

Shuffling your dataset may be important for certain algorithms, particularly classification problems. Shuffling prevents the algorithm from memorizing the order of your inputs and thus prevent predictions due to the particular sequencing. Make sure that the correct outputs correspond to their inputs. Write code to shuffle your features and labels together.

Display your shuffled images with their labels:

Notice that this produces an error. Try to figure out what's wrong and display again:

# Installing TensorFlow & Resources

There are several machine learning frameworks available, with the most popular being Caffe, Tensorflow, and PyTorch. Each have their own pros and cons. The one that we will use is TensorFlow by Google. To get started, install TensorFlow by following the instructions here: https://www.tensorflow.org/install/

Watch the series of videos by 3Blue1Brown on neural networks to gain a more fundamental understanding of gradient descent and backpropagation: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi