<a href="https://colab.research.google.com/github/TatianaO8/AI/blob/master/HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 5

Summarize and describe the different concepts/methods/algorithms that you have learned in this course.

Use a Colab notebook. Make sure that you organize the material logically by using sections/subsections. Also, use code cell to include code snippets.

I suggest that you group everything into five categories:

1. General concepts (for instance, what is artificial intelligence, machine learning, deep learning)

2. Basic concepts (for instance, here you can talk about linear regression, logistic regression, gradients, gradient descent)

3. Building a model (for instance, here you can talk about the structure of a convent, what it components are etc.)

4. Comping a model (for instance, you can talk here about optimizers, learning rate etc.)

5. Training a model (for instance, you can talk about overfitting/underfitting)

6. Finetuning a pretrained model (describe how you proceed)



## 1. General Concepts:

Deep learning is a subset of machine learning. Machine learning is a subset of artificial intelligence (AI).

Artificial intelligence is a branch of computer science that converts human intelligence into a computer.

Machine learning is the process of the computer learning by examples/data it is exposed to. ML relies less on human made code compared to general AI because it learns more and more as more data is inputted; unlike AI which is a code of rules it follows. For every certain type of input, there is an output. 

ML is the process of training a model to make useful predictions using a data set. Then, use this prediction model to predict unseen data output. 

Deep learning has networks capable of learning unsupervised from data that is unstructured and unlabeled. 

Examples Of AI: Speech recognition, decision-making, translation, etc.


### Machine Learning

Three types of ML: 

*   Supervised learning - the model is provided with labeled training data 
> *   Predict label based off features 
> > *   Features are the things you describe the label you want the model to predict off the features given
> *   In supervised machine learning, you feed the features and their corresponding labels into an algorithm in a process called training 
> > *   During training, the algorithm gradually determines the relationship between features and their corresponding labels. This relationship is called the model.  
> *   To tie it all together, supervised machine learning finds patterns between data and labels that can be expressed mathematically as functions 
> *   Given an input feature, you are telling the system what the expected output label is, thus you are supervising the training. The ML system will learn patterns on this labeled data.  
> *   Predict label based off features data and must infer its own rules for doing so 


*   Unsupervised learning – the model has no hints how to categorize each piece of 
> * Identify meaningful patterns in the data 
> * The machine must learn from an unlabeled data set

* Reinforcement learning – you don't collect examples with labels 
> * You set up a model (often called an agent in RL) with the game, and you tell the model not to get a "game over" screen 
> * Receives reward r from each action it takes at each state and then moves on to the next one 

### Deep Learning

A layer is a data-processing module that takes as input one or more tensors and that outputs one or more tensors. 
>* Densely connected layers or fully connected layers - used for simple vector data stored in 2D tensors of shape (samples, features)
>* Recurrent layers - used for sequence data stored in 3D tensors of shape (samples, timesteps, features)
>>* The most commonly used is long-short term memory (LSTM)
>* Convolutional layers - used for image data stored in 4D tensors

Deep learning models in Keras are made by using a combination and/or many of these layers to form useful data-processing pipelines.





In [0]:
# Showing a simple Keras model
# It uses a dense layer

%tensorflow_version 2.x
from tensorflow.keras import models
from tensorflow.keras import layers

network = models.Sequential()
network.add(layers.Dense(1, input_shape=(1,)))

network.compile(optimizer='sgd', loss='mse', metrics=['mse'])

network.fit(train_xs, train_ys, epochs=10, batch_size=8)

## 2. Basic Concepts

A regression model predicts continuous values. 

A classification model predicts discrete values.

When training a model, the goal, on average, is to have low loss across all examples by findind the right set of weights and biases. (Loss is the penalty for a bad prediction. Thus, a perfect prediction is 0 and anything else is above that.)



### Linear Regression

In linear regression, you could draw a straight line to show a relationship between variables. It might not necessairly pass through every dot, but it gives a good idea and shows the relationship clearly. 

The equation for a line in ML: ŷ = b + Σ$w_jx_j$... where
*   ŷ - predicted label
*   b - bias or $w_0$
*   $w_j$ is the weight of feature j = 1,2,...
*   $x_j$ is the jth feature (input)

It uses a loss function called mean squared loss.
Mean squared loss = (1/m)Σ.5(y-ŷ)$^2$, m = # of inputs

The model iteratively adjusts the weight and bias until the loss stops changing or changes by the slightest.

### Gradient Descent
To more efficient, there is a better mechanism/algorithm called gradient descent.

The gradient ▽L is a vector whose entries are partial derivatives of the loss function.
> *  Has both a direction and a magnitude.
> * Takes a step in the direction of the negative gradient to reduce the loss
> * Gradient descent then updates the gradient:
> > w <- w - α▽L       where α = learning rate


A batch is the total number of examples you use to calculate the gradient in a single iteration.


In [0]:
# number of epochs
epochs = 10
# learning rate
lr = 0.01

# initial value for weight w and bias b
w = np.random.randn(1)
b = np.zeros(1)

for epoch in np.arange(epochs):
  for i in np.arange(80):
    y_pred = w * train_xs[i] + b
    
    grad_w = (y_pred - train_ys[i]) * train_xs[i]
    grad_b = (y_pred - train_ys[i])
    
    w -= lr * grad_w
    b -= lr * grad_b



#### Stochastic Gradient Descent (SGD)

SGD uses one example per iteration (batch size is 1) and it is chosen at random.
It makes a lot of noise, a lot of randomness in the dataset.

#### Mini-batch Stochastic Gradient Descent (mini-batch SGD)

This is the middle ground between a full-batch iteration and SGD. In mini-batch SGD, 10 to 10,000 random example are chosen per iteration.
The amount of noise is again in between both using full-batch and SGD; reduces the amount of noise in SGD but is still more efficient than full-batch.

In [0]:
# Snippet of SGD

weight_path_sgd.append(weight)
for epoch in range(epochs):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    
    for i in range(m):           
        xi = X_b_shuffled[i:i+1]
        yi = y_shuffled[i:i+1]
        gradient = xi.T.dot(xi.dot(weight) - yi)
        weight = weight - lr * gradient
        weight_path_sgd.append(weight)
        
        y_predict = X_new_b.dot(weight)                    
        plt.plot(X_new, y_predict, "b-")    

In [0]:
# Snippet of mini-batch SGD

weight_path_mgd.append(weight)
for epoch in range(epochs):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, batch_size):
        xi = X_b_shuffled[i:i+batch_size]
        yi = y_shuffled[i:i+batch_size]
        gradient = 1 / batch_size * xi.T.dot(xi.dot(weight) - yi)
        weight = weight - lr * gradient
        weight_path_mgd.append(weight)

### Logistic Regression

Logistic regression is used for binary classification problems meaning there are only two classes.
*   Uses the sigmoid activation function: guarantees a valid response between 0 and 1
>  (1/(1 + e^-z))     where z = Σ$w_jx_j$ + b
*   Most of the time uses binary cross entropy loss: Loss (L) = -ylog(a) - (1-y)log(1-a)

Thus, is the sigmoid function is anything less than .5, the example is labeled as class 0. On the other hand, if the sigmoid function is anything more than or equal to .5, the example is labeled as class 1.


In [0]:
# Logistic Regression in Keras

def build_and_compile_model():
    # build model
    model = tf.keras.models.Sequential()

    layer = tf.keras.layers.Dense(1, activation='sigmoid', input_shape=(2,))
    model.add(layer)

    # compile model
    model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.001),
                loss='binary_crossentropy',
                metrics=['accuracy'])
    
    model.fit(data_training, labels_training, epochs=10, batch_size=512)
    weights, bias = layer.get_weights()
    
    return model, weights, bias

In [0]:
# Logistic Regression

def sigmoidFunction(z):
  return 1/(1 + math.exp(-z))


for epoch in range(epochs):
  shuffled_indices = np.random.permutation(m)

  for i in range(m):
    z = X[i].dot(weights) + b
    a = sigmoidFunction(z)
    gradient_w = (a - y[i]) * X[i]
    gradient_b = (a-y[i])   
    weights = weights - lr * gradient_w
    b = b - lr * gradient_b

display_random_data_with_lines(labels, data, weights, b, initial_weight, initial_b)

#compute the binary cross entropy and accuracy on the test set
binary_cross_entropy_loss = 0
correct_count = 0

for i in range(len(data_test)):
  a = sigmoidFunction(data_test[i].dot(weights) + b)
  binary_cross_entropy_loss += -1*labels_test[i]*math.log(a,2)- (1-labels_test[i])*math.log(a,2)

  if (a < .5 and labels_test[i] == 0) or (a >= .5 and labels_test[i] == 1):
    correct_count += 1


## 3. Building a Model

### Convent

Convent is another name for convolutional neural networks (CNN).
CNN's are used when building models for image classification because it can extract a higher representation of pixels. It is different from other networks in that it doesn't take in textures or shapes; it just takes in the raw pixels as input and learns how to extract features and learns to predict what the image is. 

*   Convolution 
>* extracts tiles of the input feature map and applies a filter to it to compute new features, producing an output feature map
>>* filters are matrices 
>>* during training, the CNN learns the optimal values for the filter matrices to enable it to extract meaningful features from the input feature map
>>* as more filters are applied to the input feature map, the more features extracted but at a cost of time increasing
>>* each additional filter has less value so the goal is to use the minimum amount of filters with the best interpretation of the input
>* the filter slide over the input feature map one pixel at a time
>* padding can be used for preservation
>>* In the early layers of our network, we want to preserve as much information about the original input volume so that we can extract those low level features 
*   ReLU - stands for rectified linear activation unit and is a piecewise linear function that will output the input directly if is positive, otherwise, it will output zero: max(0, x)
*   MaxPooling - An algorithm that reduces the number of dimensions of the feature map while still preserving the most critical feature information
> It also uses a feature map and extracts tiles except we take the maximum value of each tile and input it into our new feature map
>> tiles are moved by a defined stride, a scalar value


How it works:
1. The CNN receives an input feature map, a 3D matrix
2. Uses convolution, ReLU, and pooling for feature extraction
3. One or more fully connected layers
4. Last layers is for classification using softmax activation function outputting a probability value from 0 to 1 on each classification label prediction


A basic convnet is made up of a stack of Conv2D and MaxPooling2D layers. Then, the output is a 3D tensor. The width and height dimensions tend to shrink as you go deeper in the network. Then the next step to feed the last output tensor into a densely connected classifier network. 

Convnets have two interesting properties: the patterns they learn are translation invariant and they can learn spatial hierarchies of patterns. This gives it an advantage over a densely connected network. A densely connected network has to relearn patterns if found in a new location unlike convnets that learn in a more general form. Thus, convnets need fewer samples to learn representations that give generalization power. 



In [0]:
# Convolution 2D

def conv2d(input_mat, kernel_mat):
  s = kernel_mat.shape[0]
  m = input_mat.shape[0]

  # Check for invalid convolution
  if s > m:
    raise Exception('Invalid convolution inputs. The kernel matrix size is greater than the input matrix size.')

  # Calculate the size of the output matrix
  n = m - s + 1

  # Create an empty output matrix
  output_mat = np.zeros((n,n))
  
  for i in np.arange(n):
    for j in np.arange(n):
      for r in np.arange(s):
        for p in np.arange(s): 
          output_mat[i][j] += kernel_mat[r][p] * input_mat[i+r][j+p]
  

  return output_mat

In [0]:
# Maxpooling 2D

def maxpooling2d(input_mat, s):
  m = input_mat.shape[0]

  if s > m:
    raise Exception('Invalid maxpooling inputs. s is larger than the input_matrix size.')
  
  n = int(m/s)
  output_mat = np.zeros((n,n))

  for i in np.arange(n):
    for j in np.arange(n):
      output_mat[i][j] = input_mat[i*s][j*s]
      for r in np.arange(s):
        for p in np.arange(s): 
          output_mat[i][j] = max(output_mat[i][j], input_mat[r+i*s][p+j*s])
  
  return output_mat

In [0]:
# Convolutional Neural Network
from keras import layers
from keras import models
from keras import optimizers

model = models.Sequential()
model.add(layers.Conv2D(32,(5,5),activation=’relu’,input_shape=(28,28,1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (5, 5), activation=’relu’))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation=’softmax’))

## 4. Comping a Model

### Learning Rate

The learning rate is symbolized as α and is a scalar value.
It is also referred to as the step size. This is used in gradient descent to update the gradient as I talked and showed before. 

The learning rate is a hyperparamter meaning it is external to the model. It is something we define when we train the model. 

If learning rate is too small, the model might take a long time to learn. On the other hand, if the learning rate is too big, the model might not ever get to its ideal point because it might skip over it.



### Epoch

Epoch in machine learning is the number of iterations of the dataset.

This is also a hyperparameter and is defined when setting the parameters of your model.

(You can see these hyperparamters in some of the code shown in here.)

### Optimizers

Optimizers determines how the model will update during training.

An example is gradient descent, which was exaplained above. 

### Loss function

A loss function is the quantity that will be minimzed during training.

There are a few loss functions, but there is pretty much a set pair between the problem type and the loss function.

binary classification -> binary_crossentropy, activation: sigmoid

multiclass, single-label classification -> categorical_crossentropy/binary_crossentropy, activation: softmax/sigmoid

regression to arbituary values -> mse, activation: none

regression to values in [0,1] -> mse or binary_crossentropy, activation: sigmoid



## 5. Training a Model

### Overfitting/Underfitting

Overfitting is when a model trains the data too well, so it bad at predicting.
*   Model has low loss on training data, high loss on test data
* Model is probably more complex than it has to be and didn't generalize well

Underfitting is when a model does not fit the data well enough, so it can't capture a trend in the data




###Training/Test/Validation Sets

To train a model, we need to input it data. This is where training, validation, and testing data comes in.

To prevent overfitting, we want the model to predict on values it has never seen before. Thus, we split the data into two data sets if the data is large enough, training and testing sets. The training set is the data used to train the model and the testing set is the data used to make predictions with the new model to see how good of a model it is.


If the data is not large enough and to reduce overfitting, we use validation sets as well. The validation set is used to test the training set and tune model parameters accoridingly and once it passes that test, it will test on the testing set. 

There are a few ways to split this data. 
*   Simple hold-out
> Split data into training set, validation set, and testing set
> Flaw: if the validation set is too small, it might not represent the data as a whole
*   K-fold validation
> Used when you have little data
> Split your data into K partitions of equal size 
> For each partition i, train a model on the remaining K - 1 partitions, and evaluate it on partition i






training set -> validation set -> testing set


In [0]:
train_data = data[num_validation_samples:]
validation_data = data[:num_validation_samples]

model = get_model()
model.train(training_data)

validation_score = model.evaluate(validation_data)
model = get_model()
model.train(data)

test_score = model.evaluate(test_data)

## Finetuning a pre-trained model

To understand what finetuning is we have to understand what a pre-trained model and what ‘freezing’ means. 

A pretrained model is a common and highly effective approach to deep learning on small image datasets. It is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. Some pretrained models are VGG16, ResNet50, Inception V3, etc. Moreover, freezing means preventing weights from updating from training. Thus, we say we will freeze a layer or a set of layers. We do this because if we don’t, then there is no significance to using a pretrained model and what was learned before will now be modified. To do this we just set the network’s trainable attribute to False. 

Now we finally can understand what fine-tuning means. Fine-tuning is the method of unfreezing a few top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model and these top layers. We used this because it represents the process better. It slightly adjusts the model being reused in order to make it more relevant to the model we are trying to use. 

The steps for fine-tuning a network according to Rosebrock is:


1.   Add your custom network on top of an already-trained base network.
2.   Freeze the base network
3. Train the part you added.
4. Unfreeze some layers in the base network
5. Jointly train both these layers and the part you added.



In [0]:
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
  if layer.name == 'conv2d_4':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False