<a href="https://colab.research.google.com/github/NSkuhala/Test1/blob/master/MLC7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***
Chapter 7 includes Black Box Methods - Neural Networks and Support Vector Machines. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.
***
In machine learning some methods may appear at first glance to be magic. In engineering, these are referred to as black box processes. Such processes transfroms an input into an output in a magical way. 
***
At first we will have a look at artificial neural networks (ANN). ANNs model the  relationship between a set of input signals and an output signal. These models are derived from our understanding of the human brain and are applied to several real-world problems like 
*  Speech, handwirting and image recognition programs
*   Automation of smart devices like office building's environmental controls, self-driving cars and self-piloting drones
*   Sophisticated models of waeather and climate patterns

Conceptual model of human brain activity.
Incoming signals are received by the cell's dendrites. Theis process allows the impuls to be weighted according to its relative importance or frequency. The cellbody begins to accumulate the incoming signals till a threshold is reaches. At this point the cell fires an output signal wich will be transmitted down th axon. Finally the electric signal is again processed to be the output.
Applying this to machine learning we will find the input signals as *x* variables and the output signal as *y* variable. For each dendrite the input signal is weighted according to its importance. The summed input signals in the cell body than will be passed on according to an activation function denoted by *f*.

$$y(x)=f(\sum_{i=1}^{n} w_{i}x_{i})$$

Each ANN can differ in respect to the following characteristics


*   **Activation function**, transforms a neuron's net input signal into a single output signal
*   **Network topology**, describes the number of neurons and number of layers in the model and the manner in which they are connected
*   **Training algorithm**, specifies how connection weights are set in order to inhibit or excite neurons in proportion to the input signal

***
**Activation function**

Is the mechanism by which the artifictial neuron procsses incoming information and passes it throughout the network. If by summing up the input signals the fireing threshold is exceeded, the neuron passes on the signal; otherwise, it doesn't. There are numerous alternatives of the activation function (Linear / Satrated / Hyperbolic Tangent / Gaussian / Sigmod) Latter is the most common used. The primary detail that differentiates these activation functions is the output signals range. 

**Networtk topology**

The capacity to learn is rooted in its topology. It can differ by three key characteristics
*   The numbers of layers
*   Wether information in the network is allowed to travel backward
*   The number of nodes within each layer of the network
It determines the complexity of the tasks that can be learned by the network. 

A bit of terminology is needed to distinguish Artificial neurons based on their position in the network.

The input nodes receive unprocessed signals directly from the input data. 1 input node for 1 single feature in the dataset. The feature's value will be transformed by the corresponding node's activation function and wil be received by the output node, which than uses its own activation function to generate a final prediction.
Input and output nodes are arranged in groups known as layers. As more complex the network is, the more layers are added. Those multiple hidden layers networks are called deep neural network (DNN)

*Direction of information travel*

Feedforward networks, the input signal is fed continously in one direction to the output layer.
In contrast there arefeedback network allows signals to travel backward using loops. It allows extremely complex patterns to be learnd. 


*Number of nodes*

The number of nodes is predetermined by the number of features in the input data. Similary, the number of output nodes is prdetermined by the number of outcomes. The numbers of hidden layers is in yours choice. There is no reliable rule to determine the layers. In general, more complex more layers

***
The following lines of code showes an example of how modeling th estrength of concrete with ANNs

In [0]:

concrete <- read.csv("https://raw.githubusercontent.com/NSkuhala/Test1/master/concrete.csv")
str(concrete)


'data.frame':	1030 obs. of  9 variables:
 $ cement      : num  141 169 250 266 155 ...
 $ slag        : num  212 42.2 0 114 183.4 ...
 $ ash         : num  0 124.3 95.7 0 0 ...
 $ water       : num  204 158 187 228 193 ...
 $ superplastic: num  0 10.8 5.5 0 9.1 0 0 6.4 0 9 ...
 $ coarseagg   : num  972 1081 957 932 1047 ...
 $ fineagg     : num  748 796 861 670 697 ...
 $ age         : int  28 14 28 28 28 90 7 56 28 28 ...
 $ strength    : num  29.9 23.5 29.2 45.9 18.3 ...


You can see that our net dataset has 1030 observation of our 9 variables.
8 Variables (features) are components of the concrete which contribute to its strength. The last is the outcoming strength.
Due to the fact that our dataset ranges from 0 to over 1000 we have to normalize our dataset at first, because only input data which is scaled to a narrow range around zero woks best in ANNs.



In [0]:
normalize <- function(x) { 
  return((x - min(x)) / (max(x) - min(x)))
}

In [0]:
concrete_norm <- as.data.frame(lapply(concrete, normalize))

For confirmation, we compare the Min. and Max. of our summaries

In [0]:
summary(concrete_norm$strength)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2664  0.4001  0.4172  0.5457  1.0000 

In [0]:
summary(concrete$strength)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.33   23.71   34.45   35.82   46.13   82.60 

Since the dataset looks reasonable, we split it up into a trainings set (75 % of the data) and a test set ( 25 % of the data). Fortunately the CSV file we used is already sorted randomly.

In [0]:
concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]

Next step: Train the neuralnet model

In [0]:
install.packages("neuralnet")
library(neuralnet)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



using the neuralnet() functin in the neuralnet package
*   target, outcome in the mydata data frame (df)
*   predictors, R formula specifying the geatures in the mydata df
*   data, df where the target and predictors are found
*   hidden, numbers of neurons in the hidden layer (by deafault = 1)
*   act.fct, activation function. either "logistic" od "tanh"

At first we will run our model with a single hidden node.
Also execute set.seed(12345) for an exactly match of the outcome compared to the book, because neural networks begins with random weights.

In [0]:
set.seed(12345)
concrete_model <- neuralnet(formula = strength ~ cement + slag +
                              ash + water + superplastic + 
                              coarseagg + fineagg + age,
                              data = concrete_train)


In [0]:
plot(concrete_model)


The weights for each of the connections are depicted, as are the bias terms. The Bias terms are numeric constants that allow the value at the indicated nodes to be shifted upward or downward , much like the intercept in a linear equation. At the bottom, R reports the number of training steps and an error measure, caed the sum of squared errors (SSE) The lower the SSE, the more closely the model conforms the training gata

For predictions we use the compute() function

In [0]:
model_results <- compute(concrete_model, concrete_test[1:8])

In [0]:
predicted_strength <- model_results$net.result

In [0]:
cor(predicted_strength, concrete_test$strength)

0
0.8064656


We see that there is a fairly strong relationship between our predicted concrete strength and the true value.

In the next step we want to improve our model performance. 
At first by adding hidden nodes, afterwards by adding more hidden layers and changing the network's activatino funcion

In [0]:
set.seed(12345)
concrete_model2 <- neuralnet(strength ~ cement + slag +
                               ash + water + superplastic + 
                               coarseagg + fineagg + age,
                               data = concrete_train, hidden = 5)


In [0]:
plot(concrete_model2)

In [0]:
# evaluate the results as we did before
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$strength)

0
0.9244533


Changing the activatin function

In [0]:
softplus <- function(x) { log(1 + exp(x)) }

In [0]:
set.seed(12345)
concrete_model3 <- neuralnet(strength ~ cement + slag +
                               ash + water + superplastic + 
                               coarseagg + fineagg + age,
                             data = concrete_train, hidden = c(5, 5), act.fct = softplus)

In [0]:
plot(concrete_model3)

In [0]:
# evaluate the results as we did before
model_results3 <- compute(concrete_model3, concrete_test[1:8])
predicted_strength3 <- model_results3$net.result
cor(predicted_strength3, concrete_test$strength)

The following code shows a data frame comparing the original dataset's concrete strenght values to their corresponding predictions side-by-side

In [0]:
strengths <- data.frame(
  actual = concrete$strength[774:1030],
  pred = predicted_strength3
)

head(strengths, n = 3)


In [0]:
cor(strengths$pred, strengths$actual)


Reverse the min-max normalization procedure to get back to the original scale

In [0]:
unnormalize <- function(x) { 
  return((x * (max(concrete$strength)) -
          min(concrete$strength)) + min(concrete$strength))
}

Now, that we're back on similar scale, we can compute again the correlation

In [0]:
strengths$pred_new <- unnormalize(strengths$pred)
strengths$error <- strengths$pred_new - strengths$actual

head(strengths, n = 3)

cor(strengths$pred_new, strengths$actual)

Correlation about 0.935 -> awesome, good job

In part 2 we're going to talk about Support Vector Machines with the example of optical Character Recognition (OCR)

In [0]:

letters <- read.csv("https://raw.githubusercontent.com/NSkuhala/Test1/master/letterdata.csv")
str(letters)


'data.frame':	20000 obs. of  17 variables:
 $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
 $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
 $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
 $ width : int  3 3 6 6 3 5 5 3 4 13 ...
 $ height: int  5 7 8 6 1 8 4 2 4 9 ...
 $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
 $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
 $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
 $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
 $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
 $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
 $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
 $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
 $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
 $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
 $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
 $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...


Like bevore, we split the dataset into a training (80%) and test (20%) dataset

In [0]:
letters_train <- letters[1:16000, ]
letters_test  <- letters[16001:20000, ]

In [42]:
install.packages("kernlab")
library(kernlab)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [45]:
# begin by training a simple linear SVM
letter_classifier <- ksvm(letter ~ ., data = letters_train,
                          kernel = "vanilladot")


 Setting default kernel parameters  


In [46]:
# look at basic information about the model
letter_classifier


Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Linear (vanilla) kernel function. 

Number of Support Vectors : 7037 

Objective Function Value : -14.1746 -20.0072 -23.5628 -6.2009 -7.5524 -32.7694 -49.9786 -18.1824 -62.1111 -32.7284 -16.2209 -32.2837 -28.9777 -51.2195 -13.276 -35.6217 -30.8612 -16.5256 -14.6811 -32.7475 -30.3219 -7.7956 -11.8138 -32.3463 -13.1262 -9.2692 -153.1654 -52.9678 -76.7744 -119.2067 -165.4437 -54.6237 -41.9809 -67.2688 -25.1959 -27.6371 -26.4102 -35.5583 -41.2597 -122.164 -187.9178 -222.0856 -21.4765 -10.3752 -56.3684 -12.2277 -49.4899 -9.3372 -19.2092 -11.1776 -100.2186 -29.1397 -238.0516 -77.1985 -8.3339 -4.5308 -139.8534 -80.8854 -20.3642 -13.0245 -82.5151 -14.5032 -26.7509 -18.5713 -23.9511 -27.3034 -53.2731 -11.4773 -5.12 -13.9504 -4.4982 -3.5755 -8.4914 -40.9716 -49.8182 -190.0269 -43.8594 -44.8667 -45.2596 -13.5561 -17.7664 -87.4105 -107.1056 -37.0245 -30.7133 -112.3218 -32.9619 -27.2971 -35.5

In [47]:
## Step 4: Evaluating model performance ----
# predictions on testing dataset
letter_predictions <- predict(letter_classifier, letters_test)

head(letter_predictions)

table(letter_predictions, letters_test$letter)

                  
letter_predictions   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
                 A 144   0   0   0   0   0   0   0   0   1   0   0   1   2   2
                 B   0 121   0   5   2   0   1   2   0   0   1   0   1   0   0
                 C   0   0 120   0   4   0  10   2   2   0   1   3   0   0   2
                 D   2   2   0 156   0   1   3  10   4   3   4   3   0   5   5
                 E   0   0   5   0 127   3   1   1   0   0   3   4   0   0   0
                 F   0   0   0   0   0 138   2   2   6   0   0   0   0   0   0
                 G   1   1   2   1   9   2 123   2   0   0   1   2   1   0   1
                 H   0   0   0   1   0   1   0 102   0   2   3   2   3   4  20
                 I   0   1   0   0   0   1   0   0 141   8   0   0   0   0   0
                 J   0   1   0   0   0   1   0   2   5 128   0   0   0   0   1
                 K   1   1   9   0   0   0   2   5   0   0 118   0   0   2   0
                 L   0   0   0   

In [48]:
# look only at agreement vs. non-agreement
# construct a vector of TRUE/FALSE indicating correct/incorrect predictions
agreement <- letter_predictions == letters_test$letter
table(agreement)
prop.table(table(agreement))

agreement
FALSE  TRUE 
  643  3357 

agreement
  FALSE    TRUE 
0.16075 0.83925 

In [49]:
## Step 5: Improving model performance ----

# change to a RBF kernel
RNGversion("3.5.2") # use an older random number generator to match the book
set.seed(12345)
letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train, kernel = "rbfdot")
letter_predictions_rbf <- predict(letter_classifier_rbf, letters_test)

agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)
prop.table(table(agreement_rbf))

“non-uniform 'Rounding' sampler used”


agreement_rbf
FALSE  TRUE 
  275  3725 

agreement_rbf
  FALSE    TRUE 
0.06875 0.93125 

In [0]:
# test various values of the cost parameter
cost_values <- c(1, seq(from = 5, to = 40, by = 5))

RNGversion("3.5.2") # use an older random number generator to match the book
accuracy_values <- sapply(cost_values, function(x) {
  set.seed(12345)
  m <- ksvm(letter ~ ., data = letters_train,
            kernel = "rbfdot", C = x)
  pred <- predict(m, letters_test)
  agree <- ifelse(pred == letters_test$letter, 1, 0)
  accuracy <- sum(agree) / nrow(letters_test)
  return (accuracy)
})

plot(cost_values, accuracy_values, type = "b")
