<a href="https://colab.research.google.com/github/Leonardo-daVinci/Deep-Learning-PyTorch/blob/Error_Functions%26Cross_Entropy/2_Error_Functions_%26_CrossEntropy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Error Function

It tells us how badly we are performing at the moment and how far we are from the solution.  
In a cartezian plane, it may be the euclidean distance or some other distance between our current position and the target position.  
Goal of the error function is to help us take steps towards our target position.  
We take look around our current position and then find out the direction in which we are closest to the target position and take that step.  
The process is repeated till we reach the target position.  
The technique described above is called as **Gradient Descent**.
  
**Conditions to apply Gradient Descent :**

1.   It should be continuous.
2.   It should be differentiable.

##Discrete vs Continuous Predictions

Since we need a continuous error function, we need to move to continuous predictions.  
For example if we predict the answer as yes or no, we need to change it to probability of yes or no, which is between 0 and 1.  
Recall that we employed a step function in perceptron to give output as 0 or 1. To convert these discrete outputs to continuous ones, we replace the step function with **Sigmoid Function**.  

> sigmoid(x) = 1/(1+exp(-x))

Thus we can modify our activation function as follows:

> y' = sigmoid(Wx + b)


#Multi-Class Classification

If there are more classes that our model can predict, such as given some fruits our model predicts whether it is banana, orange or apple, then such problem is called as **Multi-Class Clasification**.  
  
We determine scores for each class and then calculate probabilities. Since the scores are obtained as result of linear functions, they can be also be negative. So, to convert the negative scores into positive ones, we utilize exponential function.  
  
Thus we calculate the scores for each clss using a **Softmax function**.  Suppose there are N classes with Linear function scores as Z1, Z2, .. Zn then we define softmax function as:
> P(class i) = exp(Zi)/sum(Z1+Z2..+Zn) 

In [0]:
import numpy as np

# function takes input as list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    result = []
    deno = sum([np.exp(i) for i in L])
    for i in L:
       result.append(np.exp(i)/deno)
    return result

## One-Hot Encoding
In our case of multiclass classification, to encode our classes, we define a variable for each class.  
Suppose there are 3 classes C1, C2 and C3 and a particular example E belongs to class C2 then for C2 the value is 1 an for other classes it is zero.  
That is, E would be encoded as [0,1,0]

In [0]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

#here we have 3 classes so we will have 3 different variables after one-hot encoding
X = [['Orange'],['Apple'],['Mango']]
X = enc.fit_transform(X).toarray()
X

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

##Maximum Likelihood
This method involves picking up the model that gives the existing labels the highest probability.  
Thus by maximizing the probability, we can pick up the best possible model.  
Let us consider an example where we have 2 models that classify 4 points into blue and red.

1.   First model M1, clasifies 2 points correctly while other two points incorrectly. 
2.   Second model M2, classifies all the points correctly.

Let us say that the probabilities of correctly classifying a point for the models are as follows: 

1.   For M1, probabilities are 0.60, 0.7, 0.2 and 0.1 (2 correctly classified points).
2.   For M2, probabilities are 0.7, 0.9, 0.6 and 0.8 (all points correctly classified).

Now, product of probabilities for M1 is 0.0084 while for M2 it is 0.3024, which is significantly higher than M1. Thus by method of Maximum Likelihood we can confirm that model M2 is better than M1.  
  
##Cross-Entropy
We use the sum of probabilities rather than their products because of the following reasons:

1.   If there are thousands of data points then the products of their probabilities would be very very small.
2.   A change in probability of even one data point can drastically change the entire product.

So to solve the above issues, we use logarithm function to change product of probabilities into sum of probabilities.  
Now, the probabilities are between 0 and 1, thus we have their logarithms as negative values. To turn the sum into positive, we calculate the negative logarithm. This is called as **Cross-Entropy**.  

It is also known that logarithms of numbers close to one are smaller as compared to numbers closer to zero. Thus a **good model will give us low cross-entropy while a bad model will give us high cross-entropy**.  





In [0]:
import numpy as np

# cross_entropy function takes 2 input - lists Y, P,
# where Y is 0 or 1 depending upon point being correctly classified or not
# and P is corresponding probabilty of that point
# it returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -sum(Y*np.log(P) + (1-Y)*np.log(1-P))