# Topic 2. Neural Networks 

## Mathematical background



In this notebook we review the concepts of KL divergence, Likelihood, and cross-entropy

In [4]:
import numpy as np
from scipy.stats import norm
from matplotlib import pyplot as plt
#import tensorflow as tf
import seaborn as sns


## Kullback Leibler divergence

The Kullback Leibler is a divergence measure that quatifies the difference between two distributions. 

For continuous random variable it is computed as an integral

$KL(p,q) = \int p(x)*log \left(\frac{p(x)}{q(x)} \right) dx$

For discrete random variables it is computed as a summation

$KL(p,q) = \sum_{i=1}^n p(x^i)*log \left(\frac{p(x^i)}{q(x^i)} \right) $

In [5]:
# Converts an integer to binary of 8 positions
def  Binary_8(x):
    binary = np.array([x], dtype=np.uint8)
    return binary
    
# Creates a joint distribution for 8 binary variables based on function f
def  Joint_Distribution(f):
    n = 8
    dist = [ f(Binary_8(x)) for x in range(2^n)]
    return np.array(dist/np.sum(dist))
    

## Exercise 1

Complete the implementation of the Kullback Leibler function below and evaluate quantifying the divergence
between the three distributions in the following cell

In [6]:
def Kullback_Leibler(p,q):
    KL = 0
    
    return KL

In [7]:
# Distribution associated to function f1 
f1 = np.sum
dist1 = Joint_Distribution(f1)

# Distribution associated to function f2
f2 = np.square
dist2 = Joint_Distribution(f2)

# Uniform distribution
dist3 = np.full(8,0.1)

kl_12 = Kullback_Leibler(dist1,dist2)
kl_13 = Kullback_Leibler(dist1,dist3)
kl_23 = Kullback_Leibler(dist2,dist3)

print("The KL divergence between the distributions are: ",kl_12,kl_13,kl_23)


The KL divergence between the distributions are:  0 0 0


## Exercise 2

 The function scipy.stats.entropy allows the computation of the KL (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html ). Compare the results of this function with Kullback_Leibler for computing the divergences kl_21 and kl_31.

## Likelihood and Log-likelihood


In the lecture we have seen the the maximum likelihood estimator is the most used way to estimate the parameters of a model. For a parametric distribution of parameter $\theta$, the loglikelihood is computed as:

LL = $\sum_{i=1}^n   p_{\theta}(x^i)log(p(x^i))$

and usually we want to find the parameter $\theta$ that maximizes the likelihood.

## Exercise 3

Let $\theta = (\hat{\mu},\hat{\sigma})$ be the parameters of a Gaussian distribution. Given a list of samples S  and $\theta$ as inputs, complete in the next cell the implementation of the function Log_Likelihood() that computes the log_likelihood for the sample S given the model and parameters theta.  

Suggestion: Use function vals = norm.pdf(x,mu,sigma) that computes the Gaussian probability assigned to point x
by a Gaussian distribution of parameters mu,sigma. 

In [8]:
def Log_Likelihood(S,theta):
    mu = theta[0]
    sigma = theta[1]    
    LL = 0 
    for x in S:
        XXX
    return LL
    
    
param_theta = [20,24]    
set_S = [ 79.19707822, 107.80002586, 112.87202068, 110.24491734, 84.36827062, 112.3651777 , 104.68415497, 135.02754932,
     104.7278481 ,  90.23410368, 121.39874476,  74.19135635, 102.09448052,  97.62711719, 151.56860878]

ll = Log_Likelihood(set_S,param_theta)

NameError: name 'XXX' is not defined

## Exercise 4


 Given a list of possible theta parameters (all_thetas), determine which is the theta that better fits the samples in set_S. 
 
 
 Suggestion: Use function  Log_Likelihood() implemented above

In [9]:
set_S = [ 79.19707822, 107.80002586, 112.87202068, 110.24491734, 84.36827062, 112.3651777 , 104.68415497, 135.02754932,
     104.7278481 ,  90.23410368, 121.39874476,  74.19135635, 102.09448052,  97.62711719, 151.56860878]

all_thetas = [[125,20],[100,30],[90,11],[135,24],[110,30],[108,11]]

 ##  Kullback Leibler, entropy, and cross-entropy

Let us go back to the definition of the Kullback Leibler difference: 

\begin{align}
   KL(p,q) &= \sum_{i=1}^n p(x^i)*log \left(\frac{p(x^i)}{q(x^i)} \right) \\
           &= \sum_{i=1}^n p(x^i)*log \;p(x^i)  -\sum_{i=1}^n p(x^i)* log \;q(x^i)\\
           &=  H(p) + H(p,q) \\
\end{align} 

where $H(p)$ is the entropy of $p$ and $H(p,q)=-\sum_{i=1}^n p(x^i)* log \;q(x^i)$ is called the crossentropy between distributions $p$ and $q$.

We can see then that the Kullback Leibler divergence between two distributions is the sum of the entropy of $p$ plus the cross entropy between p and q. 

## Log-likelihood and Cross-entropy 

Furthermore, there is a strong relationship between the log-likelihood and the cross-entropy. 

\begin{align}
   LL      &=  \sum_{i=1}^n   p_{\theta}(x^i)log(p(x^i)) \\
           &=  -H(p_{\theta},p)
\end{align}    
    

Therefore, maximizing the log-likelihood is equivalent to minimizing the cross-entropy.

In the following exercise we use the implementation of a logistic-regression classifier from the previous Lab. This implementation is an adaptation of that found in the *Python machine learning* book, Raschka, S., & Mirjalili, V. (2017). Packt Publishing Ltd.)

In [None]:
class MyLogisticRegression(object):
    def __init__(self, eta=0.01, n_iter=1000, random_state=0):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
        self.rgen = np.random.RandomState(self.random_state)
        self.w = None
        self.b = None
        
    def net_input(self, X):
        return np.dot(X, self.w) + self.b
    
    def activation(self, z):
        return 1. / (1. + np.exp(-np.clip(z, -25, 25)))
        
    def fit(self, X, y):        
        self.w = self.rgen.normal(loc=0, scale=0.01, size=X.shape[1])
        self.b = self.rgen.normal(loc=0, scale=0.01, size=1)
        self.cost = []
        
        for i in range(self.n_iter):
            net_input = self.net_input(X)
            output = self.activation(net_input)
            errors = y-output                                # The error is computed as the difference between the 
                                                            # prob. of the class and the prediction of the model
            
            self.w += self.eta * X.T.dot(errors)
            self.b = self.eta * errors.sum()
            
            cost = (-y.dot(np.log(output)) - ((1-y).dot(np.log(1-output))))
            #print(i,cost,errors.sum())
            self.cost.append(cost)
            
        return self
    
    def predict(self, X):
        predicted_class = self.predict_proba(X)>0.5
        return predicted_class # Given the features "predict" outputs the classification given by the model
    
    def predict_proba(self, X):
        predicted_proba = activation(net_input(X))
        return predicted_proba # Given the features predict_proba outputs the probability that the solution belongs to the class

## Exercise 5


  a) Identify in the function "fit" where the cross-entropy is computed.
  
  b) Visualize the value of the cross-entropy for the different iterations.
  
  c) Evaluate what happens when difficult data is used (Uncomment function tr_data,c = Create_Difficult_Classification_Data(npoints,npoints))
  
  d) Could you interpret what the curve of the cross-entropy tells us about the performance of the algorithm?
  
 

In [None]:
# Auxiliary function from previous Lab used to create some classification data

def Create_Classification_Data(number_points_Class_A,number_points_Class_B):    
    
    # Points in Class A
    xA = 20*np.random.rand(number_points_Class_A)
    shiftA = 20*np.random.rand(number_points_Class_A)
    yA = (4+xA)/2.0 - shiftA - 0.1

    # Points in Class B
    xB = 20*np.random.rand(number_points_Class_B)
    shiftB = 20*np.random.rand(number_points_Class_B)
    yB = (4+xB)/2.0 + shiftB + 0.1

    
    c = np.hstack((np.ones((number_points_Class_A)),np.zeros((number_points_Class_B))))
    #print(c.shape)

    # We create the training data concatenating examples from the two classes XA and XB
    tr_data = np.hstack((np.vstack((xA,yA)),np.vstack((xB,yB)))).transpose()
    #print(training_data.shape)

    return tr_data,c
    
    

In [None]:
# Auxiliary function from previous Lab used to create some classification data

def Create_Difficult_Classification_Data(number_points_Class_A,number_points_Class_B):    
    
    # Points in Class A
    xA1 = 20*np.random.rand(number_points_Class_A)
    shiftA1 = 20*np.random.rand(number_points_Class_A)
    yA1 = (4+xA1)/2.0 - shiftA1 + 5.0

    # Points in Class B
    xB1 = 20*np.random.rand(number_points_Class_B)
    shiftB1 = 20*np.random.rand(number_points_Class_B)
    yB1 = (4+xB1)/2.0 + shiftB1 - 5.0

    # Sinusoidal curve dividing the two classes      
    x2 = np.linspace(0, 20, 2000)
    y2 = 20*np.cos(0.2*np.pi*x2) 
    
    c = np.hstack((np.ones((number_points_Class_A)),np.zeros((number_points_Class_B))))
    #print(c.shape)

    # We create the training data concatenating examples from the two classes XA and XB
    tr_data = np.hstack((np.vstack((xA1,yA1)),np.vstack((xB1,yB1)))).transpose()
    #print(training_data.shape)

    return tr_data,c

In [None]:
# Number of points in each class
npoints = 150
# We generate the data for classification
tr_data,c = Create_Classification_Data(npoints,npoints)
#tr_data,c = Create_Difficult_Classification_Data(npoints,npoints)


# We define the LogisticRegression object
mylr = MyLogisticRegression(eta=0.01, n_iter=20, random_state=10)

# Logistic regression learns from data
mylr = mylr.fit(tr_data,c)

In [None]:
# Auxiliary function to draw the cross-entropy
def plot_cross_entropy(points,fsize):
    plt.figure()
    plt.xlabel(r'$x$', fontsize=fsize)
    plt.ylabel(r'$y$', fontsize=fsize)
    plt.plot(points,'-m',lw=4)
    plt.show()
        

In [None]:
plot_cross_entropy(cross_entropy,14)