# What is Optimization Algorithms


## Gradient Descent

    The gradient descent method is the most popular optimisation method. The idea of this method is to update the variables iteratively in the (opposite) direction of the gradients of the objective function. With every update, this method guides the model to find the target and gradually converge to the optimal value of the objective function. 


## Stochastic Gradient Descent 
    
    Stochastic gradient descent (SGD) was proposed to address the computational complexity involved in each iteration for large scale data. The equation is given as:


    Taking the values and adjusting them iteratively based on different parameters in order to reduce the loss function is called back-propagation.

    In this method, one sample randomly used to update the gradient(theta) per iteration instead of directly calculating the exact value of the gradient. The stochastic gradient is an unbiased estimate of the real gradient. This optimisation method reduces the update time for dealing with large numbers of samples and removes a certain amount of computational redundancy. Read more here.

## Adaptive Learning Rate Method 
    
    Learning rate is one of the key hyperparameters that undergo optimisation. Learning rate decides whether the model will skip certain portions of the data. If the learning rate is high, then the model might miss on subtler aspects of the data. If it is low, then it is desirable for real-world applications. Learning rate has a great influence on SGD. Setting the right value of the learning rate can be challenging. Adaptive methods were proposed to this tuning automatically. 

    The adaptive variants of SGD have been widely used in DNNs. Methods like AdaDelta, RMSProp, Adam use the exponential averaging to provide effective updates and simplify the calculation.

    Adagrad: weights with a high gradient will have low learning rate and vice versa
    RMSprop: adjusts the Adagrad method such that it reduces its monotonically decreasing learning rate. 
    Adam is almost similar to RMSProp but with momentum
    Alternating Direction Method of Multipliers (ADMM) is another alternative to Stochastic Gradient Descent (SGD) 
    The difference between gradient descent and AdaGrad methods is that the learning rate is no longer fixed. It is computed using all the historical gradients accumulated up to the latest iteration. Read more here.

## Conjugate Gradient Method
    
    The conjugate gradient (CG) approach is used for solving large scale linear systems of equations and nonlinear optimisation problems. The first-order methods have a slow convergence speed. Whereas, the second-order methods are resource-heavy. Conjugate gradient optimisation is an intermediate algorithm, which combines the advantages of first-order information while ensuring the convergence speeds of high-order methods.

    
## Derivative-Free Optimisation 
    
    For some optimisation problems, it can always be approached through a gradient because the derivative of the objective function may not exist or is not easy to calculate. This is where derivative-free optimisation comes into the picture. It uses a heuristic algorithm that chooses methods that have already worked well, rather than derives solutions systematically. Classical simulated annealing arithmetic, genetic algorithms and particle swarm optimisation are few such examples.

## Zeroth Order Optimisation
    
    Zeroth Order optimisation was introduced recently to address the shortcomings of derivative-free optimisation. Derivative-free optimisation methods find it difficult to scale to large-size problems and suffer from lack of convergence rate analysis. 

    Zeroth Order advantages include:

    Ease of implementation with only a small modification of commonly-used gradient-based algorithms
    Computationally efficient approximations to derivatives when they are difficult to compute
    Comparable convergence rates to first-order algorithms.


# What is  Activation Function in ml in multi-classification ?

##### Softmax

In [1]:
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline


In [2]:
# load dataset
dataframe = pandas.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]

In [3]:
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

In [4]:
# define baseline model
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(8, input_dim=4, activation='relu'))
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

In [5]:
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)

In [6]:
kfold = KFold(n_splits=10, shuffle=True)

In [7]:
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 96.67% (3.33%)
