# Logistic Classifier

![Capture.PNG](attachment:Capture.PNG)

### Weights and Bias in TensorFlow

The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, we need a Tensor that can be modified. This leaves out <mark>tf.placeholder()</mark> and <mark>tf.constant()</mark>, since those Tensors can't be modified. This is where <mark> tf.Variable</mark>lass comes in.

##### tf.Variable()

The <mark>tf.Variable</mark> class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so you must initialize the state of the tensor manually. You'll use the <mark>tf.global_variables_initializer()</mark> function to initialize the state of all the Variable tensors.

The <mark>tf.global_variables_initializer()</mark> call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown below. Using the <mark>tf.Variable</mark>  class allows us to change the weights and bias, but an initial value needs to be chosen.

In [None]:
## Initialization

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

##### tf.truncated_normal()

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it. Choosing weights from a normal distribution prevents any one weight from overwhelming other weights.

``` python
## Example
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

```


##### tf.zeros()

Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use the simplest solution, setting the bias to 0.
``` python
## Example
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))
```

### Softmax Probabilities

For multiclass problems, softmax probability is used to determine the class of output. Softmax probability function converts the scores into proper probabilities (with sum equal to 1) and it has an inherent quality to suppress the low scores, and boost the high scores.

Scores in Logistic Regression refer to Logits.

![softmax.PNG](attachment:softmax.PNG)


In [7]:
## Example
import tensorflow as tf

logit_data = [2.0, 1.0, 0.1]

logits = tf.placeholder(tf.float32)

softmax = tf.nn.softmax(logits)

with tf.Session() as sess:
    output = sess.run(softmax,feed_dict={logits:logit_data})
    print(output)

[ 0.65900117  0.24243298  0.09856589]


### One Hot Encoding 

One Hot Encoding : To depict labels/classes, by using a vector of ones and zeros. Let us say we have three Classes A,B and C, Then these will be depicted as follows

A = 1,0,0

B = 0,1,0

C = 0,0,1

### Cross Entropy

When number of classes  (thousands, millions) are huge, the length of one hot encoded vector becomes very large. 
Cross entropy gives a way to measure the distance beween two probability vectors.

To create a cross entropy function in TensorFlow, you'll need to use two new functions:

```python

x = tf.reduce_sum([1, 2, 3, 4, 5])  # 15
x = tf.log(100)  # 4.60517
```

In [8]:
## Example

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# Print cross entropy from session
cross_entropy = -tf.reduce_sum(tf.multiply(one_hot, tf.log(softmax)))

with tf.Session() as sess:
    print(sess.run(cross_entropy, feed_dict={softmax: softmax_data, one_hot: one_hot_data}))

0.356675


### Normalized Inputs

It should be a practise to have all features with 0 mean and equal variances. It becomes easier for the optimizer to reach the solution.

How to deal with image normalization?

Since each pixel ranges from 0 to 255, one way to normalize is to apply following to each color channel.

(R-128)/(128)

### Stochastic Gradient Descent (SGD)

Implementing the gradient descent in batches instead of all observations at a time is called Stochastic Gradient Descent.

Benefits : Computationally better, since the gradient step takes approximately 3 times more compute than what is required in evaluating loss function. If one big step is taken using all the training examples, extensive amount of computation power is required. Instead a better tradeoff is to take multiple steps of descent over small number of training examples, in a random direction. But the end result will always be in our favour, as we do multiple rounds over the training examples and these small steps tend to aggregate together to reach a minima.

### Momentum and Learning Rate Decay

Helping SGD :

+ Momentum : Taking knowledge of the previous gradient direction to determine the future direction. One method is to keep a running average of gradients instead of the direction of the current batch of the data

+ Learning Rate Decay : Best approach is to take a smalller and smaller step, as we train. This will help to overshoot any local minima exisiting near the solution, and help to converge to a better solution.

** ADAGRAD is one modification of SGD which implicitly does momentum and learning rate decay. 

### GOAL

![GOAL.PNG](attachment:GOAL.PNG)

+ Implement get_weights to return a tf.Variable of weights
+ Implement get_biases to return a tf.Variable of biases
+ Implement xW + b in the linear function
+ Initialize all weights