In [1]:
import numpy as np
from math import *
import cv2 as cv

Problem with imbalanced dataset:<br>
Most of the Machine Learning algorithms are based on the inherent assumption that the data is balanced, i.e., the data is equally distributed among all of its classes. When training a model on an imbalanced dataset, the learning becomes biased towards the majority classes. With more number of examples available to learn from, the model learns to perform well on the majority classes but due to the lack of enough examples the model fails to learn meaningful patterns that could aid it in learning the minority classes.<br>
Loss functions for multi-class classification with imbalanced dataset:<br>
Focal Loss<br>
Ref1(Research paper): https://arxiv.org/pdf/1708.02002.pdf<br>
Ref2: https://youtu.be/Y8_OVwK4ECk<br>
Ref3: https://medium.com/gumgum-tech/handling-class-imbalance-by-introducing-sample-weighting-in-the-loss-function-3bdebd8203b4 <br>

In [4]:
#To understand focal loss we need to first understand what is cross-entropy loss
#cross entropy loss
def cross_entropy(x,y):
    assert(len(x)==len(y))
    for e in y:
        assert((e==0 or e==1))
    for e in x:
        assert(e<1 and e>0)
    m=len(x)
    loss=lambda m,n:-log(m) if n==1 else -log(1-m)
    
    sigma=0
    i=0
    for e in x:
        sigma+=loss(e,y[i])
        i+=1
    result=sigma/m
    return result

For the above cost function, it can be noticed by plotting graphs that even when probbility is significantly greater than 0.5 there is some non-trivial loss which should be avoided. By using focal loss, having two hyper parameters alpha(for imbalanced data) and gamma(power to which a certain quantity in loss function is raised), we intend to achieve the same.

In [9]:
def focal_loss(x,y,gamma=2,alpha=None):
    assert(len(x)==len(y))
    for e in y:
        assert(e==0 or e==1)
    for e in x:
        assert(e<1 and e>0)
    
    m=len(x)
    if alpha==None or (alpha<0 and alpha>1):
        loss=lambda m,n:(-(1-m)**gamma)*log(m) if n==1 else (-(m**gamma)*log(1-m))
    else:
        loss=lambda m,n:-(alpha*(1-m)**gamma)*log(m) if n==1 else (-(1-alpha)*(m**gamma)*log(1-m))
    sigma=0
    i=0
    for e in x:
        sigma+=loss(e,y[i])
        i+=1
    result=sigma/m
    return result
    

In [10]:
#testing 
x=[0.2,0.2,0.2,0.2,0.2,0.2,0.9,0.2,0.2,0.2]
y=np.round_(x)
print(cross_entropy(x,y))
print(focal_loss(x,y))
print(focal_loss(x,y,2,0.25))
print(focal_loss(x,y,3,0.25))

0.21136524774857138
0.008138528362969378
0.00605121601439812
0.0012076091899881787


Another way of dealing with imabalanced dataset is by assigning each class a weight based on their number ina sample, so as to increase their effect in cost function.

In [None]:
#Weighted loss
#there are several ways of evaluating the weights of each class:
#Inverse number of samples(INS)
#Inverse of square root of number of samples(ISNS) 
#Effective number of samples(ENS)for this read Ref 3

#input for the below function is x(a list of list having softma activations corresponding to each y_i) and expected output y 
def weighted_loss(x,y,scheme="INS"):
    assert(len(x)==len(y))
    y_dash=list(set(y))
    n=len(y_dash)
    weights=[]
    freq=[]
    for e in y_dash:
        freq.append(y.count(e))
    for i in freq:
        if scheme="INS":
            weights.append(1/(freq))
        else if scheme="ISNS":
            weights.append(1/sqrt(freq))
    normalized=weights/max(weights)
    
    def loss(p,q):
        weight=weights[y_dash.index(q)]
        max_p=max(p)
        if y=p.index(max_p):
            return -log(max_p)
        else:
            return -log(1-max_p)
        
    sigma=0
    i=0
    for e in x:
        sigma+=loss(e,y[i])
        i+=1
    
    result=sigma/len(y)
    
    return result

    

Evaluation metrics for object detection<br>
MAP is the most robust metric for evaluating performance of an object detection algorithm. <br>
Ref1 for MAP: https://blog.roboflow.com/mean-average-precision/<br>
Ref2 for MAP: https://youtu.be/FppOzcDvaDI<br>

Before going to MAP, it is required to know, Precision, Recall, F1, IOU(Intersecion over union), AP(Average Precision)<br>
Precision=T.P/(T.P.+F.P.)<br>
Recall=T.P./(T.P.+F.N.)<br>
F1 Score=2*P*R/(P+R)<br>

Also, there is something called semantic segmentation whose evaluation metrics are similar to that for object detection.<br>
Ref for semantic segmentation:https://youtu.be/uiE56h5LyXc

In [13]:
#IOU(Intersecion over union):It is ratio of area of intersection 0f predicted bounding box by a model and ground truth 
#bounding box to that for union.
def IOU(x,y):# input format (bottom left x,bottom left y, width, height)
    assert(len(x)==4 and len(y)==4)
    relu=lambda x:x if x>0 else 0
    
    width=min(x[0]+x[2],y[0]+y[2])-max(x[0],y[0])
    height=min(x[1]+x[3],y[1]+y[3])-max(x[0],y[0])
    
    if width<0 or height<0:
        return 0
    
    else:
        return 1/(((x[2]*x[3]+y[2]*y[3])/width*height)-1)
    
    

IOU is used as a threshold for tagging a bounding box as T.P. or F.P..<br> 
Precision-recall graph is ploted after the classifying each bounding box as discussed above. Notice the way the graph is plotted is by calculating precision and recall progresively. Watch Ref2 video to clear that.
To calculate AP of a particular class, the area under the graph is calculated. Thereafter to get mAP, mean of all the AP's(of all the classes) is calculated. Now.....again mean is taken for various threshold values (say from a to b in steps of c).
This is represented as mAP@a:c:b.

Batch normalization<br>
It is often required to normalize(rescaling the data to be in the inerval [0,1]) the input dataset(if the data is of varied scale) in order to stablize the training of a neural network and for faster convergence. In batch normalization, even the outputs from layers of the neural network are normalized to make sure faster convergence and lesser training time in each step.<br>Precisely, what is done in batch normalization is actually standardization(subtraction by mean and then divided by standard deviation) w.r.t. to samples in batches(and NOT w.r.t. various activations for a single sample).<br>
Ref:https://youtu.be/dXB-KQYkzNU <br>
<br>
Layer normalization<br>
The only difference for batch normalization is that the standardiztion tha is carried out after each layer is w.r.t. all the activations corresponding to a single sample input. This is not very useful for inputs as images, but for networks in which inputs are of variable lengths.(eg,RNN(Recurrent Neural Network))<br>
Ref for layer normalization :https://youtu.be/eyPZ9Mrhri4<br>
Ref for RNN:https://youtu.be/Y2wfIKQyd1I<br>
RNN are used for sequential network where the depth of the network depends on length of the input(usually the case with NLP).<br>
<br>
Activation Functions<br>
Activation functions are functions which given a certain output based on the iput to it from a layer in  neural network. Generally, non-linear activation functions are used in neural networks because it can be proved that usig linear activation functions ultimately yields output y as a linear function of inputs and hence the hidden layers do not play much role in expressing complex functions through that neural network.Eg(Tanh, Sigmoid, relu)<br>
Ref: https://youtu.be/NkOv_k7r6no<br>
<br>

Dropout regularization<br>
Regularization is in layman's terms ways of reducing chances of over-fitting or under-fitting by a deep learning network. Dropout regularization, in particular, inovloves randomly deactivating neurons for different samples so that a particular feature does not depend on weights connected to a given activation. This way the problem of over-fitting is take care of.  
Ref:https://youtu.be/lcI8ukTUEbo<br>
<br>

Pooling layers<br>
Mainly of two types:-<br>
Max pooling(most common):Based on the hyperparameters(stride and dimensions of the filter) when convolved over a matrix(say 2-D) outputs the max of all the elements for that convolution. If the number of channels is say n_c, then the output also has the same number of channels and pooling is applied on each channel separately.<br>
Average pooling(rare):Outputs average of the elements.<br>
Note that there are no parameters to learn. Also usually no padding is used. The main purpose of pooling is (intuitively) get the most weighing feature in that particular part of the matrix. <br>
Ref: https://www.coursera.org/learn/convolutional-neural-networks/lecture/hELHk/pooling-layers<br>
<br>

R-CNN<br>
Regions with CNN(R-CNN) is a 2 stage object detection algorithm and has many variants(fast R-CNN and faster R-CNN). Superficially, the first stage involves selective search algorithm, where the algorithm returns a given number of regions(bounding boxes), to feed to AlexNet(an architecture of CNN, the scond stage of the algorithm), after resizing the regions to adjust to the input dimensions of AlexNet. This is not the cutting edge algorithm for object detection, because in 2015, a paper for YOLO algorithm came up and now it is used in most of the areas invloving object detection. The problem with R-CNN is that it is omputationally heavy(though not as heavy as sliding box approach) and very complex. YOLO is faster, and a single stage object detection algorithm. 
<br>
Ref1:https://arxiv.org/pdf/1311.2524.pdf<br>
Ref2:https://towardsdatascience.com/understanding-regions-with-cnn-features-r-cnn-ec69c15f8ea7<br>
Ref3(selective search):https://learnopencv.com/selective-search-for-object-detection-cpp-python/ <br>
Ref4(YOLO algorithm): https://youtu.be/ag3DLKsl2vk<br>
Ref5(new versions): https://youtu.be/IfRMV2MY9n0<br>
<br>

Retina Net<br>

RetinaNet=ResNet+Feature Pyramid Net+Focal Loss<br>
ResNet was the first one-stage object detection architecture that surpassed two-staged detectors w.r.t. accuracy. Two-stage detectors usually have higher accuracy because of segmentation stage, which significantly reduces the negatives. The second stage involves classification network. Due to this two stages the computation time is more. On the othe hand, in one-stage predictions, image segmentation is not carried out, resulting in lower computational time but less accuracy due to too many negatives(although single negative a not have much effect on learning but too many).
Ref: https://youtu.be/infFuZ0BwFQ<br>
<br>

Bounding box regression<br>
For bounding box regression, the y labels of a simple classification network are modified and the parameters of bounding box are added (hence there is increase in dimensionality of the output). 
<br>
<br>

ROI Alignment<br>
??
<br>
<br>

Use of 1-D convolution<br>
1-D convolution might seem as trivially multiplying all the elements by some scalar, however that is only true for 2-D input matrices. For 3-D input matrices, 1-D convolution acts like a network within the network, since it takes input of the form 1X1XN_C and outputs a single value(the filter size is of the form 1X1XN_C). This is useful when it is required to reduce the number of channels in a layer.<br>
Ref1: https://arxiv.org/pdf/1312.4400.pdf<br>
Ref2: https://www.coursera.org/learn/convolutional-neural-networks/lecture/ZTb8x/networks-in-networks-and-1x1-convolutions<br>
<br>

ResNets/Skip Connections<br>
Although it might seem that building deeper layer should not harm the network, however experiments showed that learning identity functions is difficult and the above claim might not hold true. What resNets do is take the input to a layer and add it to the output of next to next layer. Experiments have shown that this makes the deep networks less prone deteoration due to adding more layers.
Ref: https://youtu.be/ZILIbUvp5lk<br>
<br>



In [None]:
#real time object detection
