****Difference between traditional CV and DL based CV****

Traditional computer vision (CV) and deep learning (DL) based CV are two different approaches to solving problems in computer vision.

Traditional CV relies on manually designing and engineering features and algorithms to extract useful information from images or videos. This includes techniques such as edge detection, corner detection, template matching, and feature extraction using SIFT or SURF. These features are then fed into classifiers or other algorithms to perform tasks such as object detection, tracking, and recognition.

On the other hand, DL based CV uses neural networks to automatically learn useful features and representations from images or videos. These networks are typically trained on large datasets and can learn to recognize patterns and features that are difficult to engineer by hand. DL based CV includes techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks (GANs).

DL based CV has achieved state-of-the-art performance on many computer vision tasks, including image classification, object detection, segmentation, and more. It has also enabled new applications such as image style transfer, super-resolution, and image generation.

Overall, traditional CV and DL based CV are complementary approaches, and the choice of which to use depends on the specific problem and available data. Traditional CV is still useful in scenarios where the available data is limited or the task is simple, while DL based CV is generally more effective for complex tasks with large amounts of data.


Traditional Computer Vision
Relies on basic image processing techniques
Handcrafted feature extraction/Mathematical feature extraction
Classic Machine Learning techniques e.g. Image classification using SVM, Naïve Bayes, Logistic Regression, etc. Image Segmentation using K-Means 


Data Driven approach (requires image data)
Feature extraction and pattern recognition is learnt during the training process
Training neural networks on a dataset of images for the desired task (classification, segmentation, object detection, etc)


**Typical tasks in CV**


Computer vision (CV) is a rapidly growing field with a wide range of applications. Here are some typical tasks in CV:

Image classification: This involves assigning a label or category to an image, such as "dog," "cat," or "car."

Object detection: This involves identifying the location and size of objects within an image or video, and classifying them into specific categories.

Object tracking: This involves following an object over time in a video, even as it moves or changes in appearance.

Image segmentation: This involves dividing an image into different regions or segments based on their properties, such as color, texture, or intensity.

Image registration: This involves aligning two or more images that may be taken from different viewpoints or at different times.

Image restoration: This involves removing noise or other artifacts from an image to improve its quality.

Face recognition: This involves identifying individuals based on their facial features, often used in security systems or social media applications.

3D reconstruction: This involves creating a 3D model of an object or scene from multiple 2D images or video frames.

Optical character recognition (OCR): This involves recognizing text within an image and converting it into a machine-readable format.

Image synthesis: This involves generating new images or videos based on existing data or models, such as image style transfer or video prediction.

These are just a few examples of the many tasks in CV, and new applications are being developed all the time.

**CV pipeline**

A computer vision (CV) pipeline is a series of steps or processes used to perform an analysis or solve a problem in CV. Here's a typical CV pipeline:

Data acquisition: This involves collecting or creating a dataset of images or videos to be used in the analysis.

Preprocessing: This involves applying techniques such as resizing, cropping, and normalization to prepare the images or videos for analysis.

Feature extraction: This involves using traditional CV techniques or deep learning models to extract meaningful features or representations from the images or videos.

Training: This involves using the extracted features or representations to train a model or algorithm to perform a specific task, such as image classification, object detection, or segmentation.

Evaluation: This involves testing the trained model on a separate set of data to measure its performance and assess its accuracy.

Deployment: This involves integrating the trained model into a larger system or application, such as a self-driving car or a medical diagnosis tool.

The specific steps in a CV pipeline can vary depending on the problem and the available data, but this general pipeline provides a framework for organizing and executing the various tasks involved in CV analysis. It's important to note that the pipeline is an iterative process, with each step potentially feeding back into earlier steps to improve the overall performance of the system.

**Digital image as a 2D function and image representation**


An image may be defined as a two-dimensional function, f(x, y), where x and y are spatial (plane) coordinates, and the amplitude of f at any pair of coordinates (x, y) is called the intensity or gray level of the image at that point.

In an 8-bit image each pixel occupies exactly one byte. This means each pixel has 256 possible numerical values, from 0 to 255. Therefore, the color palette for an 8-bit image normally contains 256 entries, defining color 0 through color 255.

**Countour Detection**

Contours are defined as the line joining all the points along the boundary of an image that are having the same intensity. Contours come handy in shape analysis, finding the size of the object of interest, and object detection.


**Edge Detection**

Edge detection is a fundamental image processing technique that aims to identify the boundaries between objects and the background in an image. The edges can be defined as areas of rapid intensity change in an image, and they can be classified into three categories: step edges, ramp edges, and roof edges.

Edge detection is typically performed using various mathematical techniques such as convolution, gradient-based methods, and Laplacian-based methods. One of the most widely used edge detection algorithms is the Canny edge detector, which uses a multi-stage process to identify edges in an image. The first stage involves applying a Gaussian filter to smooth the image and reduce noise. The second stage involves computing the gradient of the image to identify regions of rapid intensity change. Finally, non-maximum suppression and hysteresis thresholding are used to remove false edges and detect true edges.

Edge detection is used in many applications, such as object recognition, image segmentation, and computer vision. It is also used in various fields, including medical imaging, robotics, and autonomous vehicles.

**Code for Edge detection**

In [1]:
import cv2

# read input image
img = cv2.imread('input_image.jpg', 0)

# apply Gaussian filter to smooth the image
blur = cv2.GaussianBlur(img, (3, 3), 0)

# compute gradient using Sobel operator
grad_x = cv2.Sobel(blur, cv2.CV_64F, 1, 0, ksize=3)
grad_y = cv2.Sobel(blur, cv2.CV_64F, 0, 1, ksize=3)
grad = cv2.addWeighted(grad_x, 0.5, grad_y, 0.5, 0)

# apply non-maximum suppression to remove false edges
nms = cv2.Canny(grad, 100, 200, apertureSize=3, L2gradient=True)

# display output image
cv2.imshow('Output Image', nms)
cv2.waitKey(0)
cv2.destroyAllWindows()


error: OpenCV(4.5.4) /tmp/pip-req-build-jpmv6t9_/opencv/modules/imgproc/src/smooth.dispatch.cpp:617: error: (-215:Assertion failed) !_src.empty() in function 'GaussianBlur'


n this code, we first read the input image using the OpenCV library. Then we apply a Gaussian filter to smooth the image and reduce noise. Next, we compute the gradient of the image using the Sobel operator. We then apply non-maximum suppression to remove false edges and finally display the output image using the imshow() function. The waitKey() and destroyAllWindows() functions are used to wait for a key press and close the window, respectively.

Note that the Canny() function in OpenCV automatically performs Gaussian blurring, gradient computation, non-maximum suppression, and hysteresis thresholding, so we don't need to perform those steps manually. We just need to provide the input image and appropriate threshold values for the Canny function to perform edge detection.

**Canny detection**

The Canny edge detection algorithm is a popular method for detecting edges in an image. It was developed by John F. Canny in 1986 and is still widely used today due to its accuracy and simplicity. The algorithm consists of several steps, including smoothing the image, computing the gradient magnitude and direction, non-maximum suppression, and hysteresis thresholding.

Here is the step-by-step process of the Canny edge detection algorithm:

Apply Gaussian blur to the input image to remove noise.
Compute the gradient magnitude and direction using the Sobel operator.
Perform non-maximum suppression to thin the edges by suppressing non-maximum pixels along the direction of the gradient.
Apply hysteresis thresholding to remove weak edges and connect strong edges. This involves setting two threshold values: a low threshold and a high threshold. Any edge with a magnitude below the low threshold is suppressed, while edges with a magnitude above the high threshold are retained. Edges with a magnitude between the low and high thresholds are retained only if they are connected to strong edges.
Finally, display the detected edges on the output image.

In [2]:
import cv2

# Read the input image
img = cv2.imread('input_image.jpg', 0)

# Apply Gaussian blur
img_blur = cv2.GaussianBlur(img, (5, 5), 0)

# Compute gradient magnitude and direction
grad_x = cv2.Sobel(img_blur, cv2.CV_64F, 1, 0, ksize=3)
grad_y = cv2.Sobel(img_blur, cv2.CV_64F, 0, 1, ksize=3)
grad_mag = cv2.magnitude(grad_x, grad_y)
grad_dir = cv2.phase(grad_x, grad_y, angleInDegrees=True)

# Perform non-maximum suppression
nms = cv2.Canny(grad_mag, 100, 200, apertureSize=3, L2gradient=True)

# Display the output image
cv2.imshow('Canny Edge Detection', nms)
cv2.waitKey(0)
cv2.destroyAllWindows()


error: OpenCV(4.5.4) /tmp/pip-req-build-jpmv6t9_/opencv/modules/imgproc/src/smooth.dispatch.cpp:617: error: (-215:Assertion failed) !_src.empty() in function 'GaussianBlur'


**Line Detection**

Line detection is a common task in computer vision and image processing. The Hough transform is a popular technique for detecting lines in an image. It works by representing lines in the image space as points in the Hough space, where each point corresponds to a line in the image space. The Hough transform can detect lines of any orientation and can handle noisy images.

Here are the steps involved in the Hough transform for line detection:

Convert the input image to a binary image using thresholding or edge detection.
Initialize an accumulator matrix (Hough space) with zeros.
For each non-zero pixel in the binary image, compute the corresponding lines in the image space and increment the corresponding cells in the accumulator matrix.
Threshold the accumulator matrix to extract the lines with the highest vote counts.
Convert the lines from Hough space back to the image space.
Here is a Python code for line detection using Hough transform in OpenCV:

python
Copy code
import cv2
import numpy as np

# Read the input image
img = cv2.imread('input_image.jpg')

# Convert the input image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply Canny edge detection
edges = cv2.Canny(gray, 50, 150, apertureSize=3)

# Apply Hough transform for line detection
lines = cv2.HoughLines(edges, 1, np.pi/180, 100)

# Draw the detected lines on the input image
if lines is not None:
    for line in lines:
        rho, theta = line[0]
        a = np.cos(theta)
        b = np.sin(theta)
        x0 = a * rho
        y0 = b * rho
        x1 = int(x0 + 1000*(-b))
        y1 = int(y0 + 1000*(a))
        x2 = int(x0 - 1000*(-b))
        y2 = int(y0 - 1000*(a))
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 255), 2)

# Display the output image
cv2.imshow('Line Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this code, we first read the input image using the imread() function of OpenCV. Then we convert the image to grayscale using the cvtColor() function, and apply Canny edge detection using the Canny() function. Next, we use the HoughLines() function to detect lines in the image. The first argument of the function is the input image, the second argument is the distance resolution in pixels, the third argument is the angle resolution in radians, and the fourth argument is the threshold for the minimum number of votes required to accept a line. The function returns an array of lines in polar coordinates (rho and theta).

Finally, we iterate over the lines and draw them on the input image using the line() function of OpenCV. The first argument of the function is the input image, the second and third arguments are the coordinates of the start and end points of the line, the fourth argument is the color of the line, and the fifth argument is the thickness of the line. We then display the output image using the imshow() function, and wait for a key press using waitKey(), and then close the window using destroyAllWindows().



In [3]:
import cv2
import numpy as np

# Read the input image
img = cv2.imread('input_image.jpg')

# Convert the input image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply Canny edge detection
edges = cv2.Canny(gray, 50, 150, apertureSize=3)

# Apply Hough transform for line detection
lines = cv2.HoughLines(edges, 1, np.pi/180, 100)

# Draw the detected lines on the input image
if lines is not None:
    for line in lines:
        rho, theta = line[0]
        a = np.cos(theta)
        b = np.sin(theta)
        x0 = a * rho
        y0 = b * rho
        x1 = int(x0 + 1000*(-b))
        y1 = int(y0 + 1000*(a))
        x2 = int(x0 - 1000*(-b))
        y2 = int(y0 - 1000*(a))
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 255), 2)

# Display the output image
cv2.imshow('Line Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()


error: OpenCV(4.5.4) /tmp/pip-req-build-jpmv6t9_/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'


**Hough transformation**

The Hough transform is a feature extraction technique used in computer vision and image analysis for detecting shapes in an image. The most common application of the Hough transform is to detect lines, but it can also be used to detect other shapes such as circles and ellipses.

The Hough transform works by converting image points to a parameter space representation, where each point in the parameter space represents a possible shape that could pass through a set of image points. In the case of line detection, the Hough transform converts each image point to a line in parameter space, represented by its slope and intercept.

The Hough transform algorithm works as follows:

Edge detection: Detect edges in the input image using an edge detection algorithm such as Canny edge detector.

Parameter space creation: For each edge point in the input image, compute a set of lines in parameter space that could pass through the point.

Voting: For each line in parameter space that passes through one or more edge points, increment a vote counter for that line.

Thresholding: Set a threshold for the minimum number of votes required for a line to be considered a valid detection.

Line extraction: Extract the lines with the highest vote counts from the parameter space.

Return: The output of the algorithm is a set of detected lines in the input image.

In [4]:
import cv2
import numpy as np

# Read the input image
img = cv2.imread('input_image.jpg')

# Convert the input image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply Canny edge detection
edges = cv2.Canny(gray, 50, 150, apertureSize=3)

# Apply Hough transform for line detection
lines = cv2.HoughLines(edges, 1, np.pi/180, 200)

# Draw the detected lines on the input image
if lines is not None:
    for line in lines:
        rho, theta = line[0]
        a = np.cos(theta)
        b = np.sin(theta)
        x0 = a * rho
        y0 = b * rho
        x1 = int(x0 + 1000*(-b))
        y1 = int(y0 + 1000*(a))
        x2 = int(x0 - 1000*(-b))
        y2 = int(y0 - 1000*(a))
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 255), 2)

# Display the output image
cv2.imshow('Line Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()


error: OpenCV(4.5.4) /tmp/pip-req-build-jpmv6t9_/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'


**Image Features**

Image features, also known as keypoints, are points in an image that represent distinctive local features that can be used for tasks such as image matching, object recognition, and 3D reconstruction. These keypoints are typically robust to changes in lighting, viewpoint, and scale, making them useful for a variety of computer vision tasks.

There are several types of image features that are commonly used in computer vision:

Corner features: Corner features are points in an image where the gradient changes abruptly in multiple directions. Examples of corner features include Harris corners and FAST corners.

Blob features: Blob features are regions in an image where the intensity changes gradually in all directions. Examples of blob features include Difference of Gaussians (DoG) and Scale-Invariant Feature Transform (SIFT).

Edge features: Edge features are locations in an image where there is a significant change in intensity in a single direction. Examples of edge features include Canny edges and Sobel edges.

Line features: Line features are straight or curved line segments in an image. Examples of line features include Hough transform lines and LSD lines.

Region features: Region features are groups of pixels that form a distinct region in an image. Examples of region features include Histogram of Oriented Gradients (HOG) and Local Binary Patterns (LBP).

Once image features are detected, they can be described using feature descriptors, which are vectors that summarize the local image information around each feature point. Common feature descriptors include SIFT, SURF, and ORB.

Image features and their descriptors are often used in combination with machine learning algorithms such as support vector machines (SVM) and random forests to perform tasks such as object detection, image classification, and image retrieval.

**SIFT algorithm**

SIFT (Scale-Invariant Feature Transform) is a widely used computer vision algorithm for detecting and describing local features in images. It was developed by David Lowe in 1999 and has become one of the most popular algorithms for feature detection and matching due to its robustness to changes in scale, orientation, and lighting.

The SIFT algorithm works as follows:

Scale-space extrema detection: The first step in the SIFT algorithm is to find potential feature locations at multiple scales in the image. This is achieved by computing the difference of Gaussian (DoG) pyramid, which is a series of images obtained by convolving the input image with Gaussian filters of increasing standard deviation, and subtracting adjacent levels of the pyramid.

Keypoint localization: Next, potential keypoints are selected from the scale-space extrema by comparing them to their neighboring pixels in scale and space. Keypoints that are poorly localized or have low contrast are discarded.

Orientation assignment: For each keypoint, a dominant orientation is assigned based on the gradient orientations in the local neighborhood around the keypoint. This makes the feature descriptor rotationally invariant.

Descriptor generation: Finally, a feature descriptor is generated for each keypoint by computing the gradient magnitude and orientation in a region around the keypoint, and building a histogram of gradient orientations weighted by the gradient magnitudes. The resulting histogram is normalized and concatenated to form a feature vector.

In [6]:
import cv2

# Load two images
img1 = cv2.imread('image1.jpg')
img2 = cv2.imread('image2.jpg')

# Convert images to grayscale
gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
gray2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

# Initialize SIFT detector
sift = cv2.xfeatures2d.SIFT_create()

# Detect and compute keypoints and descriptors for both images
kp1, desc1 = sift.detectAndCompute(gray1, None)
kp2, desc2 = sift.detectAndCompute(gray2, None)

# Create a BFMatcher object for matching keypoints
bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)

# Match keypoints in both images
matches = bf.match(desc1, desc2)

# Sort matches by distance
matches = sorted(matches, key = lambda x:x.distance)

# Draw the top N matches between the images
img_matches = cv2.drawMatches(img1, kp1, img2, kp2, matches[:10], None, flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)

# Display the output image
cv2.imshow('SIFT Matches', img_matches)
cv2.waitKey(0)
cv2.destroyAllWindows()


error: OpenCV(4.5.4) /tmp/pip-req-build-jpmv6t9_/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'


**Affine transformation**

Affine transformation is a type of transformation that preserves the parallelism of lines and angles between them. It is a linear mapping that can be represented by a 2x3 matrix, which transforms a point (x, y) to a new point (x', y') in the form:

x' = a * x + b * y + c
y' = d * x + e * y + f

where a, b, d, e are the scale factors and c, f are the translation factors. This transformation can also include shear and rotation, which are additional parameters in the matrix.

Affine transformations can be used in image processing for a variety of tasks, such as image scaling, rotation, translation, and skew correction. It is also a fundamental tool in computer vision for geometric transformations of images, object detection, and recognition.

In [7]:
import cv2
import numpy as np

# Load image
img = cv2.imread('image.jpg')

# Set source and destination points
pts1 = np.float32([[50,50],[200,50],[50,200]])
pts2 = np.float32([[10,100],[200,50],[100,250]])

# Compute affine matrix using the source and destination points
M = cv2.getAffineTransform(pts1, pts2)

# Apply affine transformation to the image
img_affine = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]))

# Display the original and transformed images
cv2.imshow('Original Image', img)
cv2.imshow('Affine Transformed Image', img_affine)
cv2.waitKey(0)
cv2.destroyAllWindows()


AttributeError: 'NoneType' object has no attribute 'shape'

**Intro to CNN**

A Convolutional Neural Network (CNN) is a type of deep learning algorithm that is commonly used in image recognition and processing tasks. The main idea behind CNNs is to learn the features of images by applying a series of convolutional filters to the input image. These filters slide over the image and perform element-wise multiplication with the corresponding pixel values in each region to produce a feature map.

Each filter is responsible for detecting a specific feature in the image, such as edges, corners, or textures. By stacking multiple convolutional layers, the CNN can learn increasingly complex and abstract features in the input image. The output of the last convolutional layer is then flattened and fed into a fully connected (dense) layer for classification or regression.

The key components of a CNN include:

Convolutional Layer: Applies a set of filters to the input image to produce a set of feature maps.

Pooling Layer: Subsamples the feature maps to reduce their spatial dimensions and extract the most important features.

Activation Function: Introduces non-linearity into the network and enables the model to learn complex patterns.

Fully Connected Layer: Takes the flattened output from the convolutional layers and applies a set of weights to produce the final output (classification or regression).

In [8]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Create a Sequential model
model = Sequential()

# Add a convolutional layer with 32 filters, a 3x3 kernel, and ReLU activation
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)))

# Add a max pooling layer with a 2x2 pool size
model.add(MaxPooling2D((2, 2)))

# Add another convolutional layer with 64 filters, a 3x3 kernel, and ReLU activation
model.add(Conv2D(64, (3, 3), activation='relu'))

# Add another max pooling layer with a 2x2 pool size
model.add(MaxPooling2D((2, 2)))

# Flatten the output of the previous layer
model.add(Flatten())

# Add a fully connected layer with 128 units and ReLU activation
model.add(Dense(128, activation='relu'))

# Add an output layer with 1 unit and sigmoid activation for binary classification
model.add(Dense(1, activation='sigmoid'))

# Compile the model with binary crossentropy loss and Adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 62, 62, 32)        896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 31, 31, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 29, 29, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 14, 14, 64)       0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 12544)             0         
                                                                 
 dense (Dense)               (None, 128)               1

**Image augmentation**

Image augmentation is a technique used in computer vision and deep learning to artificially expand the size of a training dataset by generating new images through various transformations. These transformed images are still representative of the original images but are slightly modified in some way. Image augmentation is used to improve the accuracy and robustness of deep learning models by exposing the model to a larger and more diverse set of training data.

Here are some commonly used image augmentation techniques:

Rotation: Rotates the image by a specified angle.

Flipping: Flips the image horizontally or vertically.

Scaling: Rescales the image by a specified factor.

Translation: Translates the image horizontally or vertically.

Noise addition: Adds random noise to the image.

Color jitter: Randomly adjusts the brightness, contrast, and saturation of the image.

Shearing: Applies a shearing transformation to the image.

Cropping: Crops a portion of the image.

Perspective transformation: Applies a perspective transformation to the image.

Elastic transformation: Applies a random deformation to the image.

In [9]:
from keras.preprocessing.image import ImageDataGenerator

# Define the image augmentation parameters
datagen = ImageDataGenerator(
        rotation_range=20,           # Rotate the image by 20 degrees
        width_shift_range=0.2,      # Shift the image horizontally by 20% of its width
        height_shift_range=0.2,     # Shift the image vertically by 20% of its height
        zoom_range=0.2,             # Zoom the image by 20%
        horizontal_flip=True,       # Flip the image horizontally
        vertical_flip=True)         # Flip the image vertically

# Load the image dataset
train_dataset = datagen.flow_from_directory(
        'train/',
        target_size=(64, 64),
        batch_size=32,
        class_mode='binary')

# Train the model using the augmented dataset
model.fit(
        train_dataset,
        steps_per_epoch=len(train_dataset),
        epochs=10)


FileNotFoundError: [Errno 2] No such file or directory: 'train/'

**Dropout
Feature scaling
Internal covariate shift
Batch normalization
Global average pooling
Transfer learning - AlexNet, VGG16
Pretrained network as classifier
Fine tuning
Model callbacks
ROC AUC**

Dropout: Dropout is a regularization technique used in deep learning to prevent overfitting. It randomly drops out some of the neurons during training, forcing the network to learn more robust features.

Feature scaling: Feature scaling is the process of scaling the input features to a common range to improve the performance of the machine learning algorithms. Commonly used scaling techniques include min-max scaling, z-score normalization, and log normalization.

Internal covariate shift: Internal covariate shift is the change in the distribution of the input data distribution to a deep learning model during training, which can slow down the training process and reduce the accuracy of the model. Batch normalization is a technique used to address internal covariate shift.

Batch normalization: Batch normalization is a technique used in deep learning to improve the stability and convergence speed of a neural network. It normalizes the activations of the previous layer across a mini-batch of inputs, reducing the internal covariate shift.

Global average pooling: Global average pooling is a technique used in deep learning to reduce the spatial dimensions of the feature maps by computing the average of each feature map.

Transfer learning - AlexNet, VGG16: Transfer learning is the process of using a pre-trained deep learning model as a starting point for a new task. AlexNet and VGG16 are popular deep learning models that have been pre-trained on large image datasets and can be used as a starting point for image classification tasks.

Pretrained network as classifier: A pre-trained network can be used as a classifier by removing the last fully connected layer and replacing it with a new output layer that matches the number of classes in the new task.

Fine tuning: Fine-tuning is a technique used in transfer learning where a pre-trained network is further trained on a new task by freezing some of the early layers and training the remaining layers.

Model callbacks: Model callbacks are functions that are called during the training process of a deep learning model. They can be used to implement early stopping, learning rate scheduling, and other custom functionality.

ROC AUC: ROC AUC is a performance metric used in binary classification problems to measure the ability of a classifier to distinguish between positive and negative classes. It represents the area under the receiver operating characteristic (ROC) curve.

****Inception V1
ResNet
MobileNet
Class activation maps ****

Inception V1:

Inception V1, also known as GoogLeNet, is a deep convolutional neural network that was designed to improve the efficiency of deep learning models by using a combination of small filters and parallel pooling operations. The Inception module, which consists of multiple parallel convolutional layers with different filter sizes, is the key building block of the network. Inception V1 achieved state-of-the-art performance on the ImageNet classification task in 2014.

ResNet:

ResNet, short for Residual Network, is a deep neural network architecture that introduces residual connections, also known as skip connections, to overcome the problem of vanishing gradients in very deep networks. The residual connections allow the gradient to flow directly through the network without being affected by intermediate layers, which can significantly improve the accuracy and convergence speed of deep networks. ResNet achieved state-of-the-art performance on the ImageNet classification task in 2015.

MobileNet:

MobileNet is a family of lightweight convolutional neural networks designed for mobile and embedded applications. MobileNet uses depthwise separable convolutions, which separate the spatial and channel-wise convolutions, to significantly reduce the number of parameters and computation required by the network. MobileNet achieves high accuracy on various image classification tasks while being much faster and smaller than traditional deep neural networks.

Class activation maps:

Class activation maps (CAM) is a technique used in convolutional neural networks to visualize the regions of an input image that are most important for a specific class. CAM works by computing the weighted sum of the feature maps of the final convolutional layer, where the weights are learned using the global average pooling of the feature maps and the final fully connected layer. The resulting map can be overlaid on the input image to highlight the regions that contribute most to the classification decision. CAM can be used to explain the decision-making process of a neural network and to identify the salient features of an input image.

**Visual embeddings
Reverse image search
Face recognition / identification
Siamese network**

Visual embeddings:

Visual embeddings, also known as image embeddings or visual features, are a compact representation of an image that captures its key visual information. Visual embeddings are learned by deep neural networks, such as convolutional neural networks (CNNs), and are often used as a starting point for various computer vision tasks, such as image classification, object detection, and image retrieval.

Reverse image search:

Reverse image search is a technique used to find similar images to a given image by using the image as a query. Reverse image search works by extracting visual embeddings from the query image and comparing them to the embeddings of a large database of images. The search engine then returns a list of images that are visually similar to the query image. Reverse image search can be used for various applications, such as finding the source of an image, identifying similar products, and detecting fake news.

Face recognition / identification:

Face recognition, also known as face identification, is a computer vision task that involves identifying a person from a digital image or a video frame. Face recognition can be done by extracting visual features, such as facial landmarks and texture, from the input image and comparing them to a database of known faces. Deep neural networks, such as Siamese networks and convolutional neural networks (CNNs), have been shown to achieve state-of-the-art performance on face recognition tasks.

Siamese network:

A Siamese network is a deep neural network architecture that consists of two identical subnetworks that share the same weights. Siamese networks are often used for tasks that involve measuring similarity or dissimilarity between pairs of input data, such as image matching, face recognition, and text similarity. Siamese networks work by feeding each input through the two subnetworks and computing a similarity score based on the distance between their output embeddings. Siamese networks can be trained using a contrastive loss function, which encourages similar inputs to have similar embeddings and dissimilar inputs to have dissimilar embeddings.

**Type I and II errors
ROC AUC
Evaluation of models
Video processing pipeline**

Type I and II errors:

Type I and Type II errors are two types of errors that can occur when testing a hypothesis or making a decision based on a statistical model. A Type I error occurs when a null hypothesis is rejected even though it is true, while a Type II error occurs when a null hypothesis is not rejected even though it is false. In other words, a Type I error is a false positive, while a Type II error is a false negative.

ROC AUC:

ROC AUC, or Receiver Operating Characteristic Area Under the Curve, is a metric used to evaluate the performance of binary classification models. The ROC curve is a plot of the true positive rate against the false positive rate at various classification thresholds. The AUC score represents the area under the ROC curve, which measures the overall performance of the model in distinguishing between positive and negative examples. A higher AUC score indicates better performance, with a perfect classifier having an AUC score of 1.

Evaluation of models:

The evaluation of models is an essential step in the machine learning pipeline. There are several metrics used to evaluate the performance of a model, depending on the specific task and the type of model. For classification tasks, metrics such as accuracy, precision, recall, and F1 score can be used. For regression tasks, metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared can be used. In addition to these metrics, cross-validation and hyperparameter tuning can also be used to assess the generalization and optimize the performance of the model.

Video processing pipeline:

The video processing pipeline refers to the series of steps involved in processing and analyzing video data. The pipeline typically consists of four main stages: acquisition, preprocessing, analysis, and output. In the acquisition stage, the video data is captured from a camera or a video file. In the preprocessing stage, the video data is preprocessed to remove noise, stabilize the image, and normalize the lighting conditions. In the analysis stage, various computer vision techniques, such as object detection, tracking, and recognition, are used to extract information from the video data. In the output stage, the results of the analysis are presented to the user in a meaningful way, such as a visualization or a report. The video processing pipeline can be applied to various applications, such as surveillance, robotics, and sports analysis.

**Object detection
Bounding box regression
RCNN family
Mean average precision**

Object detection:

Object detection is a computer vision task that involves detecting and localizing objects within an image or a video. The goal of object detection is to identify the objects of interest within an image and provide information about their location and size.

Bounding box regression:

Bounding box regression is a technique used in object detection to refine the location and size of bounding boxes around objects in an image. In this technique, a regression model is trained to predict the offsets between the initial bounding boxes and the true bounding boxes of the objects. The predicted offsets are then used to refine the initial bounding boxes and obtain more accurate locations and sizes of the objects.

RCNN family:

RCNN, or Region-based Convolutional Neural Network, is a family of object detection models that use a region-based approach to detect and localize objects in an image. The family includes several variants, such as Fast R-CNN, Faster R-CNN, and Mask R-CNN. These models use a combination of convolutional neural networks (CNNs) and region proposal algorithms to identify regions of interest in an image and then classify and refine them to obtain the final object detection results.

Mean Average Precision:

Mean Average Precision (mAP) is a commonly used metric for evaluating the performance of object detection models. It measures the average precision across all object classes and is calculated as the mean of the average precision scores for each class. The average precision score for each class is calculated as the area under the precision-recall curve for that class. A higher mAP score indicates better performance of the object detection model.

**YOLOv1**

YOLOv1, or You Only Look Once version 1, is a real-time object detection system introduced by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in 2015. It is a single-stage detector that uses a neural network to directly predict bounding boxes and class probabilities for objects in an image, without requiring a separate region proposal stage.

The YOLOv1 architecture consists of a single convolutional neural network (CNN) that processes the entire image at once and outputs a grid of predictions, each containing information about the bounding box and class of an object. The network divides the image into a grid of cells and predicts bounding boxes relative to each cell, as well as a confidence score indicating the probability that an object is present in the box. The final predictions are obtained by combining the predictions from multiple scales of the network.

YOLOv1 was designed for real-time object detection in a variety of applications, such as surveillance, robotics, and self-driving cars. It achieved high detection accuracy and real-time performance, making it a popular choice for many computer vision tasks. However, it has since been surpassed by newer versions of YOLO and other object detection models, such as Faster R-CNN and Mask R-CNN, which offer improved accuracy and performance.