Skip to content

CNNs and Object Detection

Danny Tamiru edited this page May 7, 2023 · 4 revisions

The Litgo detector service uses a Mask R-CNN model with a ResNet-50-FPN backbone that is finetuned on either the TACO or UAVVaste training data. What does this all mean? Let's break it down.

Machine Learning

AI is a field of study around the idea that we can build machines that are intelligent. Formally, an artificially intelligent machine is an agent that acts rationally in its environment. There are many ways to go about accomplishing this (building complex probabilistic models of an environment, efficiently determining optimal outcomes, etc.) but perhaps the most successful method is a branch of AI called Machine Learning.

Machine learning (ML) addresses the fact that many real-world problems simply involve too many variables to have a humanly codable solution. It does so by mimicking Humans' ability to learn from past experience to complete tasks faster or better. More concretely, this means that given examples of a problem (an assignment to all input variables called 'features'), the agent uses its learning algorithm to improve its ability to solve the problem generally.

There are 3 classes of learning algorithms: Reinforcement, Unsupervised, and Supervised. We're interested in the latter. Supervised learning algorithms are ones where training data is labeled, meaning problem examples also come with assigned output features, i.e. the solutions to the problem.

Neural Networks

Neural networks are a type of supervised learning algorithm inspired by the way the human brain works. They can be used for a wide range of tasks, such as image and speech recognition, natural language processing, and predictive modeling. They consist of interconnected nodes or "neurons" arranged in discrete layers/rows that process and transmit information. Each neuron takes in input, performs a calculation on that input, and then outputs a result. The connections between neurons are weighted, and these weights are adjusted through a process called "training" to improve the accuracy of the network's output. Neurons receive a weighted sum of the values from nodes in the previous layer and apply to it what is called an 'activation function', a non-linear function that decides whether the neuron was fired or not.

image

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a type of neural network that is used to solve image and video processing problems. They consist of a series of interconnected layers that learn to identify patterns in an image by passing it through a sequence of convolutional and pooling layers.

The convolutional layers apply filters to the input image, detecting specific features like edges and shapes. The pooling layers reduce the spatial dimensions of the output from the convolutional layer, simplifying the information and allowing for better generalization

CNNs are able to learn and identify complex patterns in images and have been used for applications namely object recognition, image segmentation, and facial recognition.

image

Mask R-CNN

To understand Mask R-CNN and its predecessors, check out this medium article.

image

Clone this wiki locally