# Datasets for Computer Vision tasks



For any computer vision task, the most important preliminary work is to find a proper dataset. Based on the task we are trying to achieve there are many types of vision data types available.

Normal images are the most basic type of data for computer vision. These images can be taken from social medias, scans or other sources. Could be taken by normal person or professional photographer. Could be taken in a controlled environment or in uncontrolled environment.

Depending on the quality of images as researchers or engineers we have to make adjustments to our models and systems.

- Larger images will require large memory capacity to train ML models.
- Larger images means high number of parameters to train.
- High resolution images tend to have more noise near low light environments.
- High resolution images take larger cost to store and transmit.

To find good size images to ML work, best approach is to find the highest resolution that is adequate to our problem while keeping the other resource contraints in check.

Other than the normal images we use in daily works there are some other types of data also available for computer vision tasks. 

- Many instruments like X-Ray, MRI, LIDARS etc create a 2D/3D images of a space. Most of the time they contain one channel and therefore can consider as a greyscale image. In CT scans what it returns is set of 3D slices of the area. We can consider that as multiple channels during calculations.

- One interesting type of vision data is polar grids. This type of data get created from Radar/ Ultrasound sensors. They have a general cone shape due to how the sensor works and therefore before using in ML applications we need to modify its content to be more meaningful for practical applications. One method is unwrapping the content. Other is just using the polar grid as it is with additional channel indicating how far a point is from the center.

<center><image src="./imgs/21.png" width="300px"/></center>

- In case of geospacial data like land ownership, topography, population density we can take all those as several channels of single image assuming data is taken over same area/projection. 


But all above image types are essentially very similar. But there are some cases we can use typical Vision ml techniques to 1D or 3D data structures like videos and audio data.

### Spectrogram

To do machine learning on audio, we can first split the data to chunks and then apply ML to those chunks.
On the other hand Audio is a 1D signal, so it is possible to use Conv1D operations in place of Conv2D. This would be like processing audio signal in time space. However in practice better to represent audio as spectrogram(A stacked view of spectrums of frequencies in the audio signal varies over time.). So in simple terms spectogram x axis shows time and y axis shows the frequency. The pixel value represent the spectral density(loudness) of the audio signal at the specific frequency.

<center><image src="./imgs/22.png" width="500px"/></center>

> Representing audio signal in above format provides an interesting capability from ML perspective. Since now the audio signal is 2D image, now we can use Computer vision techniques and models to process audio. In fact this type of technique can be used in NLP problems as well. We can convert text in to embeddings and form a 2D representation. Then we can apply ML vision algorithms on top of it.

### Videos

Obviously since videos consist of continuous image frame, we can just simply apply ML techniques to each frame and do our task specially classification and object identification like tasks.
On the other hand we can process multiple frames at a time as a rolling average and then apply ML algorithms. To do that we can use 3D convolutions. Also we can use RNNs and other sequencial processing methods(Attention) to apply ML to videos.

## Labeling the Datasets

Labeling the dataset is one of the first tasks almost every data science team have to do in their new works. Depending on the target we need to achive we need to have labeling in different manner.

For classification like task most common way to label data is by either the folder structure or using a metadata table. The problem in using folders for data labeling is that it can lead to data duplication if a image have multiple labels. Therefore in such situation metadata table is preferred.

When we are labeling data for object detection task, we need to store the data regarding the bounding box dimensions. To do that we can have an additional field in the metadata table.





### Labeling at scale

In many ML usecases it is required to have considerable sized dataset to train the model. Therefore to build up a dataset to such purpose in efficient and accurate manner, there are several tools and techniques available.

> One such tool is the `Computer Vision Annotation Tool`. It is a free, web based image annotation tool that can be installed locally as well.

If we need to annotate images to support multiple tasks then it is efficient to use an interactive interface like jupyter notebook. There is a package named `multi-label-pigeon` that can be used in annotation tasks.

Install the package using 
<pre>pip install multi-label-pigeon</pre>




In [14]:
from multi_label_pigeon import multi_label_annotate
from IPython.display import display, Image

files = ['imgs/rose.jpg']

annotations = multi_label_annotate(
    files,
    options={'flowers':['tulip', 'rose', 'sunflower'], 'color':['red', 'green', 'blue']},
    display_fn=lambda filename: display(Image(filename, width=250))
)

HTML(value='0 examples annotated, 2 examples left')

flowers


HBox(children=(Button(description='tulip', style=ButtonStyle()), Button(description='rose', style=ButtonStyle(…

color


HBox(children=(Button(description='red', style=ButtonStyle()), Button(description='green', style=ButtonStyle()…




HBox(children=(Button(description='done', style=ButtonStyle()), Button(description='back', style=ButtonStyle()…

Output()

Above is an example of using the annotation pacakge.

Another massive scale labeling method is voting and crowdsourcing. In this technique what it does is images are pushed toward a voting system so that multiple raters can label/tag the given image. Image with consistant tags from multiple raters get assigned with those particular tags, while images with conflicting labels sent to verification or get ignored. 

## Automated Labeling

In many Deep Learning cases we need to have massive amount of labeled data, but it is not possible to manually label each and every data point due to various time and cost constraints. In such scenarios we can use several workaround techniques.

One is infering labels from related data. In somecases we can infer the label of sample by only looking at a specific part of data or its context. This way we can easily get the tag of our considering data point.

Another method is a model called `Noisy Student`. Idea behind this technique is to use a small ML model to to label large amount of data iteratively. Its general steps are as follows. 

- Manually label a small dataset
- Train a small ML model using the above dataset. (Teacher Model)
- Use this model to label large set of data
- Then train another model (Student Model) with previous manually labeled dataset and new dataset. (It is important to incorporate dropout and other data augmentation methods to make the model generalizable)
- Iterate the above process by making the student model the teacher model.

This way we can make a large dataset with relatively low tagging effort.