<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-005.png" width="700">

Кроме классификации CV решает и другие задачи.

## COCO

Common objects in context


- один из наиболее популярных датасатов содержащий данные для сегментации и детектирования.

- categoryes
- masks
- bounding boxes
- captions
- person_keypoints
...

https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoDemo.ipynb

In [None]:
!wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!unzip annotations_trainval2017.zip

Для работы с датасетом используется пакет `pycocotools`

В COCO 90 категорий объектов

https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch

In [None]:
from pycocotools.coco import COCO

coco=COCO('annotations/instances_val2017.json')

cat_ids = coco.getCatIds()
print("Categories count",len(cat_ids))
print(cat_ids)

0 - не используется в качестве номера категории. Обычно его используют для обозначения класса фона.

In [None]:
# display COCO categories and supercategories

cats = coco.loadCats(coco.getCatIds())
num2cat =  {}
print('COCO categories: ')
for cat in cats:
  num2cat[cat['id']] = cat['name']
  print( cat['id'], ":" , cat['name'],end="   " )
#print('COCO categories: \n{}\n'.format(' '.join(nms)))



Есть так же суперкатегории

In [None]:
print(cats[2])
print(cats[3])


nms = set([cat['supercategory'] for cat in cats])
print('COCO supercategories: \n{}'.format(' '.join(nms)))

Датесет большой, поэтому удобно выгружать данные частями

In [None]:
# get all images containing given categories, select one at random
catIds = coco.getCatIds(catNms=['person','cat']); # person and cat
imgIds = coco.getImgIds(catIds=catIds );
#imgIds = coco.getImgIds() #imgIds = [324158]
print("Total images with person and cat ",len(imgIds))
print(imgIds)
img_list = coco.loadImgs(imgIds[0])
img = img_list[0]
print("Image data", img)

Load image

In [None]:
import skimage.io as io
import matplotlib.pyplot as plt

I = io.imread(img['coco_url'])
plt.axis('off')
plt.imshow(I)
plt.show()

Конвертация в PIL формат

In [None]:
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

def coco2pil(url):
  print(url)
  response = requests.get(img['coco_url'])
  return Image.open(BytesIO(response.content))

pil_img = coco2pil(img['coco_url'])
plt.imshow(pil_img)

Информация о разметке

https://cocodataset.org/#format-data


```
  "segmentation" : RLE or [polygon],
  "area" : float,
  "bbox" : [x,y,width,height],
  "iscrowd" : 0 or 1,
```

Полигон это набор координат [x1,y1, x2,y2 ... ]

объект может описываться несколькими полигонами

In [None]:
# load and display instance annotations
plt.imshow(I); plt.axis('off')
annIds = coco.getAnnIds(imgIds=img['id'])
anns = coco.loadAnns(annIds)

def dump_anns(anns):
  for i, a in enumerate(anns):
    print(f"#{i}")
    for k in a.keys():
      if k == 'category_id' and num2cat.get(a[k],None):
        print(k,": ",a[k], num2cat[a[k]]) # Show cat. name
      else:
        print(k,": ",a[k])

dump_anns(anns)
coco.showAnns(anns)

Что такое [RLE](https://en.wikipedia.org/wiki/Run-length_encoding) ?

run-length encoding

https://www.youtube.com/watch?v=h6s61a_pqfM

In [None]:
plt.rcParams["figure.figsize"] = (160,80)
print(catIds)
annIds = coco.getAnnIds(catIds=catIds, iscrowd = True)
anns = coco.loadAnns(annIds[0:1])

dump_anns(anns)

img = coco.loadImgs(anns[0]['image_id'])[0]
I = io.imread(img['coco_url'])
plt.imshow(I); #plt.axis('off')
coco.showAnns(anns) # People in the stands 
seg = anns[0]['segmentation']

print('Counts',len(seg['counts']))
print('Size',seg['size'])

Как получить маску в виде массива?

In [None]:
plt.rcParams["figure.figsize"] = (10,6)
import numpy as np
annIds = coco.getAnnIds(imgIds=[448263])

anns = coco.loadAnns(annIds)
msk = np.zeros(seg['size'])
for i in range(len(anns)):
  msk += coco.annToMask(anns[i])
print(msk.shape)
plt.figure()
plt.imshow(msk)

print(msk)
print(np.unique(msk))

## Semantic segmentation

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-007.png" width="700">

Постановка задачи:

Предсказать класс для каждого пикселя.

Входные данные маска: 

[ x,y - > class_num ] 

Выходные данные маска:

[ x,y - > class_num ] 


Способы решения

*  а) Наивный.
<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-012.png" width="700">


Скользящим окном пройтись по изображению и предсказать клас для каждого пикселя с учетом его соседей.

*  б) Разумный
<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-015.png" width="700">


Убрать линейный слой в конце сети. По признакам во всех каналах определять класс каждого пикселя.

<img src ="http://edunet.kea.su/repo/src/L09_CNN_Architectures/img/L09_CNN_Architectures_15.png"  width="700">

В лекции №8 мы говорили о том что сверту 1x1 можно рассматривать как полносвязанный слой.


Именно так она и будет использоваться при сегментации.

Количество классов будет соответствовать числу каналов.


Проблемы:
- нужно большое рецептивное поле, следовательно много слоев ( L 3х3 conv -> 1+2L receptive field)
- очень медленно на полноразмерных картах активации

*  в) Эффективный

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-017.png" width="700">


Вернмся к традиционной структуре со сжатием пространственных размеров. А после нее добавим разжимающий блок. 

## Автокодировщик

Такая архитектура довольно популярна и применяется не только для сегментации: 


<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-07.png" width="700">


- сглаживание шума;
- снижение размерности -> вектор признак
- генерация данных
... 


## Как устроен разжимающий(upsample) блок?

### Изменение размеров изображений 

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-08.png" width="700">


<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-10.png" width="700">

С картами признаков можно обращаться так же как и с пикселями. Для этого в Pytorchществует метод Upsample

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-11.png" width="700">

In [None]:
import torch
from torch import nn
import torchvision.transforms.functional as TF
import matplotlib.pyplot as plt

def upsample( pil, mode='nearest' ):
  tensor = TF.to_tensor(pil)
  upsampler = nn.Upsample(scale_factor=2, mode='nearest')
  tensor_128 = upsampler(tensor.unsqueeze(0))
  im_128 = TF.to_pil_image(tensor_128.squeeze()).convert("RGB")
  fig = plt.figure()
  fig.suptitle(mode)
  plt.imshow(im_128)


man_with_cat = coco2pil('http://images.cocodataset.org/val2017/000000223747.jpg')
pil_64 = man_with_cat.resize((64,64))
plt.figure()
plt.imshow(pil_64)


upsample( pil_64, mode='nearest' )
upsample( pil_64, mode='bilinear' )
upsample( pil_64, mode='bicubic' )


### Bed of Nails


<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-12.png" width="700">

Способ восстановления размерности когда наэлементы из начальной карты признаков копируются без изменения, а новые ячейки  вокруг них заполняются нулями.

### MaxUnpooling

Принципиальная разница в том что индексы элементов запоминаются.

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-019.png" width="700">

Сохраняем индексы каждого max pooling слоя
При повышении разрешения копируем значения из выхода max pooling слоя с учетом запомненных индексов


<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-14.png" width="700">

https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html

https://pytorch.org/docs/stable/generated/torch.nn.MaxUnpool2d.html?highlight=unpooling

In [None]:
import torch
from torch import nn
import torchvision.transforms.functional as TF
import matplotlib.pyplot as plt

def tensor_show( tensor,title = ''):
  im = TF.to_pil_image(tensor.squeeze()).convert("RGB")
  fig = plt.figure()
  fig.suptitle(title + str(im.size))
  plt.imshow(im)

pool = nn.MaxPool2d(kernel_size = 2, return_indices = True) # False by default
unpool = nn.MaxUnpool2d(kernel_size = 2)

pil = coco2pil('http://images.cocodataset.org/val2017/000000223747.jpg')
fig = plt.figure()
fig.suptitle('original ' +str(pil.size))
plt.imshow(pil)
tensor = TF.to_tensor(pil).unsqueeze(0)
print("Initial shape",tensor.shape)

# Downsample
tensor_half_res, indexes1  = pool(tensor)
print("Indexes shape",indexes1.shape)
tensor_show( tensor_half_res,"1/2 down")



tensor_q_res, indexes2  = pool(tensor_half_res)
tensor_show( tensor_q_res, "1/4 down")

# Upsample
tensor_half_res1 = unpool(tensor_q_res,indexes2)

#https://pytorch.org/docs/stable/nn.html#padding-layers
pad = nn.ZeroPad2d((0,0,0,1))
tensor_half_res1 = pad(tensor_half_res1)
tensor_show( tensor_half_res1 ,"1/2 up")

tensor_recovered = unpool(tensor_half_res1, indexes1)
tensor_show( tensor_recovered, "full size up")




Зачем нужен pad?


### Transpose convolution

Способы восстановления пространственных размерностей которые мы рассмотрели, не содержали обучаемых параметров.


Обычная свертка:

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-16.png" width="700">


Upsample/transpose convolution

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-17.png" width="700">


<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-15.png" width="700">

https://medium.com/@_init_/an-illustrated-explanation-of-performing-2d-convolutions-using-matrix-multiplications-1e8de8cd2544


Pytorch
<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/L8-19.png" width="700">

https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html?highlight=transpose#convtranspose2d

In [None]:
# With square kernels and equal stride
m = nn.ConvTranspose2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
input = torch.randn(20, 16, 50, 100)
output = m(input)
# exact output size can be also specified as an argument
input = torch.randn(1, 16, 12, 12)
downsample = nn.Conv2d(16, 16, 3, stride=2, padding=1)
upsample = nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
h = downsample(input)
print(h.size())
output = upsample(h, output_size=input.size())
print(output.size())


## Unet

## Предобученные модели для сегментации

FCN 

Fully Convolutional Network
для токо что бы не было путаницы с Fully Connected Network
последние именуют MLP (Multi Layer Perceptron)

Usage example
https://pytorch.org/hub/pytorch_vision_fcn_resnet101/

https://pytorch.org/vision/stable/models.html#semantic-segmentation

The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are present in the Pascal VOC dataset. You can see more information on how the subset has been selected in references/segmentation/coco_utils.py. The classes that the pre-trained model outputs are the following, in order:
['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

In [None]:
import torchvision
from PIL import Image
from torchvision import transforms
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

classes = ['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

fcn_model = torchvision.models.segmentation.fcn_resnet50(pretrained=True,  num_classes=21)

preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), #ImageNet
])

input_tensor = preprocess(pil_img)

with torch.no_grad():
    output = fcn_model(input_tensor.unsqueeze(0))#['out'][0]




Возвращаются 2 массива

* out - at each location, there are unnormalized probabilities corresponding to the prediction of each class

* aux - contains the auxillary loss values per-pixel. In inference mode, output['aux'] is not usefulcontains the auxillary loss values per-pixel. In inference mode, output['aux'] is not useful

In [None]:
a = torch.tensor([
                  [[1,4],[7,0.99]],
                  [[0.8,1],[9,12]]
                ]
                  )
a
torch.argmax(a, dim=0)


In [None]:
print(output.keys()) # Ordered dictionary
print("out", output['out'].shape,"Batch, class_num, h, w") 
print("aux", output['aux'].shape,"Batch, class_num, h, w") 
# aux and output['aux'] contains the auxillary loss values per-pixel. In inference mode, output['aux'] is not usefulcontains the auxillary loss values per-pixel. In inference mode, output['aux'] is not useful

#at each location, there are unnormalized probabilities corresponding to the prediction of each class
output_predictions = output['out'][0].argmax(0)  # for first element of batch
print(output_predictions.shape)
print(output_predictions,torch.max(output_predictions))
indexes = output_predictions
for i, cls_name in enumerate(classes):
  #print(cls_name)
  #print(indexes,indexes.shape)
  mask = torch.zeros(indexes.shape)
  mask[indexes == i] = 255 
  #print(i,class_prob.shape)
  #mask = class_prob.byte()
  #print(mask,mask.shape,torch.max(mask))
  fig = plt.figure()
  fig.suptitle(cls_name)
  plt.imshow(mask)
  #color = i *10
  #break;
  #at each location, there are unnormalized probabilities corresponding to the prediction of each class
  output_predictions += class_prob.argmax(0) * color

#print(output_predictions)
#plt.imshow(output_predictions)

In [None]:
# create a color pallette, selecting a color for each class
palette = torch.tensor([2 ** 25 - 1, 2 ** 15 - 1, 2 ** 21 - 1])
colors = torch.as_tensor([i for i in range(21)])[:, None] * palette
colors = (colors % 255).numpy().astype("uint8")

# plot the semantic segmentation predictions of 21 classes in each color
r = Image.fromarray(output_predictions.byte().cpu().numpy()).resize(pil_img.size)
r.putpalette(colors)

import matplotlib.pyplot as plt
plt.imshow(r)
# plt.show()

DeepLab

## Оценка точности

### IoU

<img src ="http://edunet.kea.su/repo/src/L12_Segmentation_Detection/img/lecture_12-115.png" width="700">

### COCO mAP

Конвертация результатов сегментации в COCO формат.

https://www.javaer101.com/en/article/18652684.html

Синхронизируем метки

In [None]:
pascal2coco = {}
print(classes)
def find_in_dic(dic,val):
  for key in dic.keys():
    if dic.get(key,None) == val:
      return key
  return 0 # Assign missed classes to bg

print(num2cat)
for i in range(1, len(classes)): # Skip BG
  #if cats.get()
  coco_ind = find_in_dic(num2cat,classes[i])
  pascal2coco[i] = coco_ind

print(pascal2coco) 


Create gt file


annIds = coco.getAnnIds(catIds=catIds, iscrowd = True)
anns = coco.loadAnns(annIds[0:1])

In [None]:
from pycocotools import mask


import numpy as np
from itertools import groupby

def binary_mask_to_rle(binary_mask):
    rle = {'counts': [], 'size': list(binary_mask.shape)}
    counts = rle.get('counts')
    for i, (value, elements) in enumerate(groupby(binary_mask.ravel(order='F'))):
        if i == 0 and value == 1:
            counts.append(0)
        counts.append(len(list(elements)))
    return rle

detection_res = []
for i, cls_name in enumerate(classes):
  binary_mask = torch.zeros(indexes.shape)
  binary_mask[indexes == i] = 1 

  if i > 0 and torch.max(binary_mask) > 0  :
  #print(binary_mask.shape,torch.max(binary_mask))
    uncompressed_rle = binary_mask_to_rle(binary_mask.numpy()) #encoded_gt,
    fortran_gt_binary_mask = np.asfortranarray(binary_mask).astype('uint8')
    #print(fortran_ground_truth_binary_mask)
    encoded_gt = mask.encode(fortran_ground_truth_binary_mask)
    #decoded = mask.decode(encoded_ground_truth)
    #print(decoded)
    bbox = list(mask.toBbox(encoded_ground_truth))
    print(bbox)
  #ground_truth_area = mask.area(encoded_ground_truth)
  #ground_truth_bounding_box = mask.toBbox(encoded_ground_truth)


    detection_res.append({
        'score': 1., #dummy
        'category_id': pascal2coco[i],
        #'segmentation' : uncompressed_rle,
        'bbox': bbox, #[80.0, 6.49, 223.89, 223.36], #
        'image_id': 448263,
        #'iscrowd' : 1
    })

print(detection_res)

In [None]:
import json 
'''
for anno in anns:
        detection_res.append({
            'score': 1.,
            'category_id': anno['category_id'],
            'bbox': anno['bbox'],
            'image_id': anno['image_id']
        })
'''    
 
with open('seg_gt.json', 'w', encoding='utf-8') as f:
  json.dump(anns, f, ensure_ascii=False, indent=4)

In [None]:
    from pycocotools.coco import COCO
    from pycocotools.cocoeval import COCOeval
    import json
    from tempfile import NamedTemporaryFile
     
    # json file in coco format, original annotation data
    #anno_file = 'annotations/instances_val2017.json'
    #coco_gt = COCO('seg_gt.json')
     
     # Use GT box as prediction box for calculation, the purpose is to get detection_res
    #with open(anno_file, 'r') as f:
    #    json_file = json.load(f)
    #annotations = json_file['annotations']


    #detection_res = []
  
   

    import json
    with open('seg_res.json', 'w', encoding='utf-8') as f:
      json.dump(detection_res, f, ensure_ascii=False, indent=4)
    #tf = file.open('seg_res.json'):
    # Due to subsequent needs, first convert detection_res to binary and then write it to the json file
    #content = json.dumps(detection_res).encode(encoding='utf-8')
    #tf.write(content)
    res_path = tf.name
     
    # loadRes will generate a new COCO type instance based on coco_gt and return
    coco_dt = coco_gt.loadRes('seg_res.json')
    coco_gt = coco.loadRes('seg_gt.json')
  
    cocoEval = COCOeval(coco_gt, coco_dt,'bbox') # 'segm', 'bbox'
    #cocoEval.params.useSegm = True
    cocoEval.evaluate()
    cocoEval.accumulate()
    cocoEval.summarize()
     
    print(cocoEval.stats)