# Example Cognitive Vision System (CVS)

In this tutorial, we present an example CVS that uses neural networks and bayesian networks for classification and scene understanding in videos. 
* The neural networks will classify each frame in a video. 
* The bayesian network will use scene understanding to improve the probabilistic inference of the neural network.


---


See the notebook that uses cogvis for image classification before diving into this one. There are details in the classification notebook that are ommitted here for brevity. 

## Initialisation

In [3]:
!pip install av

Collecting av
[?25l  Downloading https://files.pythonhosted.org/packages/41/b7/4b1095af7f8e87c0f54fc0a3de9472d09583eaf2e904a60f0817819fff11/av-8.0.3-cp36-cp36m-manylinux2010_x86_64.whl (37.2MB)
[K     |████████████████████████████████| 37.2MB 1.5MB/s 
[?25hInstalling collected packages: av
Successfully installed av-8.0.3


In [4]:
!pip install cogvis-0.0.1-py3-none-any.whl

Processing ./cogvis-0.0.1-py3-none-any.whl
Installing collected packages: cogvis
Successfully installed cogvis-0.0.1


In [5]:
import random

import av
import pandas as pd
import pickle
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from PIL import Image
from scipy import special
from torch.optim import lr_scheduler
# and most importantly...
from cogvis.classification import nn_classification

In [6]:
SEED = 1234

random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Dataset

We are using a subset of the CIFAR-100 datatset in this application. There are ten classes with each class falling into one of two categories. The classes are bed, chair, couch, table, wardrobe, forest, mountain, plain, sea and cloud. The classes are either found in outdoor scenery or indoor scenery. This separation of classes allows us to incorporate scene understanding into our application using bayesian networks.  

There are only 500 training images for each class and each image is only 32 x 32 pixels. So, we are not expecting a terribly accurate model. All we need is a reasonable model that we can improve using Bayesian networks.



---

The images from CIFAR-100 are stored in HDF5 files. The code that produces these files is included below. Before you run any of this code, you must download and `tar -xvzf` the relevant files from this [website](https://www.cs.toronto.edu/~kriz/cifar.html). 

In [7]:
import h5py

def unpickle(file):
    with open(file, 'rb') as fo:
        data_dict = pickle.load(fo, encoding='latin1')
    return data_dict


def save_images(file_path, meta_path, save_path, target_fine_labels):
    meta = unpickle(meta_path)
    data = unpickle(file_path)
    fine_label_names = meta['fine_label_names']

    idx_to_label = {idx: l for idx, l in enumerate(fine_label_names)}
    label_idxs = data['fine_labels']
    imgs = data['data']

    imgs = imgs.reshape(len(imgs),3,32,32).transpose(0,2,3,1) 
    img_dict = {label: [] for label in target_fine_labels}

    for label_idx, img in zip(label_idxs, imgs):
        label = idx_to_label[label_idx]
        if label in target_fine_labels:
            img_dict[label].append(img)

    f = h5py.File(save_path, 'w')
    for label, imgs in img_dict.items():
        imgs = np.array(imgs)
        f.create_dataset(
                label, 
                np.shape(imgs), 
                h5py.h5t.STD_U8BE, # 8-bit unsigned integer
                data=imgs,
                )
    f.close()


def main():
    meta_path = 'cifar-100-python/meta'
    train_path = 'cifar-100-python/train'
    val_path = 'cifar-100-python/test'
    target_labels = ['cloud', 'forest', 'mountain', 'plain', 'sea', 'bed', 
            'chair', 'couch', 'table', 'wardrobe']
    save_images(train_path, meta_path, 'train.hdf5', target_labels)

## Data loading and transforms
We perform a couple of complex transforms on the training data to improve the robustness of the resulting model. We only perform the necessary transforms on our validation data. We normalise the image using the mean and standard deviation of each channel for each image in our training data.

In [8]:
train_path = '/content/drive/My Drive/cifar_data_hdf5/train.hdf5'
val_path = '/content/drive/My Drive/cifar_data_hdf5/val.hdf5'
#train_path = '/content/drive/My Drive/cifar_data/train/'
#val_path = '/content/drive/My Drive/cifar_data/val/'

transform = nn_classification.data_transform(ToTensor=())
_, loader = nn_classification.data_loader(train_path, 1000, transform=transform)
data, _ = next(iter(loader))
means = [data[:,c].mean().tolist() for c in range(len(data[0]))] 
stds = [data[:,c].std().tolist() for c in range(len(data[0]))]

batch_size = 32
train_transform = nn_classification.data_transform(
    ColorJitter=(),
#    RandomHorizontalFlip=(), # only works with PIL images not HDF5
#    RandomPerspective=(),    # only works with PIL images not HDf5
    ToTensor=(),
    Normalize=(means, stds),
)

val_transform = nn_classification.data_transform(
    ToTensor=(),
    Normalize=(means, stds),
) 

train_size, train_loader = nn_classification.data_loader(train_path, batch_size, 
                                                      transform=train_transform)

val_size, val_loader = nn_classification.data_loader(val_path, batch_size, 
                                                     transform=val_transform)

## Training/Validation

We are using an existing neural network architecture, which is modified to produce the number of desired classes. The returned model has the weights that produced the highest validation accuracy.

In [52]:
criterion = nn.CrossEntropyLoss()
nn_model, params = nn_classification.existing('resnet18', 10)
optimizer = optim.SGD(params, lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

nn_model = nn_classification.train(
        nn_model, 
        train_loader, val_loader, 
        train_size, val_size, 
        criterion, 
        optimizer,
        exp_lr_scheduler, 
        num_epochs=10,
        )

Epoch: 0/9
----------
Train Loss: 1.7210 Acc: 0.3942
Val. Loss: 1.4807 Acc: 0.4850

Epoch: 1/9
----------
Train Loss: 1.2944 Acc: 0.5402
Val. Loss: 1.3385 Acc: 0.5280

Epoch: 2/9
----------
Train Loss: 1.0915 Acc: 0.6124
Val. Loss: 1.2949 Acc: 0.5500

Epoch: 3/9
----------
Train Loss: 0.9091 Acc: 0.6772
Val. Loss: 1.2796 Acc: 0.5740

Epoch: 4/9
----------
Train Loss: 0.7751 Acc: 0.7262
Val. Loss: 1.2608 Acc: 0.5830

Epoch: 5/9
----------
Train Loss: 0.6204 Acc: 0.7854
Val. Loss: 1.2185 Acc: 0.5970

Epoch: 6/9
----------
Train Loss: 0.4805 Acc: 0.8344
Val. Loss: 1.3942 Acc: 0.6110

Epoch: 7/9
----------
Train Loss: 0.2797 Acc: 0.9168
Val. Loss: 1.2537 Acc: 0.6330

Epoch: 8/9
----------
Train Loss: 0.2032 Acc: 0.9454
Val. Loss: 1.2329 Acc: 0.6440

Epoch: 9/9
----------
Train Loss: 0.1835 Acc: 0.9530
Val. Loss: 1.2324 Acc: 0.6420

Best Val. Acc: 0.644000


## Prediction

We are now going to get neural network predictions and associated probabilities for two sample videos. The first video contains indoor scenery and the second contains outdoor scenery. Since we know the scenery of each video, we can measure the accuracy of the model by looking at erroneous probabilities assigned to objects not in the given environment (indoor vs outdoor).

In [53]:
def vid_to_arrays(file_path):
  arrays = []
  container = av.open(file_path)
  for packet in container.demux():
    for frame in packet.decode():            
      if type(frame) == av.video.frame.VideoFrame:
        img = frame.to_image()
        arr = np.asarray(img)
        arrays.append(arr)
  return arrays


def arrays_to_inputs(arrays, transform):
  inputs = []
  for arr in arrays:
    img = Image.fromarray(arr)
    input = transform(img)
    inputs.append(input.numpy())
  return torch.Tensor(np.array(inputs))


def accuracy(preds, probs, env_target, class_to_idx, env_to_class):
  target_class_labels = env_to_class[env_target]
  target_class_idxs = [class_to_idx[cls] for cls in target_class_labels]
  num_correct = 0
  for prob in probs:
    target_probs = [prob[idx] for idx in target_class_idxs]
    if sum(target_probs) > 0.5:
      num_correct += 1
  return num_correct/len(probs)


transform = nn_classification.data_transform(
    Resize=(40,),
    CenterCrop=(32,), # same size as training images
    ToTensor=(),
    Normalize=(means, stds),
)

vid1 = '/content/drive/My Drive/CVS_videos/indoor.mp4'
vid2 = '/content/drive/My Drive/CVS_videos/outdoor.mp4'

arrays1 = vid_to_arrays(vid1)
arrays2 = vid_to_arrays(vid2)

inputs1 = arrays_to_inputs(arrays1, transform)
inputs2 = arrays_to_inputs(arrays2, transform)

class_to_idx = train_loader.dataset.class_to_idx
env_to_class = {
      'in': ['bed', 'chair', 'couch', 'table', 'wardrobe'],
      'out': ['cloud', 'forest', 'mountain', 'plain', 'sea'],
  }

preds1, probs1 = nn_classification.predict(nn_model, inputs1)
preds2, probs2 = nn_classification.predict(nn_model, inputs2)

acc1 = accuracy(preds1, probs1, 'in', class_to_idx, env_to_class)
acc2 = accuracy(preds2, probs2, 'out', class_to_idx, env_to_class)

print(f'Indoor accuracy: {acc1*100:.2f}')
print(f'Outdoor accuracy: {acc2*100:.2f}')


Indoor accuracy: 73.22
Outdoor accuracy: 80.38


## Bayesian network

This section will serve as a ***proof of concept*** for using Bayesian Networks (BN) for scene understanding in videos. The functionality in this section should serve as inspiration rather than the starting point of the cogvis BN component.

An attempt was made to use existing frameworks such as [BayesPy](https://www.bayespy.org/) and [Pomegranate](https://pomegranate.readthedocs.io/en/latest/). These are both small projects, which are fairly narrow in scope (not to mention BayesPy isn't actively maintained). The functionality provided was specific to a select few problems. These problems didn't seem extensible in a general way. I am of the opinion that these frameworks are not suibtable for the BN component of cogvis, although further consideration/revision may still be needed to make a decision.


---


Here, I create a simple BN using given conditional probabilities. Further work includes:
* d-separation
* automatic learning
* plate notation for BN instantiation
* variable elimination (with optimal ordering)


In [54]:
# ------------------------------------------------------------------ #

# IMPORTANT: untested, rough idea for how to implement a BN framework
# TODO: much consideration and improvement
# NOTE: probably best used as inspiration rather than starting code

# ------------------------------------------------------------------ #


class Node():
  def __init__(self, name, distribution):
    self.name = name
    self.distribution = distribution
    self.parents = []
  
  def __eq__(self, other):
    if type(other) is str:
      return self.name == other
    return self.name == other.name


class BayesianNetwork():
  
  def __init__(self, nodes, edges):
    for parent, child in edges:
      child.parents.append(parent)
    self.nodes = nodes
  
  def observe(self, target, dist):
    idx = self.nodes.index(target)
    node = self.nodes[idx]
    node.distribution = dist

  def probability(self, target):
    # NOTE: currently only works with the implemented example below (2 nodes)
    # NOTE: That is, one initial distribution and the other a conditional distribution
    # TODO: generalise
    node = self.nodes[self.nodes.index(target)]
    joint_dist = self.joint(node.distribution, node.parents[0].distribution)
    prob_dist = self.marginal(joint_dist, 0)
    return prob_dist

  def ancestors(self, node):
    if not node.parents:
      return [node]
    ancestors = []
    for parent in node.parents:
      ancestors = ancestors + self.ancestors(parent)
    return [node] + ancestors

  def joint(self, child_dist, parent_dist):
    assert(all([i == c for i, c in zip(child_dist.index, parent_dist.columns)]))
    joint_vars = [(c1, c2) for c1 in child_dist.columns for c2 in parent_dist.columns]

    heading = lambda x: x[0] + ' ' + x[1]
    columns = list(map(heading, joint_vars))
    idxs = parent_dist.index
    joint_dist = pd.DataFrame(columns=columns, index=idxs)

    for c1, c2 in joint_vars:
      for idx in idxs:
        prob = child_dist[c1][c2] * parent_dist[c2][idx]
        col = c1 + ' ' + c2
        joint_dist[col][idx] = prob
    return joint_dist


  def marginal(self, dist, target_index):
    marg_cols = list(set([c.split(' ')[target_index] for c in dist.columns]))
    marg_dist = pd.DataFrame(0, columns=marg_cols, index=dist.index)
    for col in dist.columns:
      for idx in dist.index:
        marg_col = col.split(' ')[target_index]
        marg_dist.loc[idx, marg_col] += dist[col][idx]
    return marg_dist


def combine_cond(cond1, cond2, dist):
  # NOTE: currently unused - combines parents P(X|Y0) and P(X|Y1) to P(X|Y0,Y1)
  # TODO: incorporate into Bayesian Network class
  # TODO: modify and standardise along with all other BN methods
  assert(all([n1 == n2 for n1, n2 in zip(cond1.columns, cond2.columns)]))
  index = [c1 + ' ' + c2 for c1 in cond1.index for c2 in cond2.index]
  cond = pd.DataFrame(index=index, columns=cond1.columns)
  for col in cond1.columns:
    for c1 in cond1.index:
      for c2 in cond2.index:
        index = c1 + ' ' + c2
        cond[col][index] = (cond1[col][c1] * cond2[col][c2]) / dist[col]
  arr = cond.to_numpy(dtype=float)
  normalised = special.softmax(arr, axis=1)
  cond = pd.DataFrame(normalised, index=cond.index, columns=cond.columns)
  return cond


def csv_to_cond(filename):
  cond = pd.read_csv(filename, header=0, index_col=0)
  cond.rename(columns=str.strip, index=str.strip)
  col_names = [c for c in cond.columns]
  cond[col_names] = cond[col_names].astype(float)
  return cond

Let's create a simple BN that uses scene understanding to improve the probabilistic predictions of our model. Each node in the network contains an (initial or conditional) distribution.  

Let's assume we know whether we are outside or inside. That is to say, we observe the state of the environment
and infer the probability of each object using the given conditional probability table. 


---

Each conditional distribution is represented by a pandas `DataFrame`, where each column heading represents a random variable and each row index represents a conditional variable.  
* Joint distributions contain multiple variables in the column heading (seperated by a `' '`).
* Distributions conditioned on multiple variables have these variables as an index (seperated by a `' '`)
* Initial distributions have the default index.



## Improving inference using a BN (known environment)

* We instantiate a bayesian network that has two nodes, one being an initial distribution of the environment and the other being a conditional distribution of an object given an environment. 
* Next, we assume to know the environment. We insert this evidence using the `observe` method. 
* Finally, we get the probability distribution of the required variable.

In [55]:
env = pd.DataFrame.from_dict({'in': [0.5], 'out': [0.5]}) # P(env)
obj_env = csv_to_cond('/content/drive/My Drive/bayes_conditionals/obj_env.csv') # P(obj | env)

A = Node('A', env)
B = Node('B', obj_env)
nodes = [A, B]
edges = [(A, B)]
bn = BayesianNetwork(nodes, edges)

# Let's first assume we are inside
env = pd.DataFrame.from_dict({'in': [1.0], 'out': [0.0]})
bn.observe('A', env)
bn_probs1 = bn.probability('B') # P(obj)
print(bn_probs1, '\n')

# Now let's assume we are outside
env = pd.DataFrame.from_dict({'in': [0.0], 'out': [1.0]})
bn.observe('A', env)
bn_probs2 = bn.probability('B') # P(obj)
print(bn_probs2)

   table  wardrobe   bed  cloud  chair   sea  mountain  plain  forest  couch
0   0.24      0.15  0.15   0.02   0.24  0.01      0.02   0.01    0.01   0.15 

   table  wardrobe   bed  cloud  chair  sea  mountain  plain  forest  couch
0   0.02      0.01  0.01    0.2   0.05  0.2      0.15   0.15     0.2   0.01


Using these probabilities, we can modify the prediction probabilities of our neural network.

In [56]:
def modify(vid_probs, bn_probs, class_to_idx, env_to_class, target):
  # TODO: this is janky
  target_classes = env_to_class[target]
  other_classes = [c for c in class_to_idx.keys() if c not in target_classes]

  target_idxs = [class_to_idx[c] for c in target_classes]
  other_idxs = [class_to_idx[c] for c in other_classes]

  bn_cols = sorted(bn_probs.columns, key=lambda c: class_to_idx[c])
  bn_probs = [bn_probs.loc[0, c] for c in bn_cols]

  modified_probs = pd.DataFrame(columns=vid_probs.columns, index=vid_probs.index)
  for idx in vid_probs.index:
    error_prob = sum(vid_probs.iloc[idx, other_idxs])
    vid_probs.iloc[idx, other_idxs] = 0
    delta_probs = np.asarray([p * error_prob for p in bn_probs])
    modified_probs.loc[idx,:] = delta_probs + vid_probs.loc[idx, :].to_numpy()
  return modified_probs


columns = list(range(len(bn_probs1.columns)))
vid_probs1 = pd.DataFrame(probs1, columns=columns)
vid_probs2 = pd.DataFrame(probs2, columns=columns)

final_probs1 = modify(vid_probs1, bn_probs1, class_to_idx, env_to_class, 'in') # inside video
final_probs2 = modify(vid_probs2, bn_probs2, class_to_idx, env_to_class, 'out') # outside video

Next, we simply plot and save the results.

In [60]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt


def save_plot(file_path, img, class_to_idx, nn_probs, bn_probs, transform):
  fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 5))

  ax1.imshow(img)
  ax1.axes.xaxis.set_visible(False)
  ax1.axes.yaxis.set_visible(False)

  x = [1] * len(nn_probs)
  y = list(range(1, len(nn_probs)+1))

  for cls_name, idx in class_to_idx.items():
    label = f'{cls_name}  {nn_probs[idx]*100:.2f}%'
    ax2.annotate(label, (x[idx] + 1, y[idx]))

  for cls_name, idx in class_to_idx.items():
    label = f'{cls_name}  {bn_probs[idx]*100:.2f}%'
    ax3.annotate(label, (x[idx] + 1, y[idx]))
  
  s = [1000 * p for p in nn_probs]
  ax2.set_xlim(0, len(nn_probs) + 1)
  ax2.set_ylim(0, len(nn_probs) + 1)
  ax2.scatter(x, y, s)
  ax2.axes.xaxis.set_visible(False)
  ax2.set_title('nn')

  s = [1000 * p for p in bn_probs]
  ax3.set_xlim(0, len(bn_probs) + 1)
  ax3.set_ylim(0, len(bn_probs) + 1)
  ax3.scatter(x, y, s)
  ax3.axes.xaxis.set_visible(False)
  ax3.set_title('nn with bn')

  plt.savefig(file_path)
  plt.close(fig)


def save_plots(save_path, imgs, class_to_idx, nn_probs, bn_probs, transform):
  for i, img in enumerate(imgs):
    img = transform(img)
    path = f'{save_path}/{i}.png'
    save_plot(path, img, class_to_idx, nn_probs[i], bn_probs[i], transform)


transform = nn_classification.data_transform(
    Resize=(40,),
    CenterCrop=(32,),
)

arrays1 = vid_to_arrays(vid1)
imgs1 = [Image.fromarray(arr) for arr in arrays1]
image_folder = '/content/drive/My Drive/CVS_videos/temp_images'
save_plots(image_folder, imgs1, class_to_idx, probs1, final_probs1.to_numpy(), transform)

# arrays2 = vid_to_arrays(vid2)
# imgs2 = [Image.fromarray(arr) for arr in arrays2]
# image_folder = '/content/drive/My Drive/CVS_videos/temp_images'
# save_plots(image_folder, imgs2, class_to_idx, probs2, final_probs2.to_numpy(), transform)



In [61]:
import cv2
import os

def save_vid(image_folder, vid_path, fps):
  images = [img for img in os.listdir(image_folder) if img.endswith(".png")]
  frame = cv2.imread(os.path.join(image_folder, images[0]))
  height, width, layers = frame.shape
  size = (width, height)
  video = cv2.VideoWriter(video_path,cv2.VideoWriter_fourcc(*'DIVX'), fps, size)

  for image in images:
      video.write(cv2.imread(os.path.join(image_folder, image)))

  cv2.destroyAllWindows()
  video.release()


video_path = '/content/drive/My Drive/CVS_videos/nn_indoor.avi'
save_vid(image_folder, video_path, fps=24)

# video_path = '/content/drive/My Drive/CVS_videos/nn_outdoor.avi'
# save_vid(image_folder, video_path, fps=24)


# **Incomplete**


---

## Improving inference using a BN (unkown environment)

The idea here was to use previous frames to model predictions of the current frame. 
* We would assume that the neural network is correct about the environment on average every 24 frames (1 secs). We would therefore assume to know the previous environment (`T0`). This becomes evidence for our bayesian network.  
* We would also assume the neural network is correct about the ***probabilistic predictions*** of objects in the video on average every 12 frames (0.5 secs). We therefore assume to know the probabilities of objects of the previous object (`S0`). 
* `T1` is the current environment and `S1` is the current obj.

The aim is to obtain object probabilities of the current frame (`S1`) using the distribution described by the bayesian network. We will then use these probabilities to improve the probabilistic predictions of the neural network.

In [None]:
num_classes = len(class_to_idx)
obj_dist = {c: 1/num_classes for c in class_to_idx.keys()}

env = pd.DataFrame.from_dict({'indoor': [0.5], 'outdoor': [0.5]}) # P(env)
obj = pd.DataFrame.from_dict({c: [1/num_classes] for c in class_to_idx.keys()}) # P(obj)
env_env = csv_to_cond('/content/drive/My Drive/bayes_conditionals/env_env.csv') # P(current_env|prev_env)
obj_env = csv_to_cond('/content/drive/My Drive/bayes_conditionals/obj_env.csv') # P(obj|prev_obj)
obj_obj = csv_to_cond('/content/drive/My Drive/bayes_conditionals/obj_obj.csv') # P(obj|prev_obj, current_env)
obj_obj_env = combine_cond(obj_obj, obj_env, obj)

T0 = Node('T0', env)
S0 = Node('S0', obj)
T1 = Node('T1', env_env)
S1 = Node('S0', obj_obj_env)

e0 = (T0, T1)
e1 = (T1, S0)
e2 = (S0, S1)

nodes = [T0, S0, T1, S1]
edges = [e0, e1, e2]

bn = BayesianNetwork(nodes, edges)