# Working on MoveNet Thunder
In this notebook, we will be taking a look at the MoveNet Thunder model.

## 0- Model Details

MoveNet is a convolutional neural network model that runs on RGB images and predicts human joint locations of a single person. It has some different variants. This variant,  MoveNet.SinglePose.Thunder, is a higher capacity model (compared to MoveNet.SinglePose.Lightning) that performs better prediction quality while still achieving real-time (>30FPS) speed. Naturally, thunder will lag behind the lightning, but it will pack a punch.<sup>[[1]](https://www.kaggle.com/models/google/movenet/frameworks/tfLite/variations/singlepose-thunder)</sup> Lightning is actually faster but Thunder has a better prediction quality. Since we are running this notebook on a relatively strong laptop GPU (RTX 3080) Thunder is the model that we chose. 

### Model Architecture:

[MobileNetV2](https://arxiv.org/abs/1801.04381) image feature extractor with [Feature Pyramid Network](https://arxiv.org/abs/1612.03144) decoder (to stride of 4) followed by [CenterNet](https://arxiv.org/abs/1904.07850) prediction heads with custom post-processing logic. Thunder uses depth multiplier 1.75.<sup>[[2]](https://storage.googleapis.com/movenet/MoveNet.SinglePose%20Model%20Card.pdf)</sup>

### Inputs:
For Thunder, a frame of video or an image, represented as an int32 tensor of shape: 256x256x3. Channels order: RGB with values in [0, 255].<sup>[[3]](https://storage.googleapis.com/movenet/MoveNet.SinglePose%20Model%20Card.pdf)</sup>

### Outputs:

A float32 tensor of shape [1, 1, 17, 3]. Explanation of the dimensions:
- First dimension is the batch size. For this model, it is set to 1, meaning the model processes one image at a time.
- Second dimension represents the number of detected poses in each image. Since this is a SinglePose model, it is always 1.
- Third dimension represents the keypoints. There are 17 keypoints that this model looks for. Those are in the order of: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]
- Last one represents the channels. Those channels are in this order:
    - y coordinates.
    - x coordinates.
    - The prediction confidence scores of each keypoint.

## 1- Imports 

In [76]:
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
import cv2

## 2- Loading the Model

First, in order to load the model to our local, we run this code on terminal:

wget -O model.tflite https://tfhub.dev/google/lite-model/movenet/singlepose/thunder/tflite/int8/4?lite-format=tflite

Or, if you just want to drag it and stick to this environment, run this code:

In [None]:
# import requests

# url = "https://tfhub.dev/google/lite-model/movenet/singlepose/thunder/tflite/int8/4?lite-format=tflite"
# response = requests.get(url)

# with open('model.tflite', 'wb') as f:
#     f.write(response.content)

Now, let's load our model:

In [77]:
interpreter = tf.lite.Interpreter(model_path='model.tflite')
interpreter.allocate_tensors()

## 3- Helper Functions

We need some functions to draw the keypoints and connect them. First, let's implement the one to draw the keypoints.

In [78]:
def draw_keypoints(frame, keypoints, confidence_threshold):
    y, x, c = frame.shape
    shaped = np.squeeze(np.multiply(keypoints, [y,x,1]))
    
    for kp in shaped:
        ky, kx, kp_conf = kp
        if kp_conf > confidence_threshold:
            cv2.circle(frame, (int(kx), int(ky)), 4, (0,255,0), -1) 

Now, to draw edges. We must not connect all keypoints. For example, if we connect the nose point to left elbow, it will not be nonsense. For our luck, there is a pre-defined dictionary that states which points should be connected.

In [79]:
EDGES = {
    (0, 1): 'm',
    (0, 2): 'c',
    (1, 3): 'm',
    (2, 4): 'c',
    (0, 5): 'm',
    (0, 6): 'c',
    (5, 7): 'm',
    (7, 9): 'm',
    (6, 8): 'c',
    (8, 10): 'c',
    (5, 6): 'y',
    (5, 11): 'm',
    (6, 12): 'c',
    (11, 12): 'y',
    (11, 13): 'm',
    (13, 15): 'm',
    (12, 14): 'c',
    (14, 16): 'c'
}

In [80]:
def draw_connections(frame, keypoints, edges, confidence_threshold):
    y, x, c = frame.shape
    shaped = np.squeeze(np.multiply(keypoints, [y,x,1]))
    
    for edge, color in edges.items():
        p1, p2 = edge
        y1, x1, c1 = shaped[p1]
        y2, x2, c2 = shaped[p2]
        
        if (c1 > confidence_threshold) & (c2 > confidence_threshold):      
            cv2.line(frame, (int(x1), int(y1)), (int(x2), int(y2)), (0,0,255), 2)

## 4- Main Function

Now, it is time for the main function. We will have some different versions of it, so that we can examine each process and understand what this model does in the back. This one is to understand what is the *frame* variable that VideoCapture returns.

In [81]:
def base():
    cap = cv2.VideoCapture(0)

    while cap.isOpened():
        ret, frame = cap.read()
    
        cv2.imshow('MoveNet Thunder', frame)
    
        if cv2.waitKey(10) & 0xFF==ord('q'):
            break
        
    cap.release()
    cv2.destroyAllWindows()
    return frame.shape

In [82]:
print(base())

(480, 640, 3)


Now, we need to reshape the image because as it is now, our frame is of shape (480, 640, 3). But SinglePose.Thunder model takes images of shape 256x256x3. That is, we will reshape the frame.

In [84]:
def reshaped():
    cap = cv2.VideoCapture(0)

    while cap.isOpened():
        ret, frame = cap.read()

        #====================================================================#

        img = frame.copy()
        img = tf.image.resize_with_pad(np.expand_dims(img, axis=0), 256, 256)
        input_image = tf.cast(img, dtype=tf.uint8)
        
        #====================================================================#

        cv2.imshow('MoveNet Thunder', frame)
    
        if cv2.waitKey(10) & 0xFF==ord('q'):
            break
        
    cap.release()
    cv2.destroyAllWindows()
    return input_image.shape

In [85]:
print(reshaped())

(1, 256, 256, 3)


From this output, we can see that the dimensionality has increased. Because in order to resize it, we had to encapsulate it in another array. That is what we have done with the np.expand_dims() function. 

Now, we need to set the input and output details. It is actually done through the interpreter. So, we will not need to run a function again. Also, we can remove the return statement, since it has nothing to do with the function.

In [86]:
def reshaped():
    cap = cv2.VideoCapture(0)

    while cap.isOpened():
        ret, frame = cap.read()

        img = frame.copy()
        img = tf.image.resize_with_pad(np.expand_dims(img, axis=0), 256, 256)
        input_image = tf.cast(img, dtype=tf.uint8)

        #====================================================================#

        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()

        #====================================================================#

        cv2.imshow('MoveNet Thunder', frame)
    
        if cv2.waitKey(10) & 0xFF==ord('q'):
            break
        
    cap.release()
    cv2.destroyAllWindows()
    # return input_image.shape

In [87]:
interpreter.get_input_details()

[{'name': 'serving_default_input:0',
  'index': 0,
  'shape': array([  1, 256, 256,   3]),
  'shape_signature': array([  1, 256, 256,   3]),
  'dtype': numpy.uint8,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}}]

In [88]:
interpreter.get_output_details()

[{'name': 'StatefulPartitionedCall:0',
  'index': 332,
  'shape': array([ 1,  1, 17,  3]),
  'shape_signature': array([ 1,  1, 17,  3]),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}}]

Now, to making predictions. Our function is now:

In [89]:
def predicts():
    cap = cv2.VideoCapture(0)

    while cap.isOpened():
        ret, frame = cap.read()

        img = frame.copy()
        img = tf.image.resize_with_pad(np.expand_dims(img, axis=0), 256, 256)
        input_image = tf.cast(img, dtype=tf.uint8)

        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()

        #====================================================================#

        interpreter.set_tensor(input_details[0]['index'], np.array(input_image))
        interpreter.invoke()
        keypoints_with_scores = interpreter.get_tensor(output_details[0]['index'])

        #====================================================================#

        cv2.imshow('MoveNet Thunder', frame)
    
        if cv2.waitKey(10) & 0xFF==ord('q'):
            break
        
    cap.release()
    cv2.destroyAllWindows()
    return keypoints_with_scores

In [90]:
keypoints_with_scores = predicts()

In [91]:
left_eye = keypoints_with_scores[0][0][1]
right_eye  = keypoints_with_scores[0][0][2]

In [92]:
left_eye

array([0.4576139 , 0.57402444, 0.5017696 ], dtype=float32)

In [93]:
right_eye

array([0.46162802, 0.45359972, 0.6262084 ], dtype=float32)

In [94]:
left_eye[:2]*[480,640]

array([219.65466499, 367.37564087])

In [95]:
right_eye[:2]*[480,640]

array([221.58144951, 290.30382156])

Now, it is time to finalize the main function.

In [73]:
def main():
    cap = cv2.VideoCapture(0)

    while cap.isOpened():
        ret, frame = cap.read()

        img = frame.copy()
        img = tf.image.resize_with_pad(np.expand_dims(img, axis=0), 256, 256)
        input_image = tf.cast(img, dtype=tf.uint8)

        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()

        interpreter.set_tensor(input_details[0]['index'], np.array(input_image))
        interpreter.invoke()
        keypoints_with_scores = interpreter.get_tensor(output_details[0]['index'])

        #====================================================================#

        draw_connections(frame, keypoints_with_scores, EDGES, 0.4) # 0.4 is the confidence threshold.
        draw_keypoints(frame, keypoints_with_scores, 0.4) # model makes a prediction eitherway, we don't want to draw faulty keypoints and connections

        #====================================================================#

        cv2.imshow('MoveNet Thunder', frame)
    
        if cv2.waitKey(10) & 0xFF==ord('q'):
            break
        
    cap.release()
    cv2.destroyAllWindows()

In [74]:
if __name__ == '__main__':
    main()