## Honors Project Report: Distrubuted Facial Expression Recognition
#### Samuel Zeleke & Enoch Mwesigwa


### Vision
Multi-label classification and object detection models have gained significant popularity in recent years. They've become
integral components of systems ranging from autonomous cars and security systems, to social media platforms and search
engines. Our project is aimed at creating a system that correctly labels facial expressions in real-time using an Aryzon headset and
CNNs built on keras. We'll use OpenCV's cascade classifier to extract faces from frames, and train a separate CNN classify the facial
expression. The output is used to draw a labelled box around faces on display the results on the client.

### Background
Object detection entails recognizing multiple objects places at different locations in the image. Unfortunately, regular
covNets cannot (at least not "normally") cannot solve this problem: their architecture only allows inputs and outputs
with fixed sizes. So, they are restricted to merely labelling images.

There have been several attempts to go around this problem. The most significant ones are using R-CNN and YOLO. [R-CNN](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e)
and its decedents use additional preprocessing to generate thousands of candidate regions (called "proposals") in the images
and pass each region to a covNet for classification. Obviously, this is a very resource-intensive process, (training in
some architectures takes days) and they are not fast enough for real-time object recognition. [YOLO](https://medium.com/analytics-vidhya/yolo-v3-theory-explained-33100f6d193), on the other hand, uses a single (very) deep CovNet to
both recognize regions of interest and to classify those regions. This method is several times faster and more
efficient than R-CNNs. Unfortunately, the architecture needs a lot of data for training and uses NN layers we were not familiar with.

So, building on Gurav Sharma's article "[Real Time Facial Recognition](https:/medium.com/datadriveninvestor/real-time-facial-expression-recognition-f860dacfeb6a)",
we chose to create a simpler system that combines openCV's trained cascade classifiers to extract the
faces and trained a small NN to classify the facial expressions. This largely avoids R-CNNs inefficiencies and significantly reduces the
size of training data we need to get decent predictions.

### Implementation

#### Training
Like R-CNNs, our system divides the facial expression task into two stages. In the first stage, we use openCV's pre-trained cascade classifiers
to find regions containing faces. For the second stage, Sharma recommends taking advantage of the Keras' pretrained models using transfer learning.
(Transfer learning involves using layers from a model trained for a different dataset. It let's the recepient model to take advantage of the "donor"
model's training by using its weights for predictions.) Unfortunately, we didn't have enough training data to achieve significant training set accuracy.
Additionally, the resulting models were taking a lot of storage. So, instead we adopted the architecture we used for the fashion mnist homework to build a small
CNN that classified facial expressions.

*Code for second-stage model training*

In [None]:
def train_model(batch_size = 10):
    import os
    import zipfile
    import cv2
    import tensorflow as tf
    import keras
    from keras import layers
    # import matplotlib.pyplot as plt
    import numpy

    # unzipfile
    with zipfile.ZipFile("/content/drive/My Drive/tif_extended.zip", 'r') as zip_ref:
        zip_ref.extractall("./trainingSrc")

    PATH = "./trainingSrc/tif_extended"
    CLASSES = os.listdir(PATH)

    image_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, validation_split = 0.2)
    images = image_generator.flow_from_directory(
        PATH,
        batch_size = 1,
        target_size = (256, 256),
    )
    sample, label = next(images)
    feature_dataframe = []
    target_dataframe = []

    for i in range(len(images)):
      feature_dataframe.append(images[i][0][0])
      target_dataframe.append(images[i][1][0])

    feature_dataframe = numpy.array(feature_dataframe)
    target_dataframe = numpy.array(target_dataframe)

    # create network
    model = keras.Sequential()

    # input and first convolution: extract 30 features
    model.add(keras.layers.Conv2D(30, 2, activation="relu", input_shape = (256, 256, 3)))
    model.add(keras.layers.Conv2D(60, kernel_size=5, strides=(2, 2), activation="relu"))#(60, 5, stri activation="relu"))
    # input and second convolution: extract 30 features
    model.add(keras.layers.Conv2D(60, 5, activation="relu"))
    model.add(keras.layers.Conv2D(30, 3, activation="relu"))
    model.add(keras.layers.MaxPooling2D(2))

    # input and third convolution: extract 30 features
    model.add(keras.layers.Conv2D(60, 5, activation="relu"))
    model.add(keras.layers.MaxPooling2D(2))

    #flatten
    model.add(keras.layers.Flatten())
    # three dense layers
    model.add(keras.layers.Dense(120, activation="relu"))
    model.add(keras.layers.Dense(28, activation="relu"))
    model.add(keras.layers.Dense(7, activation="softmax"))

    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["acc"]
    )

    model.summary()

    model.fit(
    x = feature_dataframe,
    y = target_dataframe,
    batch_size = batch_size,
    epochs = 10,
    validation_split = 0.2
    )

    return model

Finding useful training data for our project was a challenging process. Most face detection datasets are made to train binary classifiers that
detect whether there is a face in an image or not. However, The larger datasets mentioned in Sharma's articles were also either unavailable, or not appropriately labelled. So,
we used the following code to scrape images from Google Image search. Our "scrapper" grabs the images from the site and uses the cascade classifier to detect and generating new images that only contained the face.
It then saves the images into directories created using the search term.

*Code for scrapper*

In [None]:
import os
import time
import selenium
from selenium import webdriver
import cv2 as cv
import requests
import numpy as np
from PIL import Image

DRIVER_PATH = "./chromedriver"

class FaceScraper:

    def __init__(self, path_to_driver=DRIVER_PATH, path_to_face_model="./face_detector.xml", path_to_eye_model="./eye_detector.xml"):

        self._img_urls = dict()
        self._images = dict()
        self._wdriver_path = DRIVER_PATH
        self._face_cascade = cv.CascadeClassifier()
        self._eye_cascade = cv.CascadeClassifier()

        #-- 1. Load the cascades
        if not self._face_cascade.load(path_to_face_model):
            print('--(!)Error loading face cascade')
            exit(0)
        if not self._eye_cascade.load(path_to_eye_model):
            print('--(!)Error loading eye cascade')
            exit(0)

    def getImgUrls(self, search_terms=["smiling", "sad", "surprised", "angry", "neutral", "disgust"], max_num_links = 100, ):
        if (search_terms != None):
            self._search_terms = search_terms

        wd = webdriver.Chrome(executable_path = DRIVER_PATH)

        for term in self._search_terms:
            self._img_urls[term] = self._fetch_image_urls(term, wd, max_links_to_fetch=max_num_links, sleep_between_interactions=0.1)

        wd.quit()

    def extractFaces(self):
        if len(self._img_urls.keys()) == 0:
            raise ValueError("No images in object.")

        for label in self._img_urls.keys():
            results = []
            print("Extracting label: %s\n" % label)

            i = -1
            for url in self._img_urls[label]:
                try:
                    i += 1
                    print("  Extracting label: {} no: {}; url: {}".format(label, i, url))
                    resp = requests.get(url, stream=True).raw
                    print("\tGrabbed image from server")
                    image = np.asarray(bytearray(resp.read()), dtype="uint8")
                    print("\tConverted to an array")
                    image = cv.imdecode(image, cv.IMREAD_COLOR)
                    print("\tDecoded image")

                    print("\tGetting faces")
                    for face in self._detectFace(image):
                        results.append(face)
                except:
                    print("    error: couldn't extract faces for url")
                    continue

            self._images[label] = results

    def saveCropped(self, parent_dir=os.getcwd(), image_type = "png"):
        if len(self._images.keys()) == 0:
            raise ValueError("No images in object.")

        path_to_srcapped = parent_dir + "/scrapped_images"
        i = 1
        while (os.path.exists(path_to_srcapped)):
            path_to_srcapped = path_to_srcapped + "_" + str(i)
            i += 1
        print ("Saving to %s" % path_to_srcapped)
        os.mkdir(path_to_srcapped)
        for label in self._images:
            try:
                print("Exteracting for %s" % label)
                labelDir = path_to_srcapped + "/" + label
                os.mkdir(labelDir)
            except OSError:
                print("Unable to write images under %s label\n" % label)
                continue

            for index in range(len(self._images[label])):
                image_path =  labelDir + "/" + label + "_" + str(index) + "." + image_type
                cv.imwrite(image_path, self._images[label][index])
                try:
                    print("Progress: %.2f%" % 100 * index/len(self._images[label]))
                except:
                    continue
    def _detectFace(self, frame):
        print("\t  preprocessing image", end="... ")
        frame_gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
        frame_gray = cv.equalizeHist(frame_gray)
        #-- Detect faces and eyes
        print("Detecting faces image", end="... ")
        faces = self._face_cascade.detectMultiScale(frame_gray)
        eyes = self._eye_cascade.detectMultiScale(frame_gray)
        print("veryfiying", end="... ")
        real_faces = []
        for (x, y, w, h) in faces:
            # x_min, y_min, x_max, y_max = x, y, x + w, y + h
            # for (x_eye, y_eye, w_eye, h_eye) in eyes:
            #     if (x_min <= x_eye and x_eye <= x_max) and (y_min <= y_eye and y_eye <= y_max):
            #         real_faces.append((x, y, w, h))
            #         break
            real_faces.append((x, y, w, h))
            if len(real_faces) > 5:
                break
        print("\t  Done!")
        return [frame[y:y+h,x:x+w] for (x,y,w,h) in real_faces]

    # source
    # https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d
    def _fetch_image_urls(self, query, wd, max_links_to_fetch, sleep_between_interactions=1):
        def scroll_to_end(wd):
            wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1)

        # build the google query
        search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

        # load the page
        wd.get(search_url.format(q=query))

        image_urls = set()
        image_count = 0
        results_start = 0
        while image_count < max_links_to_fetch:
            scroll_to_end(wd)

            # get all image thumbnail results
            thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
            number_results = len(thumbnail_results)

            print("Found: {0} search results. Extracting links from {1}:{0}".format(number_results, results_start))

            for img in thumbnail_results[results_start:number_results]:
                # try to click every thumbnail such that we can get the real image behind it
                try:
                    img.click()
                    time.sleep(sleep_between_interactions)
                except Exception:
                    continue

                # extract image urls
                actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
                for actual_image in actual_images:
                    if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                        image_urls.add(actual_image.get_attribute('src'))

                image_count = len(image_urls)

                if len(image_urls) >= max_links_to_fetch:
                    print("Found: {} image links, done!".format(len(image_urls)))
                    break
            else:
                print("Found:", len(image_urls), "image links, looking for more ...")
                time.sleep(15)
                #return image_urls
                load_more_button = wd.find_element_by_css_selector(".mye4qd")
                if load_more_button:
                    wd.execute_script("document.querySelector('.mye4qd').click();")
                else:
                    return image_urls


            # move the result startpoint further down
            results_start = len(thumbnail_results)

        return image_urls



def run():
    list_of_search_terms = [
        "people at weddings",
        "depression human face",
        "people at senate hearing",
        "disgusted face",
        "people shocked"
        ]

    getFaces = FaceScraper()
    getFaces.getImgUrls(list_of_search_terms, 200)
    getFaces.extractFaces()
    getFaces.saveCropped()

# Run scrapper
#run()

This process increased or training data 3-fold. However, because the images were grabbed from a search engine, they also introduced some unintended bias.

#### Recognition
Our project's original aim was to deploy the model on a server hosted on a raspberry pi that gets video feeds from a client
(a headset). However, to meet the pi's hardware performance restrictions, we decided to run the first stage of our classifier
on the client. The client then publishes the cropped faces as Mqtt messages to a broker server.

The Raspberry Pi is subscribed to the clients topic. When is gets a message (image) from a client, it grabs the image and
publishes a string representing the emotion on the broker server. The client will grab the string an display it on the screen.

*Code for client*

***DO NOT RUN IN JUPYTER NOTEBOOK:*** The code uses openCV's imshow to display the lablled frame. This function is make the jupyter server crash.

In [None]:
def run_client():
    import cv2
    import pickle
    import socket
    import struct

    TCP_IP = '127.0.0.1'
    TCP_PORT = 9502
    video_file = 'facesVid.webm'

    # Receive facial expression labels from the server
    def receiveLabels(mySocket):
        data = mySocket.recv(1024)
        print(str(data))
        return str(data)

    # detects the faces in a frame
    def detectAndDisplay(frame):
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        frame_gray = cv2.equalizeHist(frame_gray)
        #-- Detect faces
        faces = face_cascade.detectMultiScale(frame_gray)
        face_list = []
        i = 0
        for (x,y,w,h) in faces:
            center = (x + w//2, y + h//2)
            face_list.append(frame[y:y+h, x:x+w])

        return frame, face_list, faces

    face_cascade_name = "./face_detector.xml"#args.face_cascade
    face_cascade = cv2.CascadeClassifier()

    #-- 1. Load the cascades
    if not face_cascade.load(face_cascade_name):
        print('--(!)Error loading face cascade')
        exit(0)

    print("Starting server...\n")
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)  # establishing a tcp connection
    sock.bind((TCP_IP, TCP_PORT))
    sock.listen(5)

    while True:
        (client_socket, client_address) = sock.accept()  # wait for server
        print
        'connection established with ' + str(client_address)
        cap = cv2.VideoCapture(video_file)
        pos_frame = cap.get(cv2.CAP_PROP_POS_FRAMES)
        # send frames
        while True:
            flag, frame = cap.read()
            labels = []
            if flag:
                frame_labelled, face_list, bounds = detectAndDisplay(frame)

                for i in range(len(face_list)):
                    a_face = face_list[i]
                    a_face = pickle.dumps(a_face)
                    size = len(a_face)
                    p = struct.pack('I', size)
                    a_face = p + a_face
                    client_socket.sendall(a_face)

                    label = receiveLabels(client_socket)

                    frame = cv2.rectangle(
                        frame,
                        (bounds[i][0], bounds[i][1]),
                        (bounds[i][0] + bounds[i][2], bounds[i][1] + bounds[i][3]),
                        (0, 0, 0),
                        1
                    )
                    frame = cv2.putText(frame, label, (bounds[i][0], bounds[i][1] + bounds[i][3] + 5), cv2.FONT_HERSHEY_COMPLEX, 0.5, (0, 0, 0), 1)

            else:
                cap.set(cv2.CAP_PROP_POS_FRAMES, pos_frame - 1)

            if cap.get(cv2.CAP_PROP_POS_FRAMES) == cap.get(cv2.CAP_PROP_FRAME_COUNT):
                size = 10
                p = struct.pack("I", size)
                client_socket.send(p)
                client_socket.send('')
                break

            cv2.imshow("Frame", frame)
            cv2.waitKey(1)

# Run client
# run_client()

*Note: The client uses server sockets. This is an unintended mistake. It will be fixed soon.*

The server then uses the model trained on Google Colab to reply with a string
containing the two most likely facial expressions.

*Code for the server*

In [None]:
def run_server():
    import cv2
    import socket
    import struct
    import pickle
    import keras
    import numpy
    import tensorflow
    import math

    print("Geting model files...")
    # # load json and create model
    json_file = open('model.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    print("Loading models...")
    loaded_model = keras.models.model_from_json(loaded_model_json)
    loaded_model.load_weights("model.h5")
    print("Loaded model from disk")

    # # evaluate loaded model on test data
    loaded_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # #print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))
    print("Model compiled")

    TCP_IP = '153.106.213.22'
    TCP_PORT = 9502
    server_address = (TCP_IP, TCP_PORT)
    i = 0

    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((TCP_IP, TCP_PORT))
    data = b''
    payload_size = struct.calcsize("I")

    def getframe(data):
        while len(data) < payload_size:
            data += sock.recv(4096)
        packed_msg_size = data[:payload_size]
        data = data[payload_size:]
        msg_size = struct.unpack("I", packed_msg_size)[0]
        while len(data) < msg_size:
            data += sock.recv(4096)
        frame_data = data[:msg_size]
        data = data[msg_size:]

        if frame_data == b'':
            return -1, data, None

        return 0, data, pickle.loads(frame_data)

    # send feed
    def sendLabel(socket_, prediction):
        emotions = ["neutral", "smiling", "sad", "surprise-shock", "angry", "disgusted", "fearful"]

        response = emotions[prediction.index(max(prediction))]
        prediction.pop(prediction.index(max(prediction)))
        if max(prediction) > 0.5:
            response = response + " / " + emotions[prediction.index(max(prediction))]

        socket_.send(bytearray(response, "utf-8"))


    while True:
        flag, data, frame = getframe(data)
        if (flag == -1):
            break
        frame = cv2.resize(frame, (256, 256), interpolation=cv2.INTER_AREA)
        predictions = loaded_model.predict(numpy.reshape(frame, (1, 256, 256, 3)))
        print(predictions)
        sendLabel(sock, predictions[0].tolist())

    sock.close()

# Run server
# run_server()

*Note: The server uses client sockets. This is an unintended mistake. It will be fixed soon.*

### Results
*Please watch recognition_trial.mp4 for a demonstration*

Overall, our project was a failure. Significant challenges in creating our training dataset heavily influenced our model performance; Infrastructure problems in unity prevented us
from integrating our project with the AR headset and limited us project to a facial expression recognizer that spans two computers.

Even if our model performed just as well as Sharma's model with an 85% accuracy, it did not have enough training data to understand the
neuances of facial expressions. For example, the model has a hard time differentiating between the shock and anger. This is because the datapoints
 for both expressions had people opening their mouths in them.

Additionally, our models were significantly influenced by the peculiarities of our training data. Having used phrases that associate with the emotions,
our images had similarities inherent to that phrase. (For example, if we search "syrian war" for sad, the resulting images would show the extreme forms of sadness.)
We believe this made our model make unhelpful associations between the similarities of the images and the emotion. For example, one of the phrases used to search for images
of neutral facial expression was "passport photo." This meant that the pictures had eyes looking directly as the camera. This is reflected in our observations. Frames in which a
subject is looking directly at the camera are labelled as neutral.

### Implications

There obvious ethical issues regarding applications which capture video and transmit it elsewhere via internet. Were this
 to be a product the users purchased, there would need to be specific security and privacy assurances. Though we never
 achieved it, our original goal as to have the facial running of an AR headset as the client. This would be a prototype,
 with, in the future having a product for autistic children. One of the symptoms of autism, particularly in children is
 difficulty reading facial expressions. Such an application would run on a set of AR glasses, which are indistinguishable
 from regular glasses these days. A child with autism could have some assistance identifying facial expressions, which
 could the child learn as well.

### Conclusion
Facial recognition is growing sub-field in Object-detection research. Our project focused on recognizing facial expressions and
classifying their facial expressions. We used openCV's cascading classifier models to find faces in images and trained a CNN
to classify the facial expressions. Even if our project was a failure, it still demonstrated the versatility of AI-systems
by building a pipline that spans multiple devices. With a fast enough messaging system, this division of labor lets a network
of simple machines--like the raspberry pi--can conduct resource intensive AI-tasks as a unit.

### Citations

Sharma, Gaurav. Real Time Facial Expression Recognition. Real Time Facial Expression Recognition. Medium, n.d. https://medium.com/datadriveninvestor/real-time-facial-expression-recognition-f860dacfeb6a.
https://medium.com/analytics-vidhya/yolo-v3-theory-explained-33100f6d193