# Data Context

We are using AR face database which is public and access is free. To enable detailed testing and
model building the AR face images have been manually labelled with 22 facial features on each
face. The 22 points chosen are consistent across all images. This labelled database contains face
images of 136 persons (76 men & 60 women). Images feature frontal view faces with different facial
expressions and illumination conditions.

# Data Format

- Male images are stored as: m-xx-yy.pts
- Females as: w-xx-yy.pts
- 'xx' is a unique person identifier (from "00" to "76" for men and from "00" to "60" for
women). 'yy' specifies expression or lighting condition. Its meanings are described as
follows:

```sh
1: Neutral expression
2: Smile
3: Anger
5: left light on
```

# Extract Workload

The core focus of the extract workload is to create a flat file representation of all the individuals,
which also includes their gender, id, and emotional state, alongside each of the individual points that
were gathered from each of their unique facial expresssion image(s). This flat file will be a CSV that will
be located within the `ex_res` folder,  where the extracted & minimally preprocessed data. 

The goal here is to be able to retain as much information as possible, and allow us to craft new features
across the whole dataset easily. The CSV will allow us to create a Pandas dataframe which can be easily
manipulated into our desired shape(s) when it comes to feature engineering.

## Verify FaceMarkupARDatabase

Ensure the end-user contains our original_dataset and not a manipulated / malformed version.

In [1]:
import hashlib
import os

def hash_file(file_path):
    # Generate a hash for a file
    hasher = hashlib.sha256()
    with open(file_path, 'rb') as file:
        while True:
            data = file.read(65536)  # Read in 64k chunks
            if not data:
                break
            hasher.update(data)
    return hasher.hexdigest()

def hash_folder(folder_path):
    # Generate a hash for a folder
    folder_hasher = hashlib.sha256()
    for root, dirs, files in os.walk(folder_path):
        for filename in files:
            file_path = os.path.join(root, filename)
            folder_hasher.update(hash_file(file_path).encode('utf-8'))
    return folder_hasher.hexdigest()

folder_path = "../../FaceMarkupARDatabase"
expected_hash = "a3b9a9a41f515586fb41411eb6b184dd3d291d7a04f2d3d0a8527181d1ded25a"
folder_hash = hash_folder(folder_path)

if folder_hash != expected_hash:
    raise ValueError(f"The hash of the folder does not match the expected value. Expected: {expected_hash}, Actual: {folder_hash}")
else:
    print("FaceMarkupARDatabase has been verified")

ValueError: The hash of the folder does not match the expected value. Expected: a3b9a9a41f515586fb41411eb6b184dd3d291d7a04f2d3d0a8527181d1ded25a, Actual: ac980862a82d84574ab4037e5dcde757750aa1419e55ec37717c4105493060d8

## Create Extract Results Folder At Project Root

In [6]:
import os

def create_folder_structure():
    # Get the root directory of the Git project
    root_dir = os.getcwd()

    # Define the path for the ex_res folder
    ex_res_folder = os.path.join(root_dir, "ex_res")

    # Check if the folder structure already exists
    if not os.path.exists(ex_res_folder):
        # If the ex_res folder does not exist, create the folder
        os.makedirs(ex_res_folder)
        print("Folder structure created successfully.")
    else:
        print("Folder structure already exists!, Please delete and re-run the extract pipeline if you are running into issues")
    
    return ex_res_folder

ex_res_folder = create_folder_structure()

Folder structure already exists!, Please delete and re-run the extract pipeline if you are running into issues


## Extract & Preprocess

Each facial expression from the `.pts` file(s) will be extracted and transformed accordingly:

- Gender is a binary column where `0 = Female | 1 = Male`
- Emotional Expression is one-hot-encoded, where there are four columns: `['neutral', 'neutral', 'smile', 'anger', 'left_light']`
    - There will only be one column with a 1 representing a person's emotional state
    - All other columns will be marked with 0
- Person's unique ID is a string that combines gender and person_id from the folder holding a person's specific points, for example: `'m' + '001' = 'm001'`

The expected result of this workflow should leave us with a file structure like this:

```js
└───ex_res
    └───ex_res.csv
```

#### Interesting Notes:

Women 047-060 only have anger and left_light images, which is a major blow for data that can be used to identify them.

In [7]:
from dataclasses import dataclass
from typing import List

@dataclass
class Point:
    x: float
    y: float
    
def read_points_from_file(file_name):
    points: List[Point] = []
    with open(file_name, 'r') as file:
        lines = file.readlines()
        for line in lines:
            # Remove all trailing whitespaces from line, and if that returns None for the line, skip the line
            if line.strip() is None:
                continue
            
            # Ignore all strings that are not points representing facial expression
            if line.startswith(('version', 'n_points', '{', '}')):
                continue
                
            x, y = map(float, line.split())
            points.append(Point(x, y))
    
    return points

In [8]:
from enum import Enum

class EmotionalExpression(Enum):
    NEUTRAL = 1
    SMILE = 2
    ANGER = 3
    LEFT_LIGHT = 5

def one_hot_encode_emotion(emotion_name):
    if emotion_name == 'NEUTRAL':
        return [1, 0, 0, 0]
    elif emotion_name == 'SMILE':
        return [0, 1, 0, 0]
    elif emotion_name == 'ANGER':
        return [0, 0, 1, 0]
    elif emotion_name == 'LEFT_LIGHT':
        return [0, 0, 0, 1]
    else:
        raise ValueError("Emotion Name not recognized")

def transform_df_friendly(gender: str, person_id: str, emotional_expr: str):
    df_gender = 1 if gender == 'm' else 0
    df_person_id = gender + person_id
    df_emotional_expr = one_hot_encode_emotion(EmotionalExpression(int(emotional_expr)).name)

    return df_gender, df_person_id, df_emotional_expr


In [9]:
import pandas as pd

database_path = "../../FaceMarkupARDatabase/points_22"

num_face_points = 22
column_names = ['gender', 'person_id', 'neutral', 'smile', 'anger', 'left_light']

for i in range(num_face_points):
    column_names.extend([f'p_{i}_x'])
    column_names.extend([f'p_{i}_y'])

def traverse_facial_expressions():
    # Walk through the FaceMarkupARDatabase/points_22 folder
    # Create a dataframe that will have gender, person_id, emotional_expr, and 22 points from x and y

    df = pd.DataFrame(None, columns=column_names)
    for dirpath, _, files in os.walk(database_path):
        if 'dummy.pts' in files: # Skip dummy.pts, as its not relevant
            continue
        for file in files:
            if file.endswith(".pts"):
                # Extract Gender & Person Unique ID & emotional_expr
                gender, person_id, emotional_expr = file.split('-')
                # Expression also contains the suffix of the file extension
                # Expression is always two digits that goes from 01, 02, 03, 05
                emotional_expr = emotional_expr[:2]
                facial_expression_points: List[Point] = read_points_from_file(os.path.join(dirpath, file))
                
                # Preprocess columns
                df_gender, df_person_id, df_emotional_expr_lst = transform_df_friendly(gender, person_id, emotional_expr)
                
                # Craft Dataframe Row for specific person in specific emotional state
                df_row = [df_gender, df_person_id] + df_emotional_expr_lst
                
                for _, point in enumerate(facial_expression_points):
                    df_row.extend([point.x, point.y])
                
                # Add to flat-file dataframe
                df.loc[len(df)] = df_row

    return df

flat_file_df = traverse_facial_expressions()
flat_file_df.to_csv(os.path.join(ex_res_folder, "ex_res.csv"), index=True, index_label='index')

In [11]:
test_df = pd.read_csv(os.path.join(ex_res_folder, "ex_res.csv"))
test_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,p_0_x,p_0_y,p_1_x,...,p_17_x,p_17_y,p_18_x,p_18_y,p_19_x,p_19_y,p_20_x,p_20_y,p_21_x,p_21_y
0,0,1,m061,0,1,0,0,337.065,268.794,450.991,...,398.359,366.73,397.026,400.041,398.359,464.666,286.432,378.722,511.618,375.391
1,1,1,m061,1,0,0,0,327.346,262.923,441.29,...,386.599,374.627,387.108,393.943,389.141,462.566,280.87,383.777,497.921,383.777
2,2,1,m061,0,0,0,1,325.671,257.013,443.241,...,385.823,364.557,386.734,383.696,386.734,455.696,275.544,379.139,489.722,377.316
3,3,1,m061,0,0,1,0,323.919,295.817,439.65,...,380.946,420.353,380.946,442.577,380.107,497.088,292.471,414.902,485.775,421.192
4,4,1,m066,1,0,0,0,331.813,243.564,455.946,...,390.842,362.18,389.509,375.857,388.926,465.606,262.461,373.525,516.556,377.605
