%----------------------------------------------------------------------------------------
%	PACKAGES AND OTHER DOCUMENT CONFIGURATIONS
%----------------------------------------------------------------------------------------

\documentclass[fleqn,10pt]{SelfArx} % Document font size and equations flushed left

\usepackage[english]{babel} % Specify a different language here - english by default
\usepackage[T1]{fontenc}
\usepackage{array}
\usepackage{booktabs}
\usepackage[ruled]{algorithm2e}

\usepackage{mdframed}
\usepackage{lipsum} % Required to insert dummy text. To be removed otherwise
\usepackage{hyperref}
\hypersetup{
  colorlinks=true,
  linkcolor=blue,
  filecolor=magenta,      
  urlcolor=cyan,
}
%----------------------------------------------------------------------------------------
%	COLUMNS
%----------------------------------------------------------------------------------------

\setlength{\columnsep}{0.55cm} % Distance between the two columns of text
\setlength{\fboxrule}{0.75pt} % Width of the border around the abstract

%----------------------------------------------------------------------------------------
%	COLORS
%----------------------------------------------------------------------------------------

\definecolor{color1}{RGB}{0,0,90} % Color of the article title and sections
\definecolor{color2}{RGB}{0,20,20} % Color of the boxes behind the abstract and headings

%----------------------------------------------------------------------------------------
%	HYPERLINKS
%----------------------------------------------------------------------------------------

\usepackage{hyperref} % Required for hyperlinks

\hypersetup{
	hidelinks,
	colorlinks,
	breaklinks=true,
	urlcolor=color2,
	citecolor=color1,
	linkcolor=color1,
	bookmarksopen=false,
	pdftitle={Title},
	pdfauthor={Author},
}

%----------------------------------------------------------------------------------------
%	ARTICLE INFORMATION
%----------------------------------------------------------------------------------------

\PaperTitle{Gold Price Forecasting} % Article title
\Authors{Nguyen Minh Cuong\textsuperscript{1, *}, Tran Le My Linh\textsuperscript{1, *}, Doan Ngoc Cuong\textsuperscript{1, *}, Nguyen Thanh Long\textsuperscript{1, *},
\\ Bui Khanh Linh\textsuperscript{1, *}} % Authors
\affiliation{\textsuperscript{1}\textit{The School of Information and Communication Technology - Hanoi University of Science and Technology}} % Author affiliation
\affiliation{*\textbf{Corresponding author}:
\\ cuong.nm210140@sis.hust.edu.vn, 
\\ linh.tlm210535@sis.hust.edu.vn, 
\\ cuong.dn210141@sis.hust.edu.vn,
\\ long.nt214912@sis.hust.edu.vn, 
\\ linh.bk214910@sis.hust.edu.vn
}% Corresponding author

\Keywords{time series forecasting; machine learning; gold} % Keywords - if you don't want any simply remove all the text between the curly brackets
\newcommand{\keywordname}{Keywords} % Defines the keywords heading name

%----------------------------------------------------------------------------------------
%	ABSTRACT
%----------------------------------------------------------------------------------------

\Abstract{This research is driven by ...
}

%----------------------------------------------------------------------------------------

\begin{document}

\maketitle % Output the title and abstract box
\tableofcontents % Output the contents section

\thispagestyle{empty} % Removes page numbering from the first page

%----------------------------------------------------------------------------------------
%	ARTICLE CONTENTS
%----------------------------------------------------------------------------------------

\section{Introduction} % The \section*{} command stops section numbering

Face Recognition has emerged as a transformative technology, reshaping the landscape of security, identity verification, and human-computer interaction. This computational process aims to identify and verify individuals based on their facial features, offering a seamless and non-intrusive means of authentication. The Face Recognition Problem entails a multifaceted journey, encompassing intricate steps to decipher the complexities inherent in human faces.

In recent years, advancements in computer vision, machine learning, and deep neural networks have propelled Face Recognition to the forefront of biometric technologies. This technology not only holds immense promise for enhancing security protocols but also extends its applications to diverse fields, including access control, surveillance, and personalized user experiences.

The fundamental challenge in Face Recognition lies in the need to orchestrate a series of intricate tasks. From the initial detection of facial landmarks and alignment to the extraction of discriminative features and subsequent classification, each step plays a crucial role in achieving accurate and reliable results. The fusion of computer vision algorithms and artificial intelligence methodologies has paved the way for robust Face Recognition systems capable of handling diverse scenarios, facial expressions, and environmental variations.

\subsection{Organization}
\ 
The subsequent sections of this paper will discuss the existing methods and models for predicting cryptocurrency prices, describe the parameter configuration and error setup, address the feature selection and preprocessing of explanatory variables, evaluate the performance of the proposed model, highlight the limitations of this research, and conclude with future directions for further studies.
%------------------------------------------------
\section{Data processing}

\subsection{Data Overview}

\subsection{Face Detection}
When training face detection models, we used WilderFace dataset which contains 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample image. WIDER FACE dataset is organized based on 61 event classes. For face alignment task,we trained model with LFW dataset. It contains 5,590 LFW images and 7,876 other images downloaded from the web. The training set and validation set are defined in trainImageList.txt and testImageList.txt, respectively. Each line of these text files starts with the image name, followed by the boundary positions of the face bounding box retured by our face detector, then followed by the positions of the five facial points.

\subsection{Face Verification}
For the training procedure, we use a subset of MS-Celeb-1M (MS1M) dataset. This data set contains about 10,000,000 images of roughly 100,000 people. Because of limited hardware, we only use a subset of randomly picked 10,000 labels (appox. 100,000 images) for our training. The variation used on this part is cropped into the face and aligned using RetinaFace. Each image has the shape of (3, 112, 112), that is it is a 3 channels RGB image with 112 pixels both horizontally and vertically.
%------------------------------------------------
\section{Preliminaries}

\subsection{Face Verification}
In Face Verification, the primary objective is to determine whether two images depict the same person. While a simplistic approach involves pixel-wise image comparison, this method proves inefficient due to variations in lighting, facial orientation, and other factors. To address these challenges, an encoding function denoted as $f(\text{img})$, is introduced. This encoding facilitates element-wise comparisons, resulting in more accurate judgments regarding the similarity of two images.

\subsubsection{Face Encoding}
To enhance Face Verification, we utilize advanced neural network architectures to process RGB face images, generating 128-dimensional embedding vectors containing distinctive features of those faces.

Calculating the distance between two embedding vectors and applying a threshold makes it possible to ascertain whether two pictures represent the same person. The effectiveness of an encoding is measured based on the following criteria:

\begin{itemize}
    \item The embeddings of two images of the same person should be notably similar.
    \item The embeddings of two images of different individuals should exhibit significant dissimilarity.
\end{itemize}

\subsubsection{The Triplet Loss}

For an image $x$, denoted as $f(x)$ where $f$ is the function computed by the neural network, training involves triplets of images ($A$, $P$, $N$):

\begin{itemize}
    \item $A$ represents the "Anchor" image – a picture of a person.
    \item $P$ is the "Positive" image – a picture of the same person as the Anchor image.
    \item $N$ is the "Negative" image – a picture of a different person than the Anchor image.
\end{itemize}

These triplets are selected from the training dataset and are represented as ($A^{(i)}$, $P^{(i)}$, $N^{(i)}$) for the $i$-th training example. The objective is to ensure that the encoding of an image $A^{(i)}$ of an individual is closer to the Positive image $P^{(i)}$ than to the Negative image $N^{(i)}$ by at least a margin $\alpha$:
\[
\| f(A^{(i)}) - f(P^{(i)}) \|_2^2 + \alpha \leq \| f(A^{(i)}) - f(N^{(i)}) \|_2^2
\]
This constraint guides the training process to embed similar faces closer in the feature space while pushing dissimilar faces farther apart.

\subsubsection{Online Triplet Loss}
In the original implementation of triplet loss, known as offline triplet loss, triplets are pre-selected and stored in a dataset before the training starts. Concretely, batches of triplets of size \(B\) are created, resulting in the computation of \(3B\) embeddings (anchor, positive, negative) to obtain \(B\) valid triplets. The loss for these triplets is then computed, and backpropagation is performed to update the network.

Contrastingly, in online triplet loss \cite{onlinetriplet}, triplets are dynamically selected during each training iteration. For each training batch, an anchor sample is chosen, and positive and negative samples are selected based on the current state of the model. Given a batch of \(B\) examples (e.g., \(B\) images of faces), \(B\) embeddings are computed, and a maximum of \(B^3\) triplets can be generated. However, many of these triplets are invalid, meaning they do not have 2 positives and 1 negative.

The use of the sampler method in the DataLoader helps ensure a specific number of triplets in each batch. According to the PyTorch Metric Learning library, the MPerClassSampler method allows the specification of the number of samples per class in a batch. For instance, with 4 samples per class and 16 classes per batch, a total of 64 samples per batch can be achieved. This approach enhances stabilization and maximizes the number of triplets during training.
\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/OnlineLoss/online_triplet.png}
    \caption{Figure implementation of online triplet.}

    \label{fig:onlinetriplet}
\end{figure}

Strategies in online mining: When dealing with online triplet loss, a batch of \(B\) embeddings is computed from a batch of \(B\) inputs. The goal is to generate triplets from these \(B\) embeddings. For three indices \(i, j, k \in [1, B]\), if examples \(i\) and \(j\) share the same label but are distinct, and example \(k\) has a different label, the triplet \((i, j, k)\) is considered valid. The objective is to choose triplets wisely for computing the loss.

Assuming a batch of faces as input with size \(B = PK\), where \(P\) represents different persons with \(K\) images each (typical value \(K = 4\)), two strategies are commonly employed \cite{metriclearning}:

\textbf{Batch All:}
   \begin{itemize}
      \item Select all valid triplets and average the loss on hard and semi-hard triplets.
      \item Do not consider easy triplets (those with loss \(0\)), as averaging on them would make the overall loss very small.
      \item This results in a total of \(PK(K-1)(PK-K)\) triplets (with \(PK\) anchors, \(K-1\) possible positives per anchor, and \(PK-K\) possible negatives).
   \end{itemize}

\textbf{Batch Hard:}
   \begin{itemize}
      \item For each anchor, select the hardest positive (biggest distance \(d(a,p)\)) and the hardest negative among the batch.
      \item This strategy produces \(PK\) triplets.
      \item The selected triplets are the hardest among the batch.
   \end{itemize}
   
\subsubsection{Additive Angular Margin Loss}

Additive Angular Margin Loss \cite{deng2022arcface} aims to further improve the discriminative power
of the face recognition model and to stabilize the training process by adding a penalty m to the angle between the current feature and the target weight:

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/arcface/training.png}
    \caption{Training a DCNN for face recognition supervised by the ArcFace loss \cite{deng2022arcface}. Based on the feature $x_i$ and weight $W$ normalization, we
get the $cos \theta_j$ (logit) for each class as WT
j xi. We calculate the $arccos(\theta_{y_i})$ and get the angle between the feature xi and the ground truth
weight $W_{y_i}$. In fact, $W_j$ provides a kind of centre for each class. Then, we add an angular margin penalty m on the target (ground truth) angle $\theta_{y_i}$. After that, we calculate $cos(\theta_{y_i} + m)$ and multiply all logits by the feature scale $s$. The logits then go through the softmax function and contribute to the cross-entropy loss.}

    \label{fig:arcfacetraining}
\end{figure}

As a result, the full loss function is:
\[L_2 = -\frac{1}{N} \sum_{i = 1}^{N} \frac{e^{s(cos(\theta_{y_i} + m))}}{e^{s(cos(\theta_{y_i} + m))} + \sum_{j = 1, j \neq y_{i}}^{N}e^{s\;cos\theta_j}} \]

By doing so, the proposed ArcFace loss enforce more evident gap between nearest classes comparing to the popular softmax loss' roughly separable feature embedding but noticeably ambiguous decision boundaries.

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/arcface/compare_to_softmax.png}
    \caption{Toy examples under the softmax and ArcFace loss on 8 identities with 2D features \cite{deng2022arcface}. Dots indicate samples and lines refer to the centre direction of each identity. Based on the feature normalisation, all face features are pushed to the arc space with a fixed radius. The geodesic distance gap between closest classes becomes evident as the additive angular margin penalty is incorporated.}

    \label{fig:arcfacecompare}
\end{figure}


\subsection{Model Integration}
Face Recognition Problem requires at least the following 3 steps:

\paragraph{Face Detection and Face Alignment}
This step focuses on determining the position of the face in an image or video frame. A rectangle is created around the facial region for precise localization. Additionally, this process includes aligning the facial angle to ensure accuracy and uniformity in identifying facial features.

\paragraph{Face Extraction (Face Embedding)}
After determining the face's position, this step concentrates on extracting facial features. These features are represented as a vector in a high-dimensional space, typically a 128-dimensional vector. This numeric representation of the face is used to differentiate between individuals.

\paragraph{Face Classification (Face Verification)}
The final decision-making step is to classify the face based on the generated feature vector. In this task, the model compares the distance between the feature vector of the new face and those known beforehand in the dataset. A smaller distance indicates similarity between the two faces, allowing the model to determine whether the person in the new image is someone known or unknown from the dataset.


\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{Image/Model Inte/option2.png}
    \caption{Processing flow of the Face Recognition problem.}
    \label{fig:flowproblem}
\end{figure}
%------------------------------------------------
\section{Modelling}
\subsection{Face Detection}
\subsection{Face Verification}
\subsubsection{Model with ArcFace Loss}
For this loss function, we deploy a model consists of a MobileNetV3 \cite{howard2019searching} backbone and an embedding head of 128-dimension vector. The MobileNetV3 model that is used is the small, pretrained using ImageNet version from TorchVision library. Its classifier is replaced with the aforementioned embedding head consisting a linear layer following a batch norm and a drop out layer for better convergence. During training process, we also randomly create the classify weight matrix $W$ which is discarded when inferencing.

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/arcface/mobilenetv3.png}
    \caption{MobileNetV3-Small architecture \cite{howard2019searching}.}

    \label{fig:mobilenetv3}
\end{figure}

The input data for this model is first augmented using Random Erasing, Random Crop and Random Horizontal Flip to simulate different real world variation of the same face, then resized to 224 x 224 before feed to the model.


%--------
\subsubsection{Model with Triplet Loss}

For this loss function, we deploy a model consisting of an InceptionV2 \cite{inceptionv2} and an embedding head of a 128-dimensional vector. The paper "Rethinking the Inception Architecture for Computer Vision" proposed several upgrades to factorizing convolutions and aggressive dimension reductions inside neural networks, resulting in networks with relatively low computational costs while maintaining high quality \cite{inceptionv2}.

\paragraph{General Design Principles}

Avoid representational bottlenecks, especially early in the network. Downscale the input image and feature maps gently. The more different filters you have, the more different feature maps you will have, leading to faster learning. Spatial aggregation (dimension reduction) can be done over lower-dimensional embeddings without much or any loss in representational power.

\paragraph{Factorizing Convolutions}

\begin{itemize}
    \item They factorize a 5x5 convolution into two stacked 3x3 convolutions, resulting in a (9+9)/25 computation load reduction with a relative gain of 28\% by this factorization \cite{inceptionv2}.
    \item Factorizing n x n convolutions into a combination of 1 x n and n x 1 convolutions, termed asymmetric convolution, proves to be a cost-effective alternative.
\end{itemize}

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/InceptionV2/5x5convo.png}
    \caption{Factorizing a 5x5 convolution into two stacked 3x3 convolutions \cite{inceptionv2}.}
    \label{fig:inceptionv2convo}
\end{figure}

\paragraph{Grid Size Reduction}

Traditionally, convolutional networks use pooling before convolution operations to reduce the grid size of the feature maps, but this can introduce a representational bottleneck. The authors propose increasing the number of filters to remove the bottleneck, achieved by the inception module.

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/InceptionV2/grid.png}
    \caption{Bottleneck Feature Learning Module.}
    \label{fig:inceptionv2grid}
\end{figure}

In the left picture, we are introducing a representational bottleneck by first reducing the grid size and then expanding the filter bank which is the other way around in the right picture.

However, the right side is more expensive so they proposed another solution that reduces the computational cost while eliminating the bottleneck (by using 2 parallel stride 2 pooling/convolution blocks).

\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{Image/InceptionV2/samesolutiongrid.png}
    \caption{The diagram on the right represents the same solution but from the perspective of grid sizes rather than the operations.}
    \label{fig:inceptionv2samesolutiongrid}
\end{figure}

\paragraph{Inception-V2}
The model takes an input image of size 112x112 pixels with 3 color channels (RGB). Utilizes the nn.Upsample layer to resize the input image to 112x112 pixels. Uses the Linear layer to predict the output class (classification) with 128 classes. Adds an additional fully connected layer to create a 128-sized embedding from the last layer, which can be used in applications such as unsupervised learning. Implements dropout to reduce overfitting.

\begin{table}[h]
\centering
\begin{tabular}{|l|l|l|}
\hline
\textbf{Type}                   & \textbf{Patch Size/Stride} & \textbf{Input Size} \\
\hline
ConvBlock                        & 3x3/2                             & 112x112x3           \\
ConvBlock                        & 3x3/1                             & 56x56x32            \\
MaxPool2d                        & 3x3/2                             & 28x28x64            \\
ConvBlock                        & 3x3/1                             & 28x28x64            \\
ConvBlock                        & 3x3/2                             & 14x14x80            \\
ConvBlock                        & 3x3/1                             & 14x14x192           \\
ConvBlock                        & 3x3/1                             & 14x14x288           \\
InceptionF5          & as in figure 5                    & 14x14x288           \\
InceptionF6          & as in figure 6                    & 14x14x768           \\
InceptionF7          & as in figure 7                    & 7x7x1280            \\
InceptionF7         & as in figure 7                    & 7x7x2048            \\
AdaptiveAvgPool2d               & 1x1                               & 1x1x2048            \\
Linear (Logits)                  & -                                 & 128                 \\
\hline
\end{tabular}
\caption{InceptionV2 Model Architecture}
\label{tab:inceptionv2}
\end{table}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{Image/InceptionV2/figure5.png}
    \caption{Inception Block in figure 5, 6, 7.}
    \label{fig:inceptionv2archi}
\end{figure}

The input data for this model is first augmented using Random Rotation and Random Horizontal Flip to simulate different real-world variations of the same face, then resized to 112 x 112 before being fed to the model.

\subsubsection{Model with Online Triplet Loss}
For the implementation of a model utilizing the Online Triplet Loss function, we deploy the VGG16 architecture. The VGG16 model is renowned for its simplicity and effectiveness in various computer vision tasks. Originally introduced in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition," the VGG16 architecture has proven to be a strong candidate for feature extraction due to its straightforward design principles. We keep the advantage of VGG16 and make some modifications to suit with our problem. 
\paragraph{General Design Principles} First, model used a tiny 3×3 receptive field with a 1-pixel stride—for comparison, AlexNet used an 11×11 receptive field with a 4-pixel stride. The 3×3 filters combine to provide the function of a larger receptive field. The benefit of using multiple smaller layers rather than a single large layer is that more non-linear activation layers accompany the convolution layers, improving the decision functions and allowing the network to converge quickly. Second, VGG16 use a small convolutional filter, which decreases the probability of overfitting during training. A 3×3 filter is the best size because a smaller size is unable capture left-right and up-down information. 

\paragraph{VGG16 Architecture} The VGG16 architecture is known for its simplicity and uniform structure, consisting of 16 weight layers, including 13 convolutional layers and 3 fully connected layers. 

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{Image/VGG16/archi.png}
    \caption{Figure architectural of VGG16 }
    \label{fig:vgg16archi}
\end{figure}

Here is a quick outline of the network architecture: 
\begin{itemize}
    \item \textbf{Input} —The image first is transformed to tensor and resized to (256x256) before fed to the model, this is differ from original input of VGG16 because the original VGG16 was design for ImageNet competition where consisted of images with a resolution of size 224x224 pixels. There are several reasons for larger resize, one is a higher resolution will provide more information to the model to handle these variations effectively. Furthermore, faces in images contain fine details, and using a higher resolution can help capture these details.  

    \item \textbf{Convolutional layers} — the convolutional filters of network use the smallest possible receptive field of 3×3. VGG also uses a 1×1 convolution filter as the input’s linear transformation.  

    \item \textbf{ReLu activation} — Rectified Linear Unit (ReLU) activation functions are used after each convolutional layer.ReLU is linear function return matching output for positve input and set to 0 for negative input. VGG has a set convolution stride of 1 pixel to preserve the spatial resolution after convolution. 

    \item \textbf{Hidden layers} — all the network’s hidden layers use ReLU instead of Local Response Normalization like AlexNet. The latter rises training time and memory consumption with little improvement to overall accuracy. 

    \item \textbf{Pooling layers} – A pooling layer follows several convolutional layers—this helps reduce the spatial dimensionality and the number of parameters of the feature maps created by each convolution step. Pooling is crucial given the rapid growth of the number of available filters from 64 to 128, 256, and eventually 512 in the final layers. 

    \item \textbf{Fully connected layers} — Network includes three fully connected layers. The first two layers each have 4096 channels, and the third layer has 256 channels, which is the dimension of face embedded vector (in original VGG16 network, the output layer has 1000 channels) 
\end{itemize}
%---------------------------------------------------

\section{Result}
\subsection{Evaluation Methods}
\subsubsection{Face Detection}
\subsubsection{Face Verification}
We evaluated our model using the standardized Labeled Face in the Wild dataset. Concretely, we follow the dataset's instruction to use a 10-fold cross validation using pre-splitted set. We also report the model Validation Rate (True Positive Rate, TPR) with given False Acceptance Rate (False Positive Rate, FAR) of [0.1, 0.01, 0.001] respectively. These two metrics specify how often the model is able to verify an user given a threshold satisfying a pre-defined FAR.

Furthermore, we also collected 60 images of 12 different people and report the TPR given the aforementioned FAR.
\subsection{Performance}
\subsubsection{Face Detection}
\subsubsection{Face Verification}
\paragraph{Evaluation on Labeled Face in the Wild}
\onecolumn
\begin{tabular}{@{} *5l @{}}    \toprule
\emph{model} & \emph{Accuracy} & \emph{TPR @ FAR=10\%} & \emph{TPR @ FAR=1\%} & \emph{TPR @ FAR=0.1\%} \\\midrule
InceptionNetV2 + Triplet & 86.87  & 80.33  & 32.40  & 5.20  \\ 
VGG16 + Triplet + Mining & 77.35 & 60.00 & 59.03 & 59.03 \\ 
 Model $Y$ & Y1 & Y2 & Y3 & Y4\\\bottomrule
 \hline
\end{tabular}
\end{document}

\subsection{}


%------------------------------------------------
\clearpage

\phantomsection
\section{Conclusion}
In conclusion, this project has 
%------------------------------------------------

\section{Future Work}
%----------------------------------------------------------------------------------------
%	REFERENCE LIST
%----------------------------------------------------------------------------------------
\onecolumn

\phantomsection
\bibliographystyle{abbrv}
\bibliography{sample.bib}

%----------------------------------------------------------------------------------------

\end{document}