### Artificial Neural Networks
Artificial Neural Networks (ANNs) were first inspired by the structure of the human brain. Within the human brain exist neurons, cells that carry information throughout the body. These neurons, however, are only activated when they receive stimulus higher than a certain threshold. In simpler terms, a neuron *N*'s activity is affected by the neurons that send signals to it, and the neuron *N* in turn affects the activation of succeeding neurons. 

ANNs mimic this behavior with their interconnected nodes. You've probably seen a representation like the one below: 

<p align="center">
    <img src="../res/artificial_neural_network.jpg" height="400">
</p>

Each of the circles, or nodes, above represent an "artificial neuron," mathematically designed to simulate the behavior of its human equivalent. As inputs pass through the ANN above from left to right, each node performs a calculation to determine whether or not to fire to the next neurons (ones it is connected to on the right). This process is repeated until the outputs emerge from the network. The result can then be used for a variety of tasks, such as classification.

Delving a little bit deeper, this is what a single artificial neuron may look like:

<p align="center">
    <img src="../res/artificial_neuron.jpg" height="400">
</p>

It is comprised of two main sub-modules: one for the weighted sum operation, and one for the activation. The former processes the outputs of the previous neurons while the latter determines whether or not the result is sufficient enough for the neuron should fire to its subsequent neurons. By connecting these artificial neurons, an ANN aims to mimic the behavior of the human brain.

You will have noticed that the inputs of each neuron are numeric. To clarify, the inputs of neurons are typically not numbers, but rather vectors of numbers. This design is intentional; real-world data, such as texts or images, are typically too complex to be represented by single values. Vectors, on the other hand, are multi-dimensional and provide a way to represent data much more effectively than would be possible with a single value. Since a vector can be thought of as a list of numbers, it may help to imagine each of the entries as pertaining to a particular feature of the data. For example, each entry of the vector representation of an image may correspond to the Red, Green, or Blue value for a particular pixel.

<p align="center">
    <img src="../res/img_to_vec.jpg" height="400">
</p>

The sample procedure detailed above is known as embedding, a process in which you create vectors for a deep learning model. The resulting vectors from this process are known as "embedding vectors" of simply "embeddings". Most models, whether in vision or natural language processing, require an embedding process. In an artificial neural network, the resulting embedding vectors are processed by its neurons across multiple layers to produce a final output (also in the form of a vector). This final output is then further processed; the exact procedure in which it is processed is typically depends on task that the model was created for. 

For example, classification (choosing which of many classes an instance belongs to) and next token prediction (predicting the next "text" that should appear based on previous inputs) aim to produce data of different forms. The former requires outputs in the form of labels, where each label belongs to a particular class 

```
Labels and Corresponding Classes
0 : 'dog'
1 : 'cat'
2 : 'bird'
```

while the latter requires outputs in the form of text.

```
> I
> I have
> I have a
> I have a cat.
```

To summarize, artificial neural networks are structures comprised of interconnected neurons. After a conversion process which converts data (in the form of texts, images, etc.) into embedding vectors, these neurons perform mathematical operations on their inputs and determine whether or not and what information should be passed to succeeding neurons. After countless calculations performed across multiple layers of neurons, a final resulting vector is returned, which is further processed in a way that suits the needs of the task that the artificial neural network was created for.

### Deep Learning
First, a few definitions to introduce the difference between Deep Learning and Machine Learning. This is how a [Coursera article](https://www.coursera.org/articles/ai-vs-deep-learning-vs-machine-learning-beginners-guidehttps://www.coursera.org/articles/ai-vs-deep-learning-vs-machine-learning-beginners-guide) explains deep learning:
> Deep learning is a subset of machine learning that uses artificial neural networks to mimic the learning process of the human brain.

And for additional context, machine learning may be defined as follows:
> Machine learning is a process in which a model gradually, but automatically, improves upon a particular task by processing data with its algorithm.

Imagine that you have some data ***X*** and you want to use that data to predict some ***Y***. Deep learning is the process in which artificial neural networks learn to do that. To enable this, a machine performs **backpropagation** with an **objective function**. An **objective function** is like a rubric; it details the criteria with which a model should evaluate its predictions and, more importantly, how off its predictions are from the target value ***Y***. In linear regression, for example, one aims to minimize the Mean Squared Error (MSE). During this process, one observes that different values for the slope (***a***) and vertical intercept (***b***) cause fluctuations in the MSE. Analyzing *how* these fluctuations occur could provide some insight to the direction modifications for ***a*** and ***b*** should take place.

<p align="center">
    <img src="../res/mean_squared_error.jpg" height="200">
</p>

**Backpropagation** refers to the process in which a model updates its weights (also called parameters) to better align its predictions with the labels. Returning to the example of MSE, the objective function can be re-expressed in terms of ***a***, ***x***, and ***b***. Note that the the ***y-hat*** refers to the ideal predictions (labels) while the ***y*** refers to the actual predictions based on ***x***.

<p align="center">
    <img src="../res/mean_squared_error_re.jpg" height="200">
</p>

And that means that given a specified value for ***x*** and its corresponding label ***y-hat***, the objective function is expressed in terms of ***a*** and ***b***, precisely the parameters that we want to be updating in linear regression. This new expression is helpful because with calculus (differential calculus in particular), we can measure how changes in the input variables lead to changes in the outputs (in this case, the MSE). This information can then be utilized to determine the *direction* in which weight updates should take place in a process called ***gradient descent***.

Gradient descent is much easier to understand with visualizations. The image below is a 3D visualization of how the MSE changes with different values of ***a*** and ***b*** given an ***x*** value of 0.5 and a ***y-hat*** value of 0.

<p align="center">
    <img src="../res/mse_visualization.png" height="500">
</p>

We want the MSE to be as low as possible (the lowest possible value in this case is 0), which means that given initial values of ***a*** and ***b***, we want to update ***a*** and ***b*** such that the MSE decreases. The aim is not to find the exact values of ***a*** and ***b*** where the MSE is 0; remember that the above graph is a visualization of one specific data point (where ***x*** is 0.5 and ***y-hat*** is 0). Since we don't know what the other data points are, the best we can do is update the parameters in a way that would *reduce* the MSE, in the hopes that repeating such updates across many data points will ultimately lower the overall cost for the dataset. First, let us specify an initial starting point for our ***a*** and ***b***:

<p align="center">
    <img src="../res/mse_visualization_init.png" height="500">
</p>

Given a set starting point for ***a*** and ***b***, we can calculate the first-order partial derivatives of the objective function. Since we take the partial derivatives with respect to multiple variables (***a***, ***b***), we represent them in the form of vectors, where the value in each row is the partial derivative with respect to a particular variable. Assuming the calculations have been completed, we might get values like the following:

<p align="center">
    <img src="../res/" height="500">
</p>

Recall that a neural network's neurons are comprised of mathematical operations; this means that chaining together layers of neurons is like chaining together different mathematical functions together. In a 3-layer neural network, for example, layer 1 might be represented by ***f(x)***, layer 2 by ***g(x)***, and layer 3 by ***h(x)***. The output of the entire network could then be represented by the single chained function ***h(g(f(x)))***. This output is then evaluatied using an objective function like MSE, which means that the objective function can be expressed in terms of ***h(x)***, ***g(x)***, or even ***f(x)***:

<p align="center">
    <img src="../res/mean_squared_error_composite.jpg" height="200">
</p>


In the case of With this newly expressed function, it becomes possible to measure how ***h(x)***, ***g(x)*** and ***f(x)*** each affect the loss. You've probably already learned how exactly this takes place: think back to differentiation and more specifically, the chain rule for differentiation, in which you determine how composite functions change in terms of ***x***. Our goal is (typically) to either minimize or maximize (in the case of MSE, minimize) the objective function. In other words, we are looking for extrema, where the first-order derivative is 0. This is achieved through a process called ***gradient descent***, an algorithm that takes information about the objective function's first-order and second-order derivatives to determine the direction in which parameter updates should take place.

This is what a general deep learning process might look like:
1. Get the input values ***X***
2. Feed ***X*** through the artificial neural network to get predictions
3. Compare predictions with ideal values using the **objective function**
4. 

### A Brief History on Language Models
While the architecture known as the "transformer" is ubiquitous today, this was not always the case. 

### Parameter-Efficient Fine-Tuning (PEFT) with Quantized Low Rank Adaptation (QLoRA)
Model fine-tuning is a process that occurs after model training. In other words, the model that we are working with will have weights/parameters that have been learned during the initial training (also called pre-training) process.

In very basic terms, fine-tuning can be thought of as a process that biases the model in a particular direction, generally defined by the dataset that the model is fine-tuned by. If, for example, a model is fine-tuned on movie review classification, we hope that the resulting fine-tuned model will be better enabled to complete that particular task.

In more techincal terms, we are iteratively modifying the model's weights such that the model learns to better process (predict) the data that we feed it.