Frequently Asked Questions (FAQs) on machine learning and artificial intelligence

Photo by Alina Grubnyak on Unsplash

This is a collection of FAQs and their answers about ML, AI, neural networks, and LLM. Many of these questions have been asked to me by students or other ML practitioners, I decided to collect and discuss them here.

The list is on construction and I am expanding it. Write to me or open an issue for any suggestions or additional questions.

Index

FAQ on machine learning
FAQ on artificial intelligence

FAQ on machine learning

Statistics and data science

What is better Pandas or Polars?

Python is a great ecosystem for data science, on the other hand, some basic analysis becomes complex as the datasets grow (parallel processing, query optimization, and lazy evaluation) especially if you use pandas. Pandas in fact it is single-threaded (so it means it runs on only one CPU), requires the entire dataset to be in memory, also it does not allow the order of operation optimization (it means the operations are sequential in the order of declaration which is not always the best choice).

Polars on the other hand offers:

Parallel computing, so use all available cores.
Improve storage since it leverages Apache Arrow.
File scanning which allows you to work with very large files without necessarily keeping the whole file in memory

Polars also is similar to Pandas in syntax so switching between libraries is fairly easy. Since there is a larger ecosystem compatible with Pandas, it is recommended to use it for small/medium datasets and use Polars for huge datasets.

Which correlation should use?

There are different types of correlation, The most famous of which is the Pearson correlation. The correlation coefficient represents the linear relationship between two variables. Pearson correlation has this formula:

$$r_{XY} = \frac{\sum (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum (X_i - \overline{X})^2 \sum (Y_i - \overline{Y})^2}}$$

Where X and Y are the two variables, and $\overline{X}$ and $\overline{Y}$ represent the means.

Spearman correlation is another popular alternative:

$$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

where $d_i$ represents the distance represents the difference between the ranks of corresponding values $X_i$ and $Y_i$.

The main differences between the two correlations soo:

Pearson measures linear relationships, Spearman correlation between variables that have a monotonic relationship. Pearson assumes that variables are normally distributed.
As seen Pearson is based on covariance and the other is based on ranked data. However, both have ranges between -1 and 1.
Pearson is more sensitive to outlier data. Person is more recommended for interval and ratio data, while Spearman is for ordinal and non-normally distributed data.

As seen Pearson is recommended for linear relationships, while Spearman is recommended for monotonic associations. There is also Kendall correlation, but it is very similar to Spearman for assumptions. Linear relations are a special case of monotonic functions. A monotonic relation is where there is no change in direction or always increasing or always decreasing (not necessarily linearly)

from here

This means, however, that there are cases where there is an association between two variables (neither linear nor monotonic) that none of these three types of correlation can detect

In a 2020 study, they propose a new relationship, which measures how much Y is a function of X (rather than whether there is a monotonic or linear relationship between the two). This new correlation is also based on ranking é has two possible formulas, one based on ties between the two variables and whether there are no ties (or probable no ties) between the variables. The first:

from here

and if there are ties:

from here

As can be seen, the new correlation method is not affected by the direction of the relationship (the range is between 0 and 1, with one being the maximum of the relationship). Where Pearson concludes that there is no relationship (for example, in the parabolic or sinusoidal case) this new method succeeds instead in showing a relationship.

Values of ξn(X,Y)for various kinds of scatterplots, withn=100. Noise increases from left to right. from here

If you want to try in Python the code:

from numpy import array, random, arange

def xicor(X, Y, ties=True):
    random.seed(42)
    n = len(X)
    order = array([i[0] for i in sorted(enumerate(X), key=lambda x: x[1])])
    if ties:
        l = array([sum(y >= Y[order]) for y in Y[order]])
        r = l.copy()
        for j in range(n):
            if sum([r[j] == r[i] for i in range(n)]) > 1:
                tie_index = array([r[j] == r[i] for i in range(n)])
                r[tie_index] = random.choice(r[tie_index] - arange(0, sum([r[j] == r[i] for i in range(n)])), sum(tie_index), replace=False)
        return 1 - n*sum( abs(r[1:] - r[:n-1]) ) / (2*sum(l*(n - l)))
    else:
        r = array([sum(y >= Y[order]) for y in Y[order]])
        return 1 - 3 * sum( abs(r[1:] - r[:n-1]) ) / (n**2 - 1)

Others have noticed that also this correlation is not exempt from issues, for that reasons they suggest mutual information-based coefficient R:

non-linearity. The xicorr does not capture all the types of non-linearities (like donuts).
symmetry. The correlation should be symmetric (ρ(x,y)=ρ(y,x)), this is true for Pearson and R but not xicorr
Consistency. In all the cases xicorr is consistent.
Scalability. R is more scalable with the increase of data points
Precision. R is more precise (precision is defined as stdev(A)/mean(A), meaning the variance should be small)

In any case, you can test these different correlations. While they are not present in a standard package in Pythons, I have collected them in a Python script you can easily import (check here)

Suggested lecture:

Myths About Linear and Monotonic Associations: Pearson’s r, Spearman’s ρ, and Kendall’s τ
A New Coefficient of Correlation

Machine learning in general

What is machine learning?

Machine Learning is the field of study that deals with learning or predicting something from data. In the traditional paradigm, you had to write hard-coded rules. A machine learning algorithm should infer these rules on its own.

"Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions" - from Wikipedia

As an example, we want to create a model to classify sentimental analysis of movie reviews. A traditional approach would be to create rules (something like if/then, for example, if "amazing" is present as a word the review is positive). A machine learning approach instead takes a dataset with labeled elements (a set of reviews that have already been annotated to be positive or negative) and derives the rules on its own.

In some cases, these rules can be displayed. In this case, the model has learned boundaries to separate the various classes in the iris dataset:

from here

What is the central limit theorem?

"In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. " - source

central limit theorem (CLT) In a nutshell says that the distribution of sample means approximates a normal distribution as you increase the sample size. It is one of the most fundamental statistical theorems and is an important assumption for many algorithms. In other words. A key aspect is that the average of the sample means will be equal to the true population mean and standard deviation. In other words, with a sufficiently large sample size, we can predict the characteristics of a population

What is overfitting? How to prevent it?

overfitting is one of the most important concepts in machine learning, usually occurring when the model is too complex for a dataset. The model then tends to learn patterns that are only present in the training set and thus not be able to generalize effectively. So it will not have adequate performance for unseen data.

Underfitting is the opposite concept, where the model is too simple and fails to generalize because it has not identified the right patterns. With both overfitting and underfitting, the model is underperforming

from Wikipedia

Solutions are usually:

to collect more data when possible.
Eliminate features that are not needed.
Start with a simple model such as logistic regression as a baseline and compare it with progressively more complex models.
Add regularization techniques. Use ensembles.

When to use K-fold cross-validation or group K-fold?

K-fold cross-validation is one of the most widely used evaluation methods for a machine learning model. It is usually used to understand how a model behaves when there is unseen data. K-fold cross-validation is simple, we have a dataset X and a target variable y. The dataset is divided into K folds (then a subset of X and y) and for each interaction, we train the model on k-1 fold and calculate the error on the remaining fold. If we have 100 examples and k =5, it means that at each iteration we select 20 random examples, train the model on the other 80 examples, and calculate the performance on the 20 examples.

from Wikipedia

The main problem with k-fold cross-validation is that we assume that all the different folds have the same distribution. This is not true in a number of cases where the dataset is stratified by an additional temporal, group, or spatial dimension. This causes a so-called information leak and is easily understood when we look at data that are temporally stratified. If we use random shuffling, the model will see into the future and we have what can be called data leakage

cross-validation leads to predictions that are overly optimistic (overly confident), favor models that are prone to overfitting. So for real-world cases, we need an alternative that avoids leakage between folds. This can be achieved with group folds:

"GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. GroupKFold makes it possible to detect this kind of overfitting situations." -source

from scikit-learn

Should I use Class imbalance corrections?

In general, there are plenty of methods for correcting class imbalance data, though it is not always a good idea to do so. Imbalance data is when in a classification dataset there is an overabundance of one of the class labels. In this case, the most abundant class is called the majority class and the other minority class (this is in the case of binary classification, but class imbalance can also occur in the case of multiclass classification).

from here

Generally, the most commonly used strategies are:

Downsample the majority class. in this case the goal is to reduce the number of examples in the majority class to have a balanced dataset (but we risk losing information).
Upsampling of the minority class. conversely we increase the number of examples in the minority class by exploiting machine learning or artificial intelligence approaches (this can introduce bias though)

For example, when we are interested in a well-calibrated model, oversampling may do more harm than not. A calibrated model is when we can interpret the output of such a model in terms of probability. We might actually think that all models are calibrated, but in general, models are overconfident (for example in the case of binary classification, an overconfident model predicts values close to 0 and 1 in many cases where they should not do).

"Model calibration captures the accuracy of risk estimates, relating to the agreement between the estimated (predicted) and observed number of events. In clinical applications where a patient’s predicted risk is the entity used to inform clinical decisions, it is essential to assess model calibration. If a model is poorly calibrated, it may produce risk estimates that do not approximate a patient’s true risk well [3]. A poorly calibrated model may produce predicted risks that consistently over- or under-estimate true risk or that are too extreme (too close to 0 or 1) or too modest (too close to event prevalence)" -- from here

A simpler example, if we have a model that predicts the probability of a fire in a building, a model calibrated when it gives a probability of 0.8 or 0.2, means that in the first case a fire is four times more likely. In the case of an uncalibrated model these probabilities do not have the same meaning.

"Overall, as imbalance between the classes was magnified, model calibration deteriorated for all prediction models. All imbalance corrections affected model calibration in a very similar fashion. Correcting for imbalance using pre-processing methods (RUS, ROS, SMOTE, SENN) and/or by using an imbalance correcting algorithm (RB, EE) resulted in prediction models which consistently over-estimated risk. On average, no model trained with imbalance corrected data outperformed the control models in which no imbalance correction was made, with respect to model calibration." -- from here

In other words, when we are interested in a calibrated model, dataset-balancing techniques do more harm than good.

This is then in line with early reports that showed that SMOTE works only with weak learners and is destructive of model calibration.

In fact, you rarely see it used on Kaggle (or at least it does not seem to be a winning strategy). According to this stems from the fact that models like SMOOTH implicitly assume that the class distribution is sufficiently homogenous around the minority class instances. Which is then not necessarily the case (especially since not all variables are so homogenous).

So we should never use it?

According to this article as a general rule:

Balancing could improve prediction performance for weak classifiers but not for the SOTA classifiers. The strong classifiers (without balancing) yield better prediction quality than the weak classifiers with balancing.
For label metric (example F1) optimizing the decision threshold is recommended do to simplicity and lower compute cost (nowhere is it written that a threshold of 0.5 must be used necessarily).

Suggested lecture:

The harms of class imbalance corrections for machine learning based prediction models: a simulation study
To SMOTE, or not to SMOTE?
Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
Why SMOTE is not used in prize-winning Kaggle solutions?

What is gradient descent? What are the alternatives?

!

Clustering

Should I use the elbow method?

The elbow method has long been the most widely used method to evaluate the clustering number for k-means.

Starting with a dataset, we create a loop in which we test an increasing number of clusters. The idea is simple, at each iteration of the k-means we calculate the inertia (sum of squared distances between each point and the center of the cluster it is assigned). the inertia goes down with the number of clusters because the clusters get smaller and smaller, and thus the points get closer and closer to the center of the cluster. Plotting the inertia on the y-axis and the number of clusters, there is a sweet spot that represents the point of maximum curvature (beyond this point increasing the number of clusters is no longer convenient).

There are many other methods for being able to analyze the number of clusters for an algorithm.

Tree-based models

What is bagging or boosting?

Bagging and boosting are two different ensemble techniques that are used to improve the performance of an ensemble of decision trees by reducing system error. The basic idea is that each individual decision tree is trained with a different dataset. The main difference is that bagging trains the different models on different subsets of data while boosting conducts the training sequentially, focusing on the error committed by the previous model.

An ensemble is a machine learning technique in which we combine different models to improve performance. By combining different weak learners we get a model that has better performance. The disadvantage is that these models if not properly regularized can go into overfitting.

More in detail, Bagging combines multiple models that are trained on different datasets with the aim of reducing the variance of the system (by averaging the error of the different models that make up the ensemble). Given a dataset, different datasets are created for each ensemble decision tree (this is usually conducted by bootstrapping). For predictions, each model produces a prediction and the majority prediction is usually chosen. An example of this approach is Random Forest.

We then initially randomly conduct bootstrap sampling of the initial dataset (sampling with replacement) and train a single model on this subset. For each weak learner, this process is repeated. For classification, we combine predictions with majority voting (while for regression we average the predictions).

from here

In boosting, each model depends on the models that have been previously trained. This allows for a system that is better adapted to the dataset. Sampling of the data is conducted and then a weak learner (a tree) is trained on that data. Initially, for each sample in the dataset, we have the same weight. The error for each sample is then calculated, the greater the error the greater the weight that will be assigned to that sample. The data are passed to the next model. Each model also has an associated weight based on the goodness of its predictions. The model weight is used to conduct a weighted average over the final predictions.

Boosting attempts to sequentially reduce the error of the models in the ensemble by trying to correct misclassifications of the previous model (thus reducing both bias and variance of the system). Examples of boosting algorithms are: AdaBoost, XGBoost, Gradient Boosting Mechanism.

from here

Suggested lecture:

Evolutionary bagging for ensemble learning
Ensemble deep learning: A review

Why there are different impurity metrics?

!

Should I use XGBoost? Or Catboost, LightGBM, random forest?

!

Are tree-based models better than neural networks for tabular data?

!

FAQ on artificial intelligence

What are the differences between machine learning and artificial intelligence?

In essence, artificial intelligence is a set of algorithms and techniques that exploits neural networks to solve various tasks. These are neural networks composed of several layers that can extract complex features from a dataset. This makes it possible to avoid feature engineering and at the same time learn a complex representation of the data. Neural networks also can learn relationships that are nonlinear and thus complex patterns that other machine learning algorithms can hardly learn. This clearly means having patterns with many more parameters and thus the need for appropriate learning algorithms (such as backpropagation).

What is supervised learning? self-supervised learning?

Supervised learning uses labeled training data and self-supervised learning does not. In other words, when we have labeled we can use supervised learning, only sometimes we don't have it and for that, we need other algorithms.

In supervised learning we usually have a dataset that has labels (for example, pictures of dogs and cats), we divide our dataset into training and testing and train the model to solve the task. Because we have the labels we can check the model's responses and its performance. In this case, the model is learning a function that binds input and output data and tries to find the relationships between the various features of the dataset and the target variable. Supervised learning is used for classification, regression, sentiment analysis, spam detection, and so on.

In unsupervised learning (or self-supervised), on the other hand, the purpose of the model is to learn the structure of the data without specific guidance. For example, if we want to divide our consumers into clusters, the model has to find patterns underlying the data without knowing what the actual labels are (we don't have them after all). These patterns are used for anomaly detection, big data visualization, customer segmentation, and so on.

from here

Semi-supervised learning is an intermediate case, where we have few labels and a large dataset. The goal is to use the few labeled examples to label a larger amount of unlabeled data

from here

self-supervised learning is now stricly related to transfer learning (check below). In fact, many models are trained unsupervised on a huge amount of data. For example, Transformers are trained on a lot of textual data using a huge amount of data during the pretraining phase. During this phase, language modeling or another pretext task is used to make the model learn a knowledge of the language. Labeling this data would be too expensive, so the purpose of the model is to learn the structure of the language during pretraining. Only at a later stage is the model adapted for a specific task. So in this case our purpose is to take advantage of the amount of data and a task that allows us to train the model, without having to annotate or specify the task.

What is transfer learning?

transfer learning is a process in which we exploit a model's abilities for a different task than what was originally trained.

"Transfer learning and domain adaptation refer to the situation where what has been learned in one setting … is exploited to improve generalization in another setting" -source

"Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned." -source

Transfer learning requires the model to learn features that are general. So they are usually models that are trained on a huge amount of data and can learn very different patterns from each other.

For example, a large convolutional network such as ResNet is trained on a large number of images such as Imagenet, then the model is retrained to classify images of dogs or cats. In this case, the model head that is specific to the original task (imagenet) is removed and replaced with a final layer for the specific task.

from here

Another widely used case is the transformer. The transformer is a very large model that can learn a large number of patterns. In this case, the pattern is not trained for a specific task but in self-supervised learning. This first phase is called the pre-training phase. During this initial phase, the model learns language features by predicting the next word in a word sequence (or as a masked language model, where some words are masked and the model has to predict them). Once this is done, the model can be repurposed for other tasks such as sentiment analysis (a classification task).

This approach can be used with so many types of data, for example for images, you can mask part of the images and the model has to predict what is in the masked patch.

What is knowledge distillation?

LLMs are getting bigger and bigger, reaching even more than 100B parameters. These models excel in different tasks and achieve state-of-the-art in all benchmarks. But do we need an LLM for every task?

Sometimes we need a model that is capable of accomplishing a task with as much accuracy as possible but is also computationally efficient (e.g., a classification task that must run on a device). For this a smaller model might be fine, the important thing is that it is capable of doing the task as well as possible. The idea behind Knowledge Distillation is that we can distill a complex model into a smaller, more efficient model that needs limited resources. The idea behind it is that the larger model is a "teacher" and the smaller model is a "student." In practice we use the soft probabilities (or logits) of the teacher network to supervise the student network along with the class labels, this is because these probabilities provide more information than a label alone and allow the model to learn better

from here

To give a more concrete example, we have a model like ResNet-50 that is trained on millions of images of a thousand different classes, but we need a model that can recognize dogs and cats (two classes) and has few layers. The idea is to create a model (student) that is able to mimic the generalization ability of the more complex model (i.e. ResNet) and we use the probabilities generated by the teacher network to train it. The advantage is that we generally need much less data than training the student model from scratch and without a teacher

from here

So we have a teacher model (ResNet in our example) that generates probabilities for each class. After that we take the student model and train it for the same data, again obtaining a probability distribution, exploiting a distillation loss we try to make these probability distributions similar to those of the teacher. In addition, we have a cross-entropy loss in which we use the actual labels of the data. The student model is trained by exploiting these two losses so it learns from both the teacher model and the real labels.

An interesting 2024 study states that it is not always necessary to use a larger model as a teacher. A larger model is a model that is more expensive. Certainly, a larger model is a more accurate model, but for the same computing budget a smaller model guarantees more coverage (more examples) and more diversity. This is especially interesting when it comes to reasoning. The authors have shown that this approach works best when there are limited resources.

from here

In this study, they propose a new approach called Learning with Less Computational Resources and less data for Knowledge Distillation (LLKD). In this approach, they prioritize the teacher model exhibiting high confidence in its labeling (so they should be correct) and the student model exhibiting high uncertainty (so examples that the student model finds difficult). In simple words, use examples that are difficult for the student but which the teacher is confident about instead.

The teacher generates labels for texts along with a confidence score (how sure it is that for a text that is the associated label).
The student is trained on these texts along with the pseudo-labels (the labels generated by the teacher). For each example, he also generates an uncertainty estimate for that example.
Data are chosen so that the examples are of good quality (the teacher is sure of the labels) and informative to the student (the student is uncertain).

from here

Suggested lecture:

Knowledge Distillation: A Survey Articles describing in detail:
Teach What You Know, Learn What Is Hard to Master
Beyond Human Feedback: How to Teach a Genial AI Student
Teaching is Hard: How to Train Small Models and Outperforming Large Counterparts

Neural networks

What is an artificial neuron?

A neural network is a collection of artificial neurons. The artificial neuron is clearly inspired by its human counterpart. The first part of the biological neuron receives information from other neurons. If this signal is relevant gets excited and the potential (electrical) rises, if it exceeds a certain threshold the neuron activates and passes the signal to other neurons. This transfer is passed through the axon and its terminals that are connected to other neurons.

from Wikipedia

This is the corresponding equation for the artificial neuron:

$$[y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)]$$

As we can see, this equation mimics the behavior of the human neuron. X inputs are the signals coming from other neurons, the neuron weighs their importance by multiplying them with a set of weights. Once this information is weighed, this sum is called the transfer function. If the information is relevant it must pass a threshold, in this case given by the activation function. If it is above the threshold, the neuron is activated and passes the information (in the biological one this is called firing). This becomes the input for the next neuron.

What are neural networks?

In a nutshell, Neural networks are a series of layers composed of different artificial neurons. We have an input layer, several hidden layers, and an output layer. The first layer takes inputs and is therefore specific to inputs. The hidden layers learn a representation of the data, while the last layer is specific to the task (classification, regression, auto-encoder, and so on).

from here

We generally distinguish:

shallow neural networks, where you have one or few hidden layers
Deep neural networks, where you have many hidden layers (more about below)

Neural networks despite having many parameters have the advantage of extracting sophisticated and complicated representations from data. They also learn high-level features from the data that can then be reused for other tasks. An additional advantage is that neural networks generally do not require complex pre-processing like traditional machine learning algorithms. Neural networks were invented with the purpose, that models would learn features on their own even when the data is noisy (a kind of automatic "feature engineering")

What is a Convolutional Neural Network (CNN)?

Convolutional neural networks are a subset of neural networks that specialize in image analysis. These neural networks are inspired by the human cortex, especially the visual cortex. The two main concepts are the creation of a hierarchical representation (increasingly complex features from the top to the bottom of the network) and the fact that successive layers have an increasingly larger receptive field size (layers further forward in the network see a larger part of the image).

This causes convolutional neural networks to create an increasingly complex representation, where layers further down the network recognize simpler features (edges, textures) and layers further up the network recognize more complex features (objects or faces)

How does it actually work?

Pixels that are close together represent the same object and pattern, so they should be processed together by the neural network. These neural networks consist of three main layers: a convolutional layer to extract features. Pooling layer, to reduce the spatial dimension of the representation. Fully connected layer, usually the last layers to map the representation between input and output (for example, if we want to classify various objects).

A convolutional layer basically accomplishes the dot product between a filter and a matrix (the input). In other words, we have a filter flowing over an image to learn and map features. This makes the convolutional network particularly efficient because it leads to sparse interaction and fewer parameters to save.

A convolutional network is the repetition of these elements, in which convolution and pooling layers are interspersed, and then at the end, we have a series of fully convolutional layers

What is Recurrent Neural Network (RNN)?

An RNN is a subtype of neural network that specializes in sequential data (a sequence of data X with x ranging from 1 to time t). They are recursive because they perform the same task for each element in the sequence, and the output for an element is dependent on previous computations. In simpler terms, at each input of the sequence they perform a simple computation and do an update of the hidden state (or memory), this memory is then used for subsequent computations. So this computation can be seen with a kind of roll because for each input the output of the previous input is important:

More formally, the hidden state $(h_t)$ of an RNN at time step $(t)$ is updated by:

$$[h_t = f(W_h h_{t-1} + W_x x_t + b)]$$

And the output at time step $(t)$ is given by:

$$[y_t = g(W_y h_t + b_y)]$$

where:

$(h_t)$ is the hidden state at time step $(t)$,
$(h_{t-1})$ is the hidden state at the previous time step $(t-1)$,
$(x_t)$ is the input at time step $(t)$,
$(W_h)$, $(W_x)$, and $(W_y)$ are the weight matrices for the hidden state, input, and output, respectively,
$(b)$ and $(b_y)$ are the bias terms,
$(f)$ is the activation function for the hidden state, often tanh or ReLU,
$(g)$ is the activation function for the output, which depends on the specific task (e.g., softmax for classification).

which can be also represented as:

RNNs can theoretically process inputs of indefinite length without the model increasing in size (the same neurons are reused). The model also takes into account historical information, and weights are shared for various interactions over time. In reality, they are computationally slow, inefficient to train, and after a few time steps forget past inputs.

What is a deep network?

Deep neural networks are basically neural networks in which there are many more layers. Today almost all of the most widely used models belong to this class.

Obviously, models with more hidden layers can build a more complex and sophisticated representation of the data. However, this comes at a cost in terms of data required (more layers, more data, especially to avoid overfitting), time, and computation resources (more parameters often take more time and more hardware). In addition, training requires special tunings to avoid problems such as overfitting, underfitting, or vanishing gradients.

What’s the Dying ReLU problem?

The rectifier or ReLU (rectified linear unit) activation function has a number of advantages but also a number of disadvantages:

It is not zero differentiable (this is because it is not "smooth" at zero).
Not zero-centered.
Dying ReLU problem

The Dying ReLU problem refers to the scenario in which many ReLU neurons only output values of 0. In fact, if before the activation function the output of the neuron is less than zero, after ReLU is zero:

$$\text{ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\\ 0 & \text{otherwise} \end{cases}$$

Plot of the ReLU rectifier (blue) and GELU (green) functions near x = 0, from Wikipedia

The problem occurs when most of the inputs are negative. The worst case is if the entire network dies, at which point the gradient fails to flow during backpropagation and there will no longer be an update of the weights. The entire or a significant part of the network then becomes inactive and stops learning. Once this happens there is no way to reactivate it.

The Dying ReLU problem is related to two different factors:

High learning rate. The high learning rate could cause the weight to become negative since a large amount will be subtracted, thus leading to negative input for ReLU.
Large negative bias. Since bias contributes to the equation this is another cause.

So the solution is either to use a small learning rate or to test one of several alternatives to ReLU

Suggested lecture:

Dying ReLU and Initialization: Theory and Numerical Examples

What is the vanishing gradient?

The Vanishing Gradient Problem refers to the decreasing gradient and its approach to zero as different layers are added to a neural network. The deeper the network gets, the less gradient reaches the first few layers and the more difficult it becomes to train.

This is clearly understandable if the sigmoid function is used as the activation function. The sigmoid function squishes a large input between 0 and 1, so a large change in input does not correspond to a large change in output. In addition, its derivative is small for a large input X

sigmoid activation function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

derivative:

$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$

* from here*

When there are few layers this is not a problem, but the more layers added the more it reduces the gradient and impacts training. Since backpropagation starts from the final layer to the initial layers, if there are n layers with sigmoid it means that n small derivatives are multiplied (backpropagation in fact uses the chain rule of derivatives). With little gradient, we will have little update of the first layers, so these layers will not be trained efficiently

The ReLU has been used as a solution, and in fact little by little it has become the most widely used function in neural networks. For particularly deep networks, residual connections have also been used, which precisely skips the activation function and thus avoids derivative reduction. Also, to reduce input space, batch normalization is another solution, thus preventing the input from adding the outer edges of the sigmoid

What is the dropout? how I should use it efficiently?

Dropout is a regularization technique that aims to reduce network complexity to avoid overfitting. The problem is that models can learn statistical noise, the best way to avoid this is to change parameters, get different models, and aggregate. Obviously, this would be very computationally expensive. Dropout instead allows for the implicit ensemble.

"Dropout is a technique that addresses both these issues. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. The term “dropout” refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. The choice of which units to drop is random. " -source: original papers

* from the original papers*

During training a certain amount of neurons (decided with a probability p) is deactivated. If the probability p is 50 % for a layer, it means that randomly 50% of the neurons will be set to zero. This means that the model cannot rely on a particular neuron for training nor on the combination of neurons, but will learn different representations. This acts on overfitting because during overfitting one neuron might compensate for the error of another neuron. This process is called co-adaptations in which several neurons are in "collusion" and reduces the generalization abilities of the neurons. If we use dropout instead, we prevent neuron co-adaptation because some neurons are set to zero in random manner.

" According to this theory, the role of sexual reproduction is not just to allow useful new genes to spread throughout the population, but also to facilitate this process by reducing complex co-adaptations that would reduce the chance of a new gene improving the fitness of an individual. Similarly, each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. However, the hidden units within a layer will still learn to do different things from each other. " -source: original papers

This also allows us to learn features that are better generalizable:

* from the original papers*

During training, if you set the probability to 50 % the remaining neurons are rescaled by an equivalent factor (e.g. 2x). In inference, on the other hand, the probability p is zero and all neurons are active.

* from the original papers*

Tips for using Dropout:

When using ReLU according to some it would be better, to put it before the activation function (fully connected, dropout, ReLu).
The general rule of thumb would be to use low dropout rates first (p= 0.1/0.2) and then increase until no decrease in performance. In the original article, they suggest 0.5 as a general value for a variety of tasks and for hidden units. In some articles, they suggest 0.8 for the input layer (this is how 20% of neurons are considered) and 50% for hidden layers. Or at any rate a higher p in the first few layers.
Dropout is especially recommended for large networks and small datasets

What is the batch normalization? how I should use it efficiently?

!

What are Kolmogorov-Arnold Network (KAN)? Why the hype about it?

Kolmogorov-Arnold Networks (KANs) are a new type of neural network that is based on the Kolmogorov-Arnold representation theorem (while classical neural networks are based universal approximation theorem, according to which a neural network could approximate any function).

According to the Kolmogorov-Arnold representation theorem, any multivariate function can be expressed as a finite composition of continuous functions (combined by addition). To make a simpler example, we can imagine a cake as the result of a series of ingredients combined together in some way. In short, a complex object can be seen as the sum of individual elements that are combined in a specific way. In a recipe, we add only one ingredient at a time to make the process simpler.

$$f(x_1, \ldots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^{n} \phi_{q,p}(x_p) \right)$$

Observing this equation, we have a multivariate function (our cake) and univariate functions (our ingredients and $\Phi_q$ explaining how they are combined (the recipe steps). In short, from a finished product, we want to reconstruct the recipe.

Why is this theorem of interest to us? Because in machine learning we need systems that allow us to approximate complex functions efficiently and accurately. Especially when there are so many dimensions, neural networks are in danger of falling into what is called the curse of dimensionality.

The second theoretical element we need is the concept of spline. Spline is a piecewise polynomial function that defines a smooth curve through a series of points. B-splines, on the other hand, is the mode of fit. For example, let's imagine that we have collected temperature data throughout the day at varying intervals and we want at this point a curve that shows us the trend. We can use a polynomial curve. The problem is that we would like the best one, only this doesn't happen and these curves tend to fluctuate quite a bit (Runge's phenomenon for friends). Spline allows us to fit better because it divides the data into segments and fits an individual polynomial curve for each segment (before they had one curve for all the data). B-splines are an improvement that allows for better-fit curves. B-spline in short provides better accuracy. It achieves this by using control points to guide the fitting.

Mathematically this is the equation of the b-spline:

$$C(t) = \sum_{i=0}^{n} P_i N_{i,k}(t)$$

where $P_i$ are the control points, $N_{i,k}$ are called the basis fucntion and $t$ is called knot vector.

Now we have the theoretical elements, what we need to keep in mind is:

given a complex function we have a theorem that tells us we can reconstruct it from single unitary elements and a series of steps.
we can fit a curve with great accuracy for a series of points and thus highlight trends and other analyses.

Why do we care about it? How can we use it for a neural network?

The classical neural network has some limitations:

Fixed activation functions on the node. Each neuron has a predetermined activation function (like ReLU or Sigmoid). This is fine in many of the cases though it reduces the flexibility and adaptability of the network. In some cases, it is difficult for a neural network to optimize a certain function or adapt to certain data.
Interpretability. Neural networks are poorly interpretable, the more parameters the worse it becomes. Understanding the internal decision-making process becomes difficult and therefore it is harder to trust the predictions.

At this point, KANs have recently been proposed to solve these two problems.

* from the original papers*

KANs are based in joining the Kolmogorov-Arnold Representation (KAR) theorem with B-splines. At each edge of the neural network, we use B-splines at each edge of each neuron so as to porter learn B-spline activation function. In other words, the model learns the decomposition of the data (our pie) into a series of b-splines (our ingredients).

* from the original papers*

Now let us go into a little more detail. In KAN the matrix of weights is replaced by a set of univariate function parameters at the edges of the network. Each node then can be seen as the sum of these functions (which are nonlinear). In contrast in MLPs, we have a linear transformation (the multiplication with the matrix of weights) and a nonlinear function. In formulas we can clearly see the difference:

$$\text{KAN}(\mathbf{x}) = \left( \Phi_{L-1} \circ \Phi_{L-2} \circ \ \cdots \circ \Phi_1 \circ \Phi_0 \right) \mathbf{x}$$

$$\text{MLP}(\mathbf{x}) = \left( \mathbf{W}_{L-1} \circ \sigma \circ \cdots \circ \mathbf{W}_1 \circ \sigma \circ \mathbf{W}_0 \right) \mathbf{x}$$

In a more compact version, we can rewrite it like this:

$$f(x_1, x_2, \ldots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^{n} \phi_{q,p}(x_p) \right)$$

Where $ϕ_q,p$ are univariate functions (b-splines) and $ϕ_q$ is the final function that assembles everything.

Now each layer of a KAN network can be seen like this:

$$\mathbf{x}^{(l+1)} = \sum_{i=1}^{n_l} \phi_{i,j} \left(x_i^{(l)} \right)$$

where $x(l)$ is the transformation of the input to layer $l$ (basically the cooked dish after a number of steps and added ingredients) and $ϕ_l,i,j$ are the functions at the edges between layer $l$ and $l+1$.

Notice that the authors want to maintain this parallelism with the MLP. In MLPs you can stack different layers and then each layer learns a different representation of the data. The authors use the same principle, where they have a KAN layer and more layers can be added.

from the original papers

B-splines allow us to learn complex relationships in the data, simply that they adjust their shape to minimize the approximation error. These flexibilities allow us to learn complex yet subtle patterns.

The beauty of B-splines is that they are controlled by a set of control points (called grid points), the greater these points the greater the accuracy a spline can use to represent the feature. the greater the grid points, the more detail a splines can capture in the data. The authors therefore decided to use a technique to optimize this process, and learn more detailed patterns (add grid points) without conducting retraining, however.

As you can see the model starts with a coarse grid (fewer intervals). the idea is to start with learning the basic structure of the data without focusing on the details. As learning progresses you start adding points (refine the grid) and this di allows you to capture more details in the data. This is achieved by using least squares optimization to try to minimize the difference between the refined spline and the original one (the one with fewer points, in short learning more detail without losing the overall knowledge about the data previously learned). We could define it as starting with a sketch of a drawing, to which we gradually add details, trying not to transform the original basic idea

from the original papers

In principle, a spline can be made arbitrarily accurate to a target function as the grid can be made arbitrarily fine-grained. This good feature is inherited by KANs. By contrast, MLPs do not have the notion of “fine-graining”. Admittedly, increasing the width and depth of MLPs can lead to improvement in performance (“neural scaling laws”). However, these neural scaling laws are slow -source

In other words, grid extension allows us to make the KAN more accurate without having to increase the number of parameters (add functions, for example). For the authors, this allows even small KANs to be accurate (sometimes even more so than larger ones (with more layers, and more functions) probably because the latter capture more noise).

To summarize:

The architecture of Kolmogorov-Arnold Networks (KAN) is unique in that its core idea is to replace traditional fixed linear weights with learnable univariate functions, achieving greater flexibility and adaptability. The architecture of KAN consists of multiple layers, each containing several nodes and edges. Each node is responsible for receiving input signals from the previous layer and applying nonlinear transformations to these signals via learnable univariate functions on the edges. These univariate functions are typically parameterized by spline functions to ensure smoothness and stability in the data processing process-source

The use of splines makes it possible to capture complex patterns in the data as nonlinear relationships. The advantage of this architecture is that it is adaptable to various patterns in the data, and these functions can be adapted dynamically and also be refined in the process. This then allows the model to be not only adaptable but also very expressive.

One of the strengths of KANs is precisely the interpretability. To improve this, the authors use two techniques:

Sparsification and Pruning
Symbolification

Sparsification is used to sparsify the network, so we can eliminate connections that are not needed. to do this the authors use L1-regularization. Usually in neural networks, L1-norm reduces the magnitude of the weights by inducing sparsification (making them zero or close to zero). In KAN there are no “weights” proper so we should define the L1 norm of these activation functions. In this case, we therefore act on the function:

$$\left\lvert \phi \right\rvert_1 = \frac{1}{N_p} \sum_{s=1}^{N_p} \left\lvert \phi(x_s) \right\rvert$$.

with $\phi(x_s)$ representing the value of the function $N_p$ the number of input samples. With a $L_1$ in short we evaluate the value of the function and try to reduce it as a function of the absolute value of their mean.

Pruning is another technique used in neural networks in which we eliminate connections (neurons or edges) that are below a certain threshold value. This is because weights that are too small do not have an impact and can be eliminated. So by pruning, we can get a smaller network (a subnetwork). In general, this subnetwork is less heavy and more efficient, but it maintains the performance of the original network (like when pruning dead branches in a tree, the tree usually grows better). After pruning and sparsification, we then have a network that is less complex and potentially more interpretable.

Symbolification is an interesting approach because the goal is to replace learned univariate functions with known symbolic functions (such as cosine, sine, or log). So given a univariate function we want to identify potential symbolic functions. In practice, the univariate function can practically be approximated by another function that is better known and humanly readable. This task may not seem easy:

However, we cannot simply set the activation function to be the exact symbolic formula, since its inputs and outputs may have shifts and scalings. -source

So taking the input $x$ and the output $y$ we want to replace with a function $f$, but we learn parameters (a,b,c,d) to try to approximate the original univariate function with: $y \approx c f(ax + b) + d$. This is then done with grid search and linear regression.

from the original papers

At this point:

we know how to train a KAN network.
we have eliminated unnecessary connections by sparsification and pruning.
we have made it more interpretable because now our network is no longer composed of univariate functions but of symbolic functions

The authors provide in this paper some examples where the elements we have seen are useful. For example, KANs are most efficient at tasks such as fitting in inputs of various sizes. KANs for the authors are more expressive but more importantly more efficient than MLPs (they require fewer parameters and scale better). It is also easy to interpret when the relationship between X and y is a symbolic function.

from the original papers

Another interesting point is that for the authors, KANs work best for continual learning. According to them, these networks are better able to retain learned information and adapt to learn new information (for more details on continual learning we discuss it in more detail in this section later).

When a neural network is trained on task 1 and then shifted to being trained on task 2, the network will soon forget about how to perform task 1. A key difference between artificial neural networks and human brains is that human brains have function ally distinct modules placed locally in space. When a new task is learned, structure re-organization only occurs in local regions responsible for relevant skills , leaving other regions intact. -source

For the authors, this favors KANs and the local nature of splines. This is because MLP rely on global activations that impact the entire model, while KANs have a more local optimization for each new example (as a new example arrives they only change limited sets of spline coefficients). In MLPs, on the other hand, any local changes are propagated throughout the system likely damaging learned knowledge (catastrophic forgetting).

As expected, KAN only remodels regions where data is present on in the current phase, leaving previous regions unchanged. By contrast, MLPs remodels the whole region after seeing new data samples, leading to catastrophic forgetting. -source

As can be seen in this case, KANs do not forget the information learned up to that point:

from the original papers

Currently, the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters. We should be honest that we did not try hard to optimize KANs’ efficiency though, so we deem KANs’ slow training more as an engineering problem to be improved in the future rather than a fundamental limitation. If one wants to train a model fast, one should use MLPs. In other cases, however, KANs should be comparable or better than MLPs, which makes them worth trying. -source

from the original papers

The question is: are these KANs better than MLPs?

Not everyone agrees with the supposed superiority of KANs. For example, in this report four criticisms of the original article are offered:

MLPs have learnable activation functions as well. Indeed, learnable functions can be used as activation functions. For example, something similar has already been explored here. For example, if we consider a very simple MLP with two layers and rewrite it with a learnable activation function it is very reminiscent of kan $f(\mathbf{x}) = \mathbf{W}_2 \sigma(\mathbf{W}_1 \mathbf{x}) = \mathbf{W}_2 \phi_1(\mathbf{x}).$
The content of the paper does not justify the name, Kolmogorov-Arnold Networks (KANs). An MLP can be rewritten as an addition. The difference between the KAT theorem and the Universal Approximation Theorem (UAT), is that the latter requires that to approximate any two-layer function with infinite neurons, while KAT would reduce to (2n + 1) function in hidden layers. The authors do not consistently use (2n + 1) in the hidden layer. 3 KANs are MLPs with spline-basis as the activation function. Rather than new neural networks, they would be MLPs in which the activation function is a spline
KANs do not beat the curse of dimensionality. For the author, the claim is unwarranted by the evidence.

How do they compare with MLP?

In this paper, instead they try to conduct an in-depth comparison between MLP and KAN for different domains (controlling the number of parameters and evaluating different tasks). The authors comment: "Under these fair settings, we observe that KAN outperforms MLP only in symbolic formula representation tasks, while MLP typically excels over KAN in other tasks.“ In the ablation studies they show that its B-spline activation function gives an advantage in symbolic formula representation.

from the original papers

Overall, the article has the positive effect of making the use of B-splines in neural networks more tractable and proposing a system that is more interpretable (via sparsification, pruning, and symbolification). Also, the system is not yet optimized so in computational terms it is not exactly competitive with an MLP. It can be an interesting alternative though and still develop.

In this other article they compared the ability of KAN and MLP to represent and approximate functions.

In this article, we study the approximation theory and spectral bias of KANs and compare them with MLPs. Specifically, we show that any MLP with the ReLUk activation function can be reparameterized as a KAN with a comparable number of parameters. This establishes that the representation and approximation power of KANs is at least as great as that of MLPs. -source

Thus, KANs can represent MLPs that have a similar size (in other words, they have the same approximation power).

On the other hand, we also show that any KAN (without SiLU non-linearity) can be represented using an MLP. However, the number of parameters in the MLP representation is larger by a factor proportional to the grid size of the KAN-source

MLPs can represent KANs, however, at the cost of increasing the parameter number. The number of parameters of an MLP increases significantly with the grid size of the KAN. So if we have a task that requires KANs with a large grids, it is not efficient to use an MLP.

Another interesting result is the spectral bias analysis. MLPs are known to have a spectral bias for low-frequency components first (like smooth, gradual changes in the function that the model wants to learn). Gradient descent favors low frequencies are learned earlier in training, in contrast, high-frequency details require finer adjustments, which typically come later in training as the model fits more of the detailed features of the data. This spectral bias serves as a regularizer and seems to be useful for many machine-learning applications. Sometimes high-frequencies (like sharp, rapid changes or very detailed variations, where the function's value changes rapidly over a short interval) are useful to learn. In these cases often high-frequency information has to be encoded using methods like Fourier feature mapping or use different nonlinear functionals. For the authors, KANs theoretically have reduced spectral bias.

MLPs manifest strong spectral biases (top), while KANs do not (bottom). from the original papers

Precisely since KANs are not susceptible to spectral biases, they are likely to overfit to noises. As a consequence, we notice that KANs are more subject to overfitting on the training data regression when the task is very complicated. On the other hand, we can increase the number of training points to alleviate the overfitting-source

So being less susceptible to spectral bias comes at the cost of being more at risk of overfitting. For smooth function, one can use a KAN with a small grid size (fewer points in the B-splines), while for high frequencies function better a KAN with a large grid. For large-scale and smooth function, however, an MLP is recommended.

Working with KAN

It is actually very easy to train KANs with the official Python library: PyKAN. For example, you just need to define the model and to train it:

model = KAN(width=[4, 5, 3], grid=5, k=3, seed=0, device=device)
results = model.fit(dataset, opt="Adam", steps=100, metrics=(train_acc, test_acc), 
                    loss_fn=torch.nn.CrossEntropyLoss(),
                    lamb=0.01, lamb_entropy=10.);

As you can see from the image below, we can see the progressive sparsification effect that happens with KANs. This increases the interpretability of the system:

Applications of KAN

Fast development of KAN in one year. from the original papers

KANs have been efficiently adapted to deep learning. This was done through a series of modifications and new architectures. This has meant that they can also be used for a variety of applications outside those thought of.

Several articles have come out today that present the application of KANs, below are some examples of applications:

KAN-based models for medical image segmentation (U-net based) have been proposed in computer vision.
There are also KAN proposed for time series as here, here or here. They show they can have similar results to RNNs with less parameters.
For graph analysis, like graph collaborative filtering (here) or molecular representation (here)
Wav-KAN showcasing KANs’ broad applicability in modern fields like signal processing.

Seeing some applications in detail, this study proposes an extension of KANs for survival modeling (called time-to-event analysis, where they model time until an event happens). In this task, deep learning-based models usually perform better than traditional machine learning models but at the cost of loss of interpretability. Precisely because of the need for interpretability, KANs have been considered a good alternative:

The key contributions of this paper are in demonstrating that (a) CoxKAN finds interpretable symbolic formulas for the hazard function, (b) CoxKAN identifies biomarkers and complex variable interactions, and (c) CoxKAN achieves performance that is superior to CoxPH and consistent with or better than DeepSurv (the equivalent MLP-based model). --source

CoxKAN can be seen as an extension of KANs in which censored regression is conducted. Everything is pretty much the same except the loss (you use Cox loss plus a regularization coefficient to sparsify). Since KANs are slower to train than MLPs regularization helps speed up training. Unlike MLP we have both pruning and symbolic fitting. These two steps help make the model more interpretable.

from the original article

once the specific library is installed:

# Install coxkan
! pip install coxkan scikit-survival

We are also installing scikit-survival to make a comparison with a trained model based on decision trees (XGBoost based).

At this point, we can train our neural network:

from coxkan import CoxKAN
from sklearn.model_selection import train_test_split
import numpy as np
from coxkan.datasets import gbsg

# load dataset
df_train, df_test = gbsg.load(split=True)
name, duration_col, event_col, covariates = gbsg.metadata()

# init CoxKAN
ckan = CoxKAN(width=[len(covariates), 1], seed=42)

# pre-process and register data
df_train, df_test = ckan.process_data(df_train, df_test, duration_col, event_col, normalization='standard')

# train CoxKAN
_ = ckan.train(
    df_train, 
    df_test, 
    duration_col=duration_col, 
    event_col=event_col,
    opt='Adam',
    lr=0.01,
    steps=100)

print("\nCoxKAN C-Index: ", ckan.cindex(df_test))

# Auto symbolic fitting
fit_success = ckan.auto_symbolic(verbose=False)
display(ckan.symbolic_formula(floating_digit=2)[0][0])

# Plot coxkan
fig = ckan.plot(beta=20)

Results for coxKAN:

import pandas as pd
from sksurv.ensemble import GradientBoostingSurvivalAnalysis
from sksurv.metrics import concordance_index_censored
from sksurv.util import Surv

# Prepare the target variables for survival analysis
y_train = Surv.from_arrays(event=df_train['event'].astype(bool), time=df_train['duration'])
y_test = Surv.from_arrays(event=df_test['event'].astype(bool), time=df_test['duration'])

# Prepare the feature matrices
X_train = df_train.drop(['duration', 'event'], axis=1)
X_test = df_test.drop(['duration', 'event'], axis=1)

# Initialize and train the model
model = GradientBoostingSurvivalAnalysis()
model.fit(X_train, y_train)

# Predict risk scores on the test set
pred_test = model.predict(X_test)

# Compute the concordance index
cindex = concordance_index_censored(y_test['event'], y_test['time'], pred_test)

print("C-index on the test set:", cindex[0])

results for the gradient boosting method:

We can notice three things:

the result is similar to that obtained with traditional machine learning.
We can get the symbolic formula for our KAN that allows us to interpret the relationship between the various features.
We can also visualize these features.

Suggested lectures:

KAN: Kolmogorov-Arnold Networks
KAN 2.0: Kolmogorov-Arnold Networks Meet Science
A Comprehensive Survey on Kolmogorov Arnold Networks (KAN)

Other resources:

official code
Awesome KAN - a list of resources about KAN
The Annotated Kolmogorov-Arnold Network (KAN) - a great explanation and implementation from scratch of the KAN
KANvas - a tool to play and understand KAN
Why is the (KAN) Kolmogorov-Arnold Networks so promising
Implementation on how to use Kolmogorov-Arnold Networks (KANs) for classification and regression tasks.
A Survey on Kolmogorov-Arnold Network

What are (Comple)XNet or Xnet?

One of the most important challenges in computational mathematics and artificial intelligence (AI) is finding the most appropriate function to accurately model a given dataset. Traditional methods involve a number of classes of functions (such as polynomials or Fourier series) that are simple and computationally inexpensive. Deep learning methods, on the other hand, use locally linear functions with nonlinear activations. In 2024, KANs were proposed, which instead were inspired by the Kolmogorov-Arnold representation theorem and use B-splines as learnable functions (in the case of MLPs, one uses functions that have fixed functions)

Instead, in this paper they start from another theorem Cauchy integral formula to extend real-valued functions into the complex domain. This theorem states that if you have an analytic (smooth and differentiable) function in a region of the complex plane, then the integral function along any closed loop within that region is zero. So a function must have a derivative at every point in the complex plane, and that derivative is continuous. A closed loop is simply a path in the complex plane that begins at one point and ends at the same point. This theorem is useful for predicting the behavior of functions.

Building on this, the authors develop a new activation function called the Cauchy Activation Function:

![xNET](https://github.com/SalvatoreRa/tutorial/blob/main/images/Cauchy_activation_ function.png?raw=true) from here

where λ1, λ2, and d are parameters optimized during training. The beauty of this function is that it can approximate any smooth function to its highest possible order. This function is localized and decays at the both ends: the output decreases as the input moves away from the center, allowing it to focus on local data. It is also mathematically efficient

from here

In contrast to KANs, we can build an activation function and insert it directly into our neural network, without the need for major adaptations.

import torch 
import torch.nn as nn

# Create a class to represent the function
class CauchyActivation(nn.Module):
  def __init__(self):
    super().__init__()
    # Set the trainable parameters: λ1, λ2, d 
    self.lambda1 = nn.Parameter(torch.tensor(1.0))
    self.lambda2 = nn.Parameter(torch.tensor(1.0))
    self.d = nn.Parameter(torch.tensor(1.0))
  def forward(self, x):
    x2_d2 = x ** 2 + self.d ** 2
    return self.lambda1 * x / x2_d2 + self.lambda2 / x2_d2
# Define and insert directly in a neural net
cauchy_activation = CauchyActivation()

For the authors, this is enough to have better-performing networks:

from here

How to deal with overfitting in neural networks?

"The central challenge in machine learning is that we must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization."-source

Neural networks are sophisticated and complex models that can be composed of a great many parameters. This makes it possible for neural networks to store patterns and correlations spurious that are only present in the training set. There are several techniques that can be used to reduce the risk of overfitting

collecting more data and data augmentation

Obviously, the best way is to collect more data, especially quality data. In fact, the model is exposed to more patterns and needs to identify relevant ones

Data augmentation simulates having more data and makes it harder for the model to learn spurious correlations or store patterns or even whole examples (deep networks are very capable in terms of parameters).

from here

What is the lottery ticket hypothesis?

The lottery ticket hypothesis was proposed in 2019 to explain why neural networks are pruned after training. In other words, once we have trained a neural network with lots of parameters we want to remove the weights that do not serve the task and create a lighter network (pruning). This allows for smaller, faster neural networks that consume fewer resources. Many researchers have wondered, but can't we eliminate the weights before training? If these weights are not useful afterward, they may not be useful during training either.

"The Lottery Ticket Hypothesis. A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. "-source

from here

What the authors do is a process called Iterative Magnitude Pruning, basically, they start by training the network, eliminate all the smaller weights, and then extract a subnetwork. This subnetwork is initialized with small, random weights and they re-train until convergence. This subnetwork is called a "winning ticket" because randomly it received the right weights so that it could be the one with the best performance.

Now, this leads to two important considerations: There is a random subnetwork that is more computationally efficient and can be further trained to improve performance. Also, this subnetwork has better generalization capabilities. If it could be identified in a pre-training manner it would reduce the need to use large dense networks.

According to some authors, the lottery ticket hypothesis is one of the reasons why neural networks form sophisticated circuits. Thus, it is not that weights gradually improve, but instead improve if these circuits are already present (weights that have won the lottery). This would then be the basis of grokking

Suggested lecture:

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

What is grokking?

Overfitting is considered one of the major problems in neural networks and is defined when a model fails to generalize. An overfitting model shows good results on the training dataset, but poor results on test data (so it fails to generalize what it has learned in the training data). Overfitting can have many causes such as little data, lack of regularization, and model being too complex (see above).

Recently, however, the concept of overfitting has been meesso challenged by the so-called Grokking. Grokking is defined as a delayed generalization, at the first stage the model seems to memorize the training set data and not learn generalization (this is seen from the loss and accuracy curves) at a certain time continuing training there is a rapid decrease in the validation loss (also known as “grok”).

Grokking: A dramatic example of generalization far after overfitting on an algorithmic dataset. from here

We discussed this in detail in this article, but basically there are various forces at work:

The model tends to memorize elements because it is more efficient. These circuits memorize training examples.
under regularization forces such as weighting decay, slowly generalization circuits emerge. These circuits are able to understand the patterns underlying the data.
This balance is dependent on various elements such as dataset size.

Grokking seems more like a theoretical case without practical applications, especially since it needs many iterations to emerge. A paper was recently presented that discusses the possibility of creating an algorithm called Grokfast, to accelerate model convergence toward generalization.

The system decomposes the gradient of a parameter into two components: a fast-varying component and a slow-varying component. The former is responsible for overfitting, and the latter is responsible for generating (inspired by circuits described in other articles). By exploiting this you can then speed up convergence, and simply strengthen the influence of the slow-varying component. Here, the code.

Articles describing in detail:

Grokking: Learning Is Generalization and Not Memorization

What is continual learning? Why do neural networks struggle with continual learning?

When we refer to continual learning it refers to the ability of neural networks to adapt as data or tasks change.

"Continual learning, sometimes referred to as lifelong learning or incremental learning, is a subfield of machine learning that focuses on the challenging problem of incrementally training models on a stream of data with the aim of accumulating knowledge over time. This setting calls for algorithms that can learn new skills with minimal forgetting of what they had learned previously, transfer knowledge across tasks, and smoothly adapt to new circumstances when needed." -- source

A conceptual framework of continual learning. from here

In general, this is an extremely interesting problem for deep learning. In fact, we want our model to be able to adapt to new data, without the need to have to train the model from scratch. For example, an LLM has a large memory capacity and shows several skills, but we want to conduct an update of its knowledge with recent facts that occurred after its training. To be able to do this, we usually use fine-tuning techniques, but these can reduce the model's performance or lead it to produce hallucinations. Fine-tuning is therefore not optimal. So the problem remains open.

There are two main problems with continual learning (or also called incremental learning or lifelong learning) that we will discuss more fully below:

catastrophic forgetting - the model when fitting new data forgets previous data. An example, a model that is adapted to the medical domain loses knowledge of other domains. The same is true for some tasks, a model trained for image classification when fine-tuned loses its initial capabilities.
missing learning plasticity - the ability to acquire new information. A model that lacks plasticity has no ability to learn new information or solve new tasks.

In addition, to solving these two issues we would like our model to have strong generalizability, that is, to be able to use new data to learn new capabilities. In addition, we would also like the method for continual learning to be resource-efficient. In fact, models today are expensive to train and we would not want to have an expensive method to conduct model training. In addition, we would like continual learning methods not to lead to expansion of a model (e.g., starting from a model of 1B parameters after tot iterations we would end up with a model of 10B).

Moreover, continual learning does not only mean learning new knowledge or new tasks. In fact, continual learning also means the possibility of model editing. In fact, models are often imperfect, learn different decision shortcuts, or have bias, outdated data, facts that have changed, and knowledge that is not in compliance with new regulations (e.g., on privacy). So it would be important to find a way to be able to selectively edit knowledge without losing other relevant knowledge.

As mentioned neural networks during training lose plasticity. Several studies have tried to highlight why this happens. According to this study during training, there are two phases of learning. The first phase is memorization and the second phase of reduction of information (a reorganization or forgetting phase) even during a performance growth phase. According to the authors masking this memorization phase (disrupting it in some way) significantly reduces performance leading to a performance deficit.

DNNs exhibit critical periods, and alterations of these critical periods bring a performance deficit. from here

The authors do not argue, but as we have seen in Grokking there are two stages: memorization and generalization. The fact may be related. In a later study they note that there are several factors that impact the plasticity of a network. They note that regularization (L2-regularization) and noise injection improve network plasticity. They seek to analyze the factors that change during the loss of network plasticity. In particular, they identify three factors:

continual increase in the fraction of constant units. One of the known problems of ReLU is the dying ReLU problem. Basically, during training some neurons begin to return negative values that become zero with ReLU. A constant unit does not flow the gradient and no longer undergoes updates, so it loses plasticity. With continual training, the percentage of dead units increases during iterations. The increase in these dead units, also decreases the total capacity of the network and thus its ability to learn new information (loss of plasticity).
steady growth of the network's average weight magnitude. The weights become larger and larger. Growth in weights is considered a problem because it slows network learning and slows convergence.
drop in the effective rank of the representation. The effective rank represents the number of linearly independent dimensions. A highly effective rank indicates that most dimensions contribute to the transformation that a matrix gives. Simply put, a high effective rank means that most of the dimensions contain relevant information. A low rank means that there is a lot of redundant information. In the case of neural networks, the rank of a hidden layer measures the number of neurons that contribute to the output. A low effective rank, on the other hand, means that most neurons are not providing useful information. Gradient descent favors through implicit regularization of the loss function or implicit minimization of the effective rank. Decreasing the effective rank is detrimental to learning new information because it decreases the representational power of a layer (which does not help in learning a new task).

from here*

For the authors, these factors explain why regularization helps but does not solve the problem. L2-regularization incentivizes solutions that have low weight magnitude (i.e., it penalizes weights when they grow too large). While other techniques introduce a little Gaussian noise. This reduces the problem of dying units because by adding noise we do not have zero-units. Also, according to the authors loss of plasticity could be related to the lottery ticket hypothesis. Neural networks when initialized are initialized with random weights, this makes some subnetworks have winning tickets. Simply put, there are subnetworks that are suitable for the task and that are random, simply by “luck” during initialization. During training these networks will receive a reward and be preserved, but the network during training eliminates everything that is not needed for the task (so the network loses randomness and diversity in comparison to the original network). This means that if we want our model to learn a new task there will be fewer winning tickets. Therefore, the authors propose continual backpropagation in which weights that no longer contribute to the output are re-initialized. These are then neurons that in many approaches are pruned when we need to reduce the size of the network. So the authors propose to reinitialize them in order to use them to learn new tasks.

Catastrophic forgetting is the other problem of continual learning. In this case, the interest is in maintaining stability of the information and tasks learned. The problem with neural networks is that they optimize the parameters for a task; by changing tasks, the model will change the parameters and thus also the parameters that were specific to the previous task. In fact, some proposed solutions are Progressive Neural Networks where we add a new network for each new task while still maintaining connections to the previously trained network. So the weights that were important for a task remain intact. Other solutions act on the dataset, keeping a part of the data from previous tasks (Replay Techniques). Another alternative is Elastic Weight Consolidation (EWC) where a regularization term is added, this term penalizes changes in weights that are important for previously learned tasks.

Suggested lectures:

Continual Learning: Applications and the Road Forward
A Comprehensive Survey of Continual Learning: Theory, Method and Application
Loss of plasticity in deep continual learning

Embeddings

What is an embedding?

An embedding is a low-dimensional space representation of high-dimensional space vectors. For example, one could represent the words of a sentence with a sparse vector (1-hot encoding or other techniques), and embedding allows us to obtain a compact representation. Although embedding originated for text, it can be applied to all kinds of data: for example, we can have a vector representing an image

sparse word representation:

from here

In general, the term embedding became a fundamental concept of machine learning after 2013, thanks to Word2Vec. Word2Vec made it possible to learn a vector representation for each word in a vocabulary. This vector captures features of a word such as the semantic relationship of the word, definitions, context, and so on. In addition, this vector is numeric and can be used for operations (or for downstream tasks such as classification)

from here

Word2Vec then succeeds in grouping similar word vectors (distances in the embedding space are meaningful). Word2Vec estimates the meaning of a word based on its occurrences in the text. The system is simple, we try to predict a word by its neighbors or its context. In this way, the model learns the context of a word

Suggested lecture:

Efficient Estimation of Word Representations in Vector Space

What are embedding vectors and latent space? What is a latent representation?

As mentioned above an embedding vector is the representation of high dimensional data in vectors that have a reduced size. In general, they are continuous vectors of real numbers, although the starting vectors may be sparse (0-1) or of integers (pixel value). The value of these embedding vectors is that they maintain a concept of similarity or distance. Therefore, two elements that are close in the initial space must also be close in the embedding space.

The representation means a transformation of the data. For example, a neural network at each layer learns a representation of the data. intermediate, i.e., layer representations within the neural network are also called latent representation (or latent space). The representation can also be the output of a neural network, and the resulting vectors can also be used as embeddings.

For example, we can take images, pass them through a transformer, and use the vectors obtained after the last layer. These vectors have a small size and are both a representation and embedding. So often the terms have the same meaning. Since a text like hot-encoding is a representation, in some cases the terms differ

Can we visualize embedding? It is interesting to do it?

As mentioned, distances are meaningful in the embedding space. In addition, operations such as searching for similar terms can be performed. These embeddings have been used to search for similar documents or conduct clustering.

from here

After getting an embedding it is good practice to conduct a visual analysis, it allows us to get some visual information and understand if the process went well. Because embeddings have many dimensions (usually up to 1024, although they can be more), we need to use techniques to reduce the dimensions such as PCA and t-SNE to obtain a two-dimensional projection

from here

Transformer

What is self-attention?

!

What is a transformer?

!

What is a Vision Transformer (ViT)?

!

Why there are so many transformers?

!

Can we use BERT like models for text generation?

**Masked language models (MLM)** have long dominated NLP, especially because they were adaptable for multiple tasks. The paradigm shift occurred in 2020 when it was shown that GPT-like models could do **in-context learning**. ICL allows a model to be able to do a task directly without fine-tuning. For example, by providing some examples in the prompt the model understands the task and executes it.

“BERT-style models are very restricted in their generative capabilities. Because of the cumbersomeness of task-specific classification heads, we strongly do not recommend using this class of autoencoding models moving forward and consider them somewhat deprecated.” — source

This sealed the fate of BERT-like models and the rise of auto-regressive models (GPT-like).

The point is that it was thought that these models could not generate text efficiently. Until 2024 two studies showed that this was not the case. Not only can they generate text but they can also do LCI:

There are two methods used to solve tasks with in-context learning: text generation where the model completes a given prompt (e.g. for translation) and ranking where the model chooses an answer from several options (e.g. for multiple choice questions). — source

from here

The result is that a model trained by masked language modeling is competitive with GPT-like models for text generation and in-context learning: it shows in fact "similar absolute performance, similar scaling behavior, and with similar improvement when given more few-shot demonstrations.” Since these models are not designed to generate accommodations must be used but still, there is no need for additional training.

from here

Another study confirms these results. For the authors MLM outperforms Causal Language Modeling (CLM) for text generation, it is more aligned with the original reference text.

Articles describing in detail:

Maybe GPT Isn’t the Best: BERTs Can Master Generative In-Context Learning

Suggested lectures:

BERTs are Generative In-Context Learners
Exploration of Masked and Causal Language Modelling for Text Generation

What is a mamba model? Is it an alternative to a transformer?

In recent times there has been talk of Mamba and State Space Models (SSMs) as possible alternatives to transformers. Recently these models seem promising because they have a nearly linear computational cost that is especially attractive when long sequences of text (up to one million tokens) have to be modeled.

from here

The authors of the models describe it as " Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics"

SSMs originated as a framework for describing the behavior of a system dynamically over time. In a simple way, considering a maze, the “state space” is the map of all possible locations (or states), and the “state space representation” can be considered the map. This map tells us where we can go, how to go there, and where we are at that moment. Where we are and how far we are from the exit of the maze could be described with a vector (“state vectors”). In the context of a neural network for text, given the state of the system (or hidden state) we can generate a new token (or metaphorically switch state or move in the map).

The basic assumption is that if one knows the current state of the world and how it will evolve, one can decide how to move.

More technically, SSMs then try for an input sequence x(t) to map it to a latent state representation h(t) and predict an output y(t). For example, given a text sequence, generate a latent representation and use it to predict the next word. A dynamical system can be predicted from its state at each time t using these two equations:

equation 1: h'(t) = A h(t) + B x(t),

equation 2: y(t) = C h(t) + D x(t),

The goal is then to determine h(t) so given x(t) we can then determine the output y(t). Equation 1 tells us that the state h changes as a function of the current state (based on a matrix A) and the input x (and a matrix B). Equation 2 tells us that the output is obtained grazio the state h (and a matrix C) and the input x (and an associated matrix D). In simple words, given an input x the state h is changed, and the output is affected by both the input x and the state h (for an input x we have the update of the state h and an output y). More formally, the evolution of the system depends on newly acquired information and its current state. Applied to a language model, arriving at an input x with a model with a state h we can predict the next token y (if you notice it is very close to an autoregressive language model or how transformers work).

from here

Briefly, we can consider the different matrixes as:

A is the transition state matrix, which is guiding the transition from one state to another. Intuitively it represents how we can forget the least relevant part of the state.
B maps the input to the new state, controlling which part of the input we need to remember.
C allows us to map the output from the model state, or how to use the model state to make a prediction.
D is considered a kind of skip connection, or how the input affects the prediction

Calculating the state representation h(t) analytically is complex, especially if the signal is continuous. Since text is a discrete input by nature, discretizing the model makes our lives easier. Zero-Order Hold (ZOH) is the technique that is used in Mamba to transform the model. After applying the ZOH, the equations are:

$$h_k = \overline{A} h_{k-1} + \overline{B} x_k$$

$$y_k = C h_k$$

where $$\overline{A} = \exp(\Delta A)$$, and $$\overline{B} = (\Delta A)^{-1} (\exp(\Delta A) - I) \cdot \Delta B$$, $$k$$ is the discrete time step.

from here

In addition, we can also see this process as a convolution, in which we apply kernel sliding on the various tokens. At each time step, we can calculate the output in this way:

from here

The three representations we have seen (continuous representation and its discritization into recurrent and convolutional) have different advantages and disadvantages. Recurrent representation is efficient in inference but does not allow parallel training. So training is conducted with convolutional representation (it allows parallelization of training).

Another interesting modification in Mamba is the use of High-order Polynomial Projection Operators (HiPPO) to initialize the matrix A during training. This is used to compress the input signals into vectors of coefficients. The idea is to exploit d this concept in the A-matrix, so that this captures the recent tokens and there is a decay of information for the old tokens. After all, the A-matrix is used to capture the information from previous states to produce the new state. In this way we improve the ability of the model in handling long-range dependencies.

In addition, Mamba uses two particular additions:

selective scan algorithm to filter out irrelevant information. This is especially important to allow the model to be context-aware and not treat all tokens equally.
hardware-aware algorithm that allows it to store intermediate results. Hardware-aware algorithm allows it to better exploit the capabilities of the GPU (similar to what flash attention does).

An SSM model compresses the entire history (i.e., everything seen up to that model) and does so efficiently. Transformers do not compress the story, they are, however, very powerful to look at and attend to the sequence (thus searching for what is important and modeling relationships). In a sense, the internal state of a transformer can be seen almost as the cache of the whole hystory. SSMs are not as powerful, so Mamba tries to have a state that is both efficient and powerful. Therefore, the B and C matrices have a dynamic size that changes in dependence of the input and are different for each input token (thus ensuring context awareness). These two matrices thus help to choose which information to retain and which not to.

These improvements can then be seen in the Mamba block. Here the SSM conducts discretization, HiPPO initialization, Selective scan algorithm and is accelerated thanks to hardware-aware algorithm

from here

suggested lectures:

A Visual Guide to Mamba and State Space Models
The State Space Model taking on Transformers
The Annotated S4 - a JAX implementation and in-depth explanation
Mamba No. 5 (A Little Bit Of...) - a detailed explaination of Mamba
a series of posts on state space models: here, here, and here.
A Survey of Mamba - a detailed survey of Mamba models
The Mamba Training Cookbook. - Pre-training hybrid (Mamba style) models is different from pre-training normal Transformers and it could be hard. This cookbook helps in the task.

What are other alternative to the transformer?

There are several, but in this article, we will discuss the alternatives that have been proposed and are considered the main ones:

xLSTM
Hyenas
RWKV

xLSTM

xLSTM has been proposed in 2024 as a model that can compete with transformers. The model was proposed by the same group that proposed LSTM several years ago. The authors note that LSTM has mainly three problems:

** Inability to revise storage decisions**. LSTM struggles with updating stored information when it encounters more relevant information in the sequence (this was shown with Nearest Neighbor Search problem in which LSTM struggles in identifying most similar vectors).
Limited storage capacity an LSTM memory cells compress information into a single scalar value, this is a big compression and limits the amount of information that can be stored. This is especially a problem when it comes to modeling nuances, predicting rare tokens and so on.
Lack of parallelizability LSTM is sequential by nature (each hidden state depends on the previous one) and therefore is not compatible with new hardware such as GPUs as well as making it impossible to train with huge amounts of text

from here

To solve these problems, the authors included a series of modifications that led to the creation of two variants of the xLSTM cell: sLSTM (scalar LSTM) and mLSTM (matrix LSTM).

For example sLSTM employs a series of modifications that improve its profile:

Exponential Gating. It incorporates exponential activation functions for input and forgat gate, improving control over information flow.
Normalization and stabilization. To put greater stability, the authors insert a normalizer state for input and forget gates.
Memory mixing. sLSTM supports multiple memory cells and allows memory mixing through the use of recurrent connections, this allows better pattern extraction

from here

mLSTM on the other hand increases the memory of the system, allowing it to process and store information in parallel (instead of having a scalar here we have a matrix). This system therefore uses a memory matrix to allow more efficient storing and retrieval. Inspired by attention mechanisms it uses a covariance update rule to store key-value pairs

from here

xLSTM seems to have an advantage over the transformer where memory mixing is required (e.g., parity tasks). The transformers and SSM models (such as Mamba) fail the task because they fail to conduct state tracking. Therefore, for the authors, the model is more expressive than the transformer. According to the authors also the model succeeds in efficiently modeling long context, has enhanced memory capacities, and advantages in computational cost ( favorable scaling behavior). At present, however, it has not been widely adopted by the community.

Hyenas

Because the attention mechanism has a quadratic cost in relation to sequence length several studies have focused on how to make it linear. From this line of research, the authors of this study tried to design an architecture that could handle sequences of millions of tokens at a linear cost.

The authors then introduced a new operator called Hyena hierarchy that combines long convolutions and element-wise multiplicative gating to reduce the computational cost of self-attention without, however, losing the same expressiveness.

The trick according to the authors is to maintain the data control that the attention mechanism provides (via a matrix called the attention matrix). In hyenas, they conduct this data control by defining an implicit decomposition into a sequence of matrices (evaluated via a for-loop). Then the system basically gets a series of linear projections of the data and then combines them with long convolution. This saves a lot of computational resources

from here

The Hyena operator is based on the lesson of previous models (H3 model) that use special matrices to understand relationships internal to the data. In addition, there are filters that are learned to better understand important relationships. These filters also specialize during training.

from here

The performance of this architecture seems comparable but more importantly it is much more computationally efficient.

RWKV

The architecture of RNNs (and derivatives) was a staple of NLP and sequence modeling for a long time until it was replaced by the transformer. RNNs unlike classical neural networks do not take a fixed-size input but a sequence. In addition, an intriguing fact about RNNs is that they are more versatile than people think and can be used for different scenarios: one-to-one (image classification), one-to-many (image captioning), many-to-one (sequence classification), many-to-many (sequence generation), and so on

from here

RNNs as we mentioned above are not, however, without problems such as vanishing gradient, difficulty in storing information for long sequences, and difficulty in parallelization. If the first two problems were solved by LSTM and GRU, the third was solved by the transformer. During training, the transformer though not only allows parallelization but also learning contextual information and even long dependencies within a sequence (at a rather high computational cost).

'During inference, RNNs have some advantages in speed and memory efficiency. These advantages include simplicity, due to needing only matrix-vector operations, and memory efficiency, as the memory requirements do not grow during inference. Furthermore, the computation speed remains the same with context window length due to how computations only act on the current token and the state.' -source

RKWV is inspired by Apple's free transformer though with a number of simplifications, along with a number of tricks to improve efficiency. RKWV tries to solve the context length problem, down to modeling sequences that have thousands of tokens. Likewise, make the system parallelizable so that it can be trained on GPUs (or similar hardware).

In a certain system much like a transformer, some elements are retained: an embedding layer, a series of blocks stacked on top of each other, layer normalization, and trained by causal language modeling. What is eliminated is the attention layer of classical transformers and replaced with a new type of layer

from here

There are two important elements to consider:

Channel mixing. Channel mixing is the process where the next token being generated is mixed with the previous state output of the previous layer to update this “state of mind” source This can be seen as a short-term and accurate memory
Channel mixing. Time mixing is a similar process, but it allows the model to retain part of the previous state of mind, enabling it to choose and store state information over a longer period, as chosen by the model. source, In this case, it is a lower accuracy but long-term memory

These two elements together replace the attention of the transformer and work somewhat the same way but reducing the computational burden

from here

Articles describing in detail:

Welcome Back 80s: Transformers Could Be Blown Away by Convolution
Are xLSTM a Menace to Transformer Dominion

Suggested lectures:

xLSTM: Extended Long Short-Term Memory
Hyena Hierarchy: Towards Larger Convolutional Language Models
The Annotated Hyena
RWKV: Reinventing RNNs for the Transformer Era
RWKV, Explained
RWKV Language Model
How the RWKV language model works - a minimal implementation of RWKV

Large Language Models

What is a Large Language Model (LLM)?

"language modeling (LM) aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens" -from here

In short, starting from a sequence x of tokens (sub-words) I want to predict the probability of what the next token x+1 will be. An LLM is a model that has been trained with this goal in mind. Large because it is a model that has more than 10 billion parameters (by convention).

These LLMs are obtained by scaling from the transformer and have general capabilities. Scaling means increasing parameters, training budget, and training dataset.

"Typically, large language models (LLMs) refer to Transformer language models that contain hundreds of billions (or more) of parameters, which are trained on massive text data" -from here

Files

FAQ.md

Latest commit

History

FAQ.md

File metadata and controls

Frequently Asked Questions (FAQs) on machine learning and artificial intelligence

Index

FAQ on machine learning

Statistics and data science

Machine learning in general

Clustering

Tree-based models

FAQ on artificial intelligence

Neural networks

Why do we care about it? How can we use it for a neural network?

How do they compare with MLP?

Working with KAN

Applications of KAN

Embeddings

Transformer

xLSTM

Hyenas

RWKV

Large Language Models

The scaling law

What are emergent properties?

Evolution of small LLMs

Prompt engineering

Retrieval Augmented Generation (RAG)

So, GraphRAG will substitute the traditional RAG?