# <strong> Kolmogorov-Arnold Networks </strong>
The KAN paper introduces Kolmogorov-Arnold Networks (KANs) as innovative alternatives to traditional Multi-Layer Perceptrons (MLPs). Inspired by the Kolmogorov-Arnold representation theorem, KANs replace the fixed activation functions of MLPs with learnable activation functions on edges, modeled as univariate spline functions. This design eliminates linear weights, allowing KANs to achieve better accuracy and interpretability than MLPs while using fewer parameters.

Key advantages highlighted in the paper include:

1. Accuracy: KANs outperform MLPs in tasks such as data fitting and solving partial differential equations (PDEs), demonstrating faster scaling laws and requiring smaller networks to achieve similar or superior results.
2. Interpretability: The edge-based functional representation enables intuitive visualization and interaction, making KANs particularly useful for scientific applications, such as rediscovering mathematical and physical laws.
3. Efficiency: By leveraging their unique structure, KANs demonstrate computational and memory advantages, particularly for large-scale problems.

* Kolmogorov-Arnold Networks
* https://arxiv.org/abs/2404.19756

<img src="src/kan1.png">

## <strong> Universal Approximation Theorem </strong>

The Universal Approximation Theorem is a fundamental result in the theory of neural networks. It states that a neural network with at least one hidden layer and a suitable activation function can approximate any continuous function on a compact domain to an arbitrary degree of accuracy, given enough neurons in the hidden layer.

Key Points:

##### 1. Applicability:
The theorem applies to feedforward neural networks with a single hidden layer (sometimes called single-layer perceptrons).
It requires the activation function to be non-linear and continuous, such as the sigmoid function or ReLU.

##### 2. Implications:
The theorem shows that neural networks have the potential to represent complex functions, making them powerful tools for approximation tasks in machine learning.
It does not guarantee practical success; training a network to achieve the desired approximation may still face challenges such as optimization, overfitting, or computational limitations.

##### 3. Limitations:
While the theorem ensures existence, it does not specify how many neurons are needed or how to train the network.
The result is limited to continuous functions; approximation of discontinuous functions requires additional considerations.

##### 4. Extensions:
The theorem has been extended to deep networks, showing that deeper architectures can achieve similar approximations with fewer neurons than shallow networks.




> mynote: <strong> if we can increase our number precision like 0.1 to 0.0000...01 we can fit any funtion to a single preceptron, because infinite numbers can fit between two Real Numbers. </strong>

## <strong> Kolmogorov-Arnold Representation Theorem </strong>

The Kolmogorov-Arnold Representation Theorem is a result in mathematics that deals with the representation of continuous functions of multiple variables. It states that any multivariate continuous function can be expressed as a finite sum of compositions of continuous functions of one variable and addition.

## Key Points

**Theorem Statement**:
   Any continuous function \( f(x_1, x_2, \ldots, x_n) \), defined on a compact subset of \( \mathbb{R}^n \), can be written in the form:
   \[
   f(x_1, x_2, \ldots, x_n) = \sum_{i=1}^{2n+1} \phi_i\left(\sum_{j=1}^n \psi_{ij}(x_j)\right),
   \]
   where:
   - \( \phi_i \): Continuous functions of one variable, independent of \( n \).
   - \( \psi_{ij} \): Continuous functions of one variable.

* Youtube Helpful Link:
* https://www.youtube.com/watch?v=nS2hnm0JRBk&list=WL&index=6&t=889s

> mynote: <strong> install Markdown Preview Enhanced and use cmd-shift-v to see correct format.</strong>

### <strong> Why Kolmogorov-Arnold Representation </strong>

***Simpler for Backpropagation***
    when you deal with multiple sums and trying to apply Partial derivative other variables become 0 easily, its also one reason that this architecture is interpretable.

***outside & inside***
    Despite their elegant mathematical interpretation, KANs are nothing more than combinations of
splines and MLPs, leveraging their respective strengths and avoiding their respective weaknesses.
Splines are accurate for low-dimensional functions, easy to adjust locally, and able to switch between
different resolutions. However, splines have a serious curse of dimensionality (COD) problem,
because of their inability to exploit compositional structures. MLPs, on the other hand, suffer less
from COD thanks to their feature learning, but are less accurate than splines in low dimensions,
because of their inability to optimize univariate functions. The link between MLPs using ReLU-k
as activation functions and splines have been established in [17, 18]. To learn a function accurately,
a model should not only learn the compositional structure (external degrees of freedom), but should
also approximate well the univariate functions (internal degrees of freedom). KANs are such models
since they have MLPs on the outside and splines on the inside. As a result, KANs can not only learn features (thanks to their external similarity to MLPs), but can also optimize these learned features to great accuracy (thanks to their internal similarity to splines).

<img src="src/sumSimplicityOnDeriation.png">

### <strong> Important Questions </strong>

***1. Why are Kolmogorov-Arnold Networks (KANs) slower during training despite simplifying backpropagation?***
KANs, inspired by the Kolmogorov-Arnold Representation Theorem, avoid multiplications by using a sum of univariate functions, theoretically simplifying backpropagation. However, in practice, they are slower because each "weight" in KANs is replaced by a univariate function (often splines), which is computationally expensive to evaluate and optimize. This complexity makes gradient computation during backpropagation more resource-intensive compared to simpler linear weight updates in traditional MLPs. While KANs can generalize better and require fewer parameters, the overhead of managing spline-based weights offsets the theoretical benefits during training​
AR5IV
​
AR5IV
.

***2. Are Kolmogorov-Arnold Networks faster during feedforward inference?***
Despite their smaller size, KANs are generally not faster during feedforward inference compared to traditional MLPs. The reason lies in the computational complexity of their architecture: each connection in a KAN uses spline-based or similar parameterizations, which require more operations than the linear weights in MLPs. Although KANs achieve similar or better accuracy with fewer parameters, this does not directly translate to faster inference. The additional computational demands of evaluating the univariate functions make them slower, particularly in real-time applications​
AR5IV
​
AR5IV
.

***3. If KANs are slower, what is the practical need for them?***
KANs are valuable in scenarios where model size and accuracy matter more than speed. Their ability to achieve high accuracy with fewer parameters makes them ideal for memory-constrained environments or applications requiring high generalization capabilities, such as scientific modeling or solving partial differential equations (PDEs). However, they are less suitable for real-time systems where inference speed is critical. The trade-off lies in their expressiveness and ability to learn more complex functions, which may offer long-term benefits in applications with large-scale data or complex relationships

### <strong> Learnable activation networks (LANs) </strong>

Besides KAN, we also proposed another type of learnable activation networks (LAN), which are
almost MLPs but with learnable activation functions parametrized as splines. KANs have two main
changes to standard MLPs: (1) the activation functions become learnable rather than being fixed;
(2) the activation functions are placed on edges rather than nodes. To disentangle these two factors,
we also propose learnable activation networks (LAN) which only has learnable activations but still
on nodes

> note: <strong> my overall feeling is this architecture may replace MLP in some specific fields, like complex context undrestandings like NEW Generation of LLMs or Complex Physic or Math Functions, but Still MLPs will be used </strong>