# Neural Network Architecture

This notebook can also be found [here](https://nbviewer.org/github/BoeJaker/Python-Neural-Networks/blob/master/notebook.ipynb)

A neural network is a type of machine learning algorithm that is modeled after the structure of the human brain. It consists of layers of interconnected nodes, or neurons, that process information and output a result. Neural network architecture refers to the design and configuration of these layers, including the number of layers, the number of neurons in each layer, and the types of connections between them. 

There are many different configurations and methods for building neural networks, such as feedforward networks, recurrent networks, convolutional networks, and more. These different architectures and techniques are suited for different types of problems and data, and choosing the right one can greatly improve the accuracy and efficiency of your model. 

In this Jupyter notebook, we will explore the basics of neural network architecture and examine various configurations and methods for building and training effective models.

***

## Contents

### Architecture</br>
[Perceptron (SLP)](#Perceptron)
    </br>A perceptron is a type of artificial neural network that is used for binary classification </br>
[Perceptron (MLP)](#MLP)
    </br>A type of neural network made up of multiple perceptrons arranged in layers. It's commonly used for tasks like classification and prediction.</br>
[Feedforward Neural Network](#)
    </br>These neural networks are the simplest type of neural network where the information flows in one direction from input to output layer without any feedback.</br>
[Convolutional Neural Networks (CNN)](#)
    </br>Used for image recognition and analysis, and are characterized by the use of convolutional layers, pooling layers, and fully connected layers.</br>
[Recurrant Neural network (RNN)](#RNN)
    </br>Designed for processing sequential data, such as time-series or natural language processing (NLP) tasks. They are characterized by the use of feedback loops that allow the network to maintain a "memory" of past inputs. </br>
[Long Short Term Memory (LSTM)](#LSTM)
    </br>A type of RNN that can selectively remember or forget past inputs. They are commonly used for NLP tasks such as language translation or text summarization.</br>
[Autoencoder Neural Networks](#)
    </br>Used for unsupervised learning and feature extraction. They consist of an encoder network that maps the input to a lower-dimensional space, and a decoder network that maps the lower-dimensional representation back to the original input.</br>
[Generative Adversarial Networks](#)
    </br>Used for generative modeling and image synthesis. They consist of a generator network that generates fake images and a discriminator network that attempts to distinguish between the fake images and real images.</br>
[Siamese Neural Networks](#)
    </br>Siamese networks are used for tasks such as image comparison and similarity scoring. They consist of two identical neural networks that share the same weights and process two inputs independently before combining them to produce an output.</br>
[Reinforcement Learning Neural Networks](#)
    </br>Reinforcement learning networks are used for learning through trial and error. They consist of an agent that interacts with an environment and receives rewards or punishments for its actions. The network learns to optimize its actions to maximize its rewards over time.</br>
[Modular Neural Network](#)
    </br>These networks consist of multiple independent neural networks that can be combined to form a larger network. This allows for more efficient training and better performance on complex tasks.</br>
[Spiking Neural Network](#)
    </br> These networks are designed to simulate the behavior of biological neurons, where information is transmitted through spikes of electrical activity. They are commonly used for modeling biological systems and for applications such as robotics and control systems.</br>
[Deep Belief Networks (DBN)](#)
    </br>A type of generative neural network that use a stack of Restricted Boltzmann Machines (RBMs) to learn a hierarchical representation of the data. They are commonly used for feature extraction and unsupervised learning.</br>
[Echo State Networks](#)
    </br>These networks are a type of recurrent neural network that use a large reservoir of randomly connected neurons to process input data. The output of the network is determined by a linear combination of the reservoir neurons. They are commonly used for time-series prediction and control tasks.</br>
[Capsule Networks](#)
    </br>A type of neural network that use capsules instead of neurons as the basic processing unit. Each capsule represents an instantiation of a specific entity or feature, and the network learns to recognize objects by combining the output of multiple capsules. They are commonly used for image recognition tasks.</br>
[Neural Turing Machines](#)
    </br>These networks combine a neural network with an external memory system, allowing the network to learn algorithms and perform tasks that require long-term memory. They are commonly used for tasks such as program synthesis and language modeling.</br>
[Attention Mechanism Networks](#)
    </br>Attention mechanism networks use an attention mechanism to selectively focus on different parts of the input data when processing it. This allows the network to selectively attend to important features and ignore irrelevant ones. They are commonly used for NLP tasks such as machine translation and text summarization.</br>
[Transformer Network](#)
    </br>A type of neural network architecture commonly used in natural language processing (NLP) tasks, such as machine translation, language modeling, and text classification. It was introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017.</br>

### Encoding
[One-Hot](#)
    </br>This is a technique that converts categorical variables into a set of binary features, where each feature corresponds to a unique category.</br>
[Label](#)
    </br>This is a technique that assigns a unique integer to each category of a categorical variable.</br>
[Binary Encoding](#)
    </br>This is a technique that converts categorical variables into binary features, where each feature represents a different category, and only one feature is active (set to 1) for each sample. </br>
[Count Encoding](#)
    </br>This is a technique that replaces each category with the count of its occurrences in the dataset.</br>
[Continuous](#)
    </br>   
[Circular](#)
    </br>If the time of day is an important feature, it can be encoded using a variety of methods. One common approach is to use circular encoding, where the time of day is mapped onto a circle (e.g. 12:00pm is at the top of the circle, 6:00am is at the bottom) to capture the cyclical nature of time.</br>
[Embedding](#)
    </br>This is a technique that maps categorical variables to a low-dimensional vector space, where each category is represented by a vector.</br>
[Scaling and Normalisation](#)
    </br>Scaling and Normalization: This is a technique that scales and normalizes continuous variables to ensure that they have a similar range and distribution.</br>
[Binnig](#)
    </br>This is a technique that discretizes continuous variables into a set of categories or bins.</br>
[Feature Extraction](#)
    </br>This is a technique that extracts relevant features from the input data and represents them as a set of input features for the neural network.</br>


### Training Data

#### Text Corpus
 [Common Crawl](#Training)
 </br>A vast corpus containing crawls of the entire internet</br>
 [Web Text](#)
 </br>A diy corpus of reddit posts</br>
 [Wikipedia](#)
 </br>The Online Encyclopedia</br>

***

## [Perceptron](Architectures/perceptron.py)

A perceptron is a type of artificial neural network that is used for binary classification<>

[this writeup can also be found here](https://boejaker.com/index.php/2023/03/16/how-to-build-a-simple-neural-network-in-python-using-numpy/)

### Introduction

Neural networks are a type of machine learning model that are designed to mimic the behavior of the human brain. They are used for a wide range of applications, from image and speech recognition to natural language processing and predictive analytics.

In this section, i will show you how to write a simple neural network, known as a perceptron, in Python using the NumPy library. This neural network will consist of an input layer, and an output layer, with a sigmoid activation function. The following steps are the basic building blocks of all typical, modern, neural network architectures. It is the basis of all modern AI systems like chat-GPT, DALL-E, and co-pilot.

### Step 1: Import NumPy

First, we need to import the NumPy library, which provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions. This is the only library we need.
    

In [None]:
import numpy as np

### Step 2: Define the sigmoid activation function

![sigmoid](assets/images/sigmoid.png)*A visual representation of a sigmoid function*
</br>
The sigmoid() function is a mathematical function that maps any input value to a value between 0 and 1, which is useful for modeling the behavior of neurons in a neural network.

An activation function is a crucial step in both artificial and biological neural networks. It allows neurons to do more than simply output the input they receive. Instead, the activation function works in a way that is similar to the rate at which action potentials fire in the brain.

In [None]:
# Sigmoid activation function

def sigmoid(x,deriv=False):
    if(deriv==True): 
        return x*(1-x)
    else: 
        return 1/(1+np.exp(-x))

### Step 3: Define the neural network architecture

![Single Layer Perceptron](assets/images/perceptron.png)*Diagram of a single layer perceptron*
</br>
Our neural network will consist of an input layer with three nodes, and an output layer with one node. This is known as a single layer perceptron.

The specific size and number of layers in a neural network depend on factors such as problem complexity, dataset size, and available resources. A general rule of thumb is to match the input and output layers to the data. Hidden layers between the two will be covered in the next article.

Ultimately, the best architecture should be determined through experimentation to find the one that performs best for the specific problem.

In [None]:
input_layer = 3
output_layer = 1


# Input matrix, 4 entires each with 3 inputs
X = np.array([  [1,0,0],
                [0,0,1],
                [0,1,0],
                [1,0,1] ])
    
# Output set, 1 output per input entry            
y = np.array([  [0],
                [1],
                [0],
                [1]])

### Step 4: Initialize the weights

The weights are the parameters that the neural network will learn during training. We will seed a numpy random number generator with np.random.seed(1), then initialize the weights randomly using the np.random.random() function.

In [None]:
np.random.seed(1)

# Initialize the array of weights randomly
W0 = np.random.random((input_layer,output_layer))

### Step 5: Define the forward propagation function

The forward_propagate() propagation function computes the output of the perceptron for a given input. It does this by multiplying the input by the weights of the input layer, and applying the sigmoid activation function.

In [None]:
def forward_propagate(X):
    L0 = X
    L1 = sigmoid(np.dot(L0,W0))
    return L1

### Step 6: Train the perceptron

The train() function passes the input thorough the forward propagation function. Then it performs a series of calculations known as backpropagation, which is used to update the weights of the network based on the error between the predicted output and the target output. Backpropagation consists of three steps
The layer one error is calculated by subtracting the predicted output (L1) from the target output (y).
The layer one delta, or the gradients of the layer one error with respect to the weights of each layer are then computed.
These deltas are then used to update the weights of the network in the direction of the gradient.

This process is repeated for a specified number of epochs, or until the error is minimized.

In [None]:
def train():
    global W0
    for iter in range(10000):

        # Forward propagation
        L0 = X
        L1 = sigmoid(np.dot(L0,W0))

        # Calculate the difference between the predicted output (L1) 
        #  and target output (y)
        L1_error = y - L1

        # Multiply how much we missed by the slope of the sigmoid 
        #  at the values in L1
        L1_delta = L1_error * sigmoid(L1,True)

        # Update weights using the delta
        W0 += np.dot(L0.T,L1_delta)

    print("Output After Training:")
    print(L1)

### Step 7: Test the perceptron
To test our perceptron, we need to execute the train function, then we can pass a sample input through the forward_propagate() function and print the output.

The result variable will be a value between 0 and 1, which represents the perceptorn’s prediction for the output. For clarity we have chosen to round the output to either 0 or 1 in the print() statement.

This test results in the output [[1.]], which is the correct output as per the training data.

## The Complete Code

In [None]:
import numpy as np

imput_layer = 3
output_layer = 1

# Input matrix, 4 entires each with 3 inputs
X = np.array([  [1,0,0],
                [0,0,1],
                [0,1,0],
                [1,0,1] ])
    
# Output set, 1 output per input entry            
y = np.array([  [0],
                [1],
                [0],
                [1]])

# Sigmoid activation function
def sigmoid(x,deriv=False):
    if(deriv==True): 
        return x*(1-x)
    else: 
        return 1/(1+np.exp(-x))

np.random.seed(1)

# Initialize the array of weights randomly
W0 = np.random.random((imput_layer,output_layer))

# Define our forward propogation function
def forward_propagate(X):
    L0 = X
    L1 = sigmoid(np.dot(L0,W0))
    return L1

def train():
    global W0
    for iter in range(10000):

        # Forward propagation
        L0 = X
        L1 = sigmoid(np.dot(L0,W0))

        # Calculate the difference between the predicted output (L1) 
        #  and target output (y)
        L1_error = y - L1

        # Multiply how much we missed by the slope of the sigmoid 
        #  at the values in L1
        L1_delta = L1_error * sigmoid(L1,True)

        # Update weights using the delta
        W0 += np.dot(L0.T,L1_delta)

    print("Output After Training:")
    print(L1)

train()    

result = np.round(forward_propagate(np.array([[1,0,1]])))

print(result)


***

## [LSTM](Architectures/lstm.py)

![LSTM Architecture](assets/images/lstm.png)*A high level diagram of an LSTM architecture*

### Introduction:
This section will provide an overview of a neural network code written in Python. The code includes a function for computing the sigmoid nonlinearity, a function for converting the output of the sigmoid function to its derivative, and a training dataset generation. It also initializes neural network weights, defines input variables, and has training logic to train the network on a simple addition problem.

### Importing Libraries:
The first step in the code is to import two libraries, "copy" and "numpy". These libraries are required for various functions in the code. The numpy library is used to work with arrays and mathematical operations, while the copy library is used for copying objects.

### Sigmoid Function:
The sigmoid function is a common activation function used in neural networks. It converts any input value to a value between 0 and 1. The sigmoid function defined in this code takes a parameter "x" and applies the formula 1 / (1 + exp(-x)) to it. The result is returned as the output of the function.

### Sigmoid Output to Derivative Function:
The sigmoid output to derivative function defined in this code takes a parameter "output", which is the output value of the sigmoid function applied to some input. The function then returns the derivative of the sigmoid function applied to that output value.

### Training Dataset Generation:
The next step in the code is to generate a training dataset. It does so by creating a binary dictionary of numbers and then generates two random numbers within a range of half the largest number. It then uses binary encoding to represent these numbers and generates their sum. These binary numbers are used to train the neural network to predict their sum.

### Input Variables:
The input variables defined in this code include alpha, input_dim, hidden_dim, and output_dim. Alpha represents the learning rate of the neural network. Input_dim represents the number of input neurons, hidden_dim represents the number of hidden neurons, and output_dim represents the number of output neurons.

### Initialize Neural Network Weights:
The code initializes the neural network weights using a random function from the numpy library. The weights are initialized for the input-hidden, hidden-output, and hidden-hidden layers of the neural network.

### Training Logic:
The code then enters the training logic, which iterates 10,000 times. It generates two random numbers and their binary representations, computes their sum, and stores their binary representation as the true answer. It then initializes the variable "overallError" to 0 and initializes two lists, "layer_2_deltas" and "layer_1_values", to store the error in the output layer and the hidden layer values, respectively.

 It then moves along the binary representation of the numbers, generating input and output values for each position. It computes the hidden layer values and the output layer values using the sigmoid function. It then calculates the error in the output layer and stores it in the "layer_2_deltas" list. It also computes the binary value of the output and stores it in the variable "d". It then stores the hidden layer values in the "layer_1_values" list for use in the next iteration.

After the binary digits have been iterated through, the code enters the backpropagation step. It initializes the variable "future_layer_1_delta" to 0 and iterates through the binary digits in reverse order. For each digit, it computes the error in the output and the hidden layer, updates the weights, and stores the delta values for use in the next iteration.

The weights are updated using the learning rate, the delta values, and the input and hidden layer values. The update values for each weight are stored in three separate update matrices, which are then added to the weight matrices. Finally, the update matrices are set to zero, and the code prints out the overall error and the predicted and

In [1]:
import copy, numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)


# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2,binary_dim)
binary = np.unpackbits(
    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]


# input variables
alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1


# initialize neural network weights
synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

# training logic
for j in range(10000):
    
    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number/2) # int version
    a = int2binary[a_int] # binary encoding

    b_int = np.random.randint(largest_number/2) # int version
    b = int2binary[b_int] # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]
    
    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0
    
    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))
    
    # moving along the positions in the binary encoding
    for position in range(binary_dim):
        
        # generate input and output
        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))
        overallError += np.abs(layer_2_error[0])
    
        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))
    
    future_layer_1_delta = np.zeros(hidden_dim)
    
    for position in range(binary_dim):
        
        X = np.array([[a[position],b[position]]])
        layer_1 = layer_1_values[-position-1]
        prev_layer_1 = layer_1_values[-position-2]
        
        # error at output layer
        layer_2_delta = layer_2_deltas[-position-1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)
        
        future_layer_1_delta = layer_1_delta
    

    synapse_0 += synapse_0_update * alpha
    synapse_1 += synapse_1_update * alpha
    synapse_h += synapse_h_update * alpha    

    synapse_0_update *= 0
    synapse_1_update *= 0
    synapse_h_update *= 0
    
    # print out progress
    if(j % 1000 == 0):
        print( "Error:" + str(overallError))
        print( "Pred:" + str(d))
        print( "True:" + str(c))
        out = 0
        for index,x in enumerate(reversed(d)):
            out += x*pow(2,index)
        print( str(a_int) + " + " + str(b_int) + " = " + str(out))
        print("------------")

        


Error:[3.45638663]
Pred:[0 0 0 0 0 0 0 1]
True:[0 1 0 0 0 1 0 1]
9 + 60 = 1
------------
Error:[3.63389116]
Pred:[1 1 1 1 1 1 1 1]
True:[0 0 1 1 1 1 1 1]
28 + 35 = 255
------------
Error:[3.91366595]
Pred:[0 1 0 0 1 0 0 0]
True:[1 0 1 0 0 0 0 0]
116 + 44 = 72
------------
Error:[3.72191702]
Pred:[1 1 0 1 1 1 1 1]
True:[0 1 0 0 1 1 0 1]
4 + 73 = 223
------------
Error:[3.5852713]
Pred:[0 0 0 0 1 0 0 0]
True:[0 1 0 1 0 0 1 0]
71 + 11 = 8
------------
Error:[2.53352328]
Pred:[1 0 1 0 0 0 1 0]
True:[1 1 0 0 0 0 1 0]
81 + 113 = 162
------------
Error:[0.57691441]
Pred:[0 1 0 1 0 0 0 1]
True:[0 1 0 1 0 0 0 1]
81 + 0 = 81
------------
Error:[1.42589952]
Pred:[1 0 0 0 0 0 0 1]
True:[1 0 0 0 0 0 0 1]
4 + 125 = 129
------------
Error:[0.47477457]
Pred:[0 0 1 1 1 0 0 0]
True:[0 0 1 1 1 0 0 0]
39 + 17 = 56
------------
Error:[0.21595037]
Pred:[0 0 0 0 1 1 1 0]
True:[0 0 0 0 1 1 1 0]
11 + 3 = 14
------------


</br>***

## Training


### Common Crawl Requests

Common Crawl is a web corpus that contains a vast amount of text data in multiple languages. The corpus is created by continuously crawling the internet and indexing the text from web pages, making it a valuable resource for training language models. Because Common Crawl contains text from a diverse range of sources, it provides a broad and varied sample of natural language, which is essential for training language models that can understand and generate human-like text. Additionally, because Common Crawl is freely available and accessible to researchers and developers worldwide, it has become a popular resource for training language models, with many state-of-the-art models using the dataset as a basis for their training.


### [Common Crawl Wet Requests](Training%20Data/wet_requests.py)

 WET files are a type of web archive format used by Common Crawl to store text content from web pages. The script uses the requests library to download a list of WET file paths for the March 2023 crawl from Common Crawl's website. It then loops over each file path, downloads the corresponding WET file, decompresses it, and extracts its content using the warcio library. Finally, it decodes and prints the contents of the first three records (this can be changed) in each WET file. This code can be useful for researchers and developers who want to extract text data from Common Crawl for use in natural language processing or machine learning applications. 

The [code](Training%20Data/wet_requests.py) below imports the necessary libraries to download and process the web crawl data. It then sets the URL of the WET file paths for the March 2023 crawl, which contains the web crawl data in a compressed format.

Next, it downloads the list of WET file paths using the requests library and decompresses the file using gzip. The decompressed file contains a list of URLs of individual WET files.

After splitting the file content into individual file paths, the code loops over each file path, downloads the corresponding WET file, and prints its contents. The warcio library is used to create a WARC iterator, which iterates over the records in the WET file. The contents of the first three records are printed for demonstration purposes.

In [3]:
import requests
import gzip
from io import BytesIO
import warcio

# Set the URL of the WET file paths for the March 2023 crawl
url = 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/wet.paths.gz'

# Download the list of WET file paths
response = requests.get(url)
compressed_file = response.content

# Decompress the file
file_content = gzip.decompress(compressed_file)

# Split the file content into individual file paths
file_paths = file_content.decode().split()
# Loop over each file path, download the corresponding WET file, and print its contents
for path in file_paths:
    # Construct the URL of the WET file
    print(path)
    wet_url = 'https://data.commoncrawl.org/' + path
    
    # Download the WET file
    response = requests.get(wet_url)
    compressed_file = response.content
    
    # Decompress the file
    file_content = gzip.decompress(compressed_file)
    
    # Create a WARC iterator
    records = warcio.ArchiveIterator(BytesIO(file_content))
    
    # Iterate over the records and print the contents of the first record
    for index, record in enumerate(records):
        print(record.content_stream().read().decode('utf-8','ignore'))
        if index >= 2 :
            # To keep the demo output brief, break out of the loops early
            break 

    break

crawl-data/CC-MAIN-2023-06/segments/1674764494826.88/wet/CC-MAIN-20230126210844-20230127000844-00000.warc.wet.gz
Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20230123022639
Extracted-Date: Thu, 09 Feb 2023 17:39:18 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2023-06
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for January/February 2023
publisher: Common Crawl


Изменение цвета кальмара для камуфляжа заснято на видео
НЕ ПРОПУСТИ
Экс-президент Польши заявил об «уникальном шансе разобраться с Россией»
Названо оружие НАТО, способное долететь до Москвы и Санкт-Петербурга
Бизнесмен Пригожин обратился к Володину с просьбой ввести уголовную ответственность за дискредитацию участников боевых действий — Блокнот Россия
Депутатам Госдумы усложнили отдых за границей
«Я кровь не останавливаю, а пускаю её врагам»: Пригожин назвал различие между собой и Распутиным
Борис Джонсон не

As you can see the output of the common crawl wet archives is irregular, multilingual and captures elements as well as text. This data will require filtering before training is complete. 
Filtering could be achieved via another neural network designed to extract english text that is over a threshold length.

After filtering the WET data using existing tools or custom scripts, you may want to further refine the data to meet your specific requirements. For example, you may want to exclude certain types of content, such as pages written in a particular language or containing specific keywords. Alternatively, you may want to prioritize certain types of content, such as news articles or blog posts.

To achieve this level of filtering, you can use more advanced techniques such as machine learning classifiers or natural language processing (NLP) algorithms. These tools can help you identify patterns and extract relevant information from the text, such as sentiment, topic, or author. By applying these techniques, you can create a more targeted and high-quality dataset for training your neural network.

Overall, the process of filtering the WET data involves a combination of manual and automated techniques, depending on your specific goals and resources. It requires careful planning and experimentation to find the right balance between relevance, quality, and scalability. However, with the right tools and techniques, you can create a powerful training dataset for your language model and unlock its full potential.

## Transformer

The transformer model relies on a self-attention mechanism, which allows it to process input sequences in parallel, rather than sequentially like traditional recurrent neural networks (RNNs). This makes it more efficient and better suited for longer sequences.

In a transformer, the input sequence is first embedded into a high-dimensional space, and then multiple layers of self-attention and feedforward neural networks are applied. The self-attention mechanism allows the model to weigh the importance of different parts of the input sequence when making predictions. The feedforward neural networks then transform the representations learned by the self-attention layers into a form that can be used for the final prediction.

Transformers have shown state-of-the-art performance on a variety of NLP tasks and have been widely adopted in both research and industry.