# HuggingFace Models

- BERT:
    - Architecture:

        BERT is based off the transformer archictecture but only includes the encoder part (words to embeddings) and not the decoder (embeddings to words). The layers include the following:

            - Embedding Layer: Converts tokens to embeddings
            - Multi-Head Attention Layer: Determines the attention scores and attention weights for each token utilizing Query, Key, and Value Matrices.
            - Feed Forward Neural Network: Extracts relevancy amongst the sum of weighted embeddings. The will output a softmax layer of all the probabilities to a fixed output size. For sentiment analysis, the final layer will produce 2 nodes each with the logit probability for positive and negative sentiment.

            - Concat all the matrices from Query, Key, and Value together

            - Feed them all into another Feed Forward Neural Network



    - Tokenization (STEP 1):

        BERT uses a Tokenization technique called WordPiece which helps tokenize a sequence into words and subwords.

    - Bidirectional Embeddings (STEP 2):
        
        It is also bidirectional meaning it can gather richer semantic word relationships when processing from both left-to-right and/or right-to-left in a sentence when generating the word representations (embeddings). Therefore a token is compared with all the surrounding tokens to create the word embedding. For text generation or casual language model (ie. predicting a series of text or a single word to complete the prompt), only the words to the left of the new token matters. 

    - Pre-training (STEP 3A):

        BERT is then pre-trained towards a specific task on a large corpus of text, which can learn the sentiments of tokens and its relavancy, the semantic similarity, and the contextual meaning. All of these traits can be combined together to form the embeddings for each token, which are known as Query, Key, and Value Matrices. 
        
        Note: The pre-training should already be completed before fine-tuning a model, that of which was pre-trained by other developer(s).

        - Extracting Query, Key, and Value Matrices: 
            The matrices are initialized from the input embedding which then serve as learnable parameters for the model since the embeddings pass through a repeated 2 layer process (multi-head attention, feed forward network) n times.

              
            e.g.   "I liked the movie, but hated the score."

            - Query: Specifies the target of what to search for based on the specific LLM task e.g. NER, sentiment analysis, text generation, MLM, etc. For example if BERT is training on a movie review dataset for sentiment analysis, the query matrix Q might contain a 2xn matrix where n is the number of token embeddings (i.e 4 tokens means to have 4 columns each representing a word embedding). The two rows represent positive and negative sentiments (Q_pos, Q_neg) which are the targets. Basically, the sentiement of each word is taken into account. The model then uses the matrix to identify specific words (tokens) for conveying a positive or negative sentiment 

                query_token_sentiment_dataset = [
                    [Q_pos(I), Q_pos(liked), Q_pos(the), Q_pos(movie), Q_pos(but), Q_pos(hated), Q_pos(the), Q_pos(score), Q_pos(.)],
                    [Q_neg(I), Q_neg(liked), Q_neg(the), Q_neg(movie), Q_neg(but), Q_neg(hated), Q_neg(the), Q_neg(score), Q_neg(.)]
                    ]

                Note: Q_neg can be written as (1 - Q_pos)

            - Key: Relevant information regarding the other tokens in a input sequence such as the following: 
            
                - Relevance towards sentiment analysis: determining what words are parts-of-speech: nouns, verbs, adjectives, etc. as they have a big impact on the sentiment based on dataset of words and word_type
                

                key_sentiment_relevance_dataset = {
                    "I": "noun",
                    "liked": "verb",
                    "the": "",
                    "movie": "noun",
                    "but": "",
                    "hated", "verb",
                    "score": "noun"
                    ".": ""
                }

                - Contextual information: An adjustable-sized window of the whole sequence that slides over, so each token is taken into account when assisting with the K matrix. Therefore, each token may have neighboring words before and/or after. The combined sentiment of all the words in each window represents the contextual value for that particular token. Additionally, the K matrix will be different for each head in the Multi-Head because the vector V (see Value Vector section below) that contains the tokens for the sequence or prompt are REORDERED and will have a LARGER OR SMALLER window size. Therefore each token will have different neighboring words for each K matrix in each attention head. 

                e.g. for simplicity sake, let's keep the order of the sequence the same for this Attention Head. With a window size of 1, a single token before and after the main token is taken into account. 

                key_contextual_info_dataset_ = [
                    ["<PAD>", "I", "like], # I
                    ["I", "like", "the"], # like
                    ["like", "the", "movie"], # the 
                    ["the", "movie", "but"], # movie
                    ["movie", "but", "hated"], # but
                    ["but", "hated", "the"], # hated
                    ["hated", "the", "score"], # the  
                    ["the", "score", "."], # score
                    ["score", ".", "<PAD>"] # .
                ]

                Each word is represented by its respective window of tokens, which is presented as a sentiment value for each window
                
                key_contextual_info_sentiment_dataset_ = [
                    [.67], # I: ["<PAD>", "I", "like]
                    [.74], # like: ["I", "like", "the"]
                    [.82], # the: ["like", "the", "movie"]
                    [.55], # movie:["the", "movie", "but"]
                    [-.94], # but: ["movie", "but", "hated"]
                    [-.79], # hated: ["but", "hated", "the"]
                    [-.45], # the: ["hated", "the", "score"]
                    [.27], # score: ["the", "score", "."]
                    [.09] # .: ["score", ".", "<PAD>"]
                ]

                - Semantic meaning: Sentiments for each token based on pre-built sentiment lexicon dataset containing BOTH WORDS AND THEIR SENTIMENTS (-1->1).

                key_semantic_meaning_dataset = {
                    "I": .50,
                    "liked": .82,
                    "the": .46,
                    "movie": .43,
                    "but": -.56,
                    "hated": -.97,
                    "score": .56,
                    ".": .00
                }

            - Value: The actual information to be attended so the vector V contains the embeddings for each token, but in DIFFERENT ORDER FOR EACH ATTENTION HEAD. For further details, see the Multi-Head section below. 

                value_actual_token_embeddings = [v1, v2, v3, v4, v5, v6, v7, v8, v9]

                where vn represents the embedding of each token. 


        - Self Attention (Attention Scores):
            - BERT's mechanism computes the attention scores for each tokens' information about the other tokens in the sequence. The tokens' scores represents how important or how much the word contributes to the surrounding words. 
            - This is done by computing the dot product between the Query and Key embeddings within the Q and K matrices scaled by the square root of the dimensionality of the key vectors. Then softmax function is then used to find the attention scores i.e. the highest logit probalities for each token are chosen.

        - Attention Weights:
        Find the weighted sum between the value vector (column vector) and the attention score matrix (per row) via vector matrix multiplication, to yield the attention output for each word. 

            e.g. 

                input = 'I love learning.'
                tokens = ['I', 'love', 'learning', '.']

                # Value vector with vN representing each token as an embedding
                value_vector = [v1, v2, v3, v4]

                I = v1
                love = v2
                learning = v3
                . = v4
                
                # attention_scores = dot_product_per_sample_matrix(Key Matrix, Query Matrix) (******Matrices have been computed during the original training process through gradient descent and backpropagation) 
                attention_scores = [ 
                [.8, .2, .06, .05],   # I
                [.3, .7, .03, .04],  # love
                [.1, .02, .8, .04], # learning
                [.01, .03, .05, .9] # .
                ].T

                attention weights = sum(matmul(value_vector, attention_scores))

                i.e.
                                        I           love      learning        .
                attention_weight_I = (v1 * .8) + (v2 * .2) + (v3 * .06) + (v4 * .05)
                attention_weight_love = (v1 * .3) + (v2 * .7) + (v3 * .03) + (v4 * .04)
                attention_weight_learning = (v1 * .1) + (v2 * .02) + (v3 * .8) + (v4 * .04)
                attention_weight_. = (v1 * .01) + (v2 * .03) + (v3 * .05) + (v4 * .9)


        - Multi-head Attention:
        Multi-head Attention Layer is applied by computing different Query, Key, and Value Matrices in parallel, which are all then concatenated with each other then linearly transformed to output the attention scores in that MHA Layer before passing on to the Feed Forward NN. 
        
        Note: Based on the input, each head will have a Value Matrix of embeddings which consist of reordered input token embeddings (see below). Therefore the Key and Query matrices will contain different embedding values for tokens. 

            e.g. 

                input = 'I love learning.'
                tokens = ['I', 'love', 'learning', '.']

                # Assume vN are embeddings for EACH word or character above.

                # For Attention Head 1
                value_vector_1 = [v1, v2, v3, v4]

                # For Attention Head 2
                value_vector_2 = [v3, v4, v1, v2]

                # For Attention Head 3
                value_vector_3 = [v4, v3, v2, v1]


    - Feed Forward Neural Network (STEP 3B):
     Feed the concatenated attention outputs from each head of the Multi-head Attention mechanism into a feedforward network which processes and learns other hierarchal features. This is then fed to another attention layer where the process is then repeated. 


    - Fine-tuning (STEP 4):
    After pre-training, BERT's checkpoint can then be finetuned towards a specific downstream task (pretrained BERT model with checkpoint parameters) with little supervised data. The parameters are then updated as the model learns the target task.





- DistilBERT (6 layers): 

    Same as BERT, but its 40% smaller which makes it faster. Another interesting point to make out, is that DistilBERT can retain most (~97%) of the functionality from BERT making it useful all around. 



How to move LLM into production for NLP tasks?

1. Model Selection and Training
    1. Choose the right model for the appropriate task
    2. Fine-Tune or train model 
2. Infrastructure setup
    1. Choose Cloud Service
    2. Ensure the cloud resources are sufficient enough for model workloads
3. Model Deployment
    1. Containerization of model with Docker, S3/Artifact Registry
    2. Further Deploy with Kubernetes and a virtual server (e.g. Compute Engine) for managing containers and scaling application (load balancer and auto-scaling)
4. API Development
    1. Build a Flask app that calls the docker container image (ml model) for serving model inference
    2. Load Balancing: Implement load balancing to evenly distribute incoming requests to multiple instances (pods)
5. Monitoring and Logging
    1. Monitor model performance, latency (user request delays), and uptime (assurance that network is working for users to make requests). Tools include Prometheus and Grafana
    2. Setup logging which will help track and analyze errors and usage patterns. Tools include ELK stack (Elasticsearch, Logstash, Kibana) 
6. Scaling
    1. Horizontal Scaling: Add more instances (vm’s) to handle the number of workloads
    2. Vertical Scaling: Increasing the size of GPUs (or CPUs) to handle the size of workloads.
7. Security
    1. Authentication and Authorization: Implement OAuth2.0 for extra layer of verification to help secure API
    2. Data Encryption: Use HTTPS to protect data (whether its being transferred or not)
8. Model Optimization
    1. Reduce Latency by refactoring pipelines using efficient data structures.
    2. Managing Cost via monitoring and optimizing resources that are being billed to company 
9. Maintenance and Updates:
    1. Setup a CI/CD pipeline (GitHub Actions, cloud functions and source repositories) for automated testing and deployment so code (new model versions) runs through a series of tests before being merged to main branch, moved to the dev environment, then to the production environment.
    2. Update model with new data to prevent model drift. Model drifts involves two types of drifts: data drift and concept drift. Concept drift is when the relationship between the features and the target change. Data drift is when the features change.

Tools and Technologies
* Containerization and Orchestration: Docker, Kubernetes
* API Development: Flask, FastAPI, Django
* Monitoring and Logging: Prometheus, Grafana, ELK Stack
* CI/CD: Jenkins, GitHub Actions, GitLab CI
* Security: OAuth2.0, HTTPS, JWT

Example Workflow
1. Development: Train and fine-tune your model locally.
2. Containerization: Create a Docker image of your application.
3. Deployment: Deploy the Docker image on a Kubernetes cluster.
4. API: Expose the model through a RESTful API.
5. Monitoring: Set up monitoring and logging.
6. Scaling: Implement horizontal and vertical scaling based on demand.
7. Maintenance: Use CI/CD pipelines for automated testing and deployment.

