# Salesforce XGen tutorial

## Table of contents

1. Introduction
2. XGen 7B key features and capabilities
3. Applications of XGen 7B
4. Prerequisites and installation
5. Working with XGen 7B
6. Maximizing model performance
7. Building a summarizer model
8. Conclusion

### Introduction

Right now, these six are the hottest open-source LLMs:

1. LLaMA2
2. BLOOM
3. Falcon 180B
4. OPT-175B
5. GPT-Neox
6. Vicuna 13-B

And they all have the same disadvantage - very short context length, reaching up to only 2048 tokens. Compared to proprietary models like GPT-3.5 and GPT-4 that offer lengths up to 32k tokens (50 pages of text!), it seems open-source LLMs are at a heavy disadvantage. 

Context length is essentially the "memory" of LLMs. 2048-token context window means the model can only remember 2048 tokens of the conversation at a time. This significantly affects performance in tasks where a large context is crucial such as summarization, translation, code generation, etc. 

To address this critical issue, Salesforce announced its XGen-7B model with a whopping context length of 8k tokens (4 times longer than other LLMs). This article covers the key characteristics of the model and show to build a text summarizer model using it. 

### Why choose XGen 7B over others?

For most, statistics like context length don't mean much until they are translated into tangible benefits. So, here are some of its main features and the impact they can have in your own projects:

**Compact yet powerful**

Despite its relatively small size of 7 billion parameters, XGen punches well above its weight - delivering performance that rivals or exceeds that of much larger models. This efficiency is a game-changer for developers and researchers, enabling running and deployment of cutting-edge AI applications directly on high-end local machines without access to vast cloud computing resources. This balance between size and performance makes XGen particularly appealing for a wide array of users, from small startups to academic researchers.


**Versatile model variants**

Understanding various user needs, XGen offers three versions, each suited for specific applications:
   - XGen-7B-4K-base: With a 4k token sequence length, this version is suited for tasks requiring moderate context sizes. It's licensed under Apache 2.0 license.
   - XGen-7B-8K-base: This is the flagship model boasting an 8k token sequence length, designed for complex tasks that benefit from analyzing large blocks of text. Like its sibling, it's available under the Apache 2.0 license, which means it can be used for almost any purpose.
   - XGen-7B-{4K,8K}-inst: Fine-tuned on public instructional data, these models are specialized for interactive and instructional applications, available for non-commercial use. This variant is ideal for educational tools, interactive bots, and other applications where guidance or instruction is important.

**High performance on benchmarks**

The true measure of the model's strength is reflected in the benchmarks. XGen comes out on top for diverse set of benchmarks such as MMLU, HumanEval and so on when compared to models of similar size.  For an in-depth analysis, the [announcement post](https://blog.salesforceairesearch.com/xgen/#results-on-standard-benchmarks) provides a comprehensive overview of XGen's achievements across benchmarks.

**Optimization for long-sequence tasks**

At the risk of redundancy, I reiterate that XGen is highly-optimized for tasks that require large context windows. This capability is critical for applications like detailed document summarization, where understanding the entirety of a text is important for generating accurate summaries. Similarly, in comprehensive question answering and long-form content generation, XGen's ability to process large amounts of information results in more coherent, contextually relevant outputs.

### XGen 7B training details

So, how does XGen achieve these impressive results? Of course, the answer lies in the training and optimization methods used by Salesforce AI researchers.

The training strategy of XGen consists of two stages. In stage 1, a fresh model is trained on 1.37 trillion tokens, containing a mix of natural language data and code.

![image.png](attachment:a1e38753-9ad4-4d03-b708-aa9b46283367.png)

In stage two, additional 55 billion tokens of code were used to train for better code generation:

![image.png](attachment:9e590dfc-2d19-4d01-ad8b-1cbc9dc8b835.png)

The training was done using an in-house library called [JaxFormer](https://github.com/salesforce/jaxformer) specifically designed for efficient LLMs training under both data and model parallelism for TPU-v4 hardware.

### Prerequisites and installation

### Working with XGen 7B

### Maximizing model performance

### Building a summarizer model

### Conclusion