# Inference with Yi Model Using Transformers

Welcome to this tutorial on using the Hugging Face Transformers library for inference with the Yi model! In this notebook, we'll guide you step-by-step on how to load and run the Yi-1.5-6B-Chat model using Transformers. Don't worry if you're new to this – we've made the process simple and easy to follow.

## Why Transformers?

The Hugging Face Transformers library is a popular open-source Python library that offers:
- A vast collection of pre-trained models
- User-friendly APIs
- Strong community support

With Transformers, you can easily download, load, and use various models based on the Transformer architecture, including the Yi model we'll be working with today.

Let's get started!

## Step 1: Installing the Necessary Libraries

First things first, we need to install the Transformers library and other essential dependencies. Run the cell below to get everything set up:

In [None]:
!pip install transformers>=4.36.2
!pip install gradio>=4.13.0
!pip install torch>=2.0.1,<=2.3.0
!pip install accelerate
!pip install sentencepiece

Here's what each of these libraries does:
- `transformers`: For loading and using the Yi model
- `gradio`: For creating a simple web interface (if needed)
- `torch`: The PyTorch library for deep learning computations
- `accelerate`: To speed up model loading and inference
- `sentencepiece`: For tokenization processing in the model

Once the installation is complete, we're ready to start using the model!

## Step 2: Importing Libraries and Loading the Model

Now, let's import the necessary libraries and load the Yi-1.5-6B-Chat model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Set the model path
model_path = '01-ai/Yi-1.5-6B-Chat'

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",  # Automatically choose available devices
    torch_dtype='auto'  # Automatically select suitable data type
).eval()  # Set the model to evaluation mode

print("Model loaded successfully!")

Let's break down what's happening in this code:
- `AutoTokenizer` is used to load the tokenizer that matches our model.
- `AutoModelForCausalLM` loads the language model itself.
- `device_map="auto"` allows the model to automatically choose the best device (CPU or GPU).
- `torch_dtype='auto'` automatically selects the appropriate data type to optimize performance.
- `.eval()` sets the model to evaluation mode, which is important for inference.

⚠️ Note: Loading the model might take a bit of time, depending on your internet speed and computer performance.

## Step 3: Preparing Input and Running Inference

Now that our model is loaded, let's try a simple conversation!

In [None]:
# Prepare the conversation
messages = [
    {"role": "user", "content": "Hello!"}
]

# Convert the conversation to a format the model can understand
input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')

# Generate a response using the model
output_ids = model.generate(input_ids.to('cuda'))

# Decode the model's output
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)

print("User: Hello!")
print(f"Yi: {response}")

Let's explain this code:

1. We start by creating a list containing the user's message.
2. The `apply_chat_template` method converts our messages into a format the model can understand.
3. `model.generate` uses the converted input to generate a response.
4. Finally, we use `tokenizer.decode` to convert the model's output back into readable text.

Feel free to modify the `messages` list to try different conversations!

## Advanced: Creating a Simple Chat Function

To make it easier to have multi-turn conversations with the model, let's create a simple function:

In [None]:
def chat_with_yi(user_input, history=[]):
    # Add the new user input to the conversation history
    history.append({"role": "user", "content": user_input})
    
    # Prepare the input
    input_ids = tokenizer.apply_chat_template(conversation=history, tokenize=True, add_generation_prompt=True, return_tensors='pt')
    
    # Generate a response
    output_ids = model.generate(input_ids.to('cuda'), max_new_tokens=100)
    
    # Decode the response
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    
    # Add the model's response to the conversation history
    history.append({"role": "assistant", "content": response})
    
    return response, history

# Test the chat function
history = []
user_inputs = ["Hello!", "Can you tell me a joke?", "Thank you, goodbye!"]

for user_input in user_inputs:
    print(f"User: {user_input}")
    response, history = chat_with_yi(user_input, history)
    print(f"Yi: {response}\n")

This function allows you to easily have multi-turn conversations while maintaining context. Feel free to add more user inputs to test the model's performance.

## Conclusion

Congratulations! You've successfully loaded and run the Yi-1.5-6B-Chat model using the Transformers library. Now you can:
- Try different prompts
- Engage in multi-turn conversations
- Explore various capabilities of the model

Remember, when using large language models:
- The model may produce inaccurate or biased responses
- Don't share sensitive or personal information
- Always maintain critical thinking towards the model's outputs

I hope you found this tutorial helpful! If you have any questions, don't hesitate to check the [official Transformers documentation](https://huggingface.co/docs/transformers/index) or seek help from the community. Have fun on your AI journey!