<a href="https://colab.research.google.com/github/TMarafon/mistral_7b/blob/main/Mistral_7B_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Doing curl requests
In this notebook, we'll experiment with the instruct model, as it is trained for instructions. As per the model card, the expected format for a prompt is as follows

From the model card

In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [\INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

<s>[INST] {{ user_msg_1 }} [/INST] {{ model_answer_1 }}</s> [INST] {{ user_msg_2 }} [/INST] {{ model_answer_2 }}</s>

We'll start an initial query without prompt formatting, which works ok for simple queries.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# New Section

In [2]:
!curl https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1 \
  --header "Content-Type: application/json" \
	-X POST \
	-d '{"inputs": "Explain ML as a pirate", "parameters": {"max_new_tokens": 50}}' \
	-H "Authorization: Bearer Your_token" #hugging_face api token

[{"generated_text":"Explain ML as a pirate.\n\nML is like a treasure map for pirates. Just as a treasure map helps pirates find valuable loot, ML helps data scientists find valuable insights in large datasets.\n\nPirates use their knowledge of the ocean and their"}]

#Programmatic usage with Python

You can do simple requests, but the huggingface_hub library provides nice utilities to easily use the model. Among the things we can use are:

InferenceClient and AsyncInferenceClient to perform inference either in a sync or async way.
Token streaming: Only load the tokens that are needed
Easily configure generation params, such as temperature, nucleus sampling (top-p), repetition penalty, stop sequences, and more.
Obtain details of the generation (such as the probability of each token or whether a token is the last token).

In [3]:
%%capture
!pip install huggingface_hub gradio

In [5]:
from huggingface_hub import InferenceClient

client = InferenceClient(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    token="" #hugging_face api token
)

prompt = """<s>[INST] What is your favourite condiment?  [/INST]</s>
"""

res = client.text_generation(prompt, max_new_tokens=95)
print(res)


I don't have personal experiences or preferences. However, I can tell you that people's favorite condiments can vary widely based on personal taste and cultural background. Some popular condiments include ketchup, mustard, mayonnaise, hot sauce, soy sauce, and olive oil.


In [8]:
res = client.text_generation(prompt, max_new_tokens=35, stream=True, details=True, return_full_text=False)
for r in res: # this is a generator
  # print the token for example
  print(r.token.text)
  continue



I
 don
'
t
 have
 personal
 experiences
 or
 preferences
.
 However
,
 I
 can
 tell
 you
 that
 people
'
s
 favorite
 cond
iments
 can
 vary
 widely
 based
 on
 personal
 taste
 and
 cultural
 background
.
