<center> <h1> Prompt Engineering with Open-Source Large Language Models (LLMs) using HuggingFace Serverless APIs</h1> </center>

<p style="margin-bottom:1cm;"></p>

_____


In this notebook we will learn how to run any open-source LLMs via HugginFace Inference APIs using this colab notebook. You can run this notebook in your local server also without worrying about having enough infrastructure to run these models!

Thankfully HuggingFace has made its [__Inference API__](https://huggingface.co/docs/api-inference/quicktour) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on it's servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure.

The models we will be trying here include:

- __[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)__ model which is a 7B parameters transformer LLM built by the French young company [MistralAI](https://mistral.ai/company/)  is a instruct fine-tuned version of the [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) which is based on their first [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) generative text model using a variety of publicly available conversation datasets.

- __[gemma-2b-it ](https://huggingface.co/google/gemma-1.1-2b-it)__ is a part of Google's gemma series, a 2 billion parameter transformer model fine-tuned for instruction-following tasks, enabling it to handle a wide array of complex language processing activities.



__You just need an internet connection and a HuggingFace Account and API Key to use these models.__


In [None]:
from getpass import getpass
import requests
import json

API_KEY = getpass("Enter HuggingFace API Key: ")

headers = {"Authorization": "Bearer "+API_KEY}

def query(payload, MODEL_API_URL):
  response = requests.post(MODEL_API_URL, headers=headers, json=payload)
  print('API Response:', response)
  
  # Check if the response is successful
  if response.status_code == 200:
    return response.json()
  else:
    print(f"Error: {response.status_code} - {response.text}")
    return None


In [None]:
# Updated to use Mistral-7B-Instruct-v0.3 (v0.2 is deprecated and not available)
MISTRAL7B_API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3"
mistral_params = {
                  "wait_for_model": True,
                  "do_sample": False,
                  "return_full_text": False,
                  "max_new_tokens": 1000,
                }

# Alternative: If v0.3 doesn't work, try these alternatives:
# Option 1: Use a smaller Mistral model
MISTRAL7B_ALTERNATIVE_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1"

# Option 2: Use Microsoft's version
MISTRAL7B_MICROSOFT_URL = "https://api-inference.huggingface.co/models/microsoft/DialoGPT-medium"

GEMMA2B_IT_API_URL = "https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it"
gemma_params = {
                    "wait_for_model": True,
                    "do_sample": False,
                    "return_full_text": False,
                    "max_new_tokens": 1000,
                  }


In [None]:
# FIXED: Basic Q&A with proper error handling
prompt = """Can you explain what is quantum computing to a 5th grader?"""
print(prompt)

# FIXED: Added proper error handling for the output
output = query(payload={
                "inputs": prompt,
                "parameters": mistral_params
                },
                MODEL_API_URL=MISTRAL7B_API_URL)

# Check if output is valid before accessing it
if output and len(output) > 0 and 'generated_text' in output[0]:
    print("✅ Mistral-7B-v0.3 Response:")
    print(output[0]['generated_text'])
else:
    print("❌ Mistral-7B-v0.3 model is not available. Trying alternative...")
    # Try alternative model
    output = query(payload={
                    "inputs": prompt,
                    "parameters": mistral_params
                    },
                    MODEL_API_URL=MISTRAL7B_ALTERNATIVE_URL)
    
    if output and len(output) > 0 and 'generated_text' in output[0]:
        print("✅ Using alternative Mistral-7B-v0.1 model:")
        print(output[0]['generated_text'])
    else:
        print("❌ All Mistral models are unavailable. Please try Gemma model below.")


In [None]:
# FIXED: Gemma model with proper error handling
print("\n" + "="*50)
print("Testing Gemma-2B model:")
print("="*50)

output = query(payload={
                "inputs": prompt,
                "parameters": gemma_params
                },
                MODEL_API_URL=GEMMA2B_IT_API_URL)

if output and len(output) > 0 and 'generated_text' in output[0]:
    response = output[0]['generated_text']
    print("✅ Gemma-2B Response:")
    print(response)
else:
    print("❌ Gemma model is also not available. Please check your API key and internet connection.")


<center> <h1> Prompt Engineering with Open-Source Large Language Models (LLMs) using HuggingFace Serverless APIs</h1> </center>

<p style="margin-bottom:1cm;"></p>

_____


In this notebook we will learn how to run any open-source LLMs via HugginFace Inference APIs using this colab notebook. You can run this notebook in your local server also without worrying about having enough infrastructure to run these models!

Thankfully HuggingFace has made its [__Inference API__](https://huggingface.co/docs/api-inference/quicktour) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on it's servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure.

The models we will be trying here include:

- __[Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)__ model which is a 7B parameters transformer LLM built by the French young company [MistralAI](https://mistral.ai/company/)  is a instruct fine-tuned version of the [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) which is based on their first [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) generative text model using a variety of publicly available conversation datasets.

- __[gemma-2b-it ](https://huggingface.co/google/gemma-1.1-2b-it)__ is a part of Google's gemma series, a 2 billion parameter transformer model fine-tuned for instruction-following tasks, enabling it to handle a wide array of complex language processing activities.



__You just need an internet connection and a HuggingFace Account and API Key to use these models.__


In [16]:
!pip install huggingface_hub

Collecting huggingface_hub
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.34.4


In [17]:
import huggingface_hub

## Get your API Key

Remember to go to your [HuggingFace Account Settings](https://huggingface.co/settings/account) and generate an API key by creating a new token from the [Access Tokens](https://huggingface.co/settings/tokens) section.


## Load HuggingFace API Credentials

Enter your key from [here](https://huggingface.co/settings/tokens)

In [1]:
from getpass import getpass

API_KEY = getpass("Enter HuggingFace API Key: ")

Enter HuggingFace API Key:  ········


### Create LLM API Access Function

Here we create a basic function which can access any LLM API endpoint available on HuggingFace.

For more details refer to the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) as needed.

In [2]:
import requests

headers = {"Authorization": "Bearer "+API_KEY}

def query(payload, MODEL_API_URL):
  response = requests.post(MODEL_API_URL, headers=headers, json=payload)
  print('API Response:', response)
  return response.json()

## Create LLM API Access Config

Here we decide which LLMs we will access by getting their inference API endpoints.

We also set some general configuration settings. You can find the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) here.

Some useful config settings include:

- max_new_tokens: The amount of new tokens to be generated in the response
- do_sample: Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- temperature: Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative
- return_full_text: If set to False, does not return your input prompt to the model
- wait_for_model:  If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done
- repetition_penalty: The more a token is used within generation the more it is penalized to not be picked in successive generation passes.

In [3]:
MISTRAL7B_API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"
mistral_params = {
                  "wait_for_model": True,
                  "do_sample": False,
                  "return_full_text": False,
                  "max_new_tokens": 1000,
                }

GEMMA2B_IT_API_URL = "https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it"
gemma_params = {
                    "wait_for_model": True,
                    "do_sample": False,
                    "return_full_text": False,
                    "max_new_tokens": 1000,
                  }

## Prompting with Open-Source LLM APIs

Now we will use HugginFace LLM APIs and try some tasks with prompting

### 1. Basic Q & A

In [7]:
prompt = """Can you explain what is quantum computing to a 5th grader?"""
print(prompt)

Can you explain what is quantum computing to a 5th grader?


In [22]:
MISTRAL7B_API_URL

'https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'

In [29]:
from huggingface_hub import InferenceClient

client = InferenceClient()
# Correct format: messages should be a list of dictionaries
response = client.chat_completion(
    messages=[
        {"role": "user", "content": "Explain relativity simply."}
    ],
    model="mistralai/Mistral-7B-Instruct-v0.3"
)
print(response.choices[0].message.content)


 Relativity is a theory proposed by Albert Einstein that describes the fundamental laws of physics, particularly how objects move and interact, in a way that is consistent with the principles of special and general relativity.

1. Special Relativity: This theory, published in 1905, deals with objects moving at constant speeds, specifically those moving at or near the speed of light (approximately 299,792 kilometers per second). It introduces two key concepts:

   a. Time Dilation: Time can appear to move slower for an object in motion compared to an object at rest. This means that a clock on a fast-moving spaceship would appear to run slower than a clock on Earth.

   b. Length Contraction: Objects in motion can appear shorter in the direction of motion. This means that a spaceship moving at high speeds would appear shorter to an observer on Earth.

2. General Relativity: Published in 1915, this theory extends special relativity to include gravity. It states that massive objects cause 

In [31]:
# output = query(payload={
#                 "inputs": prompt,
#                 "parameters": mistral_params
#                 },
#                 MODEL_API_URL=MISTRAL7B_API_URL)

# print(output[0]['generated_text'])

In [35]:
# # with Gemma
# output = query(payload={
#                 "inputs": prompt,
#                 "parameters": gemma_params
#                 },
#                 MODEL_API_URL=GEMMA2B_IT_API_URL)
# response = output[0]['generated_text']
# print(response)

In [33]:
from IPython.display import display, Markdown
display(Markdown(response.choices[0].message.content))

 Relativity is a theory proposed by Albert Einstein that describes the fundamental laws of physics, particularly how objects move and interact, in a way that is consistent with the principles of special and general relativity.

1. Special Relativity: This theory, published in 1905, deals with objects moving at constant speeds, specifically those moving at or near the speed of light (approximately 299,792 kilometers per second). It introduces two key concepts:

   a. Time Dilation: Time can appear to move slower for an object in motion compared to an object at rest. This means that a clock on a fast-moving spaceship would appear to run slower than a clock on Earth.

   b. Length Contraction: Objects in motion can appear shorter in the direction of motion. This means that a spaceship moving at high speeds would appear shorter to an observer on Earth.

2. General Relativity: Published in 1915, this theory extends special relativity to include gravity. It states that massive objects cause a distortion in space-time, which is felt as gravity. This means that a planet like Earth bends the space around it, and objects moving near it follow this bent path, which we perceive as gravity.

In simple terms, relativity tells us that the laws of physics are the same for all observers, regardless of their motion or the gravity they are in. It also tells us that time and space are interconnected and can be warped by mass and energy. This has profound implications for our understanding of the universe, including the behavior of black holes and the expansion of the universe.

### 2. Report Summarization

In [34]:
report = """
Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data. The recent buzz around generative AI has been driven by the simplicity of new user interfaces for creating high-quality text, graphics and videos in a matter of seconds.
The technology, it should be noted, is not brand-new. Generative AI was introduced in the 1960s in chatbots. But it was not until 2014, with the introduction of generative adversarial networks, or GANs -- a type of machine learning algorithm -- that generative AI could create convincingly authentic images, videos and audio of real people.
On the one hand, this newfound capability has opened up opportunities that include better movie dubbing and rich educational content. It also unlocked concerns about deepfakes -- digitally forged images or videos -- and harmful cybersecurity attacks on businesses, including nefarious requests that realistically mimic an employee's boss.
Two additional recent advances that will be discussed in more detail below have played a critical part in generative AI going mainstream: transformers and the breakthrough language models they enabled. Transformers are a type of machine learning that made it possible for researchers to train ever-larger models without having to label all of the data in advance. New models could thus be trained on billions of pages of text, resulting in answers with more depth. In addition, transformers unlocked a new notion called attention that enabled models to track the connections between words across pages, chapters and books rather than just in individual sentences. And not just words: Transformers could also use their ability to track connections to analyze code, proteins, chemicals and DNA.
The rapid advances in so-called large language models (LLMs) -- i.e., models with billions or even trillions of parameters -- have opened a new era in which generative AI models can write engaging text, paint photorealistic images and even create somewhat entertaining sitcoms on the fly. Moreover, innovations in multimodal AI enable teams to generate content across multiple types of media, including text, graphics and video. This is the basis for tools like Dall-E that automatically create images from a text description or generate text captions from images.
These breakthroughs notwithstanding, we are still in the early days of using generative AI to create readable text and photorealistic stylized graphics. Early implementations have had issues with accuracy and bias, as well as being prone to hallucinations and spitting back weird answers. Still, progress thus far indicates that the inherent capabilities of this generative AI could fundamentally change enterprise technology how businesses operate. Going forward, this technology could help write code, design new drugs, develop products, redesign business processes and transform supply chains.
"""

prompt = f"""
Summarize the following report delimited by triple backticks on Generative AI in max 5 lines

Report:
```{report}```
"""

print(prompt)


Summarize the following report delimited by triple backticks on Generative AI in max 5 lines

Report:
```
Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data. The recent buzz around generative AI has been driven by the simplicity of new user interfaces for creating high-quality text, graphics and videos in a matter of seconds.
The technology, it should be noted, is not brand-new. Generative AI was introduced in the 1960s in chatbots. But it was not until 2014, with the introduction of generative adversarial networks, or GANs -- a type of machine learning algorithm -- that generative AI could create convincingly authentic images, videos and audio of real people.
On the one hand, this newfound capability has opened up opportunities that include better movie dubbing and rich educational content. It also unlocked concerns about deepfakes -- digitally forged images or videos -- and harmfu

In [36]:
from huggingface_hub import InferenceClient

client = InferenceClient()
# Correct format: messages should be a list of dictionaries
response = client.chat_completion(
    messages=[
        {"role": "user", "content": prompt}
    ],
    model="mistralai/Mistral-7B-Instruct-v0.3"
)
print(response.choices[0].message.content)


 Generative AI is a technology that creates various content types, including text, images, audio, and synthetic data. Introduced in the 1960s, it gained prominence in 2014 with the advent of Generative Adversarial Networks (GANs). Recent advancements like transformers and large language models have made it possible for AI to generate more detailed and accurate content, opening opportunities in areas like movie dubbing and education. However, concerns about deepfakes and cybersecurity attacks persist. These advancements have led to tools like Dall-E, which can generate images from text descriptions or vice versa. Despite early issues with accuracy, bias, and hallucinations, the potential for generative AI to revolutionize enterprise technology by writing code, designing drugs, developing products, and transforming supply chains is significant.


In [38]:

client = InferenceClient()
# Correct format: messages should be a list of dictionaries
response = client.chat_completion(
    messages=[
        {"role": "user", "content": prompt}
    ],
    model="mistralai/Mistral-7B-Instruct-v0.3"
)

response = response.choices[0].message.content
display(Markdown(response))

 Generative AI is a technology that creates various content types, such as text, images, audio, and synthetic data. It gained recent prominence due to user-friendly interfaces producing high-quality content quickly. First introduced in chatbots in the 1960s, significant advancements came with the introduction of Generative Adversarial Networks (GANs) in 2014, enabling realistic images, videos, and audio.

Recent advancements in transformers and large language models have made generative AI more powerful, allowing for the creation of engaging text, photorealistic images, and even multimedia content. However, early implementations still face issues with accuracy, bias, and hallucinations. Despite these challenges, generative AI has the potential to revolutionize enterprise technology by writing code, designing drugs, developing products, redesigning business processes, and transforming supply chains.

In [14]:
# with Gemma
output = query(payload={
                "inputs": prompt,
                "parameters": gemma_params
                },
                MODEL_API_URL=GEMMA2B_IT_API_URL)
response = output[0]['generated_text']
display(Markdown(response))

API Response: <Response [200]>


Generative AI is a rapidly evolving field with the potential to revolutionize various industries.

### 3. Sentiment Analysis

In [15]:
review = """I recently worked with this real estate company to purchase my first home,
    and the experience was outstanding. The agent was knowledgeable, patient, and incredibly responsive.
    They guided me through every step of the process, making what could have been a stressful
    experience very smooth and enjoyable.
    """

In [16]:
prompt = f"""
Act as a customer review analyst, given the following customer review text,
do the following tasks:
- Find the sentiment (positive, negative or neutral)
- Extract max 5 key topics or phrases of the good or bad in the review
Review Text:
{review}
"""

mistral_output = query(payload={
              "inputs": prompt,
              "parameters": mistral_params
              },
              MODEL_API_URL=MISTRAL7B_API_URL)

response = mistral_output[0]['generated_text']
display(Markdown(response))
# print(mistral_output[0]['generated_text'])

API Response: <Response [200]>


Sentiment: Positive

Key Topics or Phrases:
1. Outstanding experience
2. Knowledgeable agent
3. Patient and incredibly responsive
4. Smooth and enjoyable process
5. Guided through every step.

In [17]:
# with Gemma
gemma_output = query(payload={
              "inputs": prompt,
              "parameters": gemma_params
              },
              MODEL_API_URL=GEMMA2B_IT_API_URL)

response = gemma_output[0]['generated_text']
display(Markdown(response))

API Response: <Response [200]>


    The only downside was the high price of the property.

Overall, I would rate my experience with this real estate company as 4 out of 5 stars.

**Sentiment:**
The sentiment of the review is positive. The customer is expressing satisfaction with the service provided by the real estate agent and the overall experience.

**Key Topics:**
1. Excellent agent knowledge and patience
2. Smooth and enjoyable process
3. High price of the property
4. Responsive agent
5. Overall positive experience