### Invoke Model

In AWS Bedrock, invoking a model means sending a request to a foundation model (FM) hosted on Bedrock to generate text, embeddings, or other outputs. You don’t have to manage infrastructure you just call the API.

In [1]:
!pip install boto3

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 23.2.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


It installs the AWS Python SDK (boto3) into your environment so you can use AWS services.

In [3]:
import boto3
import json

It imports the AWS SDK (boto3) and the JSON library to interact with AWS and handle JSON data.

In [4]:
bedrock_runtime = boto3.client('bedrock-runtime')
model_id = "amazon.nova-micro-v1:0"

- boto3.client('bedrock-runtime') → Creates a client object for the AWS Bedrock Runtime service using the boto3 SDK. This client lets your Python program send requests to invoke foundation models hosted on AWS Bedrock
- model_id = "amazon.nova-micro-v1:0" → Defines the specific foundation model you want to use (in this case, Amazon’s Nova Micro model, version :1)

In [5]:
payload = {
    "messages":[{"role":"user",
                 "content":[{"text":"what is capital of Nepal"}]}],
    "inferenceConfig":{
        "maxTokens":50,
        "temperature":0.5,
        "topP":0.9
    }
}

- role → Defines who is speaking ("user", "assistant", or "system").
- content → The actual text or structured input.
- inferenceConfig → Controls how the model generates its response
- maxTokens → The maximum number of tokens (words/pieces of text) the model can generate in its reply.
- temperature → Controls randomness/creativity in output. Range: 0.0 (deterministic, predictable) → 1.0 (more creative, varied).
- topP → Probability threshold for sampling words (a technique called nucleus sampling). The model considers only the most likely words whose cumulative probability ≤ topP.

In [6]:
model_invoke = bedrock_runtime.invoke_model(modelId=model_id, body=json.dumps(payload))

- bedrock_runtime.invoke_model(...) → Calls the AWS Bedrock Runtime client to send a request to the foundation model you specified earlier.
- body=json.dumps(payload) → Converts the Python dictionary payload into a JSON string, which is the format the API expects for the request body.

In [7]:
model_invoke

{'ResponseMetadata': {'RequestId': '9ca62a20-d41a-4076-992f-64d8c26e4c0a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sun, 04 Jan 2026 07:45:09 GMT',
   'content-type': 'application/json',
   'content-length': '465',
   'connection': 'keep-alive',
   'x-amzn-requestid': '9ca62a20-d41a-4076-992f-64d8c26e4c0a',
   'x-amzn-bedrock-invocation-latency': '360',
   'x-amzn-bedrock-cache-write-input-token-count': '0',
   'x-amzn-bedrock-cache-read-input-token-count': '0',
   'x-amzn-bedrock-output-token-count': '50',
   'x-amzn-bedrock-input-token-count': '5'},
  'RetryAttempts': 0},
 'contentType': 'application/json',
 'body': <botocore.response.StreamingBody at 0x205235309d0>}

In [8]:
response = json.loads(model_invoke['body'].read())
response

{'output': {'message': {'content': [{'text': "The capital of Nepal is Kathmandu. Kathmandu is not only the political capital but also the largest city of Nepal. It serves as the hub of the country's administrative, commercial, and cultural activities. The city is located in the central part of the"}],
   'role': 'assistant'}},
 'stopReason': 'max_tokens',
 'usage': {'inputTokens': 5,
  'outputTokens': 50,
  'totalTokens': 55,
  'cacheReadInputTokenCount': 0,
  'cacheWriteInputTokenCount': 0}}

In [9]:
response['output']['message']['content'][0]['text']

"The capital of Nepal is Kathmandu. Kathmandu is not only the political capital but also the largest city of Nepal. It serves as the hub of the country's administrative, commercial, and cultural activities. The city is located in the central part of the"

### invoke_stream

invoke_model_with_response_stream is the Bedrock API call that streams a model’s response back in real time, chunk by chunk.
#### Difference
- invoke_model → Returns the entire response only after the model finishes generating.
- invoke_model_with_response_stream → Sends back chunks of the response as they are produced, so you can start processing or displaying text immediately (like a live chat).

In [10]:
model_stream = bedrock_runtime.invoke_model_with_response_stream(modelId=model_id, body=json.dumps(payload))

In [11]:
model_stream

{'ResponseMetadata': {'RequestId': '8dfb4567-4d7a-40a0-bf65-94c3ba940964',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sun, 04 Jan 2026 07:51:59 GMT',
   'content-type': 'application/vnd.amazon.eventstream',
   'transfer-encoding': 'chunked',
   'connection': 'keep-alive',
   'x-amzn-requestid': '8dfb4567-4d7a-40a0-bf65-94c3ba940964',
   'x-amzn-bedrock-content-type': 'application/json'},
  'RetryAttempts': 0},
 'contentType': 'application/json',
 'body': <botocore.eventstream.EventStream at 0x20523531670>}

In [12]:
full_text = ""
for event in model_stream["body"]:
    if "chunk" in event:
        chunk = event["chunk"]["bytes"]
        data = json.loads(chunk.decode("utf-8"))
        if "contentBlockDelta" in data:
            delta = data["contentBlockDelta"]["delta"]
            if "text" in delta:
                print(delta["text"], end="", flush=True)
                full_text += delta["text"]

The capital of Nepal is Kathmandu. Kathmandu is not only the political capital but also the largest city of Nepal. It serves as the center of the country's culture, history, and economy. The city is located in the Kathmandu Valley and is surrounded

This code processes the streamed response from a Bedrock model, printing each text chunk in real time while also concatenating them into full_text for later use. Delta types in Bedrock streaming include messageStart, contentBlockStart, contentBlockDelta, contentBlockStop, messageStop, and metadata, each marking different stages of the model’s streamed output.

- contentBlockStart → Signals the beginning of a new content block (e.g., the model is starting a new message). Contains metadata like the role (assistant) or block type.
- contentBlockDelta → The most frequent type carries incremental text output.
- contentBlockStop → Marks the end of a content block. Useful for knowing when the model has finished generating a message.
- messageStart → Indicates the start of a new message in the stream. Often includes metadata about the message.
- messageStop → Signals the end of the entire streamed message. Lets you know the model is done sending output.
- metadata → Provides extra information about the generation process (e.g., token counts, latency). Not text, but useful for logging or monitoring.