## Asyncio

**You can only use await inside of functions created with async def**

Multiprocessing

### Difference between Threads and Coroutines
Threads and coroutines in Python are both mechanisms for achieving concurrency, but they are fundamentally different in how they operate and are managed.

**Threads** are managed by the operating system. Each thread represents a separate flow of execution, with its own stack and memory space. Threads can run in parallel (especially on multi-core CPUs), and the OS schedules when each thread runs. This makes threads suitable for tasks that require true parallelism, such as CPU-bound operations. However, threads come with overhead due to context switching and require careful synchronization when accessing shared resources. who understands this better than an embedded systems engineer.

**Coroutines** are a programming construct managed at the language level (not by the OS). In Python, coroutines are special functions that can pause (yield) their execution and resume later, allowing other coroutines to run in the meantime. This is known as cooperative multitasking: only one coroutine runs at a time within a **single thread**, and they explicitly yield control to each other. Coroutines are lightweight, have much lower overhead than threads, and are especially well-suited for I/O-bound scenarios. Coroutines are used to perform multiple tasks cooperatively within a single thread, allowing more efficient use of resources and easier management of concurrent tasks. 

| Feature         | Threads                                 | Coroutines                                   |
|-----------------|-----------------------------------------|-----------------------------------------------|
| **Management**  | OS-level                               | Language/runtime-level                        |
| **Parallelism** | Can run in parallel (multi-core CPUs)  | Run cooperatively within a single thread      |
| **Overhead**    | High (context switching, memory)       | Low (lightweight, minimal context switching)  |
| **Use case**    | CPU-bound, true parallelism            | I/O-bound, high concurrency, async tasks      |
| **Control**     | Preemptive (OS decides when to switch) | Cooperative (programmer decides when to yield)|


In summary, threads are for parallel execution managed by the OS, while coroutines are for cooperative multitasking managed by Python itself. Coroutines are not a replacement for threads, but offer a more efficient and simpler way to handle many concurrent I/O-bound tasks in Python

### Concurrency and Parallelism: Key Concepts

What Is Concurrency?
Concurrency in computing means the ability of a system to handle multiple tasks or operations at (almost) the same time. These tasks may not actually run simultaneously—instead, they are managed so that progress is made on each over a period of time. The operating system or language runtime switches between them efficiently, giving the illusion that they are happening together.
    * Concurrency is about dealing with many things at once, not necessarily doing them at the same instant.
    * Examples: Handling multiple network requests in a web server, or responsive user interfaces where several activities seem active together.

What Is Parallelism?
Parallelism refers to actually performing multiple computations or tasks at the same time. This is only possible on hardware with multiple processors or cores.
    * Parallelism is about doing many things simultaneously, leveraging the hardware’s ability to run code on different cores or machines.
    * Examples: Processing parts of a large image on different CPU cores, or running multiple calculations at the same time in scientific computing.

How Do Threads and Coroutines Relate?
Threads
	•	Threads are a way to achieve both concurrency and, on multi-core systems, parallelism.
	•	In Python, threads allow multiple parts of a program to appear to run independently. Operating systems can truly run threads in parallel on different CPU cores.
	•	However, due to Python’s Global Interpreter Lock (GIL), true parallelism is limited for CPU-bound tasks in standard CPython. For I/O-bound programs (like waiting for files or network), threads are very effective.

| Feature                | Threads                                   |
|------------------------|-------------------------------------------|
| Managed by             | Operating system                          |
| Can provide parallelism| Yes, on multi-core CPUs                   |
| Use-case               | Both concurrency and parallelism          |
| Typical tasks          | I/O- or CPU-bound                         |


Coroutines
	•	Coroutines are a lightweight form of concurrency, managed by Python itself (not the OS).
	•	They enable the program to switch between tasks at certain points (like when one is waiting for data), but only one coroutine runs at a time within a single thread.
	•	Coroutines do not give true parallelism (all run on the same thread) but can provide very high concurrency for I/O-bound programs, such as thousands of simultaneous web requests.

| Feature                | Coroutines                                |
|------------------------|-------------------------------------------|
| Managed by             | Python interpreter (language-level)       |
| Can provide parallelism| No (single thread, cooperative switching) |
| Use-case               | High concurrency, I/O-bound workloads     |
| Typical tasks          | Asynchronous network/server programming   |


#### Comparing Concurrency and Parallelism in Python

| Mechanism    | Concurrency | Parallelism | Where Used       | When to Use                                   |
|--------------|-------------|-------------|------------------|-----------------------------------------------|
| Threads      | Yes         | Limited*    | OS-level tasks   | I/O-bound (good), CPU-bound (limited by GIL)  |
| Coroutines   | Yes         | No          | async/await code | High I/O concurrency, very lightweight        |

*Python threads can use multiple cores only in implementations without the GIL, or via multiprocessing modules.
Summary
	•	Concurrency is about managing many tasks efficiently, switching between them.
	•	Parallelism is about running multiple tasks at the same time, on separate processors/cores.
	•	Threads can give both, but Python’s GIL restricts true CPU-based parallelism (except for I/O tasks).
	•	Coroutines offer high concurrency by pausing and resuming tasks in the same thread, with minimal resource use, but do not provide actual parallel execution.

`asyncio` module cannot be used in Python environments that run on WebAssembly platforms. 

The main reason is that these platforms do not support many of the low-level system features required by `asyncio`, such as system calls and event loop support. 

Therefore, try running the examples in this guide in a regular terminal environment where all necessary system features are available.

## Streaming in OpenAI API
Streaming makes large language model (LLM) apps feel faster and more interactive. Instead of waiting for the full response to be generated, the API sends it in small parts — token by token — as soon as they’re ready. This improves user experience, reduces memory load, allows early cancellation, and handles long responses more efficiently.

In OpenAI's API, this is enabled by setting stream=True in the request. When streaming is enabled, the response is delivered using **Server-Sent Events (SSE)** over HTTP with chunked transfer encoding. [Technically, stream=True enables server-sent events (SSE), so the HTTP response is kept open and data is sent incrementally using chunked transfer encoding]

Below, we’ll look at how responses differ with and without streaming, then explore how tools like LangChain and LangGraph build on this concept using Python's async and await.

In [43]:
import dotenv
dotenv.load_dotenv()

True

### Synchronous OpenAI API Call **without** Streaming
This code invokes the OpenAI API using the synchronous Python client and does not use streaming. The entire model output is returned in one HTTP response, and the call is blocking: program execution pauses until the response arrives. This approach is straightforward and suitable for scripts or applications where concurrent handling of multiple requests or real-time partial output is not required.
* “Synchronous” means the program waits for the API call to complete before moving forward.
* “Blocking” means no other statements after the API call are executed until the full response is available.


In [46]:
from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "user",
            "content": "Say 'double bubble bath' ten times slowly.",
        },
    ],
)
print("typeof(response):", type(response), "\n")
print(response.choices[0].message.content)

typeof(response): <class 'openai.types.chat.chat_completion.ChatCompletion'> 

Sure! Here it is:

double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.


### Synchronous OpenAI API Call **with** Streaming
When `stream=True` is set, the OpenAI API keeps the HTTP connection open and transmits the model’s output incrementally as it is generated. This allows your program to receive and process each chunk of the response in real time, rather than waiting for the complete output. Internally, the API sends these chunks using server-sent events and chunked transfer encoding, enabling your application to start displaying or utilizing the model’s reply as soon as the first data arrives.


In [66]:
# This script demonstrates how to use the OpenAI Python client to create a response with stream=True

from openai import OpenAI
from types import GeneratorType
from collections.abc import Iterator

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {
            "role": "user",
            "content": "Say 'double bubble bath' ten times slowly.",
        },
    ],
    stream=True,
)
print("typeof(stream):", type(stream), " Is Generator: ", isinstance(stream, GeneratorType), " Is Iterator: ", isinstance(stream, Iterator), "\n")

typeof(stream): <class 'openai.Stream'>  Is Generator:  False  Is Iterator:  True 



In [67]:
print(stream.__next__().choices[0].delta.content)
print(stream.__next__().choices[0].delta.content)
print(stream.__next__().choices[0].delta.content)


Sure
!


In [68]:
for chunk in stream:
    print(chunk.choices[0].delta.content, end="", flush=True)

 Double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.None

In [None]:
"""
for event in stream:

    print("****" * 20)
    print(json.dumps(event.model_dump(), indent=2))
    print("@@@@" * 20)
    print(event)
    print("----" * 20)

    print(event.choices[0].delta.content, end="", flush=True)
"""

### ASynchronous OpenAI API Call


In [None]:
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key="<ur key>")


async def main():
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": "Write a one-sentence bedtime story about a unicorn."}
        ],
        stream=True,
    )

    async for event in stream:
        print(event)


asyncio.run(main())

### Generator 'yiled' and Generator expression