# Building Apps With AI Models

### Introduction

<img src="./ailabs.jpeg" alt="AI Model Architecture" width="600"/>


What you'll learn from this workshop

1. What are AI models & the types available.
2. How to build an appliaction on top AI models.  
3. Real-world AI-powered voice assistant.  




# 1. What is AI Model?

**AI models** are computer programs trained to perform tasks that typically require human intelligence. These tasks include understanding language, recognizing images, making decisions, translating languages, and more.

#### In simple terms:

<img src="./brain.png" alt="brain Architecture" width="200"/>

An **AI model** is like a brain that a computer uses to solve problems or understand things — after being taught using examples (data).



### Types of AI Modeling

#### 1. **Machine Learning Models**

These models learn patterns from data and use those patterns to make predictions or decisions.

<img src="./ml.png" alt="ml " width="600"/>

**Examples:**

* **Linear Regression** – Predicts continuous values (e.g., price of a house).
* **Decision Trees** – Classifies inputs into categories (e.g., spam vs. not spam).



#### 2. **Deep Learning Models**

These models use **neural networks**, which are layers of interconnected nodes (like a mini version of how the human brain works), to process more complex data.

<img src="./dl.png" alt="deeplearing " width="600"/>

**Examples:**

* **CNNs (Convolutional Neural Networks)** – Used for image and video recognition (e.g., detecting cats in photos).
* **RNNs (Recurrent Neural Networks)** – Used for processing sequences, such as text or speech.



#### 3. **Large Language Models (LLMs)**

LLMs are trained on massive amounts of text to **understand**, **generate**, and **reason with** human language.

<img src="./llm.png" alt="llm " width="600"/>

**Examples:**

* **GPT-4** – Powers ChatGPT, writes emails, summarizes documents, answers questions.
* **Claude**, **Gemini**, **Mistral** – Other modern LLMs used for tasks like search, coding, tutoring, and more.



> ✅ **Note:**
> For this workshop, **we'll mainly focus on Large Language Models (LLMs)** especially those that work via APIs like **OpenAI’s GPT-4**, **Anthropic’s Claude**, and **Google’s Gemini** to help you build apps that can chat, summarize, generate content, and assist users intelligently.






## How to Use AI Models

There are **three main ways** to use AI models in your applications, depending on your goals, resources, and technical setup.



### 1. **Direct Model Inference**

You **download or load** the AI model directly into your code and run it locally or on your own server. This typically involves using libraries and tools like **ollama**, or the **Hugging Face Transformers pipeline**.


<img src="./inference.png" alt="inference " width="600"/>

**Good for:**

* Full control over model behavior
* Running models offline (air-gapped or private environments)
* Custom training and fine-tuning on your own data

**Popular Open Source Providers:**

*  **[Hugging Face](https://huggingface.co/)** – Transformers (BERT, GPT2, etc.)
*  **DeepSeek** – Open-source chat & coding LLMs
*  **OpenLLM** – Serving models with FastAPI integration
*  **Qwen (Alibaba)** – Qwen1.5, Qwen2 family of open LLMs
*  **Mistral AI** – Lightweight yet powerful models (Mistral 7B, Mixtral)

> 🔗 For hundreds of pre-trained models: [Visit Hugging Face](https://huggingface.co/)



### 2. **Frameworks**

Frameworks make it easier to **build AI-powered workflows** by abstracting complex tasks like chaining models, adding memory, retrieving documents, and logging prompts.


<img src="./frameworks.png" alt="frameworks " width="600"/>

**Good for:**

* Fast prototyping of AI agents
* Adding memory, tools, or APIs to models
* Building smart assistants and chatbots

**Popular Frameworks:**

*  **LangChain** – Chain LLMs with memory, tools, and APIs
*  **LlamaIndex** – Connect LLMs to your custom data (RAG)
*  **Haystack** – Production-ready RAG pipelines for enterprise
*  **PromptLayer** – Track, debug, and version prompt interactions



### 3. **3rd-Party API Providers**

You send a request to an **external API**, and it returns a response — no need to worry about servers, GPUs, or model training.

<img src="./aiprovider.png" alt="aiprovider " width="600"/>


**Good for:**

* Rapid deployment
* Scalable infrastructure
* Low maintenance

| **Provider**                   | **What They Offer**                                           |
| ------------------------------ | ---------------------------------------------------------------- |
| **OpenAI**                     | GPT-4, GPT-4o, DALL·E (image gen), Whisper (speech), fine-tuning |
| **Google (Gemini)**            | Text + multimodal AI (image, audio, video)                       |
| **Anthropic (Claude)**         | Friendly chatbots with long-context support                      |
| **Mistral (via platforms)**    | Open chat models like Mixtral (API via Hugging Face)             |
| **Cohere**                     | Text embeddings, generation, RAG APIs                            |
| **Replicate**                  | Access models like SDXL, Whisper, etc. via hosted APIs           |
| **Hugging Face Inference API** | Hosted models for text, image, speech, and translation           |


## **Use Cases**

AI models come in different types, each designed to handle specific kinds of input from text and images to audio and beyond. Let’s explore the most common categories and what they can do.

Note:
In this workshop, we'll primarily be using OpenAI models (like GPT-4.1 nano) to explore these use cases. These models offer powerful capabilities for working with text, images, and audio through easy-to-use APIs.

In [3]:
# Install necessary packages
!pip install openai --quiet 
!pip install python-dotenv --quiet

In [4]:
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))



###  Text Models

Used for: Chatbots, summarization, translation, content generation

**Integration**: Use API like OpenAI, Claude, or Gemini




In [5]:
from openai import OpenAI

response = client.responses.create(
    model="gpt-4.1-nano",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

Once upon a time, a gentle unicorn named Luna used her shimmering horn to light up the stars and guide sleepy creatures home through the moonlit night.





### Vision Models (Image + Video)

Used for: Object detection, image classification, video analysis
**Integration**:

* Use Hugging Face models like `YOLOv8`, `CLIP`, or `Segment Anything`
* Or use OpenAI's image-to-text (e.g., GPT-4 Vision)


In [8]:
from openai import OpenAI
import base64


response = client.responses.create(
    model="gpt-4.1-mini",
    input="Generate an image of gray tabby cat hugging an otter with an orange scarf",
    tools=[{"type": "image_generation"}],
)

# // Save the image to a file
image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]

if image_data:
    image_base64 = image_data[0]
    with open("cat_and_otter.png", "wb") as f:
        f.write(base64.b64decode(image_base64))

PermissionDeniedError: Error code: 403 - {'error': {'message': 'Your organization must be verified to use the model `gpt-image-1`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.', 'type': 'invalid_request_error', 'param': None, 'code': None}}



### Audio Models

Used for: Speech-to-text (STT), text-to-speech (TTS), emotion detection
**Integration**:

* Use OpenAI Whisper for transcription
* Use Coqui TTS for speech generation



In [9]:
import base64
from openai import OpenAI

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=ChatCompletionAudio(id='audio_6889e41883c48191ae4e1eb69048280c', data='UklGRv////9XQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0Yf////8VABIADwATAA0AEgARAA4AEwATABUADwAOABAAEAATABUADwAQAA8AEQANAA8ADAAOABIADAAPAAgAEgAKABAADgAOAA0ACAAQAAoAFQANABYACwANABAADgARAAkAFwAMABYADgATABEAEwATAA0AFgAQABoADAAXAAgAEwAKABIADAAHABAA/f8LAPr/DgD7/wgAAQD4/wEA9P8BAOv/+v/s/+v/6//e/+7/2//q/9r/3v/Y/9D/3P/F/9v/xP/S/8P/w//N/73/z/+3/8r/sv/K/7v/v/+9/7L/xf+2/8j/sv/G/7j/w//D/8X/x/++/8z/xP/P/8r/0f/M/9X/1f/X/9v/3f/h/9//4f/o/+n/6P/r/+3/7//x//f/9P/5//f////6/wAABAD8/woAAAAEAAQACQAOAA8AEAAKAAwAEAANAA8AFAAUABMAEgAUABMAGAAQABYAGQAbABoAGAAaABUAIgAeACAAJgAfACgAHgArACAALgApAC4ALwAuAD4AKwA+AC4APgAxAD4AQQA6AEoAOwBQADoATwBFAFEARgBMAEwASABTAEgAUQBHAFMARQBUAEYATABDAEAARQA1AEIALQA/ACkAMQAmACUAHwAYAB8AEQATAAIACgD4//3/9P/u/+f/3//d/8v/0//F/8b/uP+x/6//pf+e/5b/jv+I/3v/ev9x/2z/ZP9d/1z/T/9R/0b/RP

#  3. Choosing the Right Architecture

Five common architecture patterns, along with their pros, cons, and example apps that benefit from each approach

## Comparison Table of Generative AI Architecture Patterns

| Architecture Pattern | Pros                                                                 | Cons                                                                 | Ideal Use Cases                                                   | Example Apps                                                                 |
|----------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------------------|
| **Serverless**        | Cost-effective, scales automatically, minimal maintenance             | Cold-start latency, limited GPU support, less control                | On-demand inference, sporadic usage                              | ChatGPT (API-based usage), DALL·E mini, Quick-Response Virtual Assistants    |
| **Microservices**     | Scalable, modular, fault-tolerant                                     | Complexity, inter-service latency, higher maintenance                | Complex AI pipelines, large systems needing modular services     | Spotify’s recommendation and personalization system, AI-powered content moderation |
| **Edge Computing**    | Low latency, privacy, reduced bandwidth usage                         | Device constraints, higher development costs, requires model optimization | Real-time AR/VR, smart healthcare devices, real-time security    | Snapchat Lenses, AI-driven AR filters, autonomous vehicles                   |
| **Hybrid Cloud**      | Data privacy, flexible resource allocation, control over sensitive data | Complex setup, higher operational costs, latency in data transfer    | Enterprise solutions, sensitive data handling                    | Financial recommendation systems, personalized healthcare diagnostics         |
| **Batch Processing**  | Efficient for bulk processing, cost-effective, high throughput         | Not real-time, resource-intensive, complex workflow                  | Content generation in bulk, data-intensive preprocessing         | Automated video generation, AI-based marketing content creators (like Jasper) |


1. Serverless Architecture
2. Microservices Architecture
3. Edge Computing
4. Hybrid Cloud Architecture
5. Batch Processing Architecture

## 1. Serverless Architecture for Generative AI Inference

Serverless architectures let cloud providers manage the infrastructure, which automatically scales based on demand, making it suitable for applications with unpredictable usage patterns.

<img src="./images/serverless.png" alt="frameworks " width="600"/>

Example


1.  Voice-based virtual assistants  – Respond to spoken commands using cloud functions.

2.  AI-powered resume scanners – Process uploads instantly and return results.

3.  Quiz generators for educators – Teachers generate quizzes from text input using serverless inference.



##  2. Microservices Architecture for Modular AI Pipelines
In a microservices architecture, different parts of an AI application are divided into independently deployable services. This is ideal for complex systems where each component performs a distinct task in a larger pipeline.

<img src="./images/microservice.png" alt="frameworks " width="600"/>



1.  News aggregation bots – Modular services for scraping, summarizing, and classifying articles.



## 3. Edge Computing for Real-Time Generative AI Applications
Edge computing places AI processing close to the user, reducing latency and improving response times. It’s particularly useful for applications requiring near-instant feedback or privacy-focused data handling.

<img src="./images/edgecomputing.png" alt="frameworks " width="600"/>


1. Filter Lenses – Runs AI models locally for face filters and AR effects.

2. Wearable fitness trackers – Use AI models locally to analyze heart rate and activity patterns.




## 4. Hybrid Cloud Architecture for Model Training and Inference
Hybrid cloud combines cloud and on-premises resources, making it a strong choice for enterprises with specific data privacy or regulatory needs. Intensive model training can be done on-premises, while inference is handled in the cloud for scalability.

<img src="./images/Hybrid.png" alt="frameworks " width="600"/>


1. Healthcare AI apps – Train models on-premises with patient data, serve diagnostics from cloud.

2. Financial advisory systems – Local data storage for compliance, cloud for model inference.

3. Government AI platforms – On-prem training with sensitive data, scalable public cloud endpoints.

4. Retail personalization engines – Customer data resides locally; cloud handles ML inference and retraining.

5. Pharmaceutical research platforms – Train models locally with proprietary data, then share cloud-based APIs for predictions.


## 5. Batch Processing Architecture for Bulk Generative Tasks
Batch processing handles large volumes of data in a non-real-time manner, ideal for applications that generate content in bulk or perform heavy data preprocessing.


<img src="./images/batch.png" alt="frameworks " width="600"/>

1. AI-based email campaign writers – Write and A/B test thousands of email variations.

2. Automated report generators – Corporate tools that create investor reports or client summaries in large batches overnight.

# 4. Real-World AI-Powered Call Assistant

In this section, we’ll build a real-time AI call assistant using **Next.js** and **Hume AI** — a voice-based AI model that understands emotions and responds with natural, human-like conversation.

To get started:

1. **Clone the AI Cookbook repo**
   Click [this link](https://github.com/AmosMaru/ai_cookbook.git) to visit the **AI Cookbook** and clone the repository:

   ```bash
   git clone https://github.com/AmosMaru/ai_cookbook.git
   cd ai_cookbook
   ```

2. **Install dependencies**

   ```bash
   npm install
   ```

3. **Run the app**

   ```bash
   npm run dev
   ```

4. **Get Hume AI credentials**
   Go to [Hume AI Console](https://console.hume.ai/), create an account, and copy your keys.

   In your `..env.example` file, add:

   ```env
   HUME_API_KEY="<YOUR API KEY>"
   HUME_SECRET_KEY="<YOUR SECRET KEY>"
   ```

5. **Customize your assistant**
   Next, we’ll update the configuration to define our own **system prompt** and behavior for the AI call assistant — such as tone, personality, and role. After editing, **save the model** to lock in the assistant’s persona and instructions.

Ready? Let’s build something that sounds human.
