# AI Domains ‚Äî NLP, Speech, and Computer Vision

This notebook explores the **three practical domains of AI** ‚Äî Natural Language Processing (NLP), Speech Recognition, and Computer Vision.

You'll see how models interpret human language, convert speech to text, and perceive the visual world ‚Äî forming the foundation of human-AI interaction.

---
### Objectives
- Define NLP, Speech, and Vision in AI.
- Perform text tokenization, sentiment analysis, and entity recognition.
- Demonstrate basic speech recognition and text-to-speech synthesis.
- Apply pre-trained neural networks for image recognition.
- Understand how these domains integrate into multimodal AI systems.

## üí¨ 1. Natural Language Processing (NLP)

Natural Language Processing enables computers to **understand, interpret, and generate human language**.

NLP involves several key subfields:
- **Tokenization:** Breaking sentences into smaller parts (tokens).
- **Part-of-Speech Tagging:** Identifying grammatical roles.
- **Named Entity Recognition (NER):** Detecting entities like people, organizations, or locations.
- **Sentiment Analysis:** Determining emotional tone.
- **Language Generation:** Producing text that reads naturally.

Let's see this in action using Hugging Face‚Äôs pre-trained models.

In [None]:
from transformers import pipeline

# Example text
text = "IBM Watson and OpenAI are transforming the AI landscape in healthcare and finance."

# Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
entities = ner(text)
print('Named Entities:', entities)

# Sentiment Analysis
sentiment = pipeline('sentiment-analysis')
result = sentiment("Artificial intelligence is revolutionizing technology.")
print('Sentiment:', result)

### üß† Insights
NLP systems combine statistical, neural, and linguistic models to interpret meaning and emotion. They enable chatbots, translators, summarizers, and search engines.

## üó£Ô∏è 2. Speech Recognition and Synthesis

**Speech-to-Text (STT):** Converts spoken language into text.
**Text-to-Speech (TTS):** Converts written text into spoken audio.

Together, these create seamless voice interfaces ‚Äî used in assistants like Siri, Alexa, and Google Assistant.

We'll simulate both STT and TTS locally.

In [None]:
# Text-to-Speech demonstration
import pyttsx3

engine = pyttsx3.init()
engine.setProperty('rate', 160)
engine.say('Welcome to Artificial Intelligence speech synthesis demonstration.')
engine.runAndWait()
print('‚úÖ Text spoken successfully.')

### üß† STT Concept (Simulated Example)

Real STT would use libraries like `SpeechRecognition` or APIs like Google Speech.
Here‚Äôs a simulated transcription pipeline:

In [None]:
def simulated_stt(audio_input):
    print(f"[Simulated Recognition] Input: {audio_input}")
    text = "AI systems are enhancing communication."
    print(f"Transcribed Output: {text}")
    return text

simulated_stt('sample_audio.wav')

### üîÑ Integration Example
1. **STT:** Captures user voice input ‚Üí converts to text.
2. **NLP:** Interprets text for intent/sentiment.
3. **TTS:** Responds back to the user in natural speech.

## üëÅÔ∏è 3. Computer Vision (CV)

Computer Vision enables AI to interpret visual data ‚Äî images and videos.
It involves three main tasks:

| Task | Description | Example |
|------|--------------|----------|
| **Image Classification** | Identify object type | Cat vs Dog |
| **Object Detection** | Locate multiple objects | Cars, People in frame |
| **Segmentation** | Label every pixel | Medical imaging, AR |

We'll use a pre-trained **ResNet-18** model for classification.

In [None]:
import torch
from torchvision import models, transforms
from PIL import Image
import requests

# Load pre-trained ResNet
model = models.resnet18(weights='IMAGENET1K_V1')
model.eval()

# Load an example image
url = 'https://upload.wikimedia.org/wikipedia/commons/9/9a/Pug_600.jpg'
img = Image.open(requests.get(url, stream=True).raw)

# Transform for model
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(img).unsqueeze(0)

# Forward pass
with torch.no_grad():
    output = model(input_tensor)

predicted_idx = torch.argmax(output[0]).item()
print(f"Predicted class index: {predicted_idx}")

### üß† Vision System Insights
CNNs like ResNet and YOLO detect and interpret visual information efficiently.
Applications: autonomous driving, surveillance, quality inspection, and AR/VR.

## üîó 4. Integrating NLP, Speech, and Vision

Modern AI systems combine multiple modalities ‚Äî **language, audio, and vision** ‚Äî to create intelligent assistants and multimodal agents.

Example pipeline:
1. **User speaks a query ‚Üí STT ‚Üí Text**
2. **NLP ‚Üí Intent Detection ‚Üí Context Understanding**
3. **CV ‚Üí Visual Confirmation or Scene Understanding**
4. **TTS ‚Üí Response Output (Human-like)**

In [None]:
from graphviz import Digraph

g = Digraph('AIDomains', format='png')
g.attr(rankdir='LR', size='9,4')
g.attr('node', shape='box', style='filled', fillcolor='lightgreen')
g.node('Speech', 'Speech Input (STT)')
g.node('NLP', 'Text Understanding (NLP)')
g.node('Vision', 'Scene Analysis (CV)')
g.node('Response', 'Response Generation (TTS)')
g.edges([('Speech','NLP'), ('NLP','Vision'), ('Vision','Response')])
g.attr(label='Multimodal AI Pipeline: Voice ‚Üí Text ‚Üí Vision ‚Üí Voice')
g.render('ai_domains_pipeline', view=True)

## üìò 5. Summary and Insights

- **NLP** enables understanding and generation of human language.
- **Speech AI** bridges human voice and digital understanding (STT ‚Üî TTS).
- **Computer Vision** empowers perception and spatial awareness.
- Combined, they form the foundation of **multimodal AI systems** ‚Äî powering assistants, self-driving cars, and smart robots.

Next: In the following notebook, we'll explore how **AI integrates with Cloud, Edge, and IoT ecosystems** to deliver scalable real-world intelligence.