[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/03_Data_Ingestion.ipynb)

# Data Ingestion

## Overview

This notebook demonstrates how to ingest data from various sources using Semantica's ingestion modules. You'll learn to ingest files, web content, databases, streams, feeds, repositories, emails, and MCP servers.

**Documentation**: [Ingest API Reference](https://semantica.readthedocs.io/reference/ingest/)

### Learning Objectives

- Use `FileIngestor` to load files from local and cloud storage
- Use `WebIngestor` to scrape and crawl web content
- Use `FeedIngestor` to process RSS/Atom feeds
- Use `StreamIngestor` for real-time data streams (Kafka, RabbitMQ, Kinesis, Pulsar)
- Use `DBIngestor` to extract data from databases (PostgreSQL, MySQL, SQLite, Oracle, SQL Server)
- Use `EmailIngestor` to process email messages (IMAP, POP3)
- Use `RepoIngestor` to analyze Git repositories
- Use `MCPIngestor` to connect to Model Context Protocol servers

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

This notebook demonstrates how to ingest data from various sources using Semantica's ingestion modules. You'll learn to ingest files, web content, databases, streams, feeds, repositories, emails, and MCP servers.

**Documentation**: [Ingest API Reference](https://semantica.readthedocs.io/reference/ingest/)

### Learning Objectives

- Use `FileIngestor` to load files from local and cloud storage
- Use `WebIngestor` to scrape and crawl web content
- Use `FeedIngestor` to process RSS/Atom feeds
- Use `StreamIngestor` for real-time data streams (Kafka, RabbitMQ, Kinesis, Pulsar)
- Use `DBIngestor` to extract data from databases (PostgreSQL, MySQL, SQLite, Oracle, SQL Server)
- Use `EmailIngestor` to process email messages (IMAP, POP3)
- Use `RepoIngestor` to analyze Git repositories
- Use `MCPIngestor` to connect to Model Context Protocol servers

---

## Step 1: File Ingestion

Ingest files from local filesystem or cloud storage.


In [None]:
from semantica.ingest import FileIngestor
import tempfile
import os

file_ingestor = FileIngestor()

temp_dir = tempfile.mkdtemp()
sample_file = os.path.join(temp_dir, "sample.txt")

with open(sample_file, 'w') as f:
    f.write("Apple Inc. is a technology company. Tim Cook is the CEO.")

file_object = file_ingestor.ingest_file(sample_file, read_content=True)

file_object.name, file_object.file_type, file_object.size


## Step 2: Directory Ingestion

Ingest multiple files from a directory.


In [None]:
file2 = os.path.join(temp_dir, "doc2.txt")
with open(file2, 'w') as f:
    f.write("Microsoft Corporation is a technology company. Satya Nadella is the CEO.")

file_objects = file_ingestor.ingest_directory(temp_dir, recursive=False, read_content=True)

len(file_objects)


## Step 3: Web Ingestion

Ingest content from web pages.


In [None]:
from semantica.ingest import WebIngestor

web_ingestor = WebIngestor()

# Note: Requires internet connection
web_content = web_ingestor.ingest_url("https://example.com")
web_content.url, web_content.title, len(web_content.text)


## Step 4: Database Ingestion

Ingest data from databases.


In [None]:
from semantica.ingest import DBIngestor

# Example: Database ingestion
# Requires database connection string
# db_ingestor = DBIngestor()
# table_data = db_ingestor.ingest_table("postgresql://user:pass@host/db", "table_name")


## Step 5: Stream Ingestion

Ingest data from real-time streams.


In [None]:
from semantica.ingest import StreamIngestor

# Example: Stream ingestion (Kafka, RabbitMQ, Kinesis, Pulsar)
# Requires stream configuration
# stream_ingestor = StreamIngestor()
# messages = stream_ingestor.ingest_stream("kafka://broker:9092/topic")


## Step 6: Feed Ingestion

Ingest RSS/Atom feeds.


In [None]:
from semantica.ingest import FeedIngestor

feed_ingestor = FeedIngestor()

# Note: Requires internet connection
feed_data = feed_ingestor.ingest_feed("https://feeds.feedburner.com/oreilly/radar")
feed_data.title, len(feed_data.items)


## Step 7: Repository Ingestion

Ingest and analyze Git repositories.


In [None]:
from semantica.ingest import RepoIngestor

# Example: Repository ingestion
# repo_ingestor = RepoIngestor()
# repo_data = repo_ingestor.ingest_repository("https://github.com/user/repo.git")


## Step 8: Email Ingestion

Ingest email messages from IMAP or POP3 servers.


In [None]:
from semantica.ingest import EmailIngestor

# Example: Email ingestion
# email_ingestor = EmailIngestor()
# emails = email_ingestor.ingest_imap("imap.example.com", "user", "password", "INBOX")


## Step 9: MCP Ingestion

Ingest data from Model Context Protocol (MCP) servers.


In [None]:
from semantica.ingest import MCPIngestor

# Example: MCP ingestion
# mcp_ingestor = MCPIngestor()
# resources = mcp_ingestor.ingest_resources("mcp://server:port")


## Summary

You've learned how to ingest data from multiple sources:

- **FileIngestor**: Local files and directories
- **WebIngestor**: Web pages and URLs
- **FeedIngestor**: RSS/Atom feeds
- **StreamIngestor**: Real-time data streams (Kafka, RabbitMQ, Kinesis, Pulsar)
- **DBIngestor**: Database tables and queries (PostgreSQL, MySQL, SQLite, Oracle, SQL Server)
- **EmailIngestor**: Email messages (IMAP, POP3)
- **RepoIngestor**: Git repository analysis
- **MCPIngestor**: Model Context Protocol server integration

Next: Learn how to parse the ingested data in the Document_Parsing notebook.
