

- Proposed S3 folder layout for meta and embedding assets.
- Tests Bedrock connectivity via boto3 client and simple embedding call.
- Describes sentenceID pattern and validates reliability with Polars regex checks.

**Core 1**:
- Local caching of Stage 1, Stage 2 and embedding tables.
- Transforms Stage 1 facts into Stage 2 meta with ML columns.

**Core 2**:
- Details filtering modes (full vs parameterized) over CIKs and years.
- Implements token-aware embedding batches respecting Cohere limits and timing.
- Merges new embeddings into existing vectors table using concat and de-duplicate.
- Updates meta table embedding metadata only for successfully embedded sentences.
- Saves merged vectors and updated meta to S3 and local cache.

**Analytics**:
- Analyzes token distributions, model limits, and per-1k-token cost sensitivity.
- Explains importance of preserving order between sentence IDs and embedding rows.
- Breaks down S3 upload, storage, and egress costs for large parquet files.

#### Code Auth: Joel Markapudi. 

### Potential Structure for S3.
```
├── DATA_MERGE_ASSETS/                 # existing structure
│   ├── FINRAG_FACT_SENTENCES/
│   └── FINRAG_FACT_METRICS/
│
└── ML_EMBED_ASSETS/                        
    ├── EMBED_META_FACT/
    │   └── finrag_fact_sentences_final.parquet
    │
    └── EMBED_VECTORS/
        ├── cohere_v3_768d/
        │   ├── finrag_embeddings_cohere_v3.parquet
        │   ├── metadata.json
        │   └── validation_report.json
        │
        └── titan_v2_1024d/            # Future
            └── ...
```


### Quick Tests 1 - 2, 
- Check for boto3.client and then check for model access through AWS org account credentials. 
- Works!

In [4]:
# Zero-cost test (no API call)
import boto3

try:
    bedrock = boto3.client(
        service_name='bedrock-runtime',
        region_name='us-east-1'
    )
    print("✓ Bedrock client created successfully")
    print(f"  Region: {bedrock.meta.region_name}")
except Exception as e:
    print(f"✗ Failed: {e}")

✓ Bedrock client created successfully
  Region: us-east-1


In [5]:
# Cell 1: Setup
import sys
from pathlib import Path

# Add loaders to path
sys.path.append(str(Path.cwd().parent / 'loaders'))

from ml_config_loader import MLConfig

# Initialize config (loads AWS credentials automatically)
config = MLConfig()

print("✓ Config loaded")
print(f"  Bucket: {config.bucket}")
print(f"  Region: {config.region}")
print(f"  Model: {config.bedrock_model_id}")



# Cell 2: Test Bedrock API
import json

# Get Bedrock client (uses config credentials)
bedrock = config.get_bedrock_client()

# Test embedding with v4
body = json.dumps({
    "texts": ["Revenue increased significantly."],
    "input_type": config.bedrock_input_type,
    "embedding_types": ["float"],
    "output_dimension": config.bedrock_dimensions
})

response = bedrock.invoke_model(
    body=body,
    modelId=config.bedrock_model_id,
    accept='*/*',
    contentType='application/json'
)

result = json.loads(response['body'].read())
embeddings = result['embeddings']['float']

print(f"✓ Bedrock API works!")
print(f"  Model: {config.bedrock_model_id}")
print(f"  Dimensions: {len(embeddings[0])}")
print(f"  First 5 values: {embeddings[0][:5]}")
# print(f"  Cost: ~$0.0000005")



[DEBUG] ✓ Found ModelPipeline via file path: D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
✓ Config loaded
  Bucket: sentence-data-ingestion
  Region: us-east-1
  Model: cohere.embed-v4:0
✓ Bedrock API works!
  Model: cohere.embed-v4:0
  Dimensions: 1024
  First 5 values: [0.048339844, 0.040039062, -0.020874023, -0.012573242, -0.044921875]


### Prep work: 01: Load S3 fact sentences, modify and create new columns, save back to S3 - finrag_fact_sentences_meta_embeds.parquet



### Analysis and patterns.
- Apple 2016 ITEM_1:
  first: 0000320193_10-K_2016_section_1_0
  last:  0000320193_10-K_2016_section_1_99
- Pattern: `{CIK}_{filing}_{year}_section_{section_ID}_{sequence}`
- We'll create the shifts of plus one and minus one for the sentence ID to perform a concept of previous and next sentence ID. But this is a rough scheme or an idea. Later we may not actually use this thoroughly because we cannot really depend on this particular element. element if the clustered key or unique key from various sources is not following the exact same pattern.
- Local file downloaded at: ModelPipeline\finrag_ml_tg1\data_cache\stage1_facts\finrag_fact_sentences.parquet

- "Revenue increased 15% to $274.5 billion." Average across millions of English sentences: 1 word ≈ 1.33 tokens
  ``` 
  - Word count: 6 words
  - Token count (actual): 9 tokens
    - ['Revenue', 'increased', '15', '%', 'to', '$', '274', '.', '5', 'billion', '.']
  - Approximation: 6 × 1.33 = 8 tokens
  ```


In [1]:
from sentID_pattern_validation import validate_sentenceid_pattern

validate_sentenceid_pattern()

Loading: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\stage1_facts\finrag_fact_sentences.parquet
✓ Loaded: 469,252 rows

SENTENCEID PATTERN VALIDATION

Total rows: 469,252

[Full Pattern: CIK_filing_year_section_sectionID_sequence]
  Valid: 469,252 (100.00%)
  Invalid: 0 (0.00%)

[Numeric Suffix Only: ends with number]
  Valid: 469,252 (100.00%)
  Invalid: 0 (0.00%)

✓ Examples of VALID sentenceIDs (parsed):
  0000034088_10-K_2006_section_10_1
    → CIK: 0000034088, Year: 2006, Section: 10, Seq: 1
  0000034088_10-K_2006_section_11_2
    → CIK: 0000034088, Year: 2006, Section: 11, Seq: 2
  0000034088_10-K_2006_section_12_10
    → CIK: 0000034088, Year: 2006, Section: 12, Seq: 10
  0000034088_10-K_2006_section_12_11
    → CIK: 0000034088, Year: 2006, Section: 12, Seq: 11
  0000034088_10-K_2006_section_12_13
    → CIK: 0000034088, Year: 2006, Section: 12, Seq: 13

[Component Validation]
  CIK extracted: 469,252 rows
  Filing ex

{'total_rows': 469252,
 'full_pattern_valid_count': 469252,
 'full_pattern_valid_pct': 100.0,
 'numeric_suffix_valid_count': 469252,
 'numeric_suffix_valid_pct': 100.0,
 'recommendation': 'reliable'}

In [2]:
# ============================================================================
# DATA PREPARATION PIPELINE
# Creates Stage 2 meta table and initializes empty vectors table
# ============================================================================
# ============================================================================
# PARAMETERS - Execution Control
# INIT_*: Creates on S3 (one-time) -- initial, such as 1st table, very first time.
# FORCE_REINIT_*: Recreates (destructive) -- deletes and recreates.
# CACHE_*: Downloads locally -- downloads from remote to local cache for faster dev, assuming your folders are empty.
# FORCE_RECACHE_*: Re-downloads -- forces re-download even if local cache exists. i.e. a basic refresh, override.
# ============================================================================

from data_preparation import DataPreparationPipeline

DataPreparationPipeline(
    cache_stage1_locally=True,
    force_recache_stage1=True,
    cache_stage2_locally=True,
    force_recache_stage2=True,
    cache_embeds_locally=True,
    force_recache_embeds=True,
    embeds_provider="cohere_1024d",
    init_meta_table=False,
    force_reinit_meta=False,
    init_vectors_table=False,
    force_reinit_vectors=False
).run()


[DEBUG] ✓ Found ModelPipeline via file path: D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
DATA PREPARATION PIPELINE
Model: cohere.embed-v4:0 (1024d)

[Stage 1 Table - Downloading from S3]
  Source: s3://sentence-data-ingestion/DATA_MERGE_ASSETS/FINRAG_FACT_SENTENCES/finrag_fact_sentences.parquet
  Size: 23.1 MB
  Downloaded: 469,252 rows (Cost: $0.0020 egress)
  ✓ Cached to: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\stage1_facts\finrag_fact_sentences.parquet

[Stage 2 Meta Table - Downloading from S3]
  Source: s3://sentence-data-ingestion/ML_EMBED_ASSETS/EMBED_META_FACT/finrag_fact_sentences_meta_embeds.parquet
  Size: 33.9 MB
  Downloaded: 469,252 rows (Cost: $0.0030 egress)
  ✓ Cached to: d:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline\finrag_ml_tg1\data_cache\meta_embeds\finrag_fact_sentence

{'stage1_cached': True,
 'stage2_meta_cached': True,
 'embeds_cached': ['cohere_1024d'],
 'meta_table_initialized': False,
 'vectors_table_initialized': False}

### Embedding Generation, Storage and Push, & Embedding metadata update in main fact_sentences_meta_embed table

- At filtering step:
   - `filtered_sentence_ids = df_filtered['sentenceID'].to_list()`
- List is your anchor throughout the pipeline
   - Knowing which sentences to embed. Knowing which meta rows to update. Progress tracking. Cost estimation.
- Merge.
  - `merged = pl.concat([existing_df, new_df])` 
  - `merged = merged.unique(subset=[key], keep='last') `

- p50 (median): 33 tokens
- p75: 48 tokens
- p95: 77 tokens
- p99: 117 tokens
- max: 2,281 tokens (outlier - likely table)
- 99.7% of sentences: <500 tokens

- Bedrock Cohere v4 limits:
   - Max tokens per text: 512 tokens
   - Max texts per batch: 96
   - Max total request tokens: ~50K (undocumented, but conservative estimate)

### Right on aws cohere page: "Smaller chunks improve retrieval and cost" 
- Long chunks: More tokens = higher embedding cost
- Short chunks: Fewer tokens = lower cost
-   1,000 chunks × 500 tokens each = 500K tokens → $0.05
-   1,000 chunks × 50 tokens each = 50K tokens → $0.005



```

# TOP: Constants & Path Resolution
config = MLConfig()
VECTORS_URI = ...
META_URI = ...

# STEP 1: Load meta table (helper)
df_meta = load_meta_table_with_cache(config)

# STEP 2: Filter (returns anchor)
df_filtered, filtered_ids = filter_sentences(df_meta, config)

# STEP 3: Generate embeddings
df_vectors, embedding_id, skipped_ids = generate_embeddings_batch(...)

# STEP 4: Merge vectors (simplified - no existence checks)
merged_vectors = merge_vectors_table(df_vectors, VECTORS_URI, storage_options)

# STEP 5: Update meta (use anchor)
updated_meta = update_meta_table(df_meta, filtered_ids, skipped_ids, model_info)

# STEP 6: Save both
Save vectors
Save meta


--------------------------------------------------------------------------------------------------------------------

- EMBEDS:
- APPLE; 2016? -> // → 47,755 tokens processed → Cost: $0.0048

- Actual file on S3: 21MB (compressed with ZSTD)
- EGRESS: code overstated the cost by 10x // 21MB / 1024 × $0.09 = $0.0018 (not $0.0246)

data_cache/stage1_facts/ → Stage-1 parquet
data_cache/meta_embeds/ → Stage-2 meta (35 cols)
data_cache/embeddings/<provider>/ → vectors per provider (e.g., cohere_1024d/)


```

In [1]:
# ============================================================================
# EMBEDDING GENERATION PIPELINE
# Generates embeddings for filtered sentences and merges with existing data
# ============================================================================
"""
COHERE SPECIFIC:
    Batch fills when EITHER condition met:
    1. 96 texts reached, OR
    2. Total tokens reached
def __init__(
        self,
        max_tokens_per_sentence=1000,    # Outlier filter
        max_texts_per_batch=96,          # Cohere API limit
        max_tokens_per_batch=128000,     # Cohere v4 capacity
        batch_log_interval=40            # Progress print frequency
    ):
"""

from embedding_generation import EmbeddingGenerationPipeline

# with defaults (all settings from ml_config.yaml)
summary = EmbeddingGenerationPipeline(batch_log_interval=25).run()

[DEBUG] ✓ Found ModelPipeline via file path: D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\FinSights\ModelPipeline
[DEBUG] ✓ AWS credentials loaded from aws_credentials.env
EMBEDDING GENERATION PIPELINE
Mode: parameterized
Model: cohere.embed-v4:0 (1024d)

[Resolved Paths]
  Vectors: ML_EMBED_ASSETS/EMBED_VECTORS/cohere_1024d/finrag_embeddings_cohere_1024d.parquet
  Meta: ML_EMBED_ASSETS/EMBED_META_FACT/finrag_fact_sentences_meta_embeds.parquet
✓ Using cached meta table
  Loaded: 469,252 rows × 34 columns (Cost: $0.00)

[Filtering: PARAMETERIZED MODE]
  CIKs: [34088, 59478, 104169, 200406, 320193, 789019, 813762, 814585, 890926, 909832, 1018724, 1045810, 1065280, 1141391, 1273813, 1276520, 1318605, 1326801, 1341439, 1403161, 1652044]
  Years: [2012, 2013, 2014]
  Companies selected:
    - EXXON MOBIL CORP (CIK: 34088)
    - ELI LILLY & Co (CIK: 59478)
    - Walmart Inc. (CIK: 104169)
    - JOHNSON & JOHNSON (CIK: 200406)
    - Apple Inc. (CIK: 320193)
    - MICROSOFT CORP (

### Why so much cost tracking and token-wise API tracking analysis? Costs.
- console itself runs inside an AWS-managed web app -> browser fetches a pre-signed HTTPS URL generated by the console service -> in-region access or intra-AWS traffic, which is often free. but.
- personal downloads also route through CloudFront or their internal edge acceleration, which may absorb small transfer fees.
- Batch ETL pulling hundreds of GB per day from S3 to local, Repeated egress from multiple regions - Awareness.


### Order Preserve Proof: Important.
```

# Building batch (order maintained)
current_batch = []  # List preserves insertion order
for idx, row in enumerate(sentences_data):
    current_batch.append({'id': sent_id, 'text': sent_text})
    all_sentence_ids.append(sent_id)  # Same order as batch

# API call
texts = [item['text'] for item in current_batch]  # Preserves order
batch_embeddings = _call_bedrock_api(...)  # Returns in same order

# Collection
all_embeddings.extend(batch_embeddings)  # Appends in order


df_vectors = pl.DataFrame({
    'sentenceID': all_sentence_ids,  # [id_0, id_1, id_2, ...]
    'embedding': all_embeddings       # [emb_0, emb_1, emb_2, ...]
})

# Row 0: sentenceID[0] → embedding[0]
# Row 1: sentenceID[1] → embedding[1]
# Perfect 1-to-1 mapping

```

### True cost analysis:
```
1. Polars: Serialize 469,252 rows → 281MB Parquet file (in RAM)
2. PyArrow: Write to temporary buffer
3. Boto3: PUT request to S3
4. S3: Receives 281MB upload
5. S3: Atomically replaces old object
6. Old version: Deleted (or moved to versioning if enabled)

Network transfer: 281MB upload (ingress = $0.00)
S3 operation: PutObject (free)
Storage: 281MB × $0.023/GB/month = $0.006/month
```

- 0.36 + 0.26 + 0.120 = $0.746 