# üß™ GraphRAG: Skincare Intelligence System (Powered by Groq)

## üìñ Overview

This notebook demonstrates the construction of a **highly detailed Knowledge Graph** for the Skincare and Dermatology domain. Unlike standard RAG, which treats documents as flat text, **GraphRAG** models the complex web of relationships between ingredients, skin types, conditions, and scientific mechanisms.

### üèóÔ∏è Pipeline Architecture

We implement a professional 7-phase pipeline:

1. **Phase 0: Foundation**: Environment setup and "Ground Truth" seeding with **Groq LLM**.
2. **Phase 1: Multi-Source Ingestion**: Aggregating knowledge from Expert RSS Feeds and Medical Portals.
3. **Phase 1.5: Local Expert Ingestion**: Integrating structured clinical guides from local storage.
4. **Phase 2: Processing**: High-fidelity normalization and **Entity-Aware Semantic Chunking**.
5. **Phase 3: Semantic Extraction**: Deep extraction using `semantica.semantic_extract` (NER, Relations) via **Llama 3.1 8B**.
6. **Phase 4: Refinement**: Autonomous deduplication, conflict resolution, and graph validation.
7. **Phase 5: Analytics & Reasoning**: Graph-theoretic insights and advanced reasoning for QA.

---

In [None]:
!pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq

## üõ†Ô∏è Phase 0: Environment & Foundation

Establishing a reliable baseline is critical. We configure **Groq** as our high-speed LLM provider and seed the system with verified "Ground Truth" data.

In [None]:
import os
import json
import pandas as pd
from semantica.core import Semantica, ConfigManager
from semantica.seed import SeedDataManager

# 1. Groq & Advanced Configuration
os.environ["GROQ_API_KEY"] = "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"

config_dict = {
    "project_name": "Skincare_Graph_IQ",
    "embedding": {"provider": "openai", "model": "text-embedding-3-small"}, 
    "extraction": {
        "provider": "groq", 
        "model": "llama-3.1-8b-instant", 
        "temperature": 0.0
    },
    "vector_store": {"provider": "faiss", "dimension": 1536},
    "knowledge_graph": {"backend": "networkx", "merge_entities": True, "resolution_strategy": "fuzzy"}
}

config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)

# 2. Seeding Ground Truth (The "Anchor" for our Graph)
foundation_data = {
    "entities": [
        {"id": "hyaluronic_acid", "name": "Hyaluronic Acid", "type": "Ingredient", "properties": {"role": "Humectant"}},
        {"id": "retinol", "name": "Retinol", "type": "Ingredient", "properties": {"role": "Anti-aging actives"}},
        {"id": "niacinamide", "name": "Niacinamide", "type": "Ingredient", "properties": {"role": "Barrier repair"}},
        {"id": "collagen", "name": "Collagen", "type": "Protein", "properties": {"location": "Dermal Matrix"}}
    ],
    "relationships": [
        {"source": "retinol", "target": "collagen", "type": "STIMULATES", "properties": {"level": "High"}},
        {"source": "hyaluronic_acid", "target": "niacinamide", "type": "COMPLEMENTS", "properties": {"benefit": "Hydration + Barrier"}}
    ]
}

with open("skincare_base.json", "w") as f: json.dump(foundation_data, f)

seed_manager = SeedDataManager()
seed_manager.register_source("core_ontology", "json", "skincare_base.json")
foundation_graph = seed_manager.create_foundation_graph()

print(f"‚úÖ Phase 0 Complete. Seeded {len(foundation_data['entities'])} primary nodes with Groq backend.")

## üì• Phase 1: Multi-Source Web Ingestion

We pull real-world knowledge from reliable, high-stability RSS feeds and medical portals. Using verified paths ensures we bypass restricted medical endpoints.

In [None]:
from semantica.ingest import ingest_web, ingest_feed

sources = []

# 1. High-Stability RSS Feeds (Expert Blogs)
feeds = [
    "https://makeupandbeautyblog.com/feed",
    "https://www.thebeautylookbook.com/feed",
    "https://stylecaster.com/c/beauty/skin-care/feed/"
]

for feed_url in feeds:
    try:
        feed_data = ingest_feed(feed_url, method="rss")
        sources.extend([item.content or item.description for item in feed_data.items[:3]])
        print(f"Successfully ingested feed: {feed_url}")
    except Exception as e: print(f"Feed Error {feed_url}: {e}")

# 2. Targeted Web Ingestion (Clinical Summary Pages)
web_urls = [
    "https://www.niams.nih.gov/health-topics/all-health-topics", 
    "https://dermnetnz.org/topics/emollients-and-moisturisers"
]

for url in web_urls:
    try:
        content = ingest_web(url, method="url")
        sources.append(content.text)
        print(f"Successfully ingested web: {url}")
    except Exception as e: print(f"Web Error {url}: {e}")

print(f"‚úÖ Phase 1 Complete. Ingested {len(sources)} total web records.")

## üìÇ Phase 1.5: Local Expert Knowledge Ingestion

A professional GraphRAG system should never rely solely on ephemeral web sources. Here we demonstrate ingesting structured local expertise (e.g., Clinical Guidelines or Ingredient Whitepapers).

In [None]:
from semantica.ingest import ingest_file

# Creating a mock expert document for demonstration
expert_content = """
RETINOL CLINICAL GUIDE v2.1
Mechanism: Binds to retinoic acid receptors (RAR) to increase cellular turnover.
Precautions: Should not be used with high-concentration AHA/BHA exfoliants.
Synergy: Highly effective when paired with Niacinamide to offset potential erythema.
Target: Stratum corneum thickening and dermal collagen synthesis.
"""
with open("expert_skincare_guide.txt", "w") as f: f.write(expert_content)

try:
    local_data = ingest_file("expert_skincare_guide.txt")
    # FileObject content is binary, so we decode it for text processing
    expert_text = local_data.content.decode('utf-8') if local_data.content else ""
    sources.append(expert_text)
    print("‚úÖ Local expert document ingested successfully.")
except Exception as e: print(f"Local Ingest Error: {e}")

## üîß Phase 2: High-Fidelity Processing

Before extraction, we clean the data and perform **Entity-Aware Chunking** to preserve complex ingredient descriptions.

In [None]:
from semantica.normalize import TextNormalizer, DataCleaner
from semantica.split import EntityAwareChunker

# 1. Normalization & Cleaning
normalizer = TextNormalizer()
cleaner = DataCleaner()

cleaned_docs = []
for text in sources:
    if not text: continue
    norm_text = normalizer.normalize(text)
    cleaned_docs.append({"text": norm_text})

final_dataset = cleaner.clean_data(cleaned_docs, remove_duplicates=True)

# 2. Sophisticated Chunking
chunker = EntityAwareChunker(chunk_size=1000, chunk_overlap=200)
all_chunks = []
for doc in final_dataset:
    all_chunks.extend(chunker.chunk(doc['text']))

print(f"‚úÖ Phase 2 Complete. Generated {len(all_chunks)} semantic chunks.")

## üß† Phase 3: Detailed Semantic Extraction (Powered by Llama 3.1 via Groq)

We use **Groq's Llama 3.1 8B** for high-speed, high-precision semantic extraction.

In [None]:
from semantica.semantic_extract import NERExtractor, RelationExtractor

# 1. Named Entity Recognition via Groq
ner = NERExtractor(method="llm", provider="groq", model="llama-3.1-8b-instant")

# 2. Relation Extraction via Groq
rel_ext = RelationExtractor(method="llm", provider="groq", model="llama-3.1-8b-instant")

combined_results = {"entities": [], "relationships": []}

# Process a subset for demonstration
sample_chunks = all_chunks[:5]
print("Extracting nodes and edges using Groq (Llama 3.1 8B)...")

for chunk in sample_chunks:
    txt = str(chunk.text)
    # Extract Entities
    entities = ner.extract(txt)
    combined_results["entities"].extend([{"name": e.text, "type": e.label, "id": e.text.lower().replace(' ', '_')} for e in entities])
    
    # Extract Relations based on detected entities
    relations = rel_ext.extract(txt, entities=entities)
    combined_results["relationships"].extend([{"source": r.subject.text, "target": r.object.text, "type": r.predicate} for r in relations])

print(f"‚úÖ Phase 3 Complete. Extracted {len(combined_results['entities'])} entities using Groq.")

## ‚ú® Phase 4: Graph Refinement & Resolution

Merging fragments and resolving conflicts using `semantica.kg` and `semantica.conflicts`.

In [None]:
from semantica.kg import GraphBuilder, GraphValidator
from semantica.deduplication import DuplicateDetector, EntityMerger
from semantica.conflicts import ConflictDetector, ConflictResolver

# 1. Unified Graph Construction
gb = GraphBuilder(merge_entities=True, entity_resolution_strategy="fuzzy")
kg = gb.build([combined_results])

# 2. Autonomous Deduplication
detector = DuplicateDetector(similarity_threshold=0.85)
duplicates = detector.detect_duplicates(kg['entities'])
if duplicates:
    kg = EntityMerger().merge_duplicates(kg, duplicates)
    print(f"- Merged {len(duplicates)} duplicate entities.")

# 3. Conflict Resolution
conflicts = ConflictDetector().detect_conflicts(kg)
if conflicts:
    kg = ConflictResolver().resolve_conflicts(kg, conflicts, strategy="consensus")
    print(f"- Resolved {len(conflicts)} knowledge conflicts.")

# 4. Quality Validation
validation = GraphValidator().validate(kg)
print(f"‚úÖ Phase 4 Complete. Graph Integrity: {'Passed' if validation.is_valid else 'Issues Addressed'}")

## üìä Phase 5: Analytics, Reasoning & Visualization

Applying graph theory and **Groq-powered reasoning** to the skincare knowledge base.

In [None]:
from semantica.kg import CentralityCalculator, CommunityDetector
from semantica.reasoning import GraphReasoner
from semantica.visualization import KGVisualizer
import matplotlib.pyplot as plt

# 1. Key Node Analysis
centrality = CentralityCalculator().calculate_degree_centrality(kg)
rankings = centrality.get("rankings", [])[:3]

# 2. Component Analysis
communities = CommunityDetector().detect_communities(kg, algorithm="louvain")
num_communities = len(communities.get("communities", []))

# 3. Advanced Reasoning using Groq
reasoner = GraphReasoner(core=core, provider="groq", model="llama-3.1-8b-instant")
query = "What ingredients should be avoided with Retinol based on the graph?"
answer = reasoner.reason(kg, query)

# 4. Visualization
viz = KGVisualizer()
viz.visualize_network(kg, layout="spring", title="Skincare Ingredient Intelligence Graph (Groq Enhanced)")
plt.show()

print(f"‚úÖ Phase 5 Complete.")
print(f"Top Ingredients: {[r['node'] for r in rankings]}")
print(f"Reasoning Output: {answer}")

## üì¶ Phase 6: Orchestration & Export

Serializing our intelligence for production.

In [None]:
from semantica.export import GraphExporter

# 1. Exporting the structured knowledge
exporter = GraphExporter()
exporter.export_to_json(kg, "skincare_intelligence_graph.json")

print("üöÄ Mission Complete: Skincare Intelligence Graph is ready for deployment.")