# Materials Technology RAG System
## Retrieval-Augmented Generation for ISO/DIN Standards and Technical Documents

This notebook implements a comprehensive RAG system for querying technical documents related to:
- **Metallography and Material Science**
- **Hardness Testing Standards (ISO/DIN)**
- **SEM (Scanning Electron Microscopy)**
- **Material Properties and Testing**
- **Austenitic Grain Size Analysis**

**Technologies Used:**
- OpenAI GPT-3.5-Turbo for answer generation (üí∞ Cost-optimized!)
- OpenAI text-embedding-3-small for document embeddings (üí∞ Newest & cheapest!)
- FAISS for vector storage and similarity search
- LangChain for RAG orchestration

**Cost Savings:** 
- GPT-3.5-Turbo is **20x cheaper** than GPT-4!
- text-embedding-3-small is **5x cheaper** than ada-002!
- **Total savings: Massive!** üéâ



# üöÄ RAG with Qdrant - NO MORE CRASHES!

**This uses LOCAL Qdrant (in-memory) by default - works immediately!**

‚úÖ No crashes - vectors stay in Qdrant process  
‚úÖ No huge files - only small query results in memory  
‚úÖ Fast - optimized vector search  

**Note:** Data is temporary (lost when closing Jupyter). For permanent storage, see instructions below.

---

## Steps:
1. Run cells 1-3 (setup)
2. Run cells 4-6 (build index, ~10 mins, first time only)
3. Run cells 7-10 (query!)


In [None]:
# CELL 1: Install Qdrant (run once)
# Uncomment if needed:
# !pip install qdrant-client -q

print("‚úÖ Ready")


‚úÖ Ready


In [None]:
# CELL 2: Setup
import os, gc, warnings
from pathlib import Path
warnings.filterwarnings('ignore')

from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

COLLECTION_NAME = "materials_tech_docs"

print("‚úÖ Setup complete")


‚úÖ Setup complete


In [None]:
# CELL 3: Connect to Qdrant Cloud
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import os

print("üîß Connecting to Qdrant Cloud...")

# Get Qdrant credentials from .env file
QDRANT_URL = os.getenv('QDRANT_URL')
QDRANT_API_KEY = os.getenv('QDRANT_API_KEY')

if not QDRANT_URL or not QDRANT_API_KEY:
    raise ValueError("‚ö†Ô∏è Missing QDRANT_URL or QDRANT_API_KEY in .env file!")

# Connect with increased timeout for large uploads
client = QdrantClient(
    url=QDRANT_URL,
    api_key=QDRANT_API_KEY,
    timeout=300,  # 5 minutes timeout for large batches
)

print("‚úÖ Connected to Qdrant Cloud!")
print(f"üåê Cluster: {QDRANT_URL}")
print("üåê Web Console: https://cloud.qdrant.io/")
print("‚è±Ô∏è  Timeout: 300 seconds\n")

# Check if collection exists
try:
    collections = client.get_collections().collections
    collection_names = [c.name for c in collections]
    
    if COLLECTION_NAME in collection_names:
        info = client.get_collection(COLLECTION_NAME)
        print(f"‚úÖ Collection '{COLLECTION_NAME}' exists with {info.points_count} vectors!")
        print("üìå Skip to Cell 7 to query!")
    else:
        print(f"‚ö†Ô∏è  Collection '{COLLECTION_NAME}' not found")
        print("üìù Run Cells 4, 5, 6 to build it (~10 mins)")
except Exception as e:
    print(f"‚ö†Ô∏è  No collections yet")
    print("üìù Run Cells 4, 5, 6 to build collection (~10 mins)")


üîß Connecting to Qdrant Cloud...
‚úÖ Connected to Qdrant Cloud!
üåê Cluster: https://36660491-35a5-4e94-ac12-1a35280b8e91.europe-west3-0.gcp.cloud.qdrant.io:6333
üåê Web Console: https://cloud.qdrant.io/
‚è±Ô∏è  Timeout: 300 seconds

‚úÖ Collection 'materials_tech_docs' exists with 13400 vectors!
üìå Skip to Cell 7 to query!


In [None]:
# CELL 4: Load & chunk PDFs
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm import tqdm

pdf_dir = Path('../ISO_DIN standards_Thesis_Documents')
pdf_files = list(pdf_dir.glob('*.pdf'))
print(f"üìö Found {len(pdf_files)} PDFs\n")
documents = []
for pdf_file in tqdm(pdf_files, desc="Loading"):
    try:
        loader = PyPDFLoader(str(pdf_file))
        docs = loader.load()
        if docs:
            documents.extend(docs)
    except:
        pass

print(f"\n‚úÖ Loaded {len(documents)} pages")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
print(f"‚úÖ Created {len(chunks)} chunks")

del documents
gc.collect()


üìö Found 50 PDFs



Loading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [03:38<00:00,  4.38s/it]



‚úÖ Loaded 5175 pages
‚úÖ Created 18273 chunks


0

In [None]:
# CELL 5: Create embeddings (3-5 mins)
from langchain_community.embeddings import OpenAIEmbeddings
import time

print("üîÑ Creating embeddings...\n")

embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small', openai_api_key=OPENAI_API_KEY)

all_texts = [c.page_content for c in chunks]
all_metadatas = [c.metadata for c in chunks]

print(f"üìä Processing {len(all_texts)} chunks...\n")

batch_size = 100
all_embeddings = []

for i in tqdm(range(0, len(all_texts), batch_size), desc="Embedding"):
    batch = all_texts[i:i+batch_size]
    embs = embeddings_model.embed_documents(batch)
    all_embeddings.extend(embs)
    time.sleep(1)
    gc.collect()

print(f"\n‚úÖ Created {len(all_embeddings)} embeddings")
print(f"üìè Dimension: {len(all_embeddings[0])}")


üîÑ Creating embeddings...

üìä Processing 18273 chunks...



Embedding: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 183/183 [06:56<00:00,  2.28s/it]


‚úÖ Created 18273 embeddings
üìè Dimension: 1536





In [None]:
# CELL 6: Upload to Qdrant Cloud
from qdrant_client.models import PointStruct
import time

vector_size = len(all_embeddings[0])

# Create collection
try:
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
    )
    print(f"‚úÖ Collection created\n")
except Exception as e:
    if "already exists" in str(e).lower():
        print(f"‚ö†Ô∏è  Collection already exists, will add to it\n")
    else:
        raise e

# Prepare points
print("üîÑ Preparing points...")
points = []
for i in range(len(all_texts)):
    points.append(PointStruct(
        id=i,
        vector=all_embeddings[i],
        payload={
            "text": all_texts[i],
            "source": all_metadatas[i].get('source', 'Unknown'),
            "page": all_metadatas[i].get('page', 0)
        }
    ))

print(f"üì¶ Total points: {len(points)}")

# Upload with smaller batches and retry logic
batch_size = 50  # Reduced from 100 to avoid timeouts
max_retries = 3

print("üöÄ Uploading to Qdrant Cloud...")
for i in tqdm(range(0, len(points), batch_size), desc="Uploading"):
    batch = points[i:i+batch_size]
    
    # Retry logic
    for attempt in range(max_retries):
        try:
            client.upsert(
                collection_name=COLLECTION_NAME, 
                points=batch,
                wait=True  # Wait for operation to complete
            )
            break  # Success, exit retry loop
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"\n‚ö†Ô∏è  Batch {i//batch_size + 1} failed, retrying ({attempt + 1}/{max_retries})...")
                time.sleep(2)  # Wait before retry
            else:
                print(f"\n‚ùå Batch {i//batch_size + 1} failed after {max_retries} attempts: {str(e)}")
                raise e
    
    time.sleep(0.5)  # Small delay between batches
    gc.collect()

print(f"\n‚úÖ Uploaded {len(points)} vectors to Qdrant Cloud!")
print("üéâ Ready to query!")
print(f"üåê View in console: https://cloud.qdrant.io/")

del all_embeddings, all_texts, all_metadatas, chunks, points
gc.collect()


‚ö†Ô∏è  Collection already exists, will add to it

üîÑ Preparing points...
üì¶ Total points: 18273
üöÄ Uploading to Qdrant Cloud...


Uploading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 366/366 [09:08<00:00,  1.50s/it]


‚úÖ Uploaded 18273 vectors to Qdrant Cloud!
üéâ Ready to query!
üåê View in console: https://cloud.qdrant.io/





0

---
## üéØ Query Time!
Collection ready! Ask questions below.


In [None]:
# CELL 7: Query function
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.chat_models import ChatOpenAI

embeddings_model = OpenAIEmbeddings(model='text-embedding-3-small', openai_api_key=OPENAI_API_KEY)
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0.1, openai_api_key=OPENAI_API_KEY, max_tokens=1500)

def ask(question):
    print(f"‚ùì {question}\n")
    
    # Get embedding
    query_vector = embeddings_model.embed_query(question)
    
    # Search Qdrant (happens in Qdrant process, not your memory!)
    results = client.query_points(collection_name=COLLECTION_NAME, query=query_vector, limit=3).points
    
    # Build context
    context_texts = [r.payload['text'] for r in results]
    sources = list(set([r.payload['source'].split('/')[-1] for r in results]))
    context = "\n\n".join(context_texts)
    
    # Generate answer
    prompt = f"""You are a Materials Science expert. Answer using the context below.
Include ISO/DIN standards when relevant.

Context:
{context}

Question: {question}

Answer:"""
    
    answer = llm.invoke(prompt).content
    
    print("üí° ANSWER:")
    print("-"*70)
    print(answer)
    print("-"*70)
    print(f"\nüìö SOURCES: {', '.join(sources)}\n")
    print("="*70 + "\n")
    
    gc.collect()

print("‚úÖ Function ready!")


‚úÖ Function ready!


In [None]:
# CELL 8: Sample questions
Q1 = "What are differences between Brinell and Vickers hardness testing?"
Q2 = "How is grain size measured according to ISO 643?"
Q3 = "What are mechanical properties of EN-AC44300 aluminum alloy?"

print("üìã Questions:")
print(f"1. {Q1}")
print(f"2. {Q2}")
print(f"3. {Q3}")


üìã Questions:
1. What are differences between Brinell and Vickers hardness testing?
2. How is grain size measured according to ISO 643?
3. What are mechanical properties of EN-AC44300 aluminum alloy?




üìù Creating Streamlit app...




‚úÖ Streamlit app created at: ../apps/materials_rag_streamlit_app.py

üöÄ To run the app:
   cd ../apps
   streamlit run materials_rag_streamlit_app.py

üìù Or run: streamlit run apps/materials_rag_streamlit_app.py


In [52]:
# CELL 9: Ask question
ask(Q1)


‚ùì What are differences between Brinell and Vickers hardness testing?

üí° ANSWER:
----------------------------------------------------------------------
One of the main differences between Brinell and Vickers hardness testing is the shape of the indenter used. Brinell testing uses a spherical indenter, while Vickers testing uses a pyramid-shaped diamond indenter. This allows Vickers testing to be used on harder materials, such as high-strength steels.

Another difference is in how the hardness number is calculated. In Brinell testing, the hardness number is calculated by dividing the load by the surface area of the indentation, while in Vickers testing, the hardness number is determined by the size of the indentation made by the indenter.

Additionally, Vickers hardness values are independent of the applied force, meaning that the hardness value obtained with a 10 kgf load should be the same as that obtained with a 50 kgf load on homogeneous material. This is not the case with Brine

In [None]:
# Another question
ask("How is grain size measured according to ISO 643")


‚ùì How is grain size measured according to ISO 643

üí° ANSWER:
----------------------------------------------------------------------
Grain size is measured according to ISO 643 using the grain size index G, which is calculated using Formula (1), (2), or (3). The index G is then compared with standard grain size charts defined in ASTM E112 at a magnification of √ó100. The number of grains per square millimeter, mean diameter of grain, mean area of grain, mean intersected segment, and mean number of intercepts on the measuring line are all parameters used to characterize grain size. Additionally, the Snyder-Graff method can be used for determining the prior-austenitic grain size of hardened and tempered high-speed steels by the linear intercept method. Other methods such as ultrasonic methods and automatic image analysis can also be used for grain size measurement, provided their accuracy has been proven through cross correlation.
-----------------------------------------------------