### Data Intake

In [12]:
%pip install pypdf
%pip install langchain_community

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("1.pdf")
docs = loader.load()

print(docs[0].page_content[:300])



Note: you may need to restart the kernel to use updated packages.
The TESS Objects of Interest Catalog from the TESS Prime Mission
Natalia M. Guerrero1 , S. Seager1,2,3 , Chelsea X. Huang1,53 , Andrew Vanderburg4,5,54 , Aylin Garcia Soto6 ,
Ismael Mireles1 , Katharine Hesse1 , William Fong1 , Ana Glidden1,2 , Avi Shporer1 , David W. Latham7 ,
Karen A. Collins7 , S


### Chunking

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

print(len(chunks))
print(chunks[0].page_content)



320
The TESS Objects of Interest Catalog from the TESS Prime Mission
Natalia M. Guerrero1 , S. Seager1,2,3 , Chelsea X. Huang1,53 , Andrew Vanderburg4,5,54 , Aylin Garcia Soto6 ,
Ismael Mireles1 , Katharine Hesse1 , William Fong1 , Ana Glidden1,2 , Avi Shporer1 , David W. Latham7 ,
Karen A. Collins7 , Samuel N. Quinn7 , Jennifer Burt8 , Diana Dragomir9 , Ian Crossﬁeld1,10, Roland Vanderspek1 ,
Michael Fausnaugh1 , Christopher J. Burke1 , George Ricker1 , Tansu Daylan1,55 , Zahra Essack1,2 ,


### Embedding

In [14]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectordb = Chroma.from_documents(chunks, embedding=embeddings, persist_directory="chroma_db",collection_name="testing")

print(vectordb._collection.count())


355


In [15]:
sample_vector = embeddings.embed_query("What is this document about?")
print(sample_vector[:10])

[-0.10597019642591476, 0.16800662875175476, 0.02391703799366951, 0.017985714599490166, 0.009284470230340958, 0.04012048617005348, 0.021706245839595795, -0.00509650306776166, -0.025474686175584793, 0.010306820273399353]


### Retrieval

In [16]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

### AI Model Definition

In [18]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA  

load_dotenv()

Qroq_api_key = os.getenv("GROQ_API_KEY")

llm = ChatGroq(model="openai/gpt-oss-20b", api_key=Qroq_api_key)

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an assistant. Use the retrieved context to answer the question.

Context:
{context}

Question:
{question}
"""
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,  
    chain_type_kwargs={
        "prompt": prompt,
    },
)


### Final Testing

In [19]:
query = "Tell me about light curves"

answer = rag_chain.invoke(query)

print("Question:", query)
print("\nAnswer:\n")
print(answer["result"])


Question: Tell me about light curves

Answer:

**Light curves – a quick guide**

A *light curve* is simply a time‑series record of a star’s brightness (flux) measured in a given photometric band. In exoplanet studies the most common goal is to detect the tiny, periodic dimming that occurs when a planet passes (or *transits*) in front of its host star. The shape, depth, and duration of that dimming encode the planet’s size, orbital period, and sometimes even atmospheric properties.

---

### 1. Why light curves are noisy

Real observations are never perfect. Several sources of low‑frequency variability contaminate the raw flux:

| Source | Effect on light curve |
|--------|-----------------------|
| **Stellar activity** (spots, faculae, rotation) | Slow, quasi‑periodic brightness changes |
| **Instrumental drifts** (thermal changes, focus shifts) | Long‑term trends, jumps |
| **Spacecraft events** (momentum dumps, reaction‑wheel adjustments) | Sudden outliers or gaps |
| **Scattered lig