In [1]:
#Learn how to split large documents into smaller chunks using RecursiveCHaracterTextSplitter
#Why this matters
#LLMs have context length limits(eg 8k or 32k token)
#Splitting ensures efficient retrieval while preserving the context
#Overlap b/w te chunks helps to avoid cutting important info sentences mid-way

In [None]:
pip install -q langchain langchain_groq pypdf langchain_community

In [None]:
!pip show langchain langchain_groq pypdf langchain_community

Name: langchain
Version: 1.2.3
Summary: Building applications with LLMs through composability
Home-page: https://docs.langchain.com/
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: langchain-core, langgraph, pydantic
Required-by: 
---
Name: langchain-groq
Version: 1.1.1
Summary: An integration package connecting Groq and LangChain
Home-page: https://docs.langchain.com/oss/python/integrations/providers/groq
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: groq, langchain-core
Required-by: 
---
Name: pypdf
Version: 6.6.0
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page: 
Author: 
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: 
Required-by: 
---
Name: langchain-community
Version: 0.4.1
Summary: Community contributed LangChain integrations.
Home-page:

In [None]:
from langchain_community.document_loaders import PyPDFLoader

pdf_path="/content/drive/MyDrive/GEN_AI/scholarship-rules-role-based.pdf"
loader=PyPDFLoader(pdf_path)

docs = loader.load()

print(f"Loaded {len(docs)} pages.")


Loaded 3 pages.


In [None]:
docs

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250812134403', 'source': '/content/drive/MyDrive/GEN_AI/scholarship-rules-role-based.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'}, page_content='CFA INSTITUTE ROLE BASED SCHOLARSHIP RULES | UPDATED: AUGUST 2025 | PAGE 1  \n  \n \nCFA Institute Role Based Scholarship Rules \n(Excludes Access Scholarships) \nLast Updated: August 2025  \nThe CFA Institute role-based scholarships are designed to raise global awareness of CFA Institute programs \namong key influencers within academia, the regulatory public sector and across the investment \nmanagement industry.  Except for Professor and Accelerate Scholarships, these will be distributed by the \nrelevant sponsor organization (e.g., a university, a regulatory body or other strategic partner (Sponsor)) \n \nEligibility  \nApplicants must meet all CFA Program enrollment requirements, including  not having exceeded the \nmaximum number of exam attempts 

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=0)
splits=text_splitter.split_documents(docs)

In [None]:
print(splits[0].page_content)

CFA INSTITUTE ROLE BASED SCHOLARSHIP RULES | UPDATED: AUGUST 2025 | PAGE 1


In [None]:
print(splits[1].page_content)

CFA Institute Role Based Scholarship Rules 
(Excludes Access Scholarships)


In [None]:
print(splits[3].page_content)

The CFA Institute role-based scholarships are designed to raise global awareness of CFA Institute


In [None]:
print(splits[4].page_content)

programs


In [39]:
from google.colab import userdata
GROQ_API_KEY=userdata.get('GROQ_API_KEY')

In [None]:
from langchain_groq import ChatGroq
llm = ChatGroq(model= 'openai/gpt-oss-20b',api_key=GROQ_API_KEY)

In [41]:
sample_data=splits[0].page_content
summary=llm.invoke(f"Summarize this chunk of data in 3 bullet points:\n\n{sample_data}")
print("Groq LLM Summary:\n",summary.content)

Groq LLM Summary:
 - **Updated Scholarship Framework** – The CFA Institute has released a new set of role‑based scholarship rules, refreshed and officially dated August 2025.  
- **Key Focus Areas** – The document outlines eligibility requirements, the application process, and the criteria used to award scholarships to candidates in specific roles (e.g., students, professionals, educators).  
- **Document Reference** – This excerpt represents the first page of the official rules publication, serving as the introductory reference for applicants and administrators.


In [42]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
splits=text_splitter.split_documents(docs)

In [43]:
sample_data=splits[0].page_content
summary=llm.invoke(f"Summarize this chunk of data in 3 bullet points:\n\n{sample_data}")
print("Groq LLM Summary:\n",summary.content)

Groq LLM Summary:
 - **Purpose & Scope** – The scholarships aim to boost global awareness of CFA Institute programs among key influencers in academia, regulatory bodies, and the investment‑management industry.  
- **Distribution Mechanism** – Except for Professor and Accelerate Scholarships, awards are administered by sponsoring partners (universities, regulators, or other strategic partners).  
- **Eligibility Requirements** – Candidates must satisfy all CFA Program enrollment criteria (including not exceeding the maximum exam attempts) and may be either new or continuing candidates; specific disqualifying circumstances are outlined but not detailed in the excerpt.


In [45]:
summary=llm.invoke(f"Extract key dates from this data:\n\n{splits[1].page_content} {splits[2]} {splits[3]}")
print("Groq LLM Summary:\n",summary.content)

Groq LLM Summary:
 **Key dates extracted from the provided text**

| Source snippet | Date found | Format / Notes |
|-----------------|------------|----------------|
| “effective with scholarships awarded **1 September 2024**” | **1 September 2024** | Start of the new scholarship rule set |
| “CFA INSTITUTE ROLE BASED SCHOLARSHIP RULES | **UPDATED: AUGUST 2025** | **August 2025** | Date of the latest update to the rules |
| PDF metadata: `creationdate: 'D:20250812134403'` | **12 August 2025 13:44:03** | Exact timestamp when the PDF was generated |

These are the only explicit dates mentioned in the text you provided.


In [46]:
#Vector database and retrievers