<a href="https://colab.research.google.com/github/Diego-Hernandez-Jimenez/RAG_NATO_streamlit/blob/main/notebooks/create_vector_database_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd '/content/drive/MyDrive/NATO RAG'

/content/drive/MyDrive/NATO RAG


In [3]:
!pip install -qU langchain-community --quiet
!pip install -qU langchain-chroma --quiet
!pip install -qU langchain-huggingface --quiet # not necessary if we finally use gemini embeddings
!pip install -qU langchain-google-genai --quiet
!pip install -qU langchain-groq --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m7.5 MB/s[0m eta [36m0:0

In [5]:
from google.colab import userdata
import os
from re import sub as regex_sub
from random import randint

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings # not necessary if we finally use gemini embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_groq import ChatGroq

## Load the document(s)

In [6]:
loaded_doc = TextLoader('./alta-handbook_MinerU/auto/edited_alta-handbook.md').load()
print(f'{loaded_doc[0].metadata["source"]} has {len(loaded_doc[0].page_content)} characters')
print(f'\n{loaded_doc[0].page_content[:250]}\n...')

./alta-handbook_MinerU/auto/edited_alta-handbook.md has 207941 characters

![](images/d106f1972cd1f87845f2df3a06fb6506f5fab195d30a46ac2d0a8b43d0b1fcfa.jpg)  

# The NATO Alternative Analysis Handbook  

Second Edition – December 2017  

The NATO Alternative Analysis Handbook  

Second Edition, December 2017  

The Alliance 
...


### Preprocessing

In [7]:
def clean_text(text):
    # remove "blank" pages
    text = regex_sub(r'This page is intentionally left blank.', '', text)
    # remove image links and captions, they are not going to be used
    text = regex_sub(r' ?\n?!\[\]\(images/.+\.jpg\)  \nFigure \d\d? – [A-z ]+ \n', '', text)

    return text

markdown_document = clean_text(loaded_doc[0].page_content)

## Split the document into manageable text chunks

In [8]:
# markdown_document = loaded_doc[0].page_content
# actually, I think there are only "#" headers in text, but it doesn't hurt to add others
headers_to_split_on = [
    ("#", "Topic 1"),
    ("##", "Topic 2"),
    ("###", "Topic 3"),
]

# I keep the headers because the titles add a lot of important semantic content to the chunk
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)[6:] # discard first 6 chunks (cover, index)

header_names = []
for i in range(len(md_header_splits)):
    metadata_content = ''.join(md_header_splits[i].metadata.values())
    header_names.append(metadata_content)

for i in range(len(md_header_splits)):
    print(f'metadata: {md_header_splits[i].metadata}\ncontent: {md_header_splits[i].page_content}\n...')

metadata: {'Topic 1': '1 Alternative Analysis Explained'}
content: # 1 Alternative Analysis Explained  
Alternative Analysis (AltA) is a capability described as follows:  
AltA is the deliberate application of independent, critical thought and alternative perspective to improve decision-making.  
The key words in this description are independent, critical thought, and alternative perspective. First, independent refers to being free from influence or control by others in matters of belief or thinking. Second, critical thought – also known as critical thinking – is the intellectually disciplined process of conceptualizing, applying, analysing, synthesizing, and evaluating information. It is necessary for valid reasoning when drawing conclusions about goals, problems, assumptions, concepts, evidence, implications, and consequences. Finally, alternative perspective is the result of looking at a situation, problem, or fact through a different mindset, cultural frame, or value and belief str

## Create vector store

In [11]:
# embeddings = HuggingFaceEmbeddings(model_name='all-mpnet-base-v2') # in v1 I used models from sentence transformers
embeddings = GoogleGenerativeAIEmbeddings(model='models/text-embedding-004', google_api_key=userdata.get('google_api_key'))

In [15]:
db_dir = './vector_db_alta_v2'

vector_db = Chroma.from_documents(
    collection_name='alta_handbook',
    documents=md_header_splits,
    embedding=embeddings,
    persist_directory=db_dir
)
# if already created
# vector_db = Chroma(collection_name='alta_handbook', persist_directory=db_dir, embedding_function=embeddings)
retriever = vector_db.as_retriever(search_type='similarity', search_kwargs={'k': 3})

In [19]:
# query = 'What does AltA mean?'
# query = 'What are the benefits of brainstorming?'
# query = 'How can I manage disruptive behaviour during a session?'
# query = 'Point me to some resources to learn more about SWOT'
query = 'What is PMI?'
retrieved = retriever.invoke(query)

for i in range(len(retrieved)):
    print(f'Result {i+1}:\n')
    print(retrieved[i].page_content)
    print('-'*100)

Result 1:

# Further reading  
https://www.mindtools.com/pages/article/newTED_05.htm Web page about PMI.
https://www.stickyminds.com/article/mind-changing-exercise Web page about PMI.
----------------------------------------------------------------------------------------------------
Result 2:

# Benefits  
PMI:  
allows you to look at a topic from different angles.  
takes little time to complete, but is nevertheless very effective.
----------------------------------------------------------------------------------------------------
Result 3:

# Plusses, Minuses, Interesting (PMI)  
(for individual, 2–10, or more than 10 people; easy)  
A very simple technique that weighs up the pros and cons as well as any interesting points regarding a decision by contrasting them with each other.
----------------------------------------------------------------------------------------------------
