Load and reading the data

In [2]:
from bs4 import BeautifulSoup
with open('/kaggle/input/gdpr-chromdb/gdprrr.html', 'r', encoding='utf-8') as file:
    html_content = file.read()
soup = BeautifulSoup(html_content, 'html.parser')

paragraphs = soup.find_all('p')

text_content = "\n".join([p.get_text() for p in paragraphs])

text_content[:1000]

'4.5.2016\xa0\xa0\xa0\nEN\nOfficial Journal of the European Union\nL 119/1\n\n            REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\n         \nof 27 April 2016\n         \non the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive\xa095/46/EC (General Data Protection Regulation)\n(Text with EEA relevance)\nTHE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE EUROPEAN UNION,\nHaving regard to the Treaty on the Functioning of the European Union, and in particular Article\xa016 thereof,\nHaving regard to the proposal from the European Commission,\nAfter transmission of the draft legislative act to the national parliaments,\nHaving regard to the opinion of the European Economic and Social Committee\xa0(1),\nHaving regard to the opinion of the Committee of the Regions\xa0(2),\nActing in accordance with the ordinary legislative procedure\xa0(3),\nWhereas:\n(1)\nThe protection of na

In [3]:
!pip install -U langchain-community


Collecting langchain-community
  Downloading langchain_community-0.2.6-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain<0.3.0,>=0.2.6 (from langchain-community)
  Downloading langchain-0.2.6-py3-none-any.whl.metadata (7.0 kB)
Collecting langchain-core<0.3.0,>=0.2.10 (from langchain-community)
  Downloading langchain_core-0.2.10-py3-none-any.whl.metadata (6.0 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain-community)
  Downloading langsmith-0.1.82-py3-none-any.whl.metadata (13 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain<0.3.0,>=0.2.6->langchain-community)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting packaging<25,>=23.2 (from langchain-core<0.3.0,>=0.2.10->langchain-community)
  Downloading packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.0->langchain-community)
  Downloading orjson-3.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [4]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence_transformers
Successfully installed sentence_transformers-3.0.1


Splitting the text in chunks using hierarchical chunking based on headers of html
and create embeddings from the chunks

In [25]:
import nltk
from nltk.tokenize import sent_tokenize
from transformers import AutoTokenizer
from langchain.embeddings import HuggingFaceBgeEmbeddings
from bs4 import BeautifulSoup


nltk.download('punkt')


tokenizer = AutoTokenizer.from_pretrained('bigscience/bloomz')

def chunk_text_based_on_tokens(text, max_tokens=200):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(tokenizer.tokenize(sentence))
        if current_length + sentence_length <= max_tokens:
            current_chunk.append(sentence)
            current_length += sentence_length
        else:

            chunks.append(" ".join(current_chunk))

            current_chunk = [sentence]
            current_length = sentence_length


    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


def extract_sections_articles_chapters(soup):
    sections = []
    current_section = []
    for element in soup.find_all(['h1', 'h2', 'h3', 'p']):
        if element.name in ['h1', 'h2', 'h3']:
            if current_section:
                sections.append(" ".join(current_section))
                current_section = []
            current_section.append(element.get_text())
        else:
            current_section.append(element.get_text())
    if current_section:
        sections.append(" ".join(current_section))
    return sections


with open('/kaggle/input/gdpr-chromdb/gdprrr.html', 'r', encoding='utf-8') as file:
    html_content = file.read()


soup = BeautifulSoup(html_content, 'html.parser')


sections = extract_sections_articles_chapters(soup)


all_chunks = []
for section in sections:
    all_chunks.extend(chunk_text_based_on_tokens(section))


model_name = "BAAI/bge-large-en"
encode_kwargs = {'normalize_embeddings': True}

model_norm = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)


embeddings = model_norm.embed_documents(all_chunks)


print(f"Number of chunks: {len(all_chunks)}")
print(f"Sample Embedding: {embeddings[0]}")


for i, chunk in enumerate(all_chunks[:]):
    print(f"Chunk {i+1}:\n{chunk}\n")


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Number of chunks: 364
Sample Embedding: [0.009983654133975506, 0.005418587010353804, -0.017273331061005592, 0.025197450071573257, -0.02126961573958397, 0.012419218197464943, -0.02804456651210785, 0.03317546844482422, 0.014056425541639328, 0.03964075818657875, 0.05111420527100563, -0.010792930610477924, 0.004926718771457672, -0.03951730206608772, -0.021753985434770584, 0.01878412812948227, -0.006961434613913298, -0.04184645414352417, -0.019962972030043602, -0.0016141011146828532, -4.094945325050503e-05, -0.010066633112728596, -0.05554582551121712, 0.005934289190918207, 0.0010987356072291732, 0.01677645370364189, 0.01775052584707737, 0.004509891849011183, 0.04856783524155617, 0.05027959868311882, -0.033771708607673645, -0.043453067541122437, 0.01772129163146019, -0.044256843626499176, -0.035511694848537445, -0.012992396019399166, 0.018195191398262978, -0.039035260677337646,

In [14]:

if len(embeddings) >= 85:
    chunk_85_embedding = embeddings[86]
    print(f"Embedding for chunk 85: {chunk_85_embedding}")
else:
    print(f"Expected at least 85 chunks, but got {len(embeddings)}")


Embedding for chunk 85: [0.0001964164839591831, 0.00540191400796175, -0.01493762992322445, 0.002094501629471779, -0.028230808675289154, -0.0006597876781597733, -0.006371537689119577, 0.006491244770586491, 0.033829230815172195, 0.021910417824983597, 0.034737955778837204, -0.007520964369177818, 0.015543515793979168, -0.007040165830403566, -0.02521529421210289, 0.03156011924147606, -0.036894336342811584, -0.012767880223691463, -0.03275051340460777, -0.0027576338034123182, 0.03758380189538002, 0.01695163920521736, -0.05268487334251404, -0.030427560210227966, -0.02565152756869793, 0.03702085092663765, 0.03414197638630867, 0.007072494365274906, 0.07895106077194214, 0.06399844586849213, -0.02766314148902893, -0.018347280099987984, 0.013414734043180943, -0.04161306843161583, -0.029052210971713066, -0.005673940759152174, 0.023307347670197487, -0.0693320780992508, -0.009348267689347267, -0.05278225615620613, 0.015038559213280678, -0.0012698137434199452, 0.05128074437379837, -0.04027838632464409,

Working with chroma db as my vector database to store the embeddings and then later retrieve them

In [7]:
!pip install chromadb


  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.18.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.46b0-py3-none-any.whl.metadata (2.0 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Inst

In [15]:
import chromadb
chroma_client = chromadb.Client()

In [16]:
collection_name = "embeddings_gdpr_collection_ivf_cosine"


try:
    chroma_client.delete_collection(name=collection_name)
    print(f"Collection {collection_name} deleted successfully.")
except Exception as e:
    print(f"Error deleting collection: {e}")


try:
    collection = chroma_client.create_collection(name=collection_name)
    print(f"Collection {collection_name} created successfully.")
except Exception as e:
    print(f"Error creating collection: {e}")


Collection embeddings_gdpr_collection_ivf_cosine deleted successfully.
Collection embeddings_gdpr_collection_ivf_cosine created successfully.


In [17]:

for i, embedding in enumerate(embeddings):
    collection.add(
        documents=[all_chunks[i]],
        ids=[f"id_{i}"],
        embeddings=[embedding]
    )



cretae embeddings for my query so i can compare it later with the other embeddings stored inside the chrom db

In [18]:
def embed_query(query, model_name):
    query_embedding = model_name.embed_documents([query])
    return query_embedding[0]


In [19]:
def query_chroma_db(query_embedding, collection, top_k=5):
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results

Setting the question and getting an answer

In [26]:
user_query = "How does the regulation ensure a consistent level of protection for personal data across all Member States while allowing for national specificity in certain processing situations?"

query_embedding = embed_query(user_query, model_norm)


results = query_chroma_db(query_embedding, collection, top_k=10)


for result in results['documents']:
    print(result)


['Consistent and homogenous application of the rules for the protection of the fundamental rights and freedoms of natural persons with regard to the processing of personal data should be ensured throughout the Union. Regarding the processing of personal data for compliance with a legal obligation, for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller, Member\xa0States should be allowed to maintain or introduce national provisions to further specify the application of the rules of this Regulation.', '(13) In order to ensure a consistent level of protection for natural persons throughout the Union and to prevent divergences hampering the free movement of personal data within the internal market, a Regulation is necessary to provide legal certainty and transparency for economic operators, including micro, small and medium-sized enterprises, and to provide natural persons in all Member\xa0States with the same leve

trial


Using TF-IDF tocheck the answer

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

# Initial keyword-based retrieval using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(all_chunks)
query_tfidf = tfidf_vectorizer.transform([user_query])

# Get top 50 documents using TF-IDF
tfidf_scores = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
top_tfidf_indices = tfidf_scores.argsort()[-50:][::-1]
top_tfidf_documents = [all_chunks[i] for i in top_tfidf_indices]

# Embedding-based reranking using Sentence-BERT
model = SentenceTransformer('BAAI/bge-large-en')
query_embedding = model.encode(user_query)
document_embeddings = model.encode(top_tfidf_documents)

cosine_scores = cosine_similarity([query_embedding], document_embeddings).flatten()
top_indices = cosine_scores.argsort()[-10:][::-1]

# Final top 10 documents
top_documents = [top_tfidf_documents[i] for i in top_indices]

# Select the most relevant document (first in the ranked list)
most_relevant_document = top_documents[0]


# Present the most relevant document as the answer
print("\nGenerated Answer:")
print(most_relevant_document)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]


Generated Answer:
(10) In order to ensure a consistent and high level of protection of natural persons and to remove the obstacles to flows of personal data within the Union, the level of protection of the rights and freedoms of natural persons with regard to the processing of such data should be equivalent in all Member States. Consistent and homogenous application of the rules for the protection of the fundamental rights and freedoms of natural persons with regard to the processing of personal data should be ensured throughout the Union. Regarding the processing of personal data for compliance with a legal obligation, for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller, Member States should be allowed to maintain or introduce national provisions to further specify the application of the rules of this Regulation. In conjunction with the general and horizontal law on data protection implementing Directive 9

lets see the cosine similarity and semantic similarity between the generated aswer and our query(I take the first(long)text as the generated aswer)

In [21]:
!pip install scikit-learn
!pip install transformers


  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import nltk
nltk.download('punkt')

# Cosine Similarity Evaluation
def evaluate_cosine_similarity(reference_answer, generated_answer):
    vectorizer = TfidfVectorizer().fit_transform([reference_answer, generated_answer])
    vectors = vectorizer.toarray()
    cosine_sim = cosine_similarity(vectors)
    return cosine_sim[0, 1]


# Semantic Similarity Evaluation
def evaluate_semantic_similarity(reference_answer, generated_answer):
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    model = AutoModel.from_pretrained('bert-base-uncased')
    
    inputs = tokenizer(reference_answer, return_tensors='pt', truncation=True, padding=True)
    reference_embedding = model(**inputs).last_hidden_state.mean(dim=1)
    
    inputs = tokenizer(generated_answer, return_tensors='pt', truncation=True, padding=True)
    generated_embedding = model(**inputs).last_hidden_state.mean(dim=1)
    
    similarity = torch.nn.functional.cosine_similarity(reference_embedding, generated_embedding).item()
    return similarity

# Example reference and generated answers (replace these with actual values)
reference_answer = """		
	
In order to ensure a consistent and high level of protection of natural persons and to remove the obstacles to flows of personal data within the Union, the level of protection of the rights and freedoms of natural persons with regard to the processing of such data should be equivalent in all Member States. Consistent and homogenous application of the rules for the protection of the fundamental rights and freedoms of natural persons with regard to the processing of personal data should be ensured throughout the Union. Regarding the processing of personal data for compliance with a legal obligation, for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller, Member States should be allowed to maintain or introduce national provisions to further specify the application of the rules of this Regulation. In conjunction with the general and horizontal law on data protection implementing Directive 95/46/EC, Member States have several sector-specific laws in areas that need more specific provisions. This Regulation also provides a margin of manoeuvre for Member States to specify its rules, including for the processing of special categories of personal data (‘sensitive data’). To that extent, this Regulation does not exclude Member State law that sets out the circumstances for specific processing situations, including determining more precisely the conditions under which the processing of personal data is lawful.
"""
generated_answer = """
['Consistent and homogenous application of the rules for the protection of the fundamental rights and freedoms of natural persons with regard to the processing of personal data should be ensured throughout the Union. Regarding the processing of personal data for compliance with a legal obligation, for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller, Member\xa0States should be allowed to maintain or introduce national provisions to further specify the application of the rules of this Regulation.', '(13) In order to ensure a consistent level of protection for natural persons throughout the Union and to prevent divergences hampering the free movement of personal data within the internal market, a Regulation is necessary to provide legal certainty and transparency for economic operators, including micro, small and medium-sized enterprises, and to provide natural persons in all Member\xa0States with the same level of legally enforceable rights and obligations and responsibilities for controllers and processors, to ensure consistent monitoring of the processing of personal data, and equivalent sanctions in all Member\xa0States as well as effective cooperation between the supervisory authorities of different Member\xa0States.', 'Such a difference in levels of protection is due to the existence of differences in the implementation and application of Directive 95/46/EC. (10) In order to ensure a consistent and high level of protection of natural persons and to remove the obstacles to flows of personal data within the Union, the level of protection of the rights and freedoms of natural persons with regard to the processing of such data should be equivalent in all Member\xa0States.', 'With regard to the processing of personal data by those competent authorities for\xa0purposes falling within scope of this Regulation, Member\xa0States should be able to maintain or introduce more specific provisions to adapt the application of the rules of this Regulation. Such provisions may determine more precisely specific requirements for the processing of personal data by those competent authorities for those other purposes, taking into account the constitutional, organisational and administrative structure of the respective Member State.', '(11) Effective protection of personal data throughout the Union requires the strengthening and setting out in detail of the rights of data subjects and the obligations of those who process and determine the processing of personal data, as well as equivalent powers for monitoring and ensuring compliance with the rules for the protection of personal data and equivalent sanctions for infringements in the Member\xa0States.', 'Differences in the level of protection of the rights and freedoms of natural persons, in particular the right to the protection of personal data, with regard to the processing of personal data in the Member\xa0States may prevent the free flow of personal data throughout the Union. Those differences may therefore constitute an obstacle to the pursuit of economic activities at the level of the Union, distort competition and impede authorities in the discharge of their responsibilities under Union law.', 'Those safeguards should ensure compliance with data protection requirements and the rights of the data subjects appropriate to processing within the Union, including the availability of enforceable data subject rights and of effective legal remedies, including to obtain effective administrative or judicial redress and to claim compensation, in the Union or in a third country. They should relate in particular to compliance with the general principles relating to personal data processing, the principles of data protection by design and by default.', 'In conjunction with the general and horizontal law on data protection implementing Directive 95/46/EC, Member\xa0States have several sector-specific laws in areas that need more specific provisions. This Regulation also provides a margin of manoeuvre for Member\xa0States to specify its rules, including for the processing of special categories of personal data (‘sensitive data’).', 'Therefore, this Regulation should provide for harmonised conditions for the processing of special categories of personal data concerning health, in respect of specific needs, in particular where the processing of such data is carried out for certain health-related purposes by persons subject to a legal obligation of professional secrecy. Union or Member State law should provide for specific and suitable measures so as to protect the fundamental rights and the personal data of natural persons.', 'Member States should be allowed to maintain or introduce further conditions, including limitations, with regard to the processing of genetic data, biometric data or data concerning health. However, this should not hamper the free flow of personal data within the Union when those conditions apply to cross-border processing of such data. (54) The processing of special categories of personal data may be necessary for reasons of public interest in the areas of public health without consent of the data subject.']
"""

# Evaluate Cosine Similarity
cosine_sim = evaluate_cosine_similarity(reference_answer, generated_answer)
print(f"Cosine Similarity: {cosine_sim:.4f}")

# Evaluate Semantic Similarity
semantic_similarity = evaluate_semantic_similarity(reference_answer, generated_answer)
print(f"Semantic Similarity: {semantic_similarity:.4f}")


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Cosine Similarity: 0.9488
Semantic Similarity: 0.9662
