In [None]:
text_corpora = """Machine learning is a subset of artificial intelligence (AI) that allows systems to learn and improve from experience without being explicitly programmed. It involves creating algorithms that can identify patterns in data and make decisions or predictions based on that data. The field has seen rapid growth in recent years, with applications ranging from speech recognition to self-driving cars.

Types of Machine Learning

Machine learning can be classified into three main types:

Supervised Learning: In supervised learning, the algorithm is trained on labeled data. The training set includes input-output pairs, where the model learns to map inputs to the correct output. A common example is image classification, where the model is trained with images that are labeled as belonging to specific categories (e.g., cat, dog, etc.). The goal is to learn a mapping that can be used to predict labels for new, unseen data.

Unsupervised Learning: Unlike supervised learning, unsupervised learning uses data that is not labeled. The goal is to identify patterns or groupings in the data without prior knowledge of the output. Clustering is a common technique in unsupervised learning, where the algorithm groups similar data points together. An example is customer segmentation, where unsupervised learning can group customers with similar purchasing behavior.

Reinforcement Learning: Reinforcement learning involves an agent that interacts with an environment and learns by receiving rewards or penalties. The agent takes actions to maximize its cumulative reward. This type of learning is often used in areas such as robotics, game playing, and autonomous vehicles.

Applications of Machine Learning

Machine learning is used in various fields and industries, from healthcare to finance. Some notable applications include:

Healthcare: Machine learning is used to predict disease outbreaks, analyze medical images, and personalize treatments. For example, machine learning models can analyze X-ray images to detect signs of pneumonia or identify tumors in MRI scans.

Finance: In finance, machine learning algorithms are used for fraud detection, risk assessment, and algorithmic trading. By analyzing historical data, machine learning models can identify patterns that might indicate fraudulent activity or predict market trends.

Natural Language Processing (NLP): NLP is a branch of machine learning that focuses on the interaction between computers and human language. It includes tasks such as sentiment analysis, language translation, and chatbots. For example, AI models like GPT-3 can generate human-like text based on a given prompt.

Challenges in Machine Learning

Despite its rapid growth and success, machine learning faces several challenges:

Data Quality: The quality of the data used to train machine learning models is crucial. Inaccurate, incomplete, or biased data can lead to poor model performance and incorrect predictions.

Overfitting: Overfitting occurs when a model performs well on training data but fails to generalize to new, unseen data. This happens when the model is too complex or learns noise in the training data.

Ethical Concerns: Machine learning algorithms can perpetuate biases in data, leading to unfair or discriminatory outcomes. Ensuring fairness and transparency in machine learning models is an ongoing challenge.

Conclusion

Machine learning has become an integral part of modern technology, with applications across many domains. While there are challenges to overcome, the potential for machine learning to drive innovation and improve decision-making is immense. As technology advances, the future of machine learning looks promising, with new techniques and applications emerging regularly."""

In [None]:
#Token Splitting

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size = 40, chunk_overlap = 10)
chunks = splitter.split_text(text_corpora)

In [None]:
#Splitting based on ssentences and a chunk_size to maintain semantic importance and context.

from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from sentence_transformers import SentenceTransformer

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

splitter = SentenceTransformersTokenTextSplitter(chunk_size = 50)
sentence_chunks = splitter.split_text(text_corpora)

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

splitter = SemanticChunker(HuggingFaceEmbeddings())
semantic_chunks = splitter.split_text(text_corpora)

  splitter = SemanticChunker(HuggingFaceEmbeddings())


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size = 60 , chunk_overlap = 20)
char_chunks = splitter.split_text(text_corpora)

In [72]:
import numpy as np
from transformers import pipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from sklearn.metrics.pairwise import cosine_similarity

# Load HuggingFace Embeddings (using SBERT)
embedding_model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"  # High-quality embeddings
hf_embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# Initialize HuggingFace text-generation pipeline (e.g., GPT-2)
generation_model = pipeline('text-generation', model='gpt2', tokenizer='gpt2')

# Example chunks obtained from different chunking methods
# Each method gives a different set of chunks
chunk_sets = {
    "token-based chunking with overlapping using TokenTextSplitter":chunks,
    "Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter": sentence_chunks,
    "Semantic Chunker": semantic_chunks,
    "character-based chunking using RecursiveCharacterTextSplitter" : char_chunks
}

# Step 1: Convert chunks to Document objects for each chunk set
document_sets = {
    method: [Document(page_content=chunk) for chunk in chunks if isinstance(chunk, str) and chunk.strip()]
    for method, chunks in chunk_sets.items()
}

# Step 2: Create vector stores for each chunk set using HuggingFaceEmbeddings
vectorstores = {}
for method, documents in document_sets.items():
    vectorstore = FAISS.from_documents(documents, hf_embeddings)
    vectorstores[method] = vectorstore

# Step 3: Define function to retrieve relevant chunk based on cosine similarity for each set
def retrieve_relevant_chunks(query, top_k=1, method="fixed_length"):
    query_embedding = hf_embeddings.embed_documents(query)
    vectorstore = vectorstores[method]

    similarities = []
    for doc in vectorstore.index:
        doc_embedding = doc.embedding  # Assuming FAISS index holds embeddings
        similarity_score = cosine_similarity([query_embedding], [doc_embedding])[0][0]
        similarities.append((doc.page_content, similarity_score))

    # Sort by similarity and return top-k results
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
    return [doc[0] for doc in sorted_similarities[:top_k]]

# Step 4: Define function to generate text with retrieved chunks from different methods
def generate_responses(query, top_k=1):
    # Retrieve relevant chunks from each chunking method
    responses = {}
    for method in vectorstores:
        relevant_chunks = retrieve_relevant_chunks(query, top_k, method)
        context = " ".join(relevant_chunks)  # Combine retrieved chunks as context

        # Generate a response based on the context for the current chunking method
        generated_response = generation_model(context, max_length=100, num_return_sequences=1)
        responses[method] = generated_response[0]['generated_text']

    return responses

# Example usage
query = "Tell me about Machine learning."
responses = generate_responses(query)
for method, response in responses.items():
    print(f"Response from {method} chunking:")
    print(response)
    print("\n")


Device set to use cpu


TypeError: 'IndexFlatL2' object is not iterable

In [76]:
import numpy as np
from transformers import pipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from sklearn.metrics.pairwise import cosine_similarity

# Load HuggingFace Embeddings (using SBERT)
embedding_model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"  # High-quality embeddings
hf_embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# Initialize HuggingFace text-generation pipeline (e.g., GPT-2)
generation_model = pipeline('text-generation', model='gpt2', tokenizer='gpt2')

# Example chunks obtained from different chunking methods
# Each method gives a different set of chunks
chunk_sets = {
    "token-based chunking with overlapping using TokenTextSplitter": chunks,
    "Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter": sentence_chunks,
    "Semantic Chunker": semantic_chunks,
    "character-based chunking using RecursiveCharacterTextSplitter": char_chunks
}

# Step 1: Convert chunks to Document objects for each chunk set
document_sets = {
    method: [Document(page_content=chunk) for chunk in chunks if isinstance(chunk, str) and chunk.strip()]
    for method, chunks in chunk_sets.items()
}

# Step 2: Create vector stores for each chunk set using HuggingFaceEmbeddings
vectorstores = {}
for method, documents in document_sets.items():
    vectorstore = FAISS.from_documents(documents, hf_embeddings)
    vectorstores[method] = vectorstore

# Step 3: Define function to retrieve relevant chunk based on cosine similarity for each set
def retrieve_relevant_chunks(query, top_k=1, method="fixed_length"):
    query_embedding = hf_embeddings.embed_documents([query])[0]  # Corrected here
    vectorstore = vectorstores[method]

    # Use FAISS similarity search
    results = vectorstore.similarity_search(query, k=top_k)

    # Extract the chunks (no similarity score in metadata, just use FAISS results directly)
    similarities = [(doc.page_content) for doc in results]

    return similarities

# Step 4: Define function to generate text with retrieved chunks from different methods
def generate_responses(query, top_k=1):
    # Retrieve relevant chunks from each chunking method
    responses = {}
    for method in vectorstores:
        relevant_chunks = retrieve_relevant_chunks(query, top_k, method)
        context = " ".join(relevant_chunks)  # Combine retrieved chunks as context

        # Generate a response based on the context for the current chunking method
        generated_response = generation_model(context, max_length=500, num_return_sequences=1)
        responses[method] = generated_response[0]['generated_text']

    return responses

# Example usage
query = "Tell me about the types of Machine Learning."
responses = generate_responses(query)
print('\n\n')
for method, response in responses.items():
    print('-------------------------------------------------------------------')
    print(f"\033[1mResponse from {method} chunking:\033[0m")
    print(response)
    print('---------------------------------------------------------------------------------------------------')
    print("\n\n")


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.





-------------------------------------------------------------------
[1mResponse from token-based chunking with overlapping using TokenTextSplitter chunking:[0m
, the future of machine learning looks promising, with new techniques and applications emerging regularly.

But the future is different in many areas, particularly in data analysis. Most of the great minds who have defined data science today have gone through research and technical education before starting to use machine learning and machine learning for data analysis. Here's what I know for sure: If you want to get started, look at the following four resources (PDF, 4.3MB). This is a good one to start:

The Future of Machine Learning by Lawrence Twomey, CTO of Hadoop and author of The Future of Machine Learning, is now available from Amazon. Don.

This is The Future: A Data Science History by George C. Marshall, co-founder of IBM and author of the book, The Data Revolution: A Technical Introduction, which is available fro

In [79]:
import numpy as np

def evaluate_coherence(chunk_sets):
    coherence_scores = {}
    for method, chunks in chunk_sets.items():
        similarities = []
        for i in range(len(chunks) - 1):
            chunk1 = hf_embeddings.embed_documents([chunks[i]])[0]
            chunk2 = hf_embeddings.embed_documents([chunks[i + 1]])[0]
            similarity = cosine_similarity([chunk1], [chunk2])[0][0]
            similarities.append(similarity)

        avg_similarity = np.mean(similarities)
        coherence_scores[method] = avg_similarity

    return coherence_scores


coherence_results = evaluate_coherence(chunk_sets)

print("Coherence Scores:")
for method, score in coherence_results.items():
    print(f"{method}: {np.round(score , 4)}\n")

Coherence Scores:
token-based chunking with overlapping using TokenTextSplitter: 0.5912

Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter: 0.6349

Semantic Chunker: 0.5939

character-based chunking using RecursiveCharacterTextSplitter: 0.4677



In [89]:
reference_response = """
Machine learning, a subset of AI, has three main types:

Supervised Learning: The algorithm learns from labeled data to map inputs to correct outputs, like in image classification.

Unsupervised Learning: Works with unlabeled data to find patterns or groupings, such as customer segmentation.

Reinforcement Learning: An agent learns by interacting with an environment and receiving rewards or penalties, used in robotics and game playing.

These types are key to applications in fields like image recognition, segmentation, and autonomous driving.
"""

def evaluate_relevance(chunk_sets, query):
    relevance_scores = {}
    for method, chunks in chunk_sets.items():
        reference_embedding = hf_embeddings.embed_documents([reference_response])[0]
        chunk_relevances = []
        for chunk in chunks:
            chunk_embedding = hf_embeddings.embed_documents([chunk])[0]
            similarity = cosine_similarity([reference_embedding], [chunk_embedding])[0][0]
            chunk_relevances.append(similarity)

        avg_relevance = np.mean(chunk_relevances)
        relevance_scores[method] = avg_relevance

    return relevance_scores

relevance_results = evaluate_relevance(chunk_sets, query)

print("Relevance Scores:\n")
for method, score in relevance_results.items():
    print(f"{method}: {np.round(score , 4)}\n")

Relevance Scores:

token-based chunking with overlapping using TokenTextSplitter: 0.4964

Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter: 0.6529

Semantic Chunker: 0.5678

character-based chunking using RecursiveCharacterTextSplitter: 0.4896



In [84]:
def evaluate_context_preservation(chunk_sets):
    context_scores = {}
    for method, chunks in chunk_sets.items():
        similarities = []
        for i in range(len(chunks) - 1):
            chunk1 = hf_embeddings.embed_documents([chunks[i]])[0]
            chunk2 = hf_embeddings.embed_documents([chunks[i + 1]])[0]
            similarity = cosine_similarity([chunk1], [chunk2])[0][0]
            similarities.append(similarity)

        avg_context = np.mean(similarities)
        context_scores[method] = avg_context

    return context_scores


context_results = evaluate_context_preservation(chunk_sets)

print("Context Preservation Scores:\n")
for method, score in context_results.items():
    print(f"{method}: {np.round(score,4)}\n")

Context Preservation Scores:

token-based chunking with overlapping using TokenTextSplitter: 0.5912

Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter: 0.6349

Semantic Chunker: 0.5939

character-based chunking using RecursiveCharacterTextSplitter: 0.4677



In [87]:
import textstat

def evaluate_readability(chunk_sets):
    readability_scores = {}
    for method, chunks in chunk_sets.items():
        readability_values = []
        for chunk in chunks:
            score = textstat.flesch_reading_ease(chunk)
            readability_values.append(score)

        avg_readability = np.mean(readability_values)
        readability_scores[method] = avg_readability

    return readability_scores

readability_results = evaluate_readability(chunk_sets)

print("Readability Scores:\n")
for method, score in readability_results.items():
    print(f"{method}: {np.round(score,4)}\n")

Readability Scores:

token-based chunking with overlapping using TokenTextSplitter: 43.4129

Tokenizing at a sentence level using SentenceTransformerTokenTextSplitter: 41.075

Semantic Chunker: 34.11

character-based chunking using RecursiveCharacterTextSplitter: 41.025

