# Advanced Embedding Techniques for Textual Data

In [1]:
text = """

**The History and Future of Artificial Intelligence**  

### 1. Introduction  
Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect of modern technology. Once confined to academic discussions, AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI evolve, and where is it headed?  

### 2. The Early Days of AI  
The idea of artificial intelligence dates back to ancient myths, where mechanical beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked the birth of AI as a field. Early AI models relied on rule-based systems, which worked well for structured tasks but struggled with uncertainty and complexity.  

### 3. The Rise of Machine Learning  
By the 1990s, a shift occurred—machine learning (ML) began to outperform traditional rule-based AI. Instead of manually encoding rules, ML models learned patterns from data. Breakthroughs in deep learning, powered by neural networks, revolutionized fields like image recognition, natural language processing (NLP), and game-playing AI. Notable achievements include IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997 and Google DeepMind’s AlphaGo outplaying human Go masters.  

### 4. Modern AI Applications  
AI now permeates various industries:  
- **Healthcare**: AI assists in diagnosing diseases, predicting patient outcomes, and personalizing treatments.  
- **Finance**: Fraud detection and algorithmic trading rely heavily on AI-driven analytics.  
- **Transportation**: Autonomous vehicles use AI to navigate complex environments.  
- **Entertainment**: Recommendation systems, such as those used by Netflix and Spotify, tailor content based on user preferences.  

### 5. Challenges and Ethical Considerations  
Despite its potential, AI faces several challenges:  
- **Bias and Fairness**: AI models can inherit biases from training data, leading to unfair outcomes.  
- **Privacy Concerns**: The collection of massive datasets raises questions about user privacy and data security.  
- **Job Displacement**: Automation threatens traditional job roles, prompting discussions on reskilling the workforce.  
- **AI Safety**: As AI systems become more autonomous, ensuring they align with human values is critical.  

### 6. The Future of AI  
Looking ahead, AI research focuses on:  
- **Explainable AI (XAI)**: Making AI decisions transparent and interpretable.  
- **General AI**: Developing systems that can reason and learn across multiple domains.  
- **AI and Creativity**: From generating art to composing music, AI is expanding its creative potential.  
- **Human-AI Collaboration**: Future AI systems will likely augment human intelligence rather than replace it.  

### 7. Conclusion  
Artificial intelligence is not just a technological evolution—it is a societal transformation. While challenges remain, AI’s potential to enhance lives is undeniable. As research progresses, the key will be balancing innovation with ethical responsibility, ensuring AI serves humanity's best interests.  

"""

## Chunking (see chunking.ipynb)

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def recursive_chunking(text, chunk_size=200, overlap=50):
    """
    Splits text into recursive character chunks using LangChain's RecursiveCharacterTextSplitter.
    
    Parameters:
        text (str): The input text to be chunked.
        chunk_size (int): The length of each chunk.
        overlap (int): The number of overlapping characters between consecutive chunks.

    Returns:
        List[str]: A list of recursively split text chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len
    )
    
    return splitter.split_text(text)


chunks = recursive_chunking(text, chunk_size=100, overlap=20)

In [3]:
chunks

['**The History and Future of Artificial Intelligence**',
 '### 1. Introduction',
 'Artificial Intelligence (AI) has transformed from a theoretical concept into a fundamental aspect',
 'fundamental aspect of modern technology. Once confined to academic discussions, AI now powers',
 'AI now powers everyday applications, from voice assistants to medical diagnostics. But how did AI',
 'But how did AI evolve, and where is it headed?',
 '### 2. The Early Days of AI',
 'The idea of artificial intelligence dates back to ancient myths, where mechanical beings were',
 'beings were depicted as sentient. However, modern AI research formally began in the 1950s. Alan',
 'in the 1950s. Alan Turing, a pioneer in computing, introduced the concept of machine intelligence',
 'intelligence through his famous "Turing Test." In 1956, the Dartmouth Conference officially marked',
 'officially marked the birth of AI as a field. Early AI models relied on rule-based systems, which',
 'systems, which worked well

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## TF-IDF Embedding

🔹 TF-IDF (Most Used in Classic NLP & Search)
📌 Why?

Simple & Efficient: Does not require deep learning.
Interpretable: Words with high importance are easy to extract.
Best for: Keyword-based search, document classification, text ranking.
📌 When to Use? ✅ Traditional search engines
✅ Keyword extraction
✅ Basic text ranking tasks

In [6]:
def compute_tfidf_embeddings(text_corpus):
    """
    Computes TF-IDF embeddings for a given corpus.
    
    Parameters:
        text_corpus (list of str): List of text documents.
    
    Returns:
        np.array: TF-IDF feature matrix.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(text_corpus)
    return tfidf_matrix.toarray()


tfidf_embeddings = compute_tfidf_embeddings(chunks)


print("TF-IDF Embeddings:", tfidf_embeddings)

TF-IDF Embeddings: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Word2Vec Embedding

🔹 Word2Vec (Most Used for Word-Level Semantics)
📌 Why?

Word-level semantic similarity (e.g., king - man + woman ≈ queen).
Good for smaller datasets that lack pretraining.
Best for: Word-level tasks, analogy reasoning, and knowledge representation.
🔥 Most Used Models:

Google’s Pretrained Word2Vec (word2vec-google-news-300)
FastText (for subword-based embeddings)
📌 When to Use? ✅ Word-level similarity tasks
✅ Linguistic feature extraction
✅ Small-scale text classification

In [9]:
def compute_word2vec_embeddings(text_corpus, vector_size=100, window=5, min_count=1):
    """
    Computes Word2Vec embeddings for a given corpus.
    
    Parameters:
        text_corpus (list of str): List of sentences.
        vector_size (int): Dimensionality of the word vectors.
        window (int): Maximum distance between current and predicted word.
        min_count (int): Ignores words with total frequency lower than this.
    
    Returns:
        gensim Word2Vec model.
    """
    tokenized_sentences = [sentence.split() for sentence in text_corpus]
    model = Word2Vec(sentences=tokenized_sentences, vector_size=vector_size, window=window, min_count=min_count)
    return model

word2vec_model = compute_word2vec_embeddings(chunks)

print("Word2Vec Embeddings:", word2vec_model.wv['Artificial'] if 'Artificial' in word2vec_model.wv else "Word not in vocabulary")


Word2Vec Embeddings: [-7.0266607e-03 -2.4170219e-03 -7.9815732e-03  7.5089652e-03
  6.1785416e-03  5.1360563e-03  8.4277727e-03 -5.6208338e-04
 -9.3858130e-03  9.1005322e-03 -4.9478901e-03  7.7800592e-03
  5.4980856e-03 -1.1094318e-03 -7.6320781e-03 -1.5357598e-03
  6.2994468e-03 -7.0587639e-03  1.3619760e-03 -8.0927676e-03
  8.7600825e-03 -2.8296749e-03  9.4858957e-03 -5.7215313e-03
 -9.7505888e-03 -8.5913725e-03 -4.1396548e-03  4.6929359e-03
 -3.3762614e-04  9.2479391e-03  3.1300674e-03  3.7870589e-03
  2.9500066e-03  8.1276819e-03 -2.4023876e-03  7.5090188e-03
 -9.5718857e-03  2.8705851e-03 -7.1649021e-04  3.2881487e-04
  6.8896785e-03 -2.9296693e-03 -2.4368847e-03 -8.8146189e-05
 -4.2771694e-04 -3.5484161e-03  6.1927396e-03 -6.5928726e-03
  7.9680011e-03 -8.5454900e-05  2.6496109e-03  3.1652716e-03
 -2.5351156e-04  1.6863156e-03 -3.1917440e-03  4.8248223e-03
  2.3161176e-04 -3.2821479e-03 -8.8147074e-03 -9.9258935e-03
  3.3968233e-04 -5.7347235e-03 -1.0923446e-03 -4.3193013e-03
 -8

## BERT-Embedding

🔹 BERT-Based Embeddings (Most Used in NLP & AI Applications)
📌 Why?

Contextualized: Unlike TF-IDF and Word2Vec, BERT-based embeddings consider word meaning based on context.
Pretrained Models: Hugging Face's Sentence Transformers (all-MiniLM, mpnet, distilbert) are widely used.
Best for: Semantic search, text similarity, retrieval-augmented generation (RAG), and deep NLP tasks.
🔥 Most Used Models:

Sentence-BERT (SBERT): sentence-transformers/all-MiniLM-L6-v2
OpenAI’s Embeddings: text-embedding-ada-002 (for LLM applications)
📌 When to Use? ✅ Semantic search (e.g., FAISS, Pinecone)
✅ Information retrieval (RAG)
✅ Text similarity (e.g., recommendation systems)
✅ Conversational AI / Chatbots

In [8]:
def compute_bert_embeddings(text_corpus, model_name='all-MiniLM-L6-v2'):
    """
    Computes BERT embeddings for a given corpus using SentenceTransformers.
    
    Parameters:
        text_corpus (list of str): List of sentences.
        model_name (str): Pretrained model name from sentence-transformers.
    
    Returns:
        np.array: Sentence embeddings.
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(text_corpus)
    return np.array(embeddings)

bert_embeddings = compute_bert_embeddings(chunks)

print("BERT Embeddings:", bert_embeddings)

BERT Embeddings: [[-0.07689003  0.03038044  0.02345188 ...  0.06512355  0.06100931
  -0.01098409]
 [-0.03123283  0.03661765  0.04304357 ...  0.08067703  0.03505589
   0.01680759]
 [-0.04303879 -0.00884608 -0.03495566 ...  0.09574018  0.10354136
  -0.02250312]
 ...
 [-0.05084278  0.02494333 -0.01725427 ...  0.03963003  0.00828772
  -0.08778213]
 [-0.08628564  0.04966684 -0.04355802 ... -0.03549082  0.09674251
  -0.07057389]
 [ 0.00073055  0.03807557  0.02349321 ...  0.07508615 -0.00896475
  -0.07838655]]
