Analytics & Data Science

Universidad de Antioquia - ML2

Feb 2024

Melissa Ortega Alzate CC.1036964792

# Libraries

In [432]:
# Natural Language Toolkit
import nltk         
from nltk.tokenize import sent_tokenize         # Tokenization is used to divide the text into words, sentences or other units.

# Calculate distances
from sklearn.metrics.pairwise import cosine_similarity

# Open Source library to build embeddings
#! pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer



# Functions

In [433]:
# Define function to read the text file
def read_text_from_file(file_name):
    """
    Read text data from a file.
    Args:
        file_name (str): The name of the file to read.
    Returns:
        str: The text read from the file.
    """
    with open(file_name, 'r', encoding='iso-8859-1') as file:
        text = file.read()
    return text

In [434]:
# Define function to split the text into a number of desired sentences
def split_text(text, sentences_per_fragment):
    """
    Split text into fragments based on a specified number of lines per fragment.

    Args:
        text (str): The text to be split into fragments.
        lines_per_fragment (int): The number of lines per fragment.

    Returns:
        list of str: List of fragments, where each fragment contains the specified number of lines.
    """
    lines = text.split('.')
    fragments = []
    
    for i in range(0, len(lines), sentences_per_fragment):
        fragment = '.'.join(lines[i: i+sentences_per_fragment])
        fragments.append(fragment)
    
    return fragments

# 1. In your own words, describe what vector embeddings are and what they are useful for.

The embeddings are a numerical representation not only of language but also of its context and semantics. They are fixed-dimensional vectors constructed from texts so that ML models can understand audio, text, image, video instructions, etc. Embeddings are, therefore, a representation in a large dimensional space with the best possible meaning of context and semantics. Since embeddings are vectors, they can be manipulated with all traditional linear algebra techniques.

These embeddings are useful in many areas, and thanks to them, recommendation systems (YouTube, Netflix), semantic search engines (YouTube), translators, ChatGPT, text classifiers, and in general, all language understanding AI models have been built.

# 2. What do you think is the best distance criterion to estimate how far two embeddings ...

# 3. Let us build a Q&A (question answering) system!

### a. Pick a text

The text was taken and edited from: https://aws.amazon.com/what-is/machine-learning/?nc1=h_ls

In [435]:
# Load the text using the predefined function
text = read_text_from_file('Lab3.txt')

# Text description
print("The length of the text is", len(text), "characters\n")
print("The text up to character 342 is:\n", text[:342])

FileNotFoundError: [Errno 2] No such file or directory: 'Lab3.txt'

### b. Split that text into meaningful chunks/pieces.

In [None]:
# Split the text into fragments
fragments = split_text(text, 1)

# Print the total number of fragments
print(f"\nThe text was divided into {i+1} fragments")

# Print the fragments
for i, fragment in enumerate(fragments):
    print(f"Fragment {i+1}: {fragment}\n")



The text was divided into 139 fragments
Fragment 1: What is machine learning?
Machine learning is the science of developing algorithms and statistical models that computer systems use to perform tasks without explicit instructions, relying on patterns and inference instead

Fragment 2:  Computer systems use machine learning algorithms to process large quantities of historical data and identify data patterns

Fragment 3:  This allows them to predict outcomes more accurately from a given input data set

Fragment 4:  For example, data scientists could train a medical application to diagnose cancer from x-ray images by storing millions of scanned images and the corresponding diagnoses

Fragment 5: 
Why is machine learning important?
Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems

Fragment 6:  Data is the critical driving force behind business decision-making but traditionally, companies have used data from various sourc

### c. Implement the embedding generation logic

In [None]:
# Instantiate the pretrained model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

# Generate embeddings for each fragment
embeddings = model.encode(fragments)

# Sample data
print("Number of embeddings:", len(embeddings))
print(f"\nEmbedding for fragment 1:\n {embeddings[0]}:")

Number of embeddings: 139

Embedding for fragment 1:
 [ 2.38300525e-02 -4.64637466e-02  1.68310502e-03 -7.91456923e-02
 -7.04026744e-02 -4.56668362e-02 -4.79449369e-02 -3.39060165e-02
 -1.30035058e-01  9.43814777e-03  1.01110913e-01  1.86520144e-01
 -5.41355535e-02  1.79512560e-01  7.76648968e-02 -2.26610407e-01
  1.84813030e-02 -7.92898536e-02  7.46685639e-02  6.74832389e-02
 -5.94234839e-03 -6.97766840e-02  9.98755731e-03  7.57842697e-03
 -1.73620563e-02 -1.07027419e-01 -7.51963956e-03 -1.49777204e-01
 -7.76965544e-02  1.25768930e-01 -3.32734995e-02 -1.88988045e-01
  1.09856255e-01  8.68013278e-02  5.63275442e-03 -2.92040594e-02
 -4.13225964e-03  1.08653918e-01  9.55615193e-02 -1.62098229e-01
  2.59939730e-01 -3.77659462e-02  1.28140509e-01 -2.98630744e-02
  3.84254418e-02  2.73570895e-01  9.19334516e-02 -8.78350884e-02
 -6.15724660e-02  1.32574901e-01 -2.61524366e-03  5.81351444e-02
 -2.27538362e-01  5.67931496e-02 -1.23464674e-01 -6.72227293e-02
 -6.74599931e-02  1.10295452e-01 -3.

In [None]:
# Print embeddings list characteristics
print("Variable type of embeddings:", type(embeddings))
print("Dimensions of embeddings:", embeddings.shape)

Variable type of embeddings: <class 'numpy.ndarray'>
Dimensions of embeddings: (139, 768)


- Each row corresponds to a fragment and each column to a dimension in the embedding space. In other words, each row in this matrix represents the embedding for a particular fragment.

- The pre-trained model is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

### d. For every question, return a sorted list of the N pieces that relate the most to the question.

In [None]:
def get_answers(user_question, text_fragments, n=3):
    """
    Get the most related text fragments to a user question.

    Args:
        user_question (str): The question asked by the user.
        text_fragments (list): List of text fragments to compare with the user question.
        n (int, optional): Number of most related fragments to return. Defaults to 5.

    Returns:
        tuple: A tuple containing two lists:
               - similarities: List of tuples (fragment, similarity_score) for all fragments.
               - sorted_fragments: List of tuples (fragment, similarity_score) for the top N related fragments.
    """
    # Instantiate the pretrained model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Generate embedding for user question
    embedding_question = model.encode(user_question)

    # Initialize a list to store the similarity between the question and each fragment
    similarities = []

    # Generate embeddings for each text fragment and calculate cosine similarity
    for fragment in text_fragments:
        embedding_fragment = model.encode(fragment)
        similarity = cosine_similarity([embedding_question], [embedding_fragment], dense_output=True)[0][0]
        similarities.append((fragment, similarity))

    # Sort text fragments based on their similarity with the user question
    sorted_fragments = sorted(similarities, key=lambda x: x[1], reverse=True)

    # Return the top N most relevant fragments
    return sorted_fragments[:n]


In [None]:
# Using the function to find the most related fragments
user_question = "How does machine learning help businesses?"

questions = [
    "How does machine learning help businesses?",
    "How does machine learning improve financial services?",
    "What role does machine learning play in retail?",
    "How is machine learning used in media and entertainment?",
    "What are the strengths of supervised learning?",
    "How does unsupervised machine learning differ from supervised learning?",
    "What are some applications of machine learning in manufacturing?",
]

# Number of fragments
n = 2

In [None]:
# Iterating over the list of questions
for user_question in questions:
    related_fragments = get_answers(user_question, fragments, n=n)

    # Results
    print(f"============ Question: {user_question} =============")
    for fragment, similarity in related_fragments:
        print(f"{fragment}\n=> Similarity: {similarity}\n")



Why is machine learning important?
Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems
=> Similarity: 0.8577966094017029

 Machine learning technology also helps companies improve logistical solutions, including assets, supply chain, and inventory management
=> Similarity: 0.7826176881790161



Financial services
Financial machine learning projects improve risk analytics and regulation
=> Similarity: 0.767458975315094


Why is machine learning important?
Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems
=> Similarity: 0.7541710138320923


Retail
Retail can use machine learning to improve customer service, stock management, upselling and cross-channel marketing
=> Similarity: 0.8252942562103271


Why is machine learning important?
Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems
=> Similar

# Bibliography



- Text: Source: AWS - What is Machine Learning?
- NLTK Tokenization API: Source: NLTK Tokenization Documentation
- Pretrained Model: Model: Hugging Face - sentence-transformers/all-MiniLM-L6-v2
- Cosine Similarity: Documentation: Scikit-learn - Cosine Similarity