## 2.5 Hugging Face Pipeline for Quick Prototyping

### Uses:
* Transformers
* Pytorch

### Topics Covered:
* Pipelines
* Pipelines in Hugging Face
* Pipeline Tasks
* Example Pipelines Including:
    * Sentiment Analysis Pipeline
    * Question Answering Pipeline
    * Translation Pipeline
    * Text Generation Pipeline
    * Conversation Pipeline
        * One using model and tokenizer import
        * One using pipeline import

### Syntax Segments Summary:
(See notebook 2.5 for an in-depth conversation model)

In [None]:
from transformers import pipeline
# 'pipeline' class allows for easy use of pre-trained models for NLP tasks

from transformers import AutoModelForCausalLM, AutoTokenizer
# AutoModelForCasualLM class allows the automatic loading of a pre-trained language model for casual language 
# AutoTokenizer class helps tokenize input text

import torch
# Library for manipulating tensors

from transformers import Conversation
# 'Conversation' class is used to represent and handle conversational data understandable by LLMs

In [None]:
model = pipeline(task="sentiment-analysis")
# downloads the default model for sentiment analysis
# without specifying a model, the default model for the specified task is downloaded

model = pipeline(model = "distilbert-base-uncased-finetuned-sst-2-english")
# downloads a specific model trained for sentiment analysis

model = pipeline(task="translation_en_to_fr")
# translation_en_to_fr model translates english to french

model = pipeline(model="gpt2")
# the gpt2 model used for text generation

model = pipeline("conversational", "microsoft/DialoGPT-medium", pad_token_id=50256)
# a specific model trained for conversation between it and a user

In [None]:
model = pipeline(task="sentiment-analysis")
# downloads the default model for sentiment analysis
# without specifying a model, the default model for the specified task is downloaded

model = pipeline(model = "distilbert-base-uncased-finetuned-sst-2-english")
# downloads a specific model trained for sentiment analysis

comp = model(text)
# model performs sentiment-analysis pipeline on the given text

In [None]:
model = pipeline(task="question-answering")
# downloads the default model for question answering

answer = model(
    question = "What date is Christmas?", 
    # pass a question for the model to answer
    context = "Christmas is on Monday, the 25th of December, 2023"
    # provide context for the model to search for the answer to the question in
)
# answer is a dictionary with the keys:
# 'score' -> the model's certainty
# 'start' -> the answer string's beginning index inside of the context string
# 'end' -> the answer string's ending index inside of the context string
# 'answer' -> a piece of text that answers the question.
    # either pulled from the context or generated by the model based on the context

In [None]:
print("DialoGPT>> " + conversation.generated_responses[-1])
# .generated_responses holds a list of the model's reponses
    # [-1] accesses the most recent response

## 2.7 Evaluating a Sentiment Analysis Model

### Uses:
* Datasets
* Transformers

### Topics Covered:
* IMDb
* Working with datasets
* Computing a sentiment model using accuracy metrics
* Tests default, SST-2, and Tweets models

### Syntax Segments Summary:

In [None]:
from datasets import load_dataset, load_metric
# load_dataset function loads datasets from the Hugging Face datasets repository
# load_metric function loads evaluation metrics used for measuring the performance of NLP models

from transformers import pipeline
# pipeline function allows you to create a pipeline for a specific task

import pandas as pd
# used for data manipulation

In [None]:
# Download tweets dataset
dataset = load_dataset("imdb", split="test")
# "imbd" = name of dataset being loaded from HuggingFace 'datasets' library
# split parameter specifies which part of the dataset to load
    # 'test' typically refers to a subset of datasets used to evaluate performance
# contains the features 'text' and the corresponding 'label'

In [None]:
df = pd.DataFrame(dataset)
# creates a pandas dataframe using the given information

all_texts = df["text"].values.tolist()
# df["texts"] selects the "texts" column
    # .values converts selected column into NumPy array
        # .tolist() converts the NumPy array into Python list

In [None]:
all_sentiments = model(all_texts, truncation=True, max_length=512)
# Would take ~16.5 hrs for me to run
# sentiment performs computations on all_texts list
# truncation = True takes the max_length (number of tokens) from texts that are too long
# max_length sets the maximum number of tokens that are allowed per text

In [None]:
score = metric.compute(predictions=predictions, references=references)
# .compute() function of metric object calculates the accuracy score -
    # by comparing the predicted values to the reference (true/accurate) values

## 2.10 Semtic Search on Big Data

### Uses:
* NumPy
* Scikit-Learn
* Faiss

### Topics Covered:
*  What is Semantic Search?
*  Using Faiss to speed up Semantic Search
*  Generating random vectors
*  Brute-force Semantic Search
*  Semantic Search with Space-partitioning Index

### Syntax Segments Summary:

In [None]:
import numpy as np
# used for manipulating data

from sklearn.preprocessing import normalize
# normalize function used for normalizing arrays or vectors

import faiss
# Facebook AI similarity Search
# Effifient in similarity search and the clustering of dense vectors

In [None]:
np.random.seed(1234)
# sets seed for NumPy random number generator

In [None]:
vectors = np.random.random((number_of_vectors, num_dimensions)).astype('float32')
# np.random.random() generates random numbers in a provided range
# (vectors, dimensions) specifies the shape of the generated array
# .astype('float32') method converts the data type of the generated array into float32
    # a data type representing 32-bit floating-point numbers

vectors = normalize(vectors)
# normalizes vectors along an axis (a dimension)
# This divides each dimension by the vectors euclidean norm. Think of this norm as the magnitude of the vector.
    # The new, normalized (divided) vector has a magnitude of 1, but is going in the same direction

In [None]:
index = faiss.IndexFlatL2(num_dimensions)
# Creates an IndexFlatL2 in a space with a specified number of dimensions
# A flat index is an index where all values are stored without hierarchy 
    # (all vectors have the same level of priority)
# L2 is a similarity metric that queries for nearest neighbors 
    # (similar vectors with the least distance from eachother)

index.add(vectors)
# fills the index with the vectors

retrieved_vector = index.reconstruct(0)
# .reconstruct retrieves a single data point from the index based on the provided position/index (0)

In [None]:
query_vector = np.random.random((1, num_dimensions)).astype('float32')
# creates a single vectors with 512 dimensions

# Here is a mathematic explaination of normalization
query_vector = normalize(query_vector)
# changes the vector's magnitude to the L2 norm while the direction stays the same
# L2 norm is a measure of the magnitude of a vector 
    # OR
    # |v| = sqrt((c1)^2 + (c2)^2 + ... + (cn)^2)
    # where all c are components of the vector
# L1 norm measures the absolute sum of all components in a vector

In [None]:
distances, indices = index.search(query_vector, num_neighbors)
# .search() performs a nearest neighbor search provided:
    # A starting point (query_vector)
    # A number of neighbors to find (4)
# Finds 1st, 2nd, 3rd, and 4th closest vectors to the query_vector
# Returns:
    # the distances between the vector and the query_vector
    # the index of each vectors inside of the IndexFlatL2 obj

In [None]:
IVFindex = faiss.IndexIVFFlat(quantizer, num_dimensions, n_cells)
# Inverted File with Flat Index
    # partitions the vector space into smaller cells (or clusters) of vectors
    # each cell contains a subset of vectors based on their proximity (or similarity)
# The quantizer Index serves as a coarse quantizer
    # which retrieves an initial candidate set, then searches within this set for nearest neighbors
# num_dimensions needs to be specified to maintain vectors with the same dimensions

In [None]:
IVFindex.train(vectors)
# trains the index using the vectors dataset
# prepares the data structure for efficient search

IVFindex.add(vectors)
# adds the vectors to the index structure
# the vectors are partitioned in cells based on the quantizer to be used for nearest neighbor searches

## 2.17 Question Answering

### Uses:
* Transformers

### Topics Covered:
* Question Answering (QA) Including:
    * Question Answering Variants
    * Question Answering Datasets (Benchmarks)
* QA using Python
* Fast QA using Python

### Syntax Segments Summary:

In [None]:
from transformers import pipeline
# pipeline function allows you to create a pipeline for a specific task

In [None]:
qa_model = pipeline("question-answering")
qa_response = qa_model(question=question, context=context)
# Model returns a dictionary containing keys:
    # score: the confidence of the model in extracting the answer from the context
    # start: the index of the character in the context that corresponds to the start of the extracted answer.
    # end: the index of the character in the context that corresponds to the end of the extracted answer.
    # answer: the text extracted from the context, which should contain the answer.

In [None]:
passage_pipe = pipeline("text-classification", model="cross-encoder/ms-marco-TinyBERT-L-2")
rankings = passage_pipe(passages_with_question)
# embeds each sentence in a list, which has been concatenated with the question in each sentence.
# the sentence with the highest embedding will be the most relevant to the question.

## 2.18 Text Summarization

### Uses:
* Transformers

### Topics Covered:
* Summarizing Text in Python

### Syntax Segments Summary:

In [None]:
from transformers import pipeline
# pipeline function allows you to create a pipeline for a specific task

In [None]:
sum_model = pipeline("summarization")
resp = sum_model(text)[0]
# returns a list of a dictionary with 1 key -> 'summary_text'