# Reading The Scala 2 Cookbook

<img src="https://m.media-amazon.com/images/I/91AfDQsL7SL._AC_UF1000,1000_QL80_.jpg" width="300" />


In [27]:
import PyPDF2
from tqdm import tqdm
import os


def divide_chunks(text_body, min_chunk_size=300, max_chunk_size=1000):
    in_chunks = text_body.split(".\n")
    out_chunks = []
    curr_chunk = ""
    for ic in in_chunks:
        if len(ic) < min_chunk_size:
            if len(ic)+len(curr_chunk) < max_chunk_size:
                curr_chunk = curr_chunk + "\n" + ic
            else:
                out_chunks.append(curr_chunk)
                curr_chunk = ""
    return out_chunks


def clean_chunk(raw_chunk):
    chunk = raw_chunk.replace('-\n','').replace('\n',' ')
    return chunk


body = ""
with open(f"data/pdf/scala_book/Scala_Cookbook.pdf", "rb") as file:
    reader = PyPDF2.PdfReader(file)
    for page in tqdm(reader.pages):
        text = page.extract_text()
        body += "\n\n" + text
    chunk_list = [clean_chunk(c) for c in divide_chunks(body)]

100%|██████████| 722/722 [00:09<00:00, 79.30it/s]


# Embeddings
##### Multidimensional Vector Representation of Semantic Meanings
<img src="https://corpling.hypotheses.org/files/2018/04/Screen-Shot-2018-04-25-at-13.21.44.png" width="400" />

In [32]:
from sentence_transformers import SentenceTransformer
import itertools

model_name = "sentence-transformers/all-MiniLM-L12-v2"

model = SentenceTransformer(model_name, device='cuda')

sentences = list(itertools.chain(*chunk_list))

embeddings = []
for s in tqdm(sentences):
  embeddings.append([float(x) for x in model.encode(s)])

100%|██████████| 175949/175949 [37:47<00:00, 77.59it/s]


# Database
##### Creating a Vector Database of Embeddings using the open-source ChromaDB
<img src="https://www.mlq.ai/content/images/2023/08/1_admwyPyR6v_IZI0EYE--eA-1.webp" width="250" />


In [34]:
import chromadb
from chromadb.utils import embedding_functions


chroma_client = chromadb.Client()

default_ef = embedding_functions.DefaultEmbeddingFunction()

collection = chroma_client.get_or_create_collection(name="scala_book_chunks", embedding_function=default_ef)

MAX_CHUNKS = 41600

collection.add(
    documents=sentences[:MAX_CHUNKS],
    embeddings=embeddings[:MAX_CHUNKS],
    ids=[str(x) for x in range(MAX_CHUNKS)]
)

# ChatBot
##### Using the open-source Databricks' Dolly model fine-tuned through the RAG techinique
<img src="https://www.databricks.com/sites/default/files/2023-04/Dolly-logo.png" width="300" />

In [35]:
from transformers import pipeline
import torch
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain


def build_qa_chain():

    model_name = "databricks/dolly-v2-3b" # Dolly smallest version (3 billion params)

    instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
                                 return_full_text=True, max_new_tokens=4096, top_p=0.95, top_k=50,
                                 device=0) #cuda

    template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert Scala developer.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions and code expamples.

    {context}

    Question: {question}

    Response:
    """

    prompt = PromptTemplate(input_variables=['context', 'question'], template=template)

    hf_pipe = HuggingFacePipeline(pipeline=instruct_pipeline)

    return load_qa_chain(llm=hf_pipe, chain_type="stuff", prompt=prompt, verbose=True)

In [None]:
# Building the chain will load Dolly and can take several minutes depending on the model size
qa_chain = build_qa_chain()

In [37]:
class Document():
    def __init__(self, content):
        self.page_content = content
        self.metadata = {"metadata": ""}

def get_similar_docs(question):
    results = collection.query(
        query_embeddings=[float(x) for x in model.encode(question)],
        n_results=3
    )
    return results["documents"]

def answer_question(question):
    similar_docs = [Document(x) for x in get_similar_docs(question)]
    result = qa_chain({"input_documents": similar_docs, "question": question})
    return result

In [40]:
import os
os.environ['CURL_CA_BUNDLE'] = ''

question = "Which are the strenghts of the Scala programming language?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert Scala developer.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions and code expamples.

    ['s', 's', 's']

    Question: Which are the strenghts of the Scala programming language?

    Response:
    [0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

The following are the strenghts of the Scala programming language:

- static typing, meaning that the type of an expression is fully determined at compile time
- strong type inference, meaning that type information is used to perform more efficient runtime checks
- functional programming style, meaning that imperative constructs are not required


In [41]:
question = "What are side effects?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert Scala developer.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions and code expamples.

    ['+', '+', '+']

    Question: What are side effects?

    Response:
    [0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

Side effects are an operations aspect of programming languages that can influence the result of an operation, especially one that writes data to a system's file system.

To some extent, the distinction between "side effects" and other effects is semantic rather than a matter of whether one should count something as a side effect or not, so it is useful to talk about both.


In [42]:
question = "How do I transform a sequence of string to uppercase?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert Scala developer.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions and code expamples.

    ['s', 's', 's']

    Question: How do I transform a sequence of string to uppercase?

    Response:
    [0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

Short textual description:

You can use the toUpper method of the String class.

Example:

scala> 'hello'.toUpper
res0: String = HELLO

Longer textual description:

You can use the toUpper method of the String class.

It applies a global transformation to each of its characters.

The toUpper method transforms a string into a new string that contains all the characters of the string,
but with each of them tr