### Read Text

In [71]:
text = """
Ensemble Algorithms
A powerful and more advanced type of machine learning algorithm are ensemble algorithms. These are techniques that combine the predictions from multiple models in order to provide more accurate predictions. In this part you will be introduced to two of the most used ensemble methods:

      Bagging and Random Forests which are among the most powerful algorithms available.
      Boosting ensemble and the AdaBoost algorithm that successively corrects the predictions of weaker models.

What This Book is Not
      This is not a machine learning textbook. We will not be going into the theory behind why things work or the derivations of equations. This book is about teaching how machine learning algorithms work, not why they work.
      This is not a machine learning programming book. We will not be designing machine learning algorithms for production or operational use. All examples in this book are for demonstration purposes only.

How To Best Use this Book
This book is intended to be read linearly from one end to the other. Reading this book is not enough. To make the concepts stick and actually learn machine learning algorithms you need to work through the tutorials. You will get the most out of this book if you open a spreadsheet along side the book and work through each tutorial.
Working through the tutorials will give context to the representation, learning and prediction procedures described for each algorithm. From there, you can translate the ideas to your own programs and to your usage of these algorithms in practice.
I recommend completing one chapter per day, ideally in the evening at the computer so you can immediately try out what you have learned. I have intentionally repeated key equations and descriptions to allow you to pick up where you left off from day to day.

Summary
It is time to finally understand machine learning. This book is your ticket to machine learning algorithms. Next up you will build a foundation to understand the underlying problem that all machine learning algorithms are trying to solve.


"""

In [73]:
import spacy
import pandas as pd
import re
import time

def brain_dataset(text, threshold=50, dataset_name="trial"):
    cleaned_text = re.sub(r'[^a-zA-Z0-9.%\s]', '', text)
    info_list = list(set(cleaned_text.split("\n")))
    info_list = [para for para in info_list if para.strip() != ""]

    nlp = spacy.load('en_core_web_sm')

    for i, text in enumerate(info_list):
        if len(text.split()) > threshold:
            doc = nlp(text)
            paragraphs = [paragraph.text for paragraph in doc.sents]
            info_list.pop(i)
            info_list[i:i] = paragraphs
            ner_results = [(ent.text, ent.label_) for ent in doc.ents]

    df = pd.DataFrame(info_list, columns=['paragraph_info'])
    dataset = df.to_csv(f"{dataset_name}.csv", index=False)

    return dataset, ner_results

start_time = time.time()
brain_dataset(text)
processing_time = round((time.time() - start_time), 2)
print(processing_time)

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [1]:
import pandas as pd

data = pd.read_csv("dataset.csv")

In [2]:
data.head(10)

Unnamed: 0,paragraph_info
0,Astonishingly statistical analyses are virtual...
1,Teaching statistics from a mathematical rather...
2,Recommendations by statistics educators includ...
3,In 2003 the National Research Council publishe...
4,In 2003 National Research Council published re...


In [3]:
texts = data["paragraph_info"]

### Convert to embeddings

In [14]:
from sentence_transformers import SentenceTransformer

model_id = "sentence-transformers/paraphrase-MiniLM-L3-v2"
dim = 384

device = "cuda:0"

model = SentenceTransformer(model_id, device=device)

Downloading .gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [15]:
encoded_text = model.encode(texts).tolist()

In [16]:
texts = texts.tolist()

AttributeError: 'list' object has no attribute 'tolist'

In [7]:
ids = [str(i) for i in range(len(encoded_text))]

### store in chromadb

In [18]:
import chromadb

chroma_client = chromadb.PersistentClient(path="./chromadb-t-docs")

In [19]:
collection = chroma_client.create_collection(
    name="book",
    metadata={"hnsw:space": "cosine"}
)

In [20]:
collection.add(
    documents=texts,
    embeddings=encoded_text,
    ids=ids
)

### get the output

In [25]:
question = "what happend in 2003 ?"
question_embed = model.encode(question)

results = collection.query(
    query_embeddings=question_embed.tolist(),
    n_results=3,
    
)

print(results)

{'ids': [['3', '9', '4']], 'distances': [[0.8786217569932246, 0.8915272446710168, 0.9205692562979663]], 'metadatas': [[None, None, None]], 'embeddings': None, 'documents': [['In 2003 the National Research Council published. Undergraduate Education to Prepare Biomedical Research Scientists that supported stronger backgrounds in physics and mathematics and suggested that biology faculty integrate these subjects into their courses.', 'Another strategy establish clear link statistics application real world.', 'In 2003 National Research Council published recommendations revising undergraduate biology education BIO2010 Undergraduate Education Prepare Biomedical Research Scientists supported stronger backgrounds physics mathematics suggested biology faculty integrate subjects courses.']]}


### BERT LM

In [56]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

def casy_response(question, top_paragraph):
        with torch.no_grad():
            inputs = tokenizer.encode_plus(question, top_paragraph, return_tensors="pt", max_length=512, truncation=True)

        with torch.no_grad():
            output = model(**inputs)

        answer_start = torch.argmax(output.start_logits)
        answer_end = torch.argmax(output.end_logits)
        answer_range = (max(0, answer_start - 10), min(len(inputs['input_ids'][0]) - 1, answer_end + 0))
        answer = tokenizer.decode(inputs['input_ids'][0, answer_range[0]:answer_range[1] + 1].cpu(), skip_special_tokens=True)
        answer = answer.replace(question, "").strip().capitalize()

        return answer

In [61]:
top_paragraph = ' '.join([i for i in results['documents']][0])
question = "what happend in 2003?"

casy_response(question, top_paragraph)

''

### OpenAI LLM

In [None]:
# Require Python 3.9
! pip install -q -U google-generativeai

In [None]:
import pathlib
import textwrap
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown

gemini = "AIzaSyD7xuL5u1cTlFv-_0gMb5lHe_lY9KEinqs"
genai.configure(api_key=gemini)
model = genai.GenerativeModel('gemini-pro')
chat = model.start_chat(history=[])

response = chat.send_message("what happend in 2003?", stream=True)
print(response)

for chunk in response:
  print(chunk.text)