A preprocessing notebook that prepares the data for upload to milvus.

In [1]:
from tqdm import tqdm
import re 

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader, DirectoryLoader
from src.utils import reformat_text
from src.vectorstore import jsonize_document

import pandas as pd
import json

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
loader = DirectoryLoader("docs/youtube/", glob="*.txt", show_progress=True)
docs = loader.load()

  0%|          | 0/14 [00:00<?, ?it/s][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/damir/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
100%|██████████| 14/14 [00:41<00:00,  2.99s/it]


In [3]:
docs

[Document(page_content="Welcome to the Huberman Lab Podcast,\n\nwhere we discuss science\n\nand science-based tools for everyday life.\n\nI'm Andrew Huberman,\n\nand I'm a professor of neurobiology and ophthalmology\n\nat Stanford School of Medicine.\n\nToday, we are discussing headaches.\n\nHeadaches are something that everybody will suffer\n\nat some point in their lifetime.\n\nOf course, some people suffer from headaches\n\nfar more often than others.\n\nAnd for many people, headaches can be incredibly debilitating,\n\nlimiting their ability to work, to socialize,\n\nto sleep, to exercise,\n\nessentially to live life in any kind of normal way.\n\nAs we'll soon discuss,\n\nthere are many different kinds of headache.\n\nWe have migraine headaches,\n\ntension headaches, cluster headaches.\n\nToday, we'll review all the different types of headaches\n\nand what the underlying biology\n\nof each and every one of those types of headaches is,\n\nas well as, fortunately,\n\nthe many excellen

In [4]:
for i, doc in enumerate(docs):
    doc.page_content = reformat_text(doc.page_content)

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
)

In [7]:
docs

[Document(page_content="Welcome to the Huberman Lab Podcast, where we discuss science and science-based tools for everyday life.\nI'm Andrew Huberman, and I'm a professor of neurobiology and ophthalmology at Stanford School of Medicine.\nToday, we are discussing music and your brain.\nHowever, this episode could have just as easily been entitled, music is your brain, or your brain is music.\nAnd that's because music, believe it or not, is a neurological phenomenon.\nMost of us think of music as something that happens outside of us, the sounds we hear, the lyrics we hear, their meaning, how they anchor us to pieces of our history, both emotional or social.\nIt turns out that when we listen to music, it activates nearly every piece of our brain.\nMoreover, when we listen to music, it activates our brain in ways that our brain itself, and indeed our body as well, help to create that music at the level of so-called neuro-ensemble, that is the firing of neurons.\nIn other words, when we lis

In [8]:
splitted_docs = text_splitter.split_documents(docs)

In [9]:
len(splitted_docs)

419

In [10]:
jsons = []

for d in tqdm(splitted_docs, desc="Processing texts"):
    json = jsonize_document(d)
    jsons.append(json)

Processing texts: 100%|██████████| 419/419 [21:09<00:00,  3.03s/it]


In [13]:
import json
with open('jsons.json', 'w') as file:
    json.dump(jsons, file, indent=4)

In [14]:
df = pd.read_json("/home/damir/Projects/huberman_rag/jsons.json")
df['pk'] = df.index
df.to_csv("dataframe.csv", index=False)