### CSV Loader

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader("./heart_failure_clinical_records.csv")

loader.load()[:5]

[Document(page_content='age: 55.0\nanaemia: 0\ncreatinine_phosphokinase: 748\ndiabetes: 0\nejection_fraction: 45\nhigh_blood_pressure: 0\nplatelets: 263358.03\nserum_creatinine: 1.3\nserum_sodium: 137\nsex: 1\nsmoking: 1\ntime: 88\nDEATH_EVENT: 0', metadata={'source': './heart_failure_clinical_records.csv', 'row': 0}),
 Document(page_content='age: 65.0\nanaemia: 0\ncreatinine_phosphokinase: 56\ndiabetes: 0\nejection_fraction: 25\nhigh_blood_pressure: 0\nplatelets: 305000.0\nserum_creatinine: 5.0\nserum_sodium: 130\nsex: 1\nsmoking: 0\ntime: 207\nDEATH_EVENT: 0', metadata={'source': './heart_failure_clinical_records.csv', 'row': 1}),
 Document(page_content='age: 45.0\nanaemia: 0\ncreatinine_phosphokinase: 582\ndiabetes: 1\nejection_fraction: 38\nhigh_blood_pressure: 0\nplatelets: 319000.0\nserum_creatinine: 0.9\nserum_sodium: 140\nsex: 0\nsmoking: 0\ntime: 244\nDEATH_EVENT: 0', metadata={'source': './heart_failure_clinical_records.csv', 'row': 2}),
 Document(page_content='age: 60.0\nana

#### Specify a column to identify the document source

Use the source_column argument to specify a source for the document created from each row. Otherwise file_path will be used as the source for all documents created from the CSV file.

<u><i>This is useful when using documents loaded from CSV files for chains that answer questions using sources.</i></u>

In [10]:
loader = CSVLoader(file_path="./heart_failure_clinical_records.csv", csv_args={"delimiter":",","quotechar":'"'},source_column="diabetes")

loader.load()[:5]

#print(loader.source_column)

[Document(page_content='age: 55.0\nanaemia: 0\ncreatinine_phosphokinase: 748\ndiabetes: 0\nejection_fraction: 45\nhigh_blood_pressure: 0\nplatelets: 263358.03\nserum_creatinine: 1.3\nserum_sodium: 137\nsex: 1\nsmoking: 1\ntime: 88\nDEATH_EVENT: 0', metadata={'source': '0', 'row': 0}),
 Document(page_content='age: 65.0\nanaemia: 0\ncreatinine_phosphokinase: 56\ndiabetes: 0\nejection_fraction: 25\nhigh_blood_pressure: 0\nplatelets: 305000.0\nserum_creatinine: 5.0\nserum_sodium: 130\nsex: 1\nsmoking: 0\ntime: 207\nDEATH_EVENT: 0', metadata={'source': '0', 'row': 1}),
 Document(page_content='age: 45.0\nanaemia: 0\ncreatinine_phosphokinase: 582\ndiabetes: 1\nejection_fraction: 38\nhigh_blood_pressure: 0\nplatelets: 319000.0\nserum_creatinine: 0.9\nserum_sodium: 140\nsex: 0\nsmoking: 0\ntime: 244\nDEATH_EVENT: 0', metadata={'source': '1', 'row': 2}),
 Document(page_content='age: 60.0\nanaemia: 1\ncreatinine_phosphokinase: 754\ndiabetes: 1\nejection_fraction: 40\nhigh_blood_pressure: 1\nplate

### File Directory Loader : <i>Load all documents in a directory</i>

In [7]:
from langchain_community.document_loaders import DirectoryLoader
import unstr


In [12]:
loader = DirectoryLoader(".",glob="*.csv")

loader.load()

Error loading file heart_failure_clinical_records.csv


ImportError: partition_csv is not available. Install the csv dependencies with pip install "unstructured[csv]"

#### PDF Loader

In [14]:
from langchain_community.document_loaders import PyPDFLoader

In [16]:
loader = PyPDFLoader("2404.19553v1.pdf")
pages = loader.load_and_split()

In [17]:
pages[0]

Document(page_content='Extending Llama-3’s Context Ten-Fold Overnight\nPeitian Zhang1,2, Ninglu Shao1,2, Zheng Liu1∗, Shitao Xiao1, Hongjin Qian1,2,\nQiwei Ye1, Zhicheng Dou2\n1Beijing Academy of Artificial Intelligence\n2Gaoling School of Artificial Intelligence, Renmin University of China\nnamespace.pt@gmail.com zhengliu1026@gmail.com\nAbstract\nWe extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA\nfine-tuning2. The entire training cycle is super efficient, which takes 8 hours on one\n8xA800 (80G) GPU machine. The resulted model exhibits superior performances\nacross a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-\ncontext language understanding; meanwhile, it also well preserves the original\ncapability over short contexts. The dramatic context extension is mainly attributed\nto merely 3.5K synthetic training samples generated by GPT-4 , which indicates\nthe LLMs’ inherent (yet largely underestimated) potential to extend its origin

In [28]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from pprint import pprint

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_store = FAISS.from_documents(documents=pages,embedding=embeddings)

docs= vector_store.similarity_search("What was the increased context length",k=2)

for doc in docs:
    print(str(doc.metadata["page"]) +":",doc.page_content )

1: 800014315 20631 26947 33263 39578 45894 52210 58526 64842 71157 77473 83789 90105 96421102736 109052 115368 121684 128000
Context Length0
11
22
33
44
55
66
77
88
100Depth Percent1.0Needle In A HayStack
12345678910
Accuracy Score from GPT3.5Figure 1: The accuracy score of Llama-3-8B-Instruct-80K-QLoRA on Needle-In-A-HayStack task.
The blue vertical line indicates the training length, i.e. 80K.
the same cluster to form each heterogeneous context. Therefore, the grouped texts share
some semantic similarity. We then prompt GPT-4 to ask about the similarities/dissimilarities
across these texts.
3.Biography Summarization : we prompt GPT-4 to write a biography for each main character
in a given book.
For all three tasks, the length of context is between 64K to 80K. Note that longer data can also be
synthesized following the same methodology. When training, we organize the question-answer pairs
for the same context in one multi-turn conversation then fine-tune the LLM to correctly answer th

#### JSON   

In [None]:
from langchain_community.document_loaders import JSONLoader

In [38]:
# Create a JSON File
import requests 
from pprint import pprint 
import json

#Download JSON from URL
url = "https://filesamples.com/samples/code/json/sample4.json"
json_data = requests.get(url).json()

#Create JSON Object from JSON Response Data
json_object= json.dumps(json_data)

#Write JSON Object into JSON File
file = "JsonData.json"
with open(file, "w") as json_file:
    json_file.write(json_object)

In [None]:
loader = JSONLoader()