# Data Loaders
* Load all kinds of data and then ask the LLM questions about it.
* Connect with data sources and load private documents.

## LangChain built-in data loaders.
* Labeled as "integrations".
* Most of them require to install the corresponding libraries.

## LangChain documentation on Document Loaders
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/).
* See the list of built-in document loaders [here](https://python.langchain.com/v0.1/docs/integrations/document_loaders/).

## Setup

#### After you download the code from the github repository in your computer
In terminal:
* cd project_name
* pyenv local 3.11.4
* poetry install
* poetry shell

#### To open the notebook with Jupyter Notebooks
In terminal:
* jupyter lab

Go to the folder of notebooks and open the right notebook.

#### To see the code in Virtual Studio Code or your editor of choice.
* open Virtual Studio Code or your editor of choice.
* open the project-folder
* open the 001-data-loaders.py file

## Create your .env file
* In the github repo we have included a file named .env.example
* Rename that file to .env file and here is where you will add your confidential api keys. Remember to include:
* OPENAI_API_KEY=your_openai_api_key
* LANGCHAIN_TRACING_V2=true
* LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
* LANGCHAIN_API_KEY=your_langchain_api_key
* LANGCHAIN_PROJECT=your_project_name

We will call our LangSmith project **001-data-loaders**.

## Track operations
From now on, we can track the operations **and the cost** of this project from LangSmith:
* [smith.langchain.com](https://smith.langchain.com)

## Connect with the .env file located in the same directory of this notebook

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [1]:
#pip install python-dotenv

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [3]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [4]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

## Simple data loading

#### Loading a .txt file

In [5]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [6]:
#!pip install langchain-community

In [7]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/be-good.txt")

loaded_data = loader.load()

* If you uncomment and execute the next cell you will see the contents of the loaded document.

In [8]:
#loaded_data

#### Loading a CSV file

In [9]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader('./data/Street_Tree_List.csv')

loaded_data = loader.load()

In [10]:
#loaded_data

#### Loading an .html file

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [11]:
#!pip install bs4

In [12]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('./data/100-startups.html')

loaded_data = loader.load()

In [13]:
#loaded_data

#### Loading a .pdf file

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [14]:
#!pip install pypdf

In [15]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./data/5pages.pdf')

loaded_data = loader.load_and_split()

In [16]:
#loaded_data[0].page_content

#### Loading a Wikipedia page and asking questions about it

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [17]:
#!pip install wikipedia

In [32]:
from langchain_community.document_loaders import WikipediaLoader

name = "JFK"

loader = WikipediaLoader(query=name, load_max_docs=1)

loaded_data = loader.load()[0].page_content

In [33]:
from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    [
        ("human", "Answer this {question}, here is some extra {context}"),
    ]
)

messages = chat_template.format_messages(
    question="Where was JFK assessinated?",
    context=loaded_data
)

In [34]:
response = chatModel.invoke(messages)

In [35]:
response

AIMessage(content='rants. Kennedy had eight siblings: Joseph Jr., Rosemary, Kathleen, Eunice, Patricia, Robert, Jean, and Edward.\n\nKennedy attended the Dexter School in Brookline before enrolling at the Choate School in Wallingford, Connecticut, for his high school education. He then went on to attend Harvard College, where he graduated in 1940 with a degree in government. Kennedy wrote his senior thesis on British foreign policy in the Munich Agreement, which later became the best-selling book, "Why England Slept."\n\nAfter college, Kennedy joined the U.S. Naval Reserve in 1941 and was stationed in the Pacific theater during World War II. He commanded a patrol torpedo boat, PT-109, which was sunk by a Japanese destroyer in 1943. Kennedy\'s actions in saving his crew earned him the Navy and Marine Corps Medal and the Purple Heart.\n\nKennedy returned to the U.S. in 1944 and briefly worked as a journalist before running for political office. He was elected to the U.S. House of Represe

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 001-data-loaders.py
* In terminal, make sure you are in the directory of the file and run:
    * python 001-data-loaders.py