### We can load multiple filetypes using Langchain. We have already seen TextLoader. We can also load PDFs, Webpages, and files from GoogleDrive

# TextLoader

In [1]:
from langchain.document_loaders import TextLoader



In [2]:
text_loader=TextLoader('data/oppenheimer.txt')

In [3]:
data=text_loader.load()

In [4]:
data[0].page_content[:250]

'Julius Robert Oppenheimer (April 22, 1904 â€“ February 18, 1967) was an American theoretical physicist and director of the Manhattan Project\'s Los Alamos Laboratory during World War II. He is often called the "father of the atomic bomb".\n\nBorn in New'

# PDFLoader

#### To use PyPDFLoader we need to install the pypdf library
`!pip install -q pypdf`

In [5]:
from langchain.document_loaders import PyPDFLoader

In [6]:
pdf_loader=PyPDFLoader('data/Consciousness in Artificial Intelligence.pdf')

In [7]:
pdf=pdf_loader.load_and_split()

In [8]:
print(len(pdf))

92


In [9]:
print(pdf[0])

page_content='Consciousness in Artificial Intelligence:\nInsights from the Science of Consciousness\nPatrick Butlin* Robert Long* Eric Elmoznino\nYoshua Bengio Jonathan Birch Axel Constant\nGeorge Deane Stephen M. Fleming Chris Frith\nXu Ji Ryota Kanai Colin Klein\nGrace Lindsay Matthias Michel Liad Mudrik\nMegan A. K. Peters Eric Schwitzgebel Jonathan Simon\nRufin VanRullen\nAbstract\nWhether current or near-term AI systems could be conscious is a topic of scientific interest and\nincreasing public concern. This report argues for, and exemplifies, a rigorous and empirically\ngrounded approach to AI consciousness: assessing existing AI systems in detail, in light of our\nbest-supported neuroscientific theories of consciousness. We survey several prominent scientific\ntheories of consciousness, including recurrent processing theory, global workspace theory, higher-\norder theories, predictive processing, and attention schema theory. From these theories we derive\n”indicator properties” 

In [10]:
print(type(pdf[0]))

<class 'langchain.schema.Document'>


<b> Hence, we see that all the document loaders create a standard Document object (similar to the object on which we had used a QARetriever on in the previous lesson)

# Webpages

#### For Webpages, we need to use the Selenium Webdriver that can extract webpages from Chrome-enabled browsers
`!pip install -q unstructured==0.7.7 selenium==4.10.0`
<br>The versions are only if you are using the versions of langchain==0.0.208 as mentioned in the course

In [11]:
from langchain.document_loaders import SeleniumURLLoader

In [12]:
url_list=['https://news.ycombinator.com/item?id=37220667','https://news.ycombinator.com/item?id=37219779','https://news.ycombinator.com/item?id=37220744'] #3 of the top 4 Hacker News items

In [13]:
url_loader=SeleniumURLLoader(url_list,browser='firefox')

In [14]:
import time

In [15]:
start_time=time.time()
url_data=url_loader.load()
end_time=time.time()

In [16]:
print('Time taken to load the data:',end_time-start_time)

Time taken to load the data: 13.066612005233765


In [17]:
len(url_data)

3

In [27]:
for sample in url_data:
    content=sample.page_content
    content=content.replace('\n\n','\n')
    print(sample.metadata)
    print(content[:300])

{'source': 'https://news.ycombinator.com/item?id=37220667'}
Hacker News
                            new | past | comments | ask | show | jobs | submit            
                              login
Why KPIs are destroying businesses (promaton.com)
          162 points by atorok 2 hours ago  | hide | past | favorite | 94 comments        
              
     
{'source': 'https://news.ycombinator.com/item?id=37219779'}
Hacker News
                            new | past | comments | ask | show | jobs | submit            
                              login
Arm Announces Public Filing for Proposed Initial Public Offering (arm.com)
          208 points by lultimouomo 5 hours ago  | hide | past | favorite | 162 commen
{'source': 'https://news.ycombinator.com/item?id=37220744'}
Hacker News
                            new | past | comments | ask | show | jobs | submit            
                              login
Consciousness in AI: Insights from the Science of Consciousness (arxiv.org)
   