## HuggingFace 데이터셋

- HuggingFaceDatasetLoader를 사용하여 Hugging Face 데이터셋을 로드합니다.
- 로드된 데이터셋은 LangChain에서 사용할 수 있는 문서 형식으로 변환됩니다.
- 이를 통해 Hugging Face 데이터셋을 LangChain의 다양한 기능과 함께 활용할 수 있습니다.

In [1]:
from langchain_community.document_loaders import HuggingFaceDatasetLoader

In [2]:
dataset_name = "imdb"  # 데이터셋 이름을 "imdb"로 설정합니다.
page_content_column = "text"  # 페이지 내용이 포함된 열의 이름을 "text"로 설정합니다.

# HuggingFaceDatasetLoader를 사용하여 데이터셋을 로드합니다.
# 데이터셋 이름과 페이지 내용 열 이름을 전달합니다.
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

In [4]:
# !pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
   ---------------------------------------- 0.0/547.8 kB ? eta -:--:--
   -------------------------------- ------- 450.6/547.8 kB 9.4 MB/s eta 0:00:01
   -------

In [5]:
data = loader.load()  # 로더를 사용하여 데이터를 불러옵니다.

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 7.81MB/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:02<00:00, 8.21MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:02<00:00, 8.35MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:04<00:00, 8.77MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 375576.58 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 438197.52 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 452048.40 examples/s]


In [6]:
# 데이터의 처음 3개 요소를 선택합니다.
data[:3]

[Document(metadata={'label': 0}, page_content='"I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered \\"controversial\\" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, th

In [7]:
from langchain.indexes import VectorstoreIndexCreator
from langchain_community.document_loaders.hugging_face_dataset import (
    HuggingFaceDatasetLoader,
)

In [8]:
dataset_name = "tweet_eval"  # 데이터셋 이름을 "tweet_eval"로 설정합니다.
page_content_column = "text"  # 페이지 내용이 포함된 열의 이름을 "text"로 설정합니다.
name = "stance_climate"  # 데이터셋의 특정 부분을 식별하는 이름을 "stance_climate"로 설정합니다.

# HuggingFaceDatasetLoader를 사용하여 데이터셋을 로드합니다.
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column, name)

VectorstoreIndexCreator 클래스를 사용하여 로더(loader)에서 벡터 저장소 인덱스를 생성합니다.

- VectorstoreIndexCreator 클래스의 인스턴스를 생성합니다.
- from_loaders 메서드를 호출하여 로더 리스트를 전달합니다.
- 로더에서 추출된 데이터를 기반으로 벡터 저장소 인덱스가 생성됩니다.
- 생성된 인덱스는 index 변수에 할당됩니다.

In [12]:
# API 키를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv

# API 키 정보 로드
load_dotenv()

True

In [22]:
from langchain_openai import OpenAIEmbeddings, OpenAI, ChatOpenAI

In [14]:
embeddings = OpenAIEmbeddings()

In [15]:
# 로더에서 벡터 저장소 인덱스를 생성합니다.
index = VectorstoreIndexCreator(embedding=embeddings).from_loaders([loader])

Downloading readme: 100%|██████████| 23.9k/23.9k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 28.1k/28.1k [00:00<00:00, 68.4kB/s]
Downloading data: 100%|██████████| 14.9k/14.9k [00:00<00:00, 36.8kB/s]
Downloading data: 100%|██████████| 5.47k/5.47k [00:00<00:00, 14.1kB/s]
Generating train split: 100%|██████████| 355/355 [00:00<00:00, 177322.61 examples/s]
Generating test split: 100%|██████████| 169/169 [00:00<00:00, 56261.40 examples/s]
Generating validation split: 100%|██████████| 40/40 [00:00<00:00, 19970.50 examples/s]


In [23]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [24]:
query = "What are the most used hashtag?"  # 가장 많이 사용되는 해시태그는 무엇인가요?
result = index.query(query, llm=llm)  # 질의를 수행하여 결과를 얻습니다.

In [25]:
result

'The most used hashtags mentioned in the provided context are #SemST, #LoveWins, and #ThanksObama.'