In [6]:
from huggingface_hub import snapshot_download  # Hugging Face Hub에서 데이터 또는 모델 리포지토리를 다운로드하기 위한 함수 import

snapshot_download(  # Hugging Face Hub에서 특정 리포지토리의 스냅샷을 다운로드
  repo_id='neural-bridge/rag-dataset-1200',  # 다운로드할 리포지토리 ID 지정 ('allganize/rag-ko')
  repo_type='dataset',  # 리포지토리 유형 설정 ('dataset')
  local_dir='./res/rag-custom',  # 다운로드한 데이터를 저장할 로컬 디렉토리 경로 지정
  local_dir_use_symlinks=False  # 심볼릭 링크 대신 파일 복사를 사용하여 다운로드 (디스크 공간 사용)
)

print()  # 줄 바꿈을 위한 빈 print() 문; 디버깅 시 유용할 수 있음

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

(…)-00000-of-00001-f0c158413defd454.parquet:   0%|          | 0.00/2.32M [00:00<?, ?B/s]

(…)-00000-of-00001-06d83c58a8ea10e8.parquet:   0%|          | 0.00/604k [00:00<?, ?B/s]




In [12]:
import pandas as pd  # pandas 라이브러리 import; 데이터 프레임과 데이터 분석 작업을 수행하는 데 사용

df = pd.read_parquet('./res/rag-custom/data/test-00000-of-00001-06d83c58a8ea10e8.parquet')  # Parquet 파일을 읽어 pandas DataFrame으로 로드
df  # DataFrame의 내용을 출력

Unnamed: 0,context,question,answer
0,Trail Patrol Training\nWant to be a part of th...,What are some of the skills taught in the Trai...,The course teaches the essential skills necess...
1,"Lot Of Cbi Theater Ww2 Letters, 2 Newspapers, ...",Who was the original owner of the lot of items...,The original lot of items belonged to Lt. Neil...
2,Just.\nWe are a small all volunteer NGO Humani...,What is the main objective of Humanity Road as...,The main objective of Humanity Road is to 'clo...
3,One of two convicted killers who escaped from ...,Who were the two convicted killers that escape...,The two convicted killers that escaped from an...
4,(Continued from Part 1...)\n(Thirty years late...,Who was the person that came to help when Isaa...,Jesus was the person who came to help when Isa...
...,...,...,...
235,Inventor of the water-powered car screamed 'th...,Who was Stanley Meyer and what was his controv...,Stanley Meyer was one of the most controversia...
236,The Coleman Archive Volume 1: The Living Tradi...,What is the key instrument of choice in the ar...,The fiddle was the key instrument of choice in...
237,From November 1981 thru February 1983 the Mona...,What was the purpose of the monthly newsletter...,The monthly newsletter produced by the Monadno...
238,.\nDiscount Street Jackets More in the Main St...,What is the sale price of the 2117 of Sweden B...,The sale price of the 2117 of Sweden Bjorklide...


In [9]:
print(df.iloc[0]['context']) 

Trail Patrol Training
Want to be a part of the Trail Patrol ?? Join an Orientation & Hike on the 1st Tuesday of each month. This course is required for all PATC members interested in joining the PATC Trail Patrol.
The course teaches the essential skills necessary to be a trail patrol member and to provide a reassuring presence on the trail while teaching safety and environmental responsibility. A Trail Patrol handbook is provided to all students. Please bring a pencil, your hiking daypack & lunch.
More Info: View the Calendar or contact TP Training or visit the Trail Patrol Training web pages.
Hike Leader Class.
More Info: Contact Hike Leader Training or click here to register.
Backpacking Classes
Educating people in safe and environmentally friendly practices for traveling into the backcountry is one of Trail Patrol’s core responsibilities. We offer backpacking classes for novices seeking to take up backpacking as well as for experienced backpackers.
Backpacking 101: An Introductory C

In [10]:
# 데이터셋 구조 확인
print("DataFrame 구조:")
print(df.info())
print("\n첫 번째 행 샘플:")
print(df.iloc[0])

DataFrame 구조:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   context   240 non-null    object
 1   question  240 non-null    object
 2   answer    240 non-null    object
dtypes: object(3)
memory usage: 5.8+ KB
None

첫 번째 행 샘플:
context     Trail Patrol Training\nWant to be a part of th...
question    What are some of the skills taught in the Trai...
answer      The course teaches the essential skills necess...
Name: 0, dtype: object


In [15]:
import pandas as pd
import pickle

# 각 컬럼의 데이터를 리스트로 변환
contexts = df['context'].tolist()  # 컨텍스트를 그대로 리스트로 변환
questions = df['question'].tolist()
answers = df['answer'].tolist()

# 데이터 사전 구성
rag_data = {
    'questions': questions,
    'contexts': contexts,
    'contexts_answer_idx': [0] * len(df),
    'contexts_answers': df['context'].tolist(),
    'answers': answers
}

# pkl 파일로 저장
with open('./res/rag-custom.pkl', 'wb') as f:
    pickle.dump(rag_data, f)

# 저장된 데이터 확인
print("데이터 통계:")
print(f"총 데이터 수: {len(df)}")
print(f"질문 개수: {len(rag_data['questions'])}")
print(f"컨텍스트 개수: {len(rag_data['contexts'])}")
print(f"답변 개수: {len(rag_data['answers'])}")

print("\n첫 번째 샘플:")
print(f"질문: {rag_data['questions'][0]}")
print(f"컨텍스트: {rag_data['contexts'][0]}")
print(f"답변: {rag_data['answers'][0]}")

데이터 통계:
총 데이터 수: 240
질문 개수: 240
컨텍스트 개수: 240
답변 개수: 240

첫 번째 샘플:
질문: What are some of the skills taught in the Trail Patrol Training course?
컨텍스트: Trail Patrol Training
Want to be a part of the Trail Patrol ?? Join an Orientation & Hike on the 1st Tuesday of each month. This course is required for all PATC members interested in joining the PATC Trail Patrol.
The course teaches the essential skills necessary to be a trail patrol member and to provide a reassuring presence on the trail while teaching safety and environmental responsibility. A Trail Patrol handbook is provided to all students. Please bring a pencil, your hiking daypack & lunch.
More Info: View the Calendar or contact TP Training or visit the Trail Patrol Training web pages.
Hike Leader Class.
More Info: Contact Hike Leader Training or click here to register.
Backpacking Classes
Educating people in safe and environmentally friendly practices for traveling into the backcountry is one of Trail Patrol’s core responsibilities