## 논문 chat bot 만들기

문서 기반 질의응답 시스템  
LLM (HuggingFace 모델)  
RAG (정보 검색형 Agent)

## 패키지 설치


In [12]:
# 설치
!pip install langchain faiss-cpu transformers torch sentencepiece
!pip install sentence-transformers
!pip install -U langchain-community
!pip install pypdf

Collecting pypdf
  Using cached pypdf-6.1.1-py3-none-any.whl.metadata (7.1 kB)
Using cached pypdf-6.1.1-py3-none-any.whl (323 kB)
Installing collected packages: pypdf
Successfully installed pypdf-6.1.1


In [2]:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# PDF 불러오기
loader = PyPDFLoader("chat_doc/DAEM-ERC.pdf")
docs = loader.load()

# 텍스트 청크로 분할
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(docs)

# 임베딩 + 벡터스토어 생성
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


In [25]:

# 로컬 LLM 불러오기 (HuggingFace 모델)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline("text-generation", 
                model=model, 
                tokenizer=tokenizer, 
                max_new_tokens=256,
                temperature=0.4,
                top_p=0.9,
                repetition_penalty=1.1,)



Device set to use cpu


In [26]:
# 질의응답 함수
def tinyllama_chat_prompt(context, question):
    # Llama 계열 chat 템플릿을 tokenizer가 자동 생성
    messages = [
        {"role": "system",
         "content": "너는 문서 기반 한국어 QA 모델이야. 주어진 문맥에서만 근거를 찾아 한국어 한두 문장으로 간결히 답해. 문맥에 없으면 '모르겠습니다.'라고 말해."},
        {"role": "user",
         "content": f"다음 문맥을 근거로 질문에 답해.\n\n[문맥]\n{context}\n\n[질문]\n{question}\n\n규칙: 불필요한 서론 금지, 근거 없는 추측 금지, 한두 문장만 출력."}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def ask(question, k=2):
    # 1) 검색
    rel = vectorstore.similarity_search(question, k=k)
    context = "\n\n---\n\n".join(d.page_content for d in rel)

    # 2) 프롬프트 생성(템플릿)
    prompt = tinyllama_chat_prompt(context, question)

    # 3) 생성
    out = pipe(prompt, return_full_text=False, max_new_tokens=256)[0]["generated_text"]
    return out


In [27]:
# 테스트
print(ask("사용한 성능지표가 뭐야?"))

The study aims to improve the performance of sentiment analysis models by using data augmentation techniques such as over-sampling for text, audio, and body sensor data. The researchers use different types of data (text, audio, and body sensor data) but apply the same over-sampling technique to all three datasets. They also propose a model that uses both text, audio, and body sensor data for classification, with each dataset having its own set of features. The performance of the model is evaluated using various metrics, including accuracy, precision, recall, F1 score, and weighted F1 score. The results show that the proposed model outperforms other models in terms of accuracy, precision, and F1 score, with a significant improvement in the case of the body sensor data. Additionally, the authors note that the use of over-sampling techniques can help alleviate the problem of class imbalance in the data and improve the overall performance of the model.
