In [71]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
#!pip install langchain sentence-transformers faiss-cpu pypdf transformers torch langchain-community PyPDF2

In [73]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline#, AutoModelForSeq2SeqLM
from transformers import AutoModelForQuestionAnswering # tego będziemy używać
from langchain.chains import RetrievalQA

# MOŻESZ IMPORTOWAĆ INNE BIBLIOTEKI, ALE PAMIĘTAJ O ICH INSTALACJI W WIERSZU POWYŻEJ !!!

# 0. Import pliku pdf

In [74]:
pdf_files = [ # wybrałem 3 artykuły o analize rynku mieszkaniowego
    '/content/drive/MyDrive/ml-zadanie4/housing_crisis_1.pdf',
    '/content/drive/MyDrive/ml-zadanie4/housing_crisis_2.pdf',
    '/content/drive/MyDrive/ml-zadanie4/housing_crisis_3.pdf'
]

all_data = []

# podział na fragmenty
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

for path in pdf_files:
    print(f"Loading {path}...")
    loader = PyPDFLoader(path)
    pages = loader.load()
    print(f"  -> loaded {len(pages)} pages")

    docs = text_splitter.split_documents(pages)
    print(f"  -> split into {len(docs)} chunks")

    all_data.extend(docs)

print(f"Załadowano {len(all_data)} fragmentów")

Loading /content/drive/MyDrive/ml-zadanie4/housing_crisis_1.pdf...




  -> loaded 11 pages
  -> split into 135 chunks
Loading /content/drive/MyDrive/ml-zadanie4/housing_crisis_2.pdf...
  -> loaded 36 pages
  -> split into 453 chunks
Loading /content/drive/MyDrive/ml-zadanie4/housing_crisis_3.pdf...
  -> loaded 13 pages
  -> split into 166 chunks
Załadowano 754 fragmentów


Pypdf pokazał problemy, ale najwyżej część dokumentu pozostanie niewczytana.

# 1. Wektorowa baza danych

In [75]:
# wybrałem model sentence-transformers/all-MiniLM-L6-v2

embed_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

embeddings = HuggingFaceEmbeddings(model_name=embed_model_name)
db = FAISS.from_documents(docs, embeddings)

In [76]:
# Zapis do pliku

db.save_local("vector_db")

In [77]:
# sprawdzenie semantyczne (top 3 wyniki)

def retrieve_relevant_docs(query, top_k=3):
    similar_docs = db.similarity_search(query, top_k)  # top 3 wyniki
    return similar_docs

query = "What are the causes of housing crisis?"
similar_docs = retrieve_relevant_docs(query)



for doc in similar_docs:
    print(doc.page_content + "...\n---")

systemic challenges:  
• Incomes not keeping up with housing costs, and  
• The severe shortage of homes affordable and available to households with the 
lowest incomes.  
The Gap Between Incomes and Housing Costs 
A major cause of housing instability is the fundamental mismatch between growing...
---
A major cause of housing instability is the fundamental mismatch between growing 
housing costs, and comparatively stagnant incomes for people with the lowest incomes. 
Over one-third of extremely low-income renters are in the labor force (35%), while the...
---
essential to provide stable, affordable housing for those who the private sector cannot 
serve on their own: households with the lowest incomes. 
Underlying Causes of the Affordable Housing Crisis  
The United States is experiencing an affordable housing and homelessness crisis...
---


Ciekawy output, wydaje się być poprawny, czasami trochę niekompletny.

# 2. Model generacji

In [78]:
# wybrałem model deepset/bert-base-cased-squad2

model_name = 'deepset/bert-base-cased-squad2'

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [89]:
qa_pipeline = pipeline(task="question-answering", model=model, tokenizer=tokenizer, max_length=512)
llm = HuggingFacePipeline(pipeline=qa_pipeline)

def answer_question(query):
    relevant_docs = retrieve_relevant_docs(query)

    answers = []

    for doc in relevant_docs:
        result = qa_pipeline(question=query, context=doc.page_content)
        answers.append({
            'answer': result['answer'],
            'score': result['score'],
            'context': doc
        })

    best_answer = max(answers, key=lambda x: x['score'])

    if best_answer['score'] < 0.2:
      return "Try asking a different question."
    if best_answer['score'] < 0.3:
      return best_answer['answer'] + " (This answer may be highly inaccurate!)"

    return best_answer['answer']

# pytanie 1
query = "What causes the housing crisis?"
answer = answer_question(query)
print(f"AI Assistant: {answer}")

# pytanie 2
query = "Who suffers from the housing crisis?"
answer = answer_question(query)
print(f"AI Assistant: {answer}")

# głupie pytanie
query = "What time is it?"
answer = answer_question(query)
print(f"AI Assistant: {answer}")

Device set to use cpu


AI Assistant: growing 
housing costs
AI Assistant: households with the 
lowest incomes
AI Assistant: Try asking a different question.


# 3. Pętla z "chatbotem"

In [92]:
print("Hello! Type 'end' to exit.\n")

while True:
  query = input("User: ")
  if query == "end":
    print("Goodbye!")
    break
  answer = answer_question(query)
  print(f"AI Assistant: {answer}")

Hello! Type 'end' to exit.

User: how much affordable housing shortage is it?
AI Assistant: 7.3 million
User: what country is experiencing the housing crisis?
AI Assistant: United States
User: who are most low-income renters?
AI Assistant: seniors
User: what are the available tax credits?
AI Assistant: Low-Income Housing Tax Credit 
Program
User: what percent of income is paid for housing?
AI Assistant: half
User: who needs to be supplied housing?
AI Assistant: lowest-income renters
User: what needs to be done?
AI Assistant: improvements to federal programs
User: how much people experience homelessness
AI Assistant: millions of the lowest-income and most 
marginalized households
User: end
Goodbye!


Krótka analiza serii pytań:

---

User: how much affordable housing shortage is it?
AI Assistant: 7.3 million

Ten model dobrze sobie radzi z szukaniem liczb

---

User: what country is experiencing the housing crisis?
AI Assistant: United States

Tutaj widać, że dane są ograniczone do tych trzech dokumentów

---

User: what percent of income is paid for housing?
AI Assistant: half

Zapewne wyjęte z kontekstu

---

User: what needs to be done?
AI Assistant: improvements to federal programs

Ostro

---

# Podsumowanie

- Model poprawnie odpowiada na pytania, pod warunkiem że są dość dokładnie sformułowane

- Mechanizm oceny podobieństwa pomaga w ocenie, czy model zaczyna zmyślać odpowiedzi

- Pracujemy na bardzo ograniczonej bazie danych, przez co wyniki są jakie są

- Odpowiedzi to raczej lekko przetworzone fragmenty zdań wyrwane z kontekstu, który był powiązany z pytaniem według modelu embeddingowego

- Następnym razem wybrałbym inny temat do analizy, ten jest dość rozległy w danych i ma trochę "humanistyczną" treść, tj. mało konkretów, sporo długich zdań

- Niestety moje wcześniejsze próby z modelami w języku polskim nie udały się