<div style="width: 30%; float: right; margin: 10px; margin-right: 5%;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/FHNW_Logo.svg/2560px-FHNW_Logo.svg.png" width="500" style="float: left; filter: invert(50%);"/>
</div>

# Phi-2 Few-Shot learning

In diesem Notebook werden wir einen Chatbot für Schweizer Immobilien Empfehlungen mittels Few-Shot learning erstellen. <br>
Dabei verwenden wir das LLM [phi-2](https://huggingface.co/microsoft/phi-2 )von Microsoft.



---
Bearbeitet durch Si Ben Tran, Yannic Lais, Rami Tarabishi im HS 2023.<br>
Bachelor of Science FHNW in Data Science.

## Einleitung

### Allgemeines Vorgehen

- Name entity recognition auf den Prompt
- Entities werden für die Datenbankabfrage extrahiert
- Prompt wird mit den Trainingsexamples sowie der Datenbankabfrage an das Phi-2 Modell gesendet

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import spacy
import pandas as pd
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
torch.cuda.is_available()

True

## Immobiliendaten

In [3]:
df = pd.read_csv('data\immo_data_202208.csv')

In [4]:
# clean price
df['price'] = df['price'].str.replace('CHF', '')
df['price'] = df['price'].str.replace(' ', '')
df['price'] = df['price'].str.replace('.', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].str.replace('—', '')
# numeric, if error, set to nan
df['price'] = pd.to_numeric(df['price'], errors='coerce')

In [5]:
# clean location drop everything before the last number
df['location'] = df['location'].str.replace('.*\d', '')
# drop everything after a comma, including the comma
df['location'] = df['location'].str.replace(',.*', '')

In [6]:
df.head(10)

Unnamed: 0,price,type,location,detailed_description,Zip,url
0,1150000.0,penthouse,"5023 Biberstein, AG",DescriptionLuxuriöse Attika-Wohnung direkt an ...,5023,https://www.immoscout24.ch//en/d/penthouse-buy...
1,1420000.0,terrace-house,"Buhldenstrasse 8d5023 Biberstein, AG",DescriptionStilvolle Liegenschaft an ruhiger L...,5023,https://www.immoscout24.ch//en/d/terrace-house...
2,720000.0,penthouse,"5022 Rombach, AG","detail_responsive#description_title2,5 Zimmerw...",5000,https://www.immoscout24.ch//en/d/penthouse-buy...
3,1430000.0,detached-house,"Buhaldenstrasse 8A5023 Biberstein, AG",DescriptionDieses äusserst grosszügige Minergi...,5023,https://www.immoscout24.ch//en/d/detached-hous...
4,995000.0,flat,"5022 Rombach, AG",DescriptionAus ehemals zwei Wohnungen wurde ei...,5022,https://www.immoscout24.ch//en/d/flat-buy-romb...
5,2160000.0,detached-house,"Buchhalde 365018 Erlinsbach, AG",DescriptionDer Blick in die Weite vermittelt R...,5018,https://www.immoscout24.ch//en/d/detached-hous...
6,550000.0,terrace-house,"5023 Biberstein, AG",DescriptionZum Objekt:Kompakt und doch sehr ge...,5023,https://www.immoscout24.ch//en/d/terrace-house...
7,590000.0,flat,"5004 Aarau, AG",DescriptionNaturnah und doch am Zentrum diese ...,5000,https://www.immoscout24.ch//en/d/flat-buy-aara...
8,547000.0,flat,"Siebenmatten 495032 Aarau Rohr, AG",DescriptionDie Überbauung Siebenmatten in Aara...,5032,https://www.immoscout24.ch//en/d/flat-buy-aara...
9,1125000.0,stepped-house,"5018 Erlinsbach, AG","DescriptionTreten Sie ein, in Ihr neues, liebe...",5018,https://www.immoscout24.ch//en/d/stepped-house...


## Phi-2 Model

In [7]:
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

CPU Autocast only supports dtype of torch.bfloat16 currently.
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.85s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Einzelner Input

In [8]:
inputs = tokenizer("Is a penguin a bird or a mamal?", return_tensors="pt").to('cuda')

# Generate outputs and decode
outputs = model.generate(**inputs, max_length=40)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text)

Is a penguin a bird or a mamal?
Answer: A penguin is a bird.

Exercise 2:
What is the difference between a simile and a metaphor


## Name entitiy recognition (NER)

In [9]:
nlp = spacy.load("en_core_web_sm")

In [10]:
prompt = "Hey, i'm looking for an appartement in Bern which costs less than 700'000 dollars. Can you help me?"

In [11]:
doc = nlp(prompt)
entities = {ent.label_: ent.text for ent in doc.ents}

In [12]:
print(entities)

{'ORG': 'Bern', 'MONEY': "less than 700'000 dollars"}


## Training Examples

In [13]:
few_shot_examples = [
    {
        "Question": "I am looking for an apartment in Zurich under 1'000'000 CHF.", 
        "Answer": "Here are some options for apartments in Zurich under 1'000'000 CHF: (max_price = 1000000, location_keyword = 'Zurich', property_type = 'apartment')"
    },
    {
        "Question": "Are there terraced houses in Bern in the CHF 500,000 to 700,000 range?",
        "Answer": "Yes, there are terraced houses in Bern in the CHF 500,000 to 700,000 range: (max_price = 700000, min_price = 500000, location_keyword = 'Bern', property_type = 'terraced_house')"
    },
    {
        "Question": "I need a detached house in Lucerne with a garden for around CHF 1,200,000.",
        "Answer": "In Lucerne you can find detached houses with a garden for around CHF 1,200,000: (location_keyword = 'Bern', property_type = 'house', arround_price = 1200000)"
    },
    {
        "Question": "Are modern apartments available in Basel for under CHF 900,000?",
        "Answer": "Modern apartments in Basel under 900'000 CHF are available: (max_price = 900000, location_keyword = 'Basel', property_type = 'apartment')"
    },
    {
        "Question": "I am looking for a large house in Lausanne, at least 5 rooms, up to 1'500'000 CHF.",
        "Answer": "Large houses in Lausanne with at least 5 rooms up to 1'500'000 CHF can be found here: (max_price = 1500000, location_keyword = 'Lausanne', property_type = 'house')"
    }

]


In [14]:
def filter_dataframe(df = df, max_price = None, min_price = None, arround_price = None, location_keyword = None, property_type = None):

    # Apply filters
    if arround_price:
        filtered_df = df[df['price'] <= arround_price * 1.1]
        filtered_df = df[df['price'] >= arround_price * 0.9]
    if max_price:
        filtered_df = df[df['price'] <= max_price]
    if min_price:
        filtered_df = df[df['price'] >= min_price]
    if location_keyword:
        filtered_df = filtered_df[filtered_df['location'].str.contains(location_keyword, case=False, na=False)]
    if property_type:
        filtered_df = filtered_df[filtered_df['type'].str.contains(property_type, case=False, na=False)]

    # Return 5 random samples
    if len(filtered_df) >= 5:
        return filtered_df.sample(n=5)
    else:
        return filtered_df

In [15]:
test = filter_dataframe(df, max_price = 700000, location_keyword = 'Basel')

In [16]:
test.head()

Unnamed: 0,price,type,location,detailed_description,Zip,url
2714,685000.0,flat,"Wiesenschanzweg 304057 Basel, BS",detail_responsive#description_titleDiese helle...,4057,https://www.immoscout24.ch//en/d/flat-buy-base...
2736,620000.0,flat,"4054 Basel, BS",DescriptionAuf der Suche nach einem neuen Zuha...,4051,https://www.immoscout24.ch//en/d/flat-buy-base...
2742,495000.0,flat,"Julia Gauss-Strasse 154056 Basel, BS",DescriptionObjektbeschrieb An idealer Lage im...,4056,https://www.immoscout24.ch//en/d/flat-buy-base...
3990,320000.0,detached-house,"Giassa Baselgia 47456 Sur, GR",Description3 Zimmer Maiensääs im Park Ela am F...,7460,https://www.immoscout24.ch//en/d/detached-hous...
2737,470000.0,flat,"In den Klostermatten 104052 Basel, BS",DescriptionIm Breite-Quartier wird diese heime...,4052,https://www.immoscout24.ch//en/d/flat-buy-base...


## Process Prompt

In [17]:
def get_model_response(query, model, tokenizer):
    # Load model and tokenizer
    model = model
    tokenizer = tokenizer

    # Format the input with few-shot examples
    prompt_text = "\n\n".join([f"Question: {ex['Question']}\nAnswer: {ex['Answer']}" for ex in few_shot_examples])
    prompt_text += f"\n\nQuestion: {query}"

    # Encode and send to model
    inputs = tokenizer(prompt_text, return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs, max_length=600, num_return_sequences=1)

    # Decode the output
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extracting the answer corresponding to the specific query
    response_parts = full_response.split("Answer:")
    for i, part in enumerate(response_parts[:-1]):
        if f"Question: {query}" in part:
            return response_parts[i + 1].split("\n")[0].strip()

    return "No specific answer found."

In [18]:
# Example usage
query = "Show me a few appartements in Basel which cost less than 700'000 CHF."
response = get_model_response(query, model=model, tokenizer=tokenizer)
print(response)

Here are some appartements in Basel that cost less than 700'000 CHF: (max_price = 700000, location_keyword = 'Basel', property_type = 'apartment')
