<div style="width: 30%; float: right; margin: 10px; margin-right: 5%;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/FHNW_Logo.svg/2560px-FHNW_Logo.svg.png" width="500" style="float: left; filter: invert(50%);"/>
</div>

# Phi-2 Few-Shot learning

In diesem Notebook werden wir einen Chatbot für Schweizer Immobilien Empfehlungen mittels Few-Shot learning erstellen. <br>
Dabei verwenden wir das LLM [phi-2](https://huggingface.co/microsoft/phi-2 ) von Microsoft.



---
Bearbeitet durch Si Ben Tran, Yannic Lais, Rami Tarabishi im HS 2023.<br>
Bachelor of Science FHNW in Data Science.

## Einleitung

### Allgemeines Vorgehen

- Name entity recognition auf den Prompt
- Entities werden für die Datenbankabfrage extrahiert
- Prompt wird mit den Trainingsexamples sowie der Datenbankabfrage an das Phi-2 Modell gesendet

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import spacy
import pandas as pd
import re
from torchsummary import summary

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

Using device: cuda

NVIDIA GeForce RTX 4060 Laptop GPU
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


## Immobiliendaten

In [3]:
# read parquet file
df = pd.read_parquet('data\immo_data_202208.parquet')

In [4]:
# filter for important columns
df = df[['Municipality', 'detailed_description', 'price_cleaned', 'type', 'rooms', 'url']]
df['price'] = df['price_cleaned'].astype(float)
df = df.drop(columns=['price_cleaned'])
# drop rows with missing values
df = df.dropna()

In [5]:
df.head()

Unnamed: 0,Municipality,detailed_description,type,rooms,url,price
0,Biberstein,DescriptionLuxuriöse Attika-Wohnung direkt an ...,penthouse,5.0,https://www.immoscout24.ch//en/d/penthouse-buy...,1150000.0
1,Biberstein,DescriptionStilvolle Liegenschaft an ruhiger L...,terrace-house,5.0,https://www.immoscout24.ch//en/d/terrace-house...,1420000.0
3,Biberstein,DescriptionDieses äusserst grosszügige Minergi...,detached-house,5.0,https://www.immoscout24.ch//en/d/detached-hous...,1430000.0
4,Küttigen,DescriptionAus ehemals zwei Wohnungen wurde ei...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-romb...,995000.0
5,Erlinsbach (AG),DescriptionDer Blick in die Weite vermittelt R...,detached-house,5.0,https://www.immoscout24.ch//en/d/detached-hous...,2160000.0


## Phi-2 Model

In [6]:
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

CPU Autocast only supports dtype of torch.bfloat16 currently.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.88s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
'''input_size = (10, 512)
summary(model, input_size=input_size)'''

'input_size = (10, 512)\nsummary(model, input_size=input_size)'

#### Einzelner Input

In [8]:
inputs = tokenizer("Is a penguin a bird or a mamal?", return_tensors="pt").to('cuda')

# Generate outputs and decode
outputs = model.generate(**inputs, max_length=40)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text)

Is a penguin a bird or a mamal?
Answer: A penguin is a bird.

Exercise 2:
What is the difference between a simile and a metaphor


## Process Prompt

In [9]:
# Few-shot examples
few_shot_examples = [
    {
        "Question": "I am looking for an flat in Zurich under 1'000'000 CHF.", 
        "Answer": "Here are some options for apartments in Zurich under 1'000'000 CHF: (max_price = 1000000, location_keyword = Zürich, property_type = flat)"
    },
    {
        "Question": "Are there terraced houses in Bern in the CHF 500,000 to 700,000 range?",
        "Answer": "Yes, there are terraced houses in Bern in the CHF 500,000 to 700,000 range: (max_price = 700000, min_price = 500000, location_keyword = Bern, property_type = terraced_house)"
    },
    {
        "Question": "I need a detached house in Lucerne with a garden for around CHF 1,200,000.",
        "Answer": "In Lucerne you can find detached houses with a garden for around CHF 1,200,000: (location_keyword = Bern, property_type = detached-house, arround_price = 1200000)"
    },
    {
        "Question": "Are there modern apartments with 3.5 rooms available in Basel for under CHF 900,000?",
        "Answer": "Modern apartments in Basel under 900'000 CHF are available: (max_price = 900000, location_keyword = Basel, property_type = flat, rooms = 3.5)"
    },
    {
        "Question": "I am looking for a large house in Lausanne, at least 5 rooms, up to 1'500'000 CHF.",
        "Answer": "Large houses in Lausanne with at least 5 rooms up to 1'500'000 CHF can be found here: (max_price = 1500000, location_keyword = Lausanne, property_type = house)"
    }

]


In [10]:
def filter_dataframe(df = df, max_price = None, min_price = None, arround_price = None, location_keyword = None, property_type = None, rooms = None):

    type_list =  df.type.unique().tolist()
    type_list.append('house')
    type_list = [x.lower() for x in type_list]

    # Apply filters
    if arround_price:
        filtered_df = df[df['price'] <= arround_price * 1.1]
        filtered_df = df[df['price'] >= arround_price * 0.9]
    if max_price:
        filtered_df = df[df['price'] <= max_price]
    if min_price:
        filtered_df = df[df['price'] >= min_price]
    if location_keyword:
        filtered_df = filtered_df[filtered_df['Municipality'].str.contains(location_keyword, case=False, na=False)]
    if property_type.lower() in type_list:
        filtered_df = filtered_df[filtered_df['type'].str.contains(property_type, case=False, na=False)]
    if rooms:
        filtered_df = filtered_df[filtered_df['rooms'] == rooms]

    # Return 5 random samples
    if len(filtered_df) >= 5:
        return filtered_df.sample(n=5, random_state = 42)
    else:
        return filtered_df

In [11]:
test = filter_dataframe(df, max_price = 1200000, location_keyword = 'Basel', property_type = 'flat')

In [12]:
test.head()

Unnamed: 0,Municipality,detailed_description,type,rooms,url,price
2742,Basel,DescriptionObjektbeschrieb An idealer Lage im...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,495000.0
2738,Basel,DescriptionDiese Wohnung im 3. Obergeschoss bi...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,650000.0
2671,Basel,DescriptionWollen Sie sich endlich Ihren Traum...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,1130000.0
2737,Basel,DescriptionIm Breite-Quartier wird diese heime...,flat,2.0,https://www.immoscout24.ch//en/d/flat-buy-base...,470000.0
2733,Basel,DescriptionBaujahr: 1961Renovation: 2022Wohnfl...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,695000.0


In [13]:
def get_model_response(query, model, tokenizer):
    # Load model and tokenizer
    model = model
    tokenizer = tokenizer

    # Format the input with few-shot examples
    prompt_text = "\n\n".join([f"Question: {ex['Question']}\nAnswer: {ex['Answer']}" for ex in few_shot_examples])
    prompt_text += f"\n\nQuestion: {query}"

    # Encode and send to model
    inputs = tokenizer(prompt_text, return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs, max_length=600, num_return_sequences=1)

    # Decode the output
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extracting the answer corresponding to the specific query
    response_parts = full_response.split("Answer:")
    for i, part in enumerate(response_parts[:-1]):
        if f"Question: {query}" in part:
            return response_parts[i + 1].split("\n")[0].strip()

    return "No specific answer found."

In [14]:
def parse_and_filter(input_str, df):
    try:
        # Extract the parameter string
        params_str = re.search(r'\((.*?)\)', input_str).group(1)
    
    except:
        return "No specific answer found."

    # Initialize parameters with default values
    params = {
        'df': df,
        'max_price': None,
        'min_price': None,
        'arround_price': None,
        'location_keyword': None,
        'property_type': None,
        'rooms': None
    }

    # Split the parameter string and iterate over each parameter
    for param in params_str.split(','):
        key, value = param.split('=')
        key = key.strip()
        value = value.strip()

        # Convert value to the correct type if necessary
        if key in ['max_price', 'min_price', 'arround_price', 'rooms']:
            value = int(value)
        # Update the parameters dictionary
        params[key] = value

    # Call the filter_dataframe function with unpacked arguments
    return filter_dataframe(**params)

### Normal Query close to Examples

In [15]:
# Example usage
query = "Show me real estate in genf which cost less than 4'000'000 CHF."

response = get_model_response(query, model=model, tokenizer=tokenizer)
filtered_df = parse_and_filter(response, df)

print(response.split(':')[0])
filtered_df

Here are some options for real estate in Genf under 4'000'000 CHF


Unnamed: 0,Municipality,detailed_description,type,rooms,url,price
14086,Genf,"Description\n""Magnifique appartement rénové au...",Apartment,4rm,https://www.homegate.ch/buy/3001932055,3950000.0
14293,Genf,"Description\n""Attique meublé au centre ville a...",Apartment,2rm,https://www.homegate.ch/buy/3002030169,2900000.0
14294,Genf,"Description\n""Stresa Lac Majeur - Lago Maggior...",Apartment,2rm,https://www.homegate.ch/buy/3002017927,289000.0


#### Evaluation

In [16]:
# show how many Municiplaities with the keyword 'Basel' are in the dataset
df[df['Municipality'].str.contains('Genf', case=False, na=False)].count()

Municipality            3
detailed_description    3
type                    3
rooms                   3
url                     3
price                   3
dtype: int64

### Query in German

In [17]:
query_german = "Zeige mir ein paar Wohnungen in Basel, die weniger als 1'200'000 CHF kosten."

response = get_model_response(query_german, model=model, tokenizer=tokenizer)
filtered_df = parse_and_filter(response, df)

print(response.split(':')[0])
filtered_df

Here are a few apartments in Basel that cost less than 1'200'000 CHF


Unnamed: 0,Municipality,detailed_description,type,rooms,url,price
21988,Basel,"Description\n""4,5-Zimmer Eigentumswohnung mit ...",Apartment,4.5rm,https://www.homegate.ch/buy/3002036691,1075000.0


#### Evaluation

Wohnungen werden automatisch als Apartments erkannt, es gäbe weitere Wohnungen mit dem porperty_type: flat, apartment etc.

In [20]:
query_german_df = filter_dataframe(df, max_price = 1200000, location_keyword = 'Basel', property_type = 'Flat')
query_german_df.head()

Unnamed: 0,Municipality,detailed_description,type,rooms,url,price
2742,Basel,DescriptionObjektbeschrieb An idealer Lage im...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,495000.0
2738,Basel,DescriptionDiese Wohnung im 3. Obergeschoss bi...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,650000.0
2671,Basel,DescriptionWollen Sie sich endlich Ihren Traum...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,1130000.0
2737,Basel,DescriptionIm Breite-Quartier wird diese heime...,flat,2.0,https://www.immoscout24.ch//en/d/flat-buy-base...,470000.0
2733,Basel,DescriptionBaujahr: 1961Renovation: 2022Wohnfl...,flat,5.0,https://www.immoscout24.ch//en/d/flat-buy-base...,695000.0


### Random Query with different Intent

In [18]:
query_random = "Hello do you have bread?"

response = get_model_response(query_random, model=model, tokenizer=tokenizer)
filtered_df = parse_and_filter(response, df)

print(response.split(':')[0])
filtered_df

Yes, we have bread.


'No specific answer found.'

#### Evaluation

### Verbesserungen

- Bekannte Städte auch englisher Name einsetzten Genf -> geneva
- (Vergleich weiteres SLM)
- Grenzen des Modelles austesten (selber code generieren lassen, selber eine spannweite einbauen price +/- 10%)
- Evaluation des Modelles

- Verbinden mit RASA Chatbot zur intenterkennung

## Fazit

Erstaunlich gute Performance, überrascht wie einfach die implementierung und auch die Abfragegeschwindigkeit ist

### Ausblick

- Viel mehr Datawrangling
- Einbindung in RASA chatbot

## Erkentnisse

- Max length des outputs am anfang ein problem anschliessend nicht mehr beim Fewshot example