<a href="https://www.kaggle.com/code/neesham/project-property-offer-suggestion?scriptVersionId=125685147" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

1. This project aims to provide answers to user questions about property deals using natural language processing techniques, such as semantic search, summarization, and question answering models.
2. The motivation behind the project is to help users quickly and accurately find information related to property deals in the real estate domain.
3. The tools and technologies used in the project include semantic search, summarization, and question answering models.

<br>
<br>


> **Introducing my latest Kaggle notebook on answering user questions about property deals using cutting-edge natural language processing techniques! 📚💡**

> **Whether you're a student or a professional, this notebook is perfect for your school or college projects. 🎓🏫 It's packed with innovative approaches like semantic search, summarization, and question answering models to help users find accurate answers to property deal-related queries in a flash! 💼🔎**

> **From explaining the concept of semantic search to showcasing the results of our powerful models, this notebook covers it all! 📊🔍**
> **🚀 Don't miss out on this opportunity to impress your peers and teachers with a top-notch project.🌟 So go ahead, modify this notebook, and showcase it in your school and college projects to make a lasting impression! 💼🔍📚**



# Importing Packages

In [1]:
import numpy as np 
import pandas as pd 

# Installing datasets, evalute, transformers and faiss (Facebook AI Similarity Search).

In [2]:
!pip install faiss-gpu
!pip install datasets evaluate transformers[sentencepiece]

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf<=3.20.2
  Downloading protobuf-3.20.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: protobuf, evaluate
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
[31mERROR:

In [3]:
df = pd.read_csv("/kaggle/input/property-offers-real-estate/propery_offers.csv")


In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,offer_short_name,start_date,end_date,moreinfo_name,full_description,deal_days
0,0,Exclusive Festival Offer,2018-10-17,2018-10-31,Fest Offer,Avail flat Rs.100/sft off on base price. Hurry...,15
1,1,No GST,2018-12-10,2019-01-31,No GST,Avail Zero GST offer on Ready to Possession un...,53
2,2,No GST,2019-02-04,2019-03-31,No GST,Avail Zero GST offer on Ready to Possession un...,56
3,3,No GST,2019-04-02,2019-06-06,No GST,Avail Zero GST offer on Ready to Possession un...,66
4,4,Pre EMI Offer,2018-09-22,2018-10-05,Pre EMI,Book your Dream Home in Prestige Group and Pay...,14


In [5]:
df.duplicated().sum()

0

## Converting pandas dataframe to Huggingface dataset.
Because it is easy to use and we can use Huggingface tokenizers and models directly on huggingface dataset objects.

In [6]:
from datasets import Dataset

property_dataset = Dataset.from_pandas(df)

property_dataset

Dataset({
    features: ['Unnamed: 0', 'offer_short_name', 'start_date', 'end_date', 'moreinfo_name', 'full_description', 'deal_days'],
    num_rows: 143
})

Concatenating all the text field so that we can make a single embedding vector for all the relevant data.


In [7]:

def concatenate_text(row):
    
    return {"text": row['full_description'] + " The offer was started on " + row['start_date'] + "." + " The offer was ended on " + row['end_date'] + "." + " The offer was available for " + str(row['deal_days']) + " days." + " The short name is " + row['offer_short_name'] + "."
    }


property_dataset = property_dataset.map(concatenate_text)

property_dataset

  0%|          | 0/143 [00:00<?, ?ex/s]

Dataset({
    features: ['Unnamed: 0', 'offer_short_name', 'start_date', 'end_date', 'moreinfo_name', 'full_description', 'deal_days', 'text'],
    num_rows: 143
})

### Result of concatenation

In [8]:
property_dataset['text'][0]

'Avail flat Rs.100/sft off on base price. Hurry Limited Period Offer. Offer is valid up to 31 Oct 2018 The offer was started on 2018-10-17. The offer was ended on 2018-10-31. The offer was available for 15 days. The short name is Exclusive Festival Offer.'

## Importing Model and Tokenizer from HuggingFace

In [9]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [10]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# Code Explanation

1. **cls_pooling(model_output):** This function takes the output of a transformer model (model_output) as input and returns the pooled representation of the [CLS] token, which is the first token in the input sequence. The [CLS] token is a special token used by many transformer-based models to represent the entire input sequence, and the pooled representation is often used as a summary or representation of the entire input.

2. **get_embeddings(text_list):** This function takes a list of text inputs (text_list) as input. It first encodes the text inputs using a tokenizer with options for padding and truncation, and returns the encoded input as a dictionary. The encoded input is then passed to a pre-trained transformer model (model) as input. The model processes the input and generates an output, which includes hidden states for each token in the input sequence. The cls_pooling() function is then called on the model output to get the pooled representation of the [CLS] token, which serves as the final embeddings or representations of the input texts.

## Debugging the Output

In [11]:
# embedding = get_embeddings(property_dataset['text'][0])

# embedding

# Now let's apply the function to the whole dataset.

> This will take some time so be patient 🙃.

In [12]:
embeddings_dataset = property_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)

  0%|          | 0/143 [00:00<?, ?ex/s]

In [13]:
# Debugging

embeddings_dataset

Dataset({
    features: ['Unnamed: 0', 'offer_short_name', 'start_date', 'end_date', 'moreinfo_name', 'full_description', 'deal_days', 'text', 'embeddings'],
    num_rows: 143
})

# Adding the faiss index

In [14]:
embeddings_dataset.add_faiss_index(column='embeddings')

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['Unnamed: 0', 'offer_short_name', 'start_date', 'end_date', 'moreinfo_name', 'full_description', 'deal_days', 'text', 'embeddings'],
    num_rows: 143
})

# Testing

In [15]:
question = "Diwali offer"

question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [16]:
# Searching the relevant data in the embeddings_dataset using semantic search

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

result = samples_df['text']

# This will be used in question answering model
context = ""

for i in result:
    context += i
    print(i)
    print()
    

Limited Time Festival Offer on price - Rs 4500/- per sft The offer was started on 2020-10-27. The offer was ended on 2020-11-30. The offer was available for 35 days. The short name is Festive Offer.

Discount upto Rs. 200/sft. The offer was started on 2019-09-04. The offer was ended on 2019-10-31. The offer was available for 58 days. The short name is Discount Offer.

Buy a plot, Get a Weekend Home FREE!!!
Per Sq.yrd price: 9999/- Fixed.
1BHK 350SFT Weekend Home we'll give it for free.
Loan Facility Available.
*Elevation is Tentative The offer was started on 2019-10-05. The offer was ended on 2019-10-31. The offer was available for 27 days. The short name is Diwali.

Exclusive Offer in this Festive season. Rs.200/ off on Basic price. Basic price is 5200/sft. Offer Valid up to Diwali.  The offer was started on 2018-10-23. The offer was ended on 2018-11-20. The offer was available for 29 days. The short name is Festival Offer.

Festive Offer - Rs. 200/sft OFF. The Great Way to Begin the 

# summarization Model

In [17]:
import requests
import torch

API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

for i in range(5):
    
    output = query({
        "inputs": result[i],
    })

    print(f">> Summary {i + 1} : ", output[0]['summary_text'])
    print()




>> Summary 1 :  Festive Offer - Rs. 200/sft OFF. The Great Way to Begin the Festivals. The short name is Diwali. The offer was started on 2019-10-26. The Offer was ended on 2019/10-27. It was available for 2 days.

>> Summary 2 :  The short name is Festival Offer. The offer was available for 29 days. Basic price is 5200/sft. Offer Valid up to Diwali. Rs.200/ off on Basic price. Offer was started on 2018-10-23. The Offer was ended on2018-11-20.

>> Summary 3 :  Buy a plot, Get a Weekend Home FREE!!! Per Sq.yrd price: 9999/- Fixed.1BHK 350SFT Weekend Home we'll give it for free.Loan Facility Available.*Elevation is Tentative The offer was started on 2019-10-05.

>> Summary 4 :   discount upto Rs. 200/sft. The offer was started on 2019-09-04 and ended 2019-10-31. The short name is Discount Offer. The Offer was available for 58 days and was available to all customers. The discount was offered for a period of 58 days. It was only available to customers with a valid bank account.

>> Summary

# Question Answering Model

In [18]:
question = "How much per square ft will cost on diwali offer."

API_URL = "https://api-inference.huggingface.co/models/deepset/roberta-base-squad2"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
answer = query({
	"inputs": {
		"question": question,
		"context": context
	},
})

answer['answer']

'5200'

# Conclusion
In this notebook, we have explored the power of semantic search, a cutting-edge approach that goes beyond traditional search methods. By leveraging advanced transformer models like summarization and question answering from the renowned Hugging Face transformer library, we have unlocked new possibilities for answering user queries and providing concise, relevant information.

We have also delved into the versatile Hugging Face dataset library, which empowers us to perform various operations on our dataset with ease. From embedding our data for semantic search to generating summaries and answering questions, we have harnessed the full potential of these powerful tools.



### Thanks for reading this notebook. Upvote it if you found it useful 😇.
### Checkout my other notebooks 🙃
* [XGBoost V/S LightGBM](https://www.kaggle.com/code/neesham/xgboost-v-s-lightgbm)
* [🔥 Pandas V/S SQL](https://www.kaggle.com/code/neesham/pandas-v-s-sql)
* [🔥 Transformers for Beginners (P1)](https://www.kaggle.com/code/neesham/transformers-for-beginners-p1)