<a href="https://colab.research.google.com/github/AliAlavi2020/testrepo/blob/main/Intent_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fine-Tuning a Pre-Trained BERT Model for Intent Classification2

This code fine-tunes a pre-trained BERT model, specifically the ParsBert model, for intent classification on a custom dataset. The dataset consists of text samples with corresponding labels, where each label represents a specific intent.

The code defines a custom dataset class IntentDataset to load and preprocess the data, and a custom classifier model ParsBertClassifier that builds upon the pre-trained BERT model. The classifier model adds a dropout layer and a linear layer on top of the BERT model to output probabilities for each intent class.

The code then loads the pre-trained BERT model and tokenizer, creates a dataset and data loader for the training data, and defines the training parameters, including the optimizer and learning rate.

The model is trained for 3 epochs using the Adam optimizer and cross-entropy loss. The model is evaluated on the training data after each epoch, and the loss is printed to track the training progress.

After training, the model is fine-tuned and ready to be used for intent classification on new, unseen data.

In [None]:
!pip install -qU hazm

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModel
from torch.utils.data import DataLoader, Dataset
import torch

config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")



def parsbert_ner_load_model(model_name):
    """Load the model"""
    try:
        config = AutoConfig.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = TFAutoModelForTokenClassification.from_pretrained(model_name)
        labels = list(config.label2id.keys())

        return model, tokenizer, labels
    except:
        return [None] * 3

# Define a custom dataset for loading data
class IntentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        ###self.tokenizer = ParsBertTokenizer.from_pretrained('parsbert')
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        text = str(self.texts[item])
        label = self.labels[item]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }



import torch.nn as nn

class ParsBertClassifier(nn.Module):
    def __init__(self, num_classes):
        super(ParsBertClassifier, self).__init__()
        self.bert = model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        outputs = self.classifier(pooled_output)
        return outputs



# Load tokenizer and data
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
train_texts = ["سفارش من ارسال شده است؟", "تقاضای عودت سفارش را دارم","بازگشت محصول "]
train_labels = [0, 1, 1]  # Assume 0: GetOrderStatus, 1: RequestRefund

train_dataset = IntentDataset(train_texts, train_labels, tokenizer, max_len=32)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

# Load pretrained model
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

num_classes = 2  # Replace with the actual number of classes
model = ParsBertClassifier(num_classes)


# Define training parameters
optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-5)
###model.train()


for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}')


print("Model fine-tuned successfully.")


Epoch 1, Loss: 0.8195169866085052
Epoch 2, Loss: 0.5882050693035126
Epoch 3, Loss: 0.40464162826538086
Model fine-tuned successfully.


In [None]:
labels

tensor([0, 1])

Model Selection and Training

Select a Pretrained Model: Use a pretrained model like BERT, RoBERTa, or DistilBERT from the Hugging Face Transformers library.
Fine-Tune the Model: Fine-tune the pretrained model on your labeled dataset to adapt it to your specific intents.
Example code using Hugging Face Transformers and PyTorch:

Inference

Predict the Intent: Once the model is trained, use it to predict the intent of new user queries.

In [None]:
def predict_intent(model, tokenizer, text):
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=32,
        return_token_type_ids=False,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    _, prediction = torch.max(outputs, dim=1)

    return prediction.item()

#user_query = "وضعیت سفارش من چگونه است؟"
user_query ="چگونه کالا را بازگشت بدهم؟"
#user_query ="چگونه کالا را برگردانم؟"
#user_query ="سفارش من چه زمانی به دست من میرسد؟"
#user_query ="چگون با واحد پشتیبانی صحبت کنم"
intent_id = predict_intent(model, tokenizer, user_query)
intents = {0: "GetOrderStatus", 1: "RequestRefund"}
print(f'Intent: {intents[intent_id]}')


Intent: GetOrderStatus


Entity Extraction
What is Entity Extraction?
Entity extraction (or named entity recognition, NER) involves identifying and classifying key pieces of information (entities) within a user’s input. For example, in “Track order 12345”, “12345” would be the entity recognized as an OrderID.

Steps to Implement Entity Extraction
Data Preparation

Collect and Label Data: Annotate training sentences with entities. For example:
Sentence: "Track order 12345"
Entities: ["OrderID": "12345"]

Model Selection and Training

Select a Pretrained NER Model: Use models like BERT or SpaCy pretrained NER models from Hugging Face.
Fine-Tune: Fine-tune the model on your custom dataset if specific entity types are needed.
Example using Hugging Face Transformers with a pretrained token classification model:

In [None]:
import re

def extract_order_number(text):
    pattern = r'\b\d{5}\b'
    match = re.search(pattern, text)
    if match:
        return match.group()
    else:
        return "Invalid order number. Please enter a 5-figure number."

# Test the function
text = "کد رهگیری سفارش من برای خرید یک دستگاه قهوه ساز 52374 است "
# texts = [
#    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
#    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
#    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس شیراز صورت گرفته است.",
# ]
order_number = extract_order_number(text)
print(order_number)

52374


In [None]:
import hazm
from IPython.display import HTML

ner_translate = {
    "B-date": "تاریخ",
    "B-event": "رویداد",
    "B-facility": "امکانات",
    "B-location": "موقعیت",
    "B-money": "پول",
    "B-organization": "سازمان",
    "B-person": "شخص",
    "B-product": "محصول",
    "B-time": "زمان",
    "B-percent": "درصد",
    "I-date": "تاریخ",
    "I-event": "رویداد",
    "I-facility": "امکانات",
    "I-location": "موقعیت",
    "I-money": "پول",
    "I-organization": "سازمان",
    "I-person": "شخص",
    "I-product": "محصول",
    "I-time": "زمان",
    "I-percent": "درصد",
    "O": None
}

normalizer = hazm.Normalizer()

def cleanize(text):
    """A way to normalize and even clean the text"""
    # clean text
    # do some fns
    return normalizer.normalize(text)

def parsbert_ner_load_model(model_name):
    """Load the model"""
    try:
        config = AutoConfig.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = TFAutoModelForTokenClassification.from_pretrained(model_name)
        labels = list(config.label2id.keys())

        return model, tokenizer, labels
    except:
        return [None] * 3

def parsbert_ner(texts, model_name, label_translate, visualize=True):
    """Predict and visualize the NER!"""
    global css_is_load

    css_is_load = False
    css = """<style>
    .ner-box {
        direction: rtl;
        font-size: 18px !important;
        line-height: 20px !important;
        margin: 0 0 15px;
        padding: 10px;
        text-align: justify;
        color: #343434 !important;
    }
    .token, .token span {
        display: inline-block !important;
        padding: 2px;
        margin: 2px 0;
    }
    .token.token-ner {
        background-color: #f6cd61;
        font-weight: bold;
        color: #000;
    }
    .token.token-ner .ner-label {
        color: #9a1f40;
        margin: 0px 2px;
    }
    </style>"""

    if not css_is_load:
        display(HTML(css))
        css_is_load = True

    model, tokenizer, labels = parsbert_ner_load_model(model_name)

    if not model or not tokenizer or not labels:
        print(not model)
        print(tokenizer)
        print(labels)
        return 'Something wrong has been happened!'

    output_predictions = []
    for sequence in texts:
        sequence = cleanize(sequence)
        tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
        inputs = tokenizer.encode(sequence, return_tensors="tf")
        outputs = model(inputs)[0]
        predictions = tf.argmax(outputs, axis=2)
        predictions = [(token, label_translate[labels[prediction]]) for token, prediction in zip(tokens, predictions[0].numpy())]

        if not visualize:
            output_predictions.append(predictions)
        else:
            pred_sequence = []
            for token, label in predictions:
                if token not in ['[CLS]', '[SEP]']:
                    if label:
                        pred_sequence.append(
                            '<span class="token token-ner">%s<span class="ner-label">%s</span></span>'
                            % (token, label))
                    else:
                        pred_sequence.append(
                            '<span class="token">%s</span>'
                            % token)

            html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence)
            display(HTML(html))

    return output_predictions




In [None]:
model_name = 'HooshvareLab/bert-base-parsbert-armanner-uncased'
x = parsbert_ner(cleanize(text), model, ner_translate, visualize=True)

AttributeError: 'list' object has no attribute 'translate'

In [None]:
print(x)

Something wrong has been happened!


In [None]:
# from transformers import pipeline


# # def custom_ner_pipeline(model, tokenizer, text):
# #     inputs = tokenizer.encode_plus(
# #         text,
# #         add_special_tokens=True,
# #         max_length=512,
# #         return_attention_mask=True,
# #         return_tensors='pt'
# #     )
# #     outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
# #     ner_tags = torch.argmax(outputs, dim=1)
# #     return ner_tags

# # # Sample text
# # text = "شماره سفارش من 123456 است"


# # # Create NER pipeline
# # ner_tags = custom_ner_pipeline(model, tokenizer, text)
# # print(ner_tags)







# # nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)




# # # Perform NER
# # entities = nlp_ner(text)
# # for entity in entities:
# #     print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']}")




# # def custom_ner_pipeline(model, tokenizer, text):
# #     inputs = tokenizer.encode_plus(
# #         text,
# #         add_special_tokens=True,
# #         max_length=512,
# #         return_attention_mask=True,
# #         return_tensors='pt'
# #     )
# #     outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
# #     ner_tags = torch.argmax(outputs, dim=1)
# #     labels = ['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']  # Define your own label mapping
# #     if ner_tags.dim() == 0:  # Check if ner_tags is a scalar tensor
# #         entities = [labels[ner_tags.item()]]
# #     else:
# #         entities = [labels[tag.item()] for tag in ner_tags[0]]
# #     return entities

# # # Sample text
# # text = "شماره سفارش من 123456 است"


# # # Create NER pipeline
# # entities = custom_ner_pipeline(model, tokenizer, text)
# # print(entities)


# # def custom_ner_pipeline(model, tokenizer, text):
# #     inputs = tokenizer.encode_plus(
# #         text,
# #         add_special_tokens=True,
# #         max_length=512,
# #         return_attention_mask=True,
# #         return_tensors='pt'
# #     )
# #     outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
# #     ner_tags = torch.argmax(outputs, dim=1)
# #     labels = ['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']  # Define your own label mapping

# #     if ner_tags.dim() == 0:  # Check if ner_tags is a scalar tensor
# #         entity = labels[ner_tags.item()]
# #     else:
# #         entity = labels[ner_tags[0].item()]

# #     return entity

# # # Sample text
# # text = "شماره سفارش من 123456 است"


# # # Create NER pipeline
# # entity = custom_ner_pipeline(model, tokenizer, text)
# # print(entity)


# def custom_ner_pipeline(model, tokenizer, text):
#     inputs = tokenizer.encode_plus(
#         text,
#         add_special_tokens=True,
#         max_length=512,
#         return_attention_mask=True,
#         return_tensors='pt'
#     )
#     outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
#     ner_tags = torch.argmax(outputs, dim=1)
#     labels = ['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']  # Define your own label mapping

#     if ner_tags.dim() == 0:  # Check if ner_tags is a scalar tensor
#         entity = labels[ner_tags.item()]
#     else:
#         entity = labels[ner_tags[0].item()]

#     # Extract the entity from the input text
#     import re
#     entity_text = re.search(r'\d+', text)
#     if entity_text:
#         entity_text = entity_text.group()
#     else:
#         entity_text = ''

#     return entity_text

# # Sample text
# text = "شماره سفارش من 123456 است"


# # Create NER pipeline
# entity = custom_ner_pipeline(model, tokenizer, text)
# print(entity)

In [None]:
# text = "شماره سفارش من 123456 است"
# inputs = tokenizer.encode_plus(text, return_tensors='pt', max_length=512, padding='max_length', truncation=True)
# outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])

# import torch.nn.functional as F

# logits = outputs.last_hidden_state
# logits = F.softmax(logits, dim=-1)

# named_entities = []
# for i, token in enumerate(inputs['input_ids'][0]):
#     if logits[i, 1] > 0.5:  # threshold for named entity recognition
#         named_entities.append(tokenizer.decode(token, skip_special_tokens=True))

# print(named_entities)


# from transformers import pipeline

# ner_pipeline = pipeline('ner', model= model, tokenizer= tokenizer)

# text = "شماره سفارش من 123456 است"
# named_entities = ner_pipeline(text)

# print(named_entities)

# from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

#tokenizer = AutoTokenizer.from_pretrained('hooshvare/parsbert-base-uncased-sentiment')
#model = AutoModelForSequenceClassification.from_pretrained('hooshvare/parsbert-base-uncased-sentiment')


# # Load model directly
# from transformers import AutoTokenizer, AutoModelForMaskedLM

# tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
# model = AutoModelForMaskedLM.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

# # Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("ner", model="HooshvareLab/bert-base-parsbert-uncased")

#ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)

text = "شماره سفارش من 123456 است"
named_entities = pipe(text)
print(named_entities)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity': 'LABEL_0', 'score': 0.6116472, 'index': 1, 'word': 'شماره', 'start': 0, 'end': 5}, {'entity': 'LABEL_0', 'score': 0.66270363, 'index': 2, 'word': 'سفارش', 'start': 6, 'end': 11}, {'entity': 'LABEL_0', 'score': 0.62693554, 'index': 3, 'word': 'من', 'start': 12, 'end': 14}, {'entity': 'LABEL_0', 'score': 0.7503306, 'index': 4, 'word': '123456', 'start': 15, 'end': 21}, {'entity': 'LABEL_0', 'score': 0.5664961, 'index': 5, 'word': 'است', 'start': 22, 'end': 25}]


Inference

Extract Entities: Use the trained model to identify entities in new inputs.

In [None]:
def extract_entities(model, tokenizer, text):
    nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)
    return nlp_ner(text)

user_query = "Can you track order 12345?"
entities = extract_entities(model, tokenizer, user_query)
print(entities)  # Outputs the entities identified in the text


AttributeError: 'ParsBertClassifier' object has no attribute 'config'

Integrating Intent Recognition and Entity Extraction
After implementing both intent recognition and entity extraction, integrate them to handle user inputs effectively:

Predict the Intent: First, determine what the user wants.
Extract Entities: Then, identify the key information needed to fulfill the user’s request.

In [None]:
user_query = "What's the status of order 12345?"
intent_id = predict_intent(model, tokenizer, user_query)
entities = extract_entities(model, tokenizer, user_query)

# Example outputs
print(f'Intent: {intents[intent_id]}')  # Intent: GetOrderStatus
print(f'Entities: {entities}')          # Entities: [{'word': '12345', 'entity': 'B-ORDERID', 'score': 0.99}]

Conclusion
Combining intent recognition and entity extraction allows your chatbot to understand user queries and act accordingly. Hugging Face Transformers and Python offer a robust toolkit for developing these functionalities, ensuring your chatbot can handle a wide range of customer interactions effectively.