<a href="https://colab.research.google.com/github/BadriMounika/Intent-Detection/blob/main/intent_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INTENT** **DETECTION**

In [None]:
pip install transformers



In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="huggingface_hub.utils._auth")

# **Tokenization**
 The process of breaking down text into smaller units, or tokens, to make it easier for machines to understand human language. It is the first step in preprocessing text data for machine learning and Natural Language Processing(NLP) tasks.

In [None]:
from transformers import BertTokenizer

# Reading csv file by using pandas
import pandas as pd
data = pd.read_csv('/content/sofmattress_train.csv')

# Loading the BERT tokenizer to prepare text for processing by the BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing the sentences(process of splitting text into individual sentences)
data['tokenized'] = data['sentence'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

print(data[['sentence', 'tokenized']].head())

                                         sentence  \
0                    You guys provide EMI option?   
1  Do you offer Zero Percent EMI payment options?   
2                                         0% EMI.   
3                                             EMI   
4                           I want in installment   

                                           tokenized  
0    [101, 2017, 4364, 3073, 12495, 5724, 1029, 102]  
1  [101, 2079, 2017, 3749, 5717, 3867, 12495, 790...  
2                [101, 1014, 1003, 12495, 1012, 102]  
3                                  [101, 12495, 102]  
4                [101, 1045, 2215, 1999, 18932, 102]  


# **Dataset** **Preparation**
1. Encode the categorical labels into numeric format using LabelEncoder.
2. Split the data into training (80%) and validation (20%) sets for model evaluation.

In [None]:
from transformers import BertTokenizer
import torch
from sklearn.model_selection import train_test_split

# Tokenize with padding and truncation(to make sure that all input sequences have the same length)
inputs = tokenizer(list(data['sentence']), max_length=128, padding=True, truncation=True, return_tensors="pt")

# Encode labels(to convert categorical variables into numerical values)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
labels = torch.tensor(encoder.fit_transform(data['label']))

# Split into training sets and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    inputs['input_ids'], labels, test_size=0.2, random_state=42
)

train_masks, val_masks = train_test_split(inputs['attention_mask'], test_size=0.2, random_state=42)

# **Model** **Initialization**
1. Loading a pre-trained BERT model (bert-base-uncased) with a classification head for sequence classification tasks.
2. The classification head is customized to match the number of unique intent classes in the dataset (num_labels).
3. Using pre-trained weights ensures the model starts with a strong understanding of language, improving performance.

In [None]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(encoder.classes_))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Fine**-**Tuning**
1. Optimizer:
   - AdamW optimizer is used with a learning rate of 5e-5 for effective fine-tuning of the BERT model.
   - It incorporates weight decay to reduce overfitting.
2. Training Loop:
   - The model is trained in batches, using the DataLoader for efficient iteration.
   - Loss is computed for each batch and used to update the model weights via backpropagation.
3. Epochs:
   - Training is repeated for multiple epochs (3 in this case) to allow the model to converge to better performance.

In [None]:
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW

batch_size = 16  # refers to the number of training examples used in one iteration on training process
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, batch_size=batch_size)

optimizer = AdamW(model.parameters(), lr=5e-5) # Adam optimizer used to train neural networks

for epoch in range(3):  # Loop over epochs
    model.train()
    for batch in train_dataloader:
        b_input_ids, b_attention_mask, b_labels = batch
        outputs = model(b_input_ids, attention_mask=b_attention_mask, labels=b_labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# **Model** **Evaluation**
1. Model Evaluation Mode:
   - The model is set to eval() mode to disable gradient computation and ensure consistent results.
2. Predictions:
   - For each batch in the validation set, logits (raw class scores) are generated.
   - The class with the highest score is selected as the predicted label using argmax.
3. Metrics:
   - Accuracy: Measures the overall proportion of correctly predicted samples.
   - Classification Report: Provides detailed metrics (precision, recall, F1-score) for each class to assess the model's performance on different intents.
4. Gradient-Free Evaluation:
   - Using torch.no_grad() optimizes performance by preventing unnecessary gradient computation.

In [None]:
from sklearn.metrics import classification_report,accuracy_score
from torch.utils.data import DataLoader, TensorDataset

# Creating the validation dataloader
# This is the missing piece of code causing the error
validation_data = TensorDataset(val_inputs, val_masks, val_labels)
validation_dataloader = DataLoader(validation_data, batch_size=batch_size)

model.eval()
val_predictions = []
with torch.no_grad():
    for batch in validation_dataloader:
        # unpack the batch into three variables
        b_input_ids, b_attention_mask, b_labels = batch
        logits = model(b_input_ids, attention_mask=b_attention_mask).logits
        val_predictions.append(logits.argmax(dim=-1))

val_predictions = [item for sublist in val_predictions for item in sublist]

# Decode the true labels (actual labels from the validation set)
val_true_labels = val_labels.cpu().numpy()

# Calculate and print evaluation metrics
accuracy = accuracy_score(val_true_labels, val_predictions)
report = classification_report(val_true_labels, val_predictions, target_names=encoder.classes_)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", report)

Accuracy: 0.5152
Classification Report:
                        precision    recall  f1-score   support

100_NIGHT_TRIAL_OFFER       1.00      0.50      0.67         4
   ABOUT_SOF_MATTRESS       0.00      0.00      0.00         3
         CANCEL_ORDER       0.00      0.00      0.00         2
        CHECK_PINCODE       1.00      1.00      1.00         1
                  COD       1.00      0.50      0.67         2
           COMPARISON       0.00      0.00      0.00         1
    DELAY_IN_DELIVERY       0.00      0.00      0.00         2
         DISTRIBUTORS       0.88      0.88      0.88         8
                  EMI       0.80      0.80      0.80         5
        ERGO_FEATURES       0.67      1.00      0.80         4
             LEAD_GEN       0.23      0.75      0.35         4
        MATTRESS_COST       0.75      1.00      0.86         3
               OFFERS       1.00      0.67      0.80         3
         ORDER_STATUS       0.09      1.00      0.17         1
       ORTHO_

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Predict** **Function** **for** **User** **Input**
1. Tokenization:
   - Converts the user-provided query into input IDs and attention masks using the same tokenizer as during training.
   - Applies padding and truncation to ensure compatibility with the model's expected input format.
2. Model Prediction:
   - The model processes the tokenized query in evaluation mode (eval()).
   - Logits (raw scores) are computed, and the class with the highest score is selected using argmax.
3. Label Decoding:
   - The numerical prediction is mapped back to its original string label using the LabelEncoder.
4. Interactive Query Input:
   - Continuously prompts the user to input queries.
   - Displays the predicted intent for each query.
   - Exits the loop when the user types "exit".

In [None]:
# Defining the Predict Function for User queries
def predict_intent(user_input):
    inputs = tokenizer(user_input, max_length=128, padding=True, truncation=True, return_tensors="pt")
    model.eval()
    with torch.no_grad():
        outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
        logits = outputs.logits
        prediction = logits.argmax(dim=-1).item()  # Get the predicted class
    predicted_label = encoder.inverse_transform([prediction])[0]
    return predicted_label

# Allowing User query and Predicting the Intents
while True:
    user_query = input("Enter your query (or 'exit' to quit): ")
    if user_query.lower() == 'exit':
        break
    predicted_intent = predict_intent(user_query)
    print(f"Predicted Intent: {predicted_intent}")

Enter your query (or 'exit' to quit): back pain
Predicted Intent: ORTHO_FEATURES
Enter your query (or 'exit' to quit): i need emi payment?
Predicted Intent: EMI
Enter your query (or 'exit' to quit): order
Predicted Intent: ORDER_STATUS
Enter your query (or 'exit' to quit): Do you provide EMI options for this product?
Predicted Intent: PRODUCT_VARIANTS
Enter your query (or 'exit' to quit): Can I buy this on installments?
Predicted Intent: EMI
Enter your query (or 'exit' to quit): cash on delivery is available?
Predicted Intent: COD
Enter your query (or 'exit' to quit): what is the status?
Predicted Intent: ORDER_STATUS
Enter your query (or 'exit' to quit): exit
