<a href="https://colab.research.google.com/github/Amine-kouki/Intent-Classifier-AI-Agent/blob/main/Scripts%5CData_cleanning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Loading and extracting thr JSON date

In [3]:
import json
import os

try:
    with open('intents_dataset.json', 'r') as f:
        data = json.load(f)
        print("Data loaded successfully.")
        print(data)  # Print first two intents for inspection

except FileNotFoundError:
    print("Error: The file was not found.")
except json.JSONDecodeError:
    print("Error: Failed to decode JSON from the file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")



Data loaded successfully.


In [4]:
import pandas as pd

rows = []
for category, intents in data.items():
    for intent , messages in intents.items():
        for messages in messages :
            rows.append({'category': category, 'intent': intent, 'text': messages})

df = pd.DataFrame(rows)

print(df.head())

         category        intent                                text
0  software_issue  cannot_login                      I can't log in
1  software_issue  cannot_login           My password isn't working
2  software_issue  cannot_login        I'm locked out of my account
3  software_issue  cannot_login  It says my credentials are invalid
4  software_issue  cannot_login                I forgot my password


#### Basic normalization

In [5]:
print(df.columns)

Index(['category', 'intent', 'text'], dtype='object')


In [6]:
import re
import pandas as pd

def normalize_text(text):
    if isinstance(text, str):
        text = text.lower()
        text = ' '.join(text.split()).strip()
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
        text = re.sub(r'<.*?>', '', text)
        text = re.sub(r'[^\w\s\.\,\!\?\'\d\-\/]+', ' ', text)
        text = ' '.join(text.split()).strip()
    return text

df['cleaned_text'] = df['text'].apply(normalize_text)

df.to_csv('cleaned_intents_dataframe.csv', index=False)

print("DataFrame head (showing original vs. cleaned text):")
print(df[['text', 'cleaned_text', 'intent']].head())
print("\nDataFrame info:")
df.info()

DataFrame head (showing original vs. cleaned text):
                                 text                        cleaned_text  \
0                      I can't log in                      i can't log in   
1           My password isn't working           my password isn't working   
2        I'm locked out of my account        i'm locked out of my account   
3  It says my credentials are invalid  it says my credentials are invalid   
4                I forgot my password                i forgot my password   

         intent  
0  cannot_login  
1  cannot_login  
2  cannot_login  
3  cannot_login  
4  cannot_login  

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446 entries, 0 to 445
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category      446 non-null    object
 1   intent        446 non-null    object
 2   text          446 non-null    object
 3   cleaned_text  446 non-null    object
dtyp

### Setting up the model

In [7]:
%pip install -q transformers datasets scikit-learn accelerate

In [10]:
# imports
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer , AutoModelForSequenceClassification , TrainingArguments , Trainer



* from ***datasets*** import Dataset: Imports the Hugging Face ***Dataset*** tool. This is a special, high-speed data format required for training the model.
* ***AutoTokenizer***: The tool that breaks text into numbers.
* ***AutoModelForSequenceClassification***: The specific "brain" designed for categorizing text.
* ***TrainingArguments***: A configuration list (rules) for training.
* ***Trainer***: The "manager" that actually runs the training loop.

---









### Loading and Labeling

In [16]:
# load data
try:
    df = pd.read_csv('cleaned_intents_dataframe.csv')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: The file was not found.")

# create label Maps
try :
    unique_intents = df['intent'].unique().tolist()
    label2id = {intent : i for i , intent in enumerate(unique_intents)}
    id2label = {i : intent for i , intent in enumerate(unique_intents)}
    print("Label maps created successfully.")
except Exception as e :
    print(f"An unexpected error occurred: {e}")

# create dataset
try :
    df['labels'] = df['intent'].map(label2id)
    print("labels applyed")
except Exception as e :
    print(f"An unexpected error occurred: {e}")

Data loaded successfully.
Label maps created successfully.
labels applyed


##### ***`df['intent'].unique().tolist()`***: Looks at the 'intent' column, finds every unique category (e.g., "login_issue", "refund"), and puts them in a simple list.

##### ***`id2label = ...`***: Creates the reverse dictionary mapping numbers back to names. The model outputs numbers, so we need this to translate the answer back to English later.

##### ***`label2id = ...`***: This uses a Python dictionary comprehension. It creates a "dictionary" that maps names to numbers.

##### ***`df['labels'] = ...`***: Creates a new column called 'labels' in your table. It looks up every text intent in the label2id dictionary and writes down the corresponding number (0, 1, 2, etc.).


#### Dataset Creation & Splitting

In [17]:
# create hugging face Datasets
ds = Dataset.from_pandas(df[['cleaned_text' , 'labels']])

# split data
ds = ds.train_test_split(test_size = 0.2)

##### ***`Dataset.from_pandas(...)`***: Converts your pandas table into a Hugging Face Dataset object. We selects only the columns we need: the text (cleaned_text) and the answers (labels).

##### ***`ds.train_test_split(test_size=0.2)`***: Randomly cuts your data into two piles:
  - **Train (80%)**: Used to teach the model.
  - **Test (20%)**: Hidden from the model during training, used later to check if it actually learned or just memorized.

#### Tokenization

In [18]:
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
  return tokenizer(examples["cleaned_text"], padding="max_length", truncation=True)

tokenized_ds = ds.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/356 [00:00<?, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

##### ***`model_checkpoint = ...`***: A string storing the exact name of the model we want to download from the Hugging Face Hub.

##### ***`tokenizer = ...`***: Downloads the specific vocabulary for DistilBERT. It knows that "refund" = ID 4852 (hypothetically).

##### ***`padding="max_length"`***: "Stretches" short sentences. If a sentence is 5 words, it adds 123 "pad" tokens (zeros) so it reaches the length of 128. Computers like uniform shapes.

##### ***`truncation=True`***: "Cuts" long sentences. If a sentence is 200 words, it chops off the end to fit the 128 limit.

##### ***`ds.map(..., batched=True)`***: Runs the function we just defined over the entire dataset at once (in batches). This is much faster than a for loop.

### Model setup

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(unique_intents),
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### ***`AutoModelForSequenceClassification`***: Loads the pre-trained DistilBERT model.

##### ***`num_labels=len(unique_intents)`***: This is the crucial part. DistilBERT normally doesn't know how many categories you have. This line effectively "deletes" the final layer of the neural network and replaces it with a new layer that has exactly as many output nodes as you have intents (e.g., 15 outputs for 15 intents).

##### ***`id2label / label2id`***: We save the dictionaries inside the model configuration. This ensures that when you save the model and load it later, it still remembers that 0 means "login_issue".