# Employee Chatbot: Data Augmentation for Training

This notebook outlines the steps to augment a dataset of employee queries for training a chatbot. The goal is to increase the dataset size from 1,000 to 20,000 rows using techniques like paraphrasing, back-translation, and synthetic data generation.

---

### *Libraries Used*
We will use the following Python libraries:
- *Data Manipulation*: pandas, numpy
- *Data Augmentation*: nlpaug, transformers, torch
- *Back-Translation*: googletrans, deep-translator
- *Grammar Correction*: language-tool-python

---

### *Installation*
To install all required libraries, we create a requirements.txt file and to install them run this command:

```bash
    pip install -r requirements.txt
```

---

### *Dataset Overview*
The dataset contains employee queries with the following columns:
- user_query: The employee's question or problem.
- intent: The category or purpose of the query (e.g., "password_reset").
- solution: The response or solution to the query.

Our goal is to augment the user_query column while preserving the intent and solution columns.

In [1]:
import pandas as pd

#load the dataset
df = pd.read_csv("../original-data/helpdesk_dataset.csv")
df.head()

Unnamed: 0,user_query,intent,solution
0,"My password isn't working, can you help me res...",reset_password,Please follow the steps in the password reset ...
1,I can't log in; I think I forgot my password. ...,reset_password,A password reset link has been sent to your em...
2,I'm unable to access my account because my pas...,reset_password,Try resetting your password using the 'Forgot ...
3,Can you reset my password for me? I’ve been lo...,reset_password,Please check your email for a reset link and f...
4,I can't remember my password and need to reset...,reset_password,A reset link has been sent to your email. Use ...


In [3]:
df["intent"].unique()

array(['reset_password', 'access_issue', 'software_install',
       'ticket_status', 'vpn_setup', 'hardware_issue', 'change_username',
       'new_hardware_request', 'unlock_account', 'network_support',
       'slow_computer', 'shared_folder_access', 'external_drive_issue',
       'software_updates', 'internet_issue', 'reset_security_questions',
       'email_signature_setup', 'change_language_settings',
       'report_phishing', 'ticketing_system_guide'], dtype=object)

In [2]:
df_copy = df.copy()

## Step 1: Paraphrasing with Grammar Correction

We use the *nlpaug* library to paraphrase the user_query column. This generates new variations of the same query while preserving the intent and solution. To ensure high-quality outputs, we also apply *grammar correction* to the paraphrased text.

### Steps:
1. Initialize the paraphrasing augmenter.
2. Initialize the grammar correction tool.
3. Sample 50% of the dataset.
4. Generate 4 paraphrased versions for each query.
5. Correct grammar for each paraphrased query.
6. Append the augmented data to a new DataFrame.

In [17]:
import nlpaug.augmenter.word as naw
import language_tool_python

# Initialize the grammar checker
tool = language_tool_python.LanguageTool("en-US")

# initialize the paraphrasing augmenter
paraphrase_aug = naw.ContextualWordEmbsAug(model_path="bert-base-uncased", action="substitute")

# Correting Grammar
def correct_grammar(text):
    matches = tool.check(text)
    corrected_text = language_tool_python.utils.correct(text, matches)
    return corrected_text

# Augment 50% of the data
augmented_data = []
for index, row in df_copy.sample(frac=0.5).iterrows(): # Sample 50% of the data
    augmented_query = paraphrase_aug.augment(row["user_query"], n=4)
    for text in augmented_query:
        corrected_text = correct_grammar(text=text)
        augmented_data.append({
            "user_query": corrected_text, 
            "intent": row["intent"],
            "solution": row["solution"]
        })

# Convert to DataFrame
augmented_df = pd.DataFrame(augmented_data)
augmented_df.head()

Unnamed: 0,user_query,intent,solution
0,She would like TAE change my business to somet...,change_username,Submit a request with the new professional use...
1,I would try to consider my username to be pure...,change_username,Submit a request with the new professional use...
2,I still need to amend my username for somethin...,change_username,Submit a request with the new professional use...
3,I would probably might change the username for...,change_username,Submit a request with the new professional use...
4,How will I install any latest version on this ...,software_install,Download the latest version from the official ...


In [18]:
augmented_df.to_csv("data/paraphrasing.csv", index=False)
print(f"Data size: {len(augmented_df)}")


Data size: 2044


In [49]:
augmented_df_copy = augmented_df.copy()

### Second Attempt (Why Not)

- 1 : 2044 -> 4088
- 2: 4088 -> 8176
- 3: Stay same 8176

In [42]:
# Augment 50% of the data
augmented_data = []
for index, row in augmented_df.sample(frac=0.5).iterrows(): # Sample 50% of the data
    augmented_query = paraphrase_aug.augment(row["user_query"], n=4)
    for text in augmented_query:
        corrected_text = correct_grammar(text=text)
        augmented_data.append({
            "user_query": corrected_text, 
            "intent": row["intent"],
            "solution": row["solution"]
        })

# Convert to DataFrame
augmented_df_2 = pd.DataFrame(augmented_data)
augmented_df_2.head()

Unnamed: 0,user_query,intent,solution
0,Where can she build another report against my ...,ticketing_system_guide,"Go to the 'Reports' section in the system, sel..."
1,How can she build one report on global health?,ticketing_system_guide,"Go to the 'Reports' section in the system, sel..."
2,But did she build a report on my marriage?,ticketing_system_guide,"Go to the 'Reports' section in the system, sel..."
3,How can one build a report about my life?,ticketing_system_guide,"Go to the 'Reports' section in the system, sel..."
4,My speed is seriously poor. All' haven even tr...,internet_issue,Ensure that there are no other devices using e...


In [52]:
def paraphrasing(df):
    # Augment 50% of the data
    augmented_data = []
    for index, row in augmented_df.iterrows():
        augmented_query = paraphrase_aug.augment(row["user_query"], n=4)
        for text in augmented_query:
            corrected_text = correct_grammar(text=text)
            augmented_data.append({
                "user_query": corrected_text, 
                "intent": row["intent"],
                "solution": row["solution"]
            })
    return augmented_data        

In [56]:
augmented_data = paraphrasing(augmented_df_2)

# Convert to DataFrame
augmented_df_2 = pd.DataFrame(augmented_data)
print(len(augmented_df_2))
augmented_df_2.head()

8176


Unnamed: 0,user_query,intent,solution
0,She would like TAE move their attitude to some...,change_username,Submit a request with the new professional use...
1,It would like me change my business in somethi...,change_username,Submit a request with the new professional use...
2,She would need ta change this business to some...,change_username,Submit a request with the new professional use...
3,She would like TAE change her clothes or becom...,change_username,Submit a request with the new professional use...
4,Parents will try to consider this username to ...,change_username,Submit a request with the new professional use...


In [58]:
augmented_data = paraphrasing(augmented_df_2)

# Convert to DataFrame
augmented_df_3 = pd.DataFrame(augmented_data)
augmented_df_3.head()

Unnamed: 0,user_query,intent,solution
0,India did prefer TAE change my business to bei...,change_username,Submit a request with the new professional use...
1,She felt like TAE expand my business through s...,change_username,Submit a request with the new professional use...
2,They would like to guide my business to someth...,change_username,Submit a request with the new professional use...
3,She would need TAE change my business or somet...,change_username,Submit a request with the new professional use...
4,I would try no consider a job as be purely pro...,change_username,Submit a request with the new professional use...


In [59]:
augmented_df_3.to_csv("data/paraphrasing-3.csv", index=False)
print(f"Data size (2nd Attempt): {len(augmented_df_3)}")

Data size (2nd Attempt): 8176


### Correcting Grammar

In [14]:
import language_tool_python

# Initialize the grammar checker
tool = language_tool_python.LanguageTool("en-US")

def correct_grammar(text):
    matches = tool.check(text)
    corrected_text = language_tool_python.utils.correct(text, matches)
    return corrected_text
correct_grammar("i are unable to come upstairs as my door is locked. can jason help them with this?")

'I am unable to come upstairs as my door is locked. Can Jason help them with this?'

In [None]:
augmented_df = augmented_df_2

## Step 2: Back-Translation

We use back-translation to generate additional variations of the user_query column. This involves translating the text to an intermediate language (e.g., French) and then back to English.

### Steps:
1. Define a back-translation function.
2. Sample 50% of the dataset.
3. Generate 2 back-translated versions for each query.
4. Append the augmented data to a new DataFrame.

In [19]:
from googletrans import Translator

translator = Translator()

def back_translate(text, src_lang="en", intermediate_lang="fr"):
    # Translate to intermediate langauge
    translaed = translator.translate(text, src=src_lang, dest=intermediate_lang).text
    # Translate abck to source language
    back_translated = translator.translate(translaed, src=intermediate_lang, dest=src_lang).text

    return back_translated

# Apply back-translation to another 50% of the data
back_translated_data = []
for index, row in df_copy.sample(frac=0.5).iterrows(): 
    back_translated_query = back_translate(row["user_query"])
    back_translated_data.append({
        "user_query": back_translated_query,
        "intent": row["intent"],
        "solution": row["solution"]
    })

# Convert to DataFrame
back_translated_df = pd.DataFrame(back_translated_data)
back_translated_df.head()



Unnamed: 0,user_query,intent,solution
0,Can we get a new set of high-performance monit...,new_hardware_request,Submit a request detailing the type and number...
1,I have an update waiting for my software.,software_updates,"Go to the software's update section, and you s..."
2,Can you help me with the VPN configuration for...,vpn_setup,Ensure that you have the correct VPN configura...
3,"My external reader shows as ""offline"" in disk ...",external_drive_issue,Try bringing the drive online in Disk Manageme...
4,My internet connection is only very slow durin...,internet_issue,This might be due to network congestion. Conta...


In [22]:
back_translated_df.to_csv("data/back-translated.csv", index=False)
print(f"Data size: {len(back_translated_df)}")

Data size: 511


In [7]:
from deep_translator import GoogleTranslator
import pandas as pd
from tqdm import tqdm
import time

def back_translate_with_deep(text, src_lang='en', intermediate_lang='fr', max_retries=3):
    """
    Perform back translation with error handling and retries
    """
    if not isinstance(text, str) or text.strip() == '':
        return ''
    
    for attempt in range(max_retries):
        try:
            # Forward translation
            translated = GoogleTranslator(source=src_lang, target=intermediate_lang).translate(text.strip())
            time.sleep(0.5)  # Small delay between translations
            
            # Back translation
            back_translated = GoogleTranslator(source=intermediate_lang, target=src_lang).translate(translated)
            return back_translated
            
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"\nFailed to translate after {max_retries} attempts: {str(e)[:100]}...")
                return text  # Return original text if all attempts fail
            time.sleep(1)  # Wait before retry

In [8]:
def process_back_translations(df, batch_size=50):
    """
    Process back translations with progress tracking and batching
    """
    back_translated_data = []
    
    # Calculate total number of batches for the progress bar
    total_batches = (len(df) + batch_size - 1) // batch_size
    
    # Process in batches
    for batch_start in tqdm(range(0, len(df), batch_size), 
                           desc="Processing back translations", 
                           total=total_batches):
        batch_end = min(batch_start + batch_size, len(df))
        batch = df.iloc[batch_start:batch_end]
        
        for _, row in batch.iterrows():
            back_translated_query = back_translate_with_deep(row["user_query"])
            back_translated_data.append({
                "user_query": back_translated_query,
                "intent": row["intent"],
                "solution": row["solution"]
            })
        
        # Save checkpoint after each batch
        temp_df = pd.DataFrame(back_translated_data)
        temp_df.to_csv("data/back-translation-checkpoint.csv", index=False)
        
        # Small delay between batches to avoid rate limiting
        time.sleep(1)
    
    return pd.DataFrame(back_translated_data)

In [9]:
def main():    
    print(f"Starting back translation for {len(df_copy)} entries...")
    
    # Process the translations
    back_translated_df = process_back_translations(df_copy)
    
    # Save final results
    output_file = "data/back-translated-100.csv"
    back_translated_df.to_csv(output_file, index=False)
    
    print(f"\nBack translation completed!")
    print(f"Data size: {len(back_translated_df)}")
    print(f"Results saved to: {output_file}")

In [10]:
main()

Starting back translation for 1022 entries...


Processing back translations: 100%|██████████| 21/21 [31:57<00:00, 91.32s/it]


Back translation completed!
Data size: 1022
Results saved to: data/back-translated-100.csv





## Step 3: Synthetic Data Generation

We use a pre-trained GPT model to generate synthetic queries based on the intent column.

### Steps:
1. Load a pre-trained GPT model: **GPT-2**.
2. Generate 2 synthetic queries for each intent.
3. Append the synthetic data to a new DataFrame.

In [32]:
# from transformers import pipeline

# # Load GPT-2 text generation pipeline
# generator = pipeline("text-generation", model="gpt2")

# def generate_variations(prompt, num_variations=5):
#     variations = []
#     for _ in range(num_variations):
#         generated_text = generator(prompt, max_length=50, num_return_sequences=1)[0]["generated_text"]
#         variations.append(generated_text.strip())

#     return variations

# # Generate synthetic data
# synthetic_data = []
# for index, row in df_copy.iterrows():
#     prompt = f"Generate a question about : {row["intent"]}"
#     variations = generate_variations(prompt=prompt, num_variations=2) # Generate 2 variations per row
#     for variation in variations:
#         synthetic_data.append({
#             "user_query": variation,
#             "intent": row["intent"],
#             "solution": row["solution"]
#         })

# # Convert to DataFrame
# synthetic_df = pd.DataFrame(synthetic_data)
# synthetic_df.head()
    

- Used a pre-trained GPT-2 model to generate new queries based on the intent and solution.
- Improved prompts by using real examples from the dataset and dynamically extracted intent-specific keywords using *TF-IDF*.

In [39]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# from transformers import pipeline
# import pandas as pd
# import re

# class SyntheticDataGenerator:
#     def __init__(self, model_name="gpt2"):
#         self.generator = pipeline("text-generation", model=model_name)
        
#     def extract_intent_keywords(self, df, top_n=5):
#         """Extract top keywords for each intent using TF-IDF"""
#         intent_keywords = {}
#         for intent in df["intent"].unique():
#             # Filter queries for the current intent
#             queries = df[df["intent"] == intent]["user_query"]
            
#             # Compute TF-IDF
#             vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
#             tfidf_matrix = vectorizer.fit_transform(queries)
            
#             # Get top keywords
#             feature_names = vectorizer.get_feature_names_out()
#             tfidf_scores = tfidf_matrix.sum(axis=0).A1
#             top_indices = tfidf_scores.argsort()[-top_n:][::-1]
#             intent_keywords[intent] = [feature_names[i] for i in top_indices]
        
#         return intent_keywords

#     def create_intent_prompt(self, example_query, solution):
#         """Create focused prompts using real examples"""
#         return f"""Generate 2 variations of this technical support question:
#                 Original: {example_query}
#                 Solution Context: {solution}
#                 Variations should:
#                 - Be phrased differently
#                 - Use casual language
#                 - Describe real-world scenarios
#                 Examples:"""

#     def generate_variations(self, prompt, num_variations=2):
#         """Generate with constrained randomness"""
#         try:
#             results = self.generator(
#                 prompt,
#                 max_length=512,
#                 max_new_tokens=50,
#                 num_return_sequences=num_variations,
#                 temperature=0.3, # Reduced randomness
#                 top_p=0.95,
#                 repetition_penalty=1.2,
#                 clean_up_tokenization_spaces=True
#             )
#             return [self.clean_output(res["generated_text"]) for res in results]
#         except Exception as e:
#             print(f"Generation error: {e}")
#             return []

#     def clean_output(self, text):
#         """Extract only the generated variation"""
#         # Remove prompt contamination
#         cleaned = re.split(r'Examples?:|Variations?:', text)[-1]
#         # Remove quotes and markdown
#         cleaned = re.sub(r'["*]', '', cleaned).strip()
#         # Get first complete sentence
#         if '.' in cleaned:
#             return cleaned.split('.')[0] + '.'
#         return cleaned

#     def is_valid_query(self, text, intent_keywords):
#         """Validate relevance to intent"""
#         text_lower = text.lower()
#         return (
#             len(text) > 10 and
#             any(kw in text_lower for kw in intent_keywords) and
#             '?' in text and
#             not ('example' in text_lower or 'http' in text)
#         )

#     def generate_synthetic_dataset(self, df, num_variations=2):
#         synthetic_data = []
        
#         # Extract intent keywords dynamically
#         intent_keywords = self.extract_intent_keywords(df)
#         print("Extracted Intent Keywords:", intent_keywords)

#         for _, row in df.iterrows():
#             prompt = self.create_intent_prompt(row["user_query"], row["solution"])
#             variations = self.generate_variations(prompt, num_variations)
            
#             for var in variations:
#                 if self.is_valid_query(var, intent_keywords[row["intent"]]):
#                     synthetic_data.append({
#                         "user_query": var,
#                         "intent": row["intent"],
#                         "solution": row["solution"]
#                     })

#         return pd.DataFrame(synthetic_data)

# # Usage
# generator = SyntheticDataGenerator()
# synthetic_df = generator.generate_synthetic_dataset(df_copy, num_variations=2)
# synthetic_df.head()

In [41]:
# synthetic_df.to_csv("data/synthetic-gpt2.csv", index=False)
# print(f"Data size: {len(synthetic_df)}")

## Step 4: Combine All Augmented Data

We combine the original dataset with the augmented data (paraphrased, back-translated, and synthetic) to create a final dataset.

In [None]:
final_df = pd.concat([df_copy, augmented_df, back_translated_df, synthetic_df], ignore_index=True)

# Remove duplicates (if any)
final_df = final_df.drop_duplicates(subset=["user_query"])

final_df.to_csv("augmented_helpdesk_dataset.csv", index=False)
print(f"Original dataset size: {len(df)} \n Final dataset size: {len(final_df)}")

## Step 5: Evaluate Augmented Data

We evaluate the quality of the augmented data by:
1. Checking for grammatical errors.
2. Ensuring the meaning of the queries is preserved.
3. Verifying diversity in the augmented queries.

In [None]:
from language_tool_python import LanguageTool

# Initialize grammar checker
tool = LanguageTool('en-US')

# Check grammar for a sample of augmented data
sample = final_df.sample(10)
for index, row in sample.iterrows():
    matches = tool.check(row['user_query'])
    if matches:
        print(f"Query: {row['user_query']}")
        print(f"Grammar issues: {matches}")
        print("------")