**Generative Text for Customer Support Automation**

Project Overview: Develop an AI-powered system to automate customer support interactions using generative models like GPT-3.5.

In [1]:
pip install transformers torch pandas


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [2]:
import pandas as pd

# Create a sample customer support data
data = {
    'text': [
        "How can I reset my password?",
        "What are your opening hours?",
        "Can I change my subscription plan?",
        "How do I update my billing information?",
        "I have a problem with my order. Can you help?",
        "Where can I find the user manual?",
        "My account is locked. What should I do?",
        "How do I contact customer support?",
        "What is the refund policy?",
        "Can I track my shipment online?",
        "I need help with the installation process.",
        "How can I cancel my order?",
        "Is there a discount for bulk purchases?",
        "Can I change the delivery address?",
        "How do I use the promotional code?",
        "The product I received is damaged. What now?",
        "How long does the warranty last?",
        "Can I return a product without the receipt?",
        "How do I know if my payment was successful?",
        "What payment methods do you accept?"
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)
# Save to CSV
file_path = "customer_support_data.csv"
df.to_csv("customer_support_data.csv", index=False)


In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
# Data cleaning and preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize text
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join words back into a single string
    text = ' '.join(words)
    return text

# Apply preprocessing to the text data
df['text'] = df['text'].apply(preprocess_text)

# Save to CSV
file_path = "cleaned_customer_support_data.csv"
df.to_csv(file_path, index=False)



In [5]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling

# Load your customer support data
data = pd.read_csv('customer_support_data.csv')

# Save the text data to a file
with open("train.txt", "w") as f:
    for line in data['text']:
        f.write(line + "\n")

In [6]:
# Function to load dataset
def load_dataset(file_path, tokenizer, block_size=128):
    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size
    )
    return dataset



In [7]:
# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [8]:
# Prepare the dataset and data collator
train_dataset = load_dataset("train.txt", tokenizer)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)



In [9]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.32.1


In [10]:
!pip uninstall -y accelerate transformers
!pip install accelerate transformers[torch] --quiet

Found existing installation: accelerate 0.32.1
Uninstalling accelerate-0.32.1:
  Successfully uninstalled accelerate-0.32.1
Found existing installation: transformers 4.41.2
Uninstalling transformers-4.41.2:
  Successfully uninstalled transformers-4.41.2
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
import accelerate
import transformers
import torch

print("Accelerate version:", accelerate.__version__)
print("Transformers version:", transformers.__version__)
print("Torch version:", torch.__version__)

Accelerate version: 0.32.1
Transformers version: 4.41.2
Torch version: 2.3.0+cu121


In [12]:
from packaging import version
import accelerate

required_version = "0.21.0"
installed_version = accelerate.__version__

if version.parse(installed_version) >= version.parse(required_version):
    print(f"Accelerate version {installed_version} is compatible.")
else:
    raise ImportError(
        f"Accelerate version {installed_version} is not compatible. "
        f"Please install accelerate>={required_version}."
    )

Accelerate version 0.32.1 is compatible.


In [13]:
!pip install transformers torch --quiet


In [14]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)



In [15]:
!pip install pyarrow==10.0.1 datasets==2.11.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.9/35.9 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source

In [16]:
!pip uninstall -y pyarrow datasets cudf-cu12

Found existing installation: pyarrow 10.0.1
Uninstalling pyarrow-10.0.1:
  Successfully uninstalled pyarrow-10.0.1
Found existing installation: datasets 2.11.0
Uninstalling datasets-2.11.0:
  Successfully uninstalled datasets-2.11.0
Found existing installation: cudf-cu12 24.4.1
Uninstalling cudf-cu12-24.4.1:
  Successfully uninstalled cudf-cu12-24.4.1


In [17]:
%pip install --upgrade pip



Collecting pip
  Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.1.2


In [18]:

!pip install pyarrow==14.0.1 datasets --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.[0m[31m
[0m

In [19]:
# Initialize Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

# Assuming you are using a huggingface transformer model
from torch.optim import AdamW
learning_rate = 1e-5 # Set your desired learning rate here
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Prepare the model, optimizer, and dataloader for distributed training
model, optimizer, train_dataloader = accelerator.prepare(
    model,
    optimizer,
    train_dataset
)

In [21]:
num_train_epochs =1
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Check if 'batch' is a dictionary before proceeding
        if isinstance(batch, dict):
            print(f"Step {step} batch keys: {batch.keys()}")  # Print keys of the batch dictionary
            for key in batch:
                # Check if the item is a tensor before printing its shape
                if isinstance(batch[key], torch.Tensor):
                    print(f"{key} shape: {batch[key].shape}")
                else:
                    print(f"{key} is not a tensor")
        else:
            print(f"Step {step}: Batch is not a dictionary, it's a {type(batch)}")
        break

Step 0: Batch is not a dictionary, it's a <class 'torch.Tensor'>


In [22]:
import torch

# Training loop
num_train_epochs = 1
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Print batch shape
        print(f"Step {step} batch shape: {batch.shape}")

        # Ensure tensors have the correct dimensions
        if batch.dim() == 1:
            batch = batch.unsqueeze(0)

        # Move tensor to the appropriate device
        batch = batch.to(accelerator.device)

        # Assuming batch contains the input_ids and attention_mask
        # If batch is a tensor, you may need to generate attention masks
        attention_mask = torch.ones(batch.shape, device=batch.device)

        # Forward pass
        outputs = model(input_ids=batch, attention_mask=attention_mask, labels=batch) # Add labels for loss calculation
        # Check if model outputs a loss directly
        if hasattr(outputs, 'loss'):
            loss = outputs.loss
        else:
            # Calculate loss manually if needed (example with cross-entropy loss)
            loss_fn = torch.nn.CrossEntropyLoss()
            logits = outputs.logits  # Assuming your model outputs logits
            loss = loss_fn(logits.view(-1, logits.size(-1)), batch.view(-1))
        # Backward pass and optimization
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss.item()}")



Step 0 batch shape: torch.Size([128])
Epoch 0, Step 0, Loss: 2.3958230018615723


In [23]:
# Save the model
model.save_pretrained("./gpt2-customer-support")
tokenizer.save_pretrained("./gpt2-customer-support")

('./gpt2-customer-support/tokenizer_config.json',
 './gpt2-customer-support/special_tokens_map.json',
 './gpt2-customer-support/vocab.json',
 './gpt2-customer-support/merges.txt',
 './gpt2-customer-support/added_tokens.json')

In [24]:
# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./gpt2-customer-support")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-customer-support")

In [25]:
# Function to generate responses
def generate_response(prompt):
    inputs = tokenizer.encode(prompt, return_tensors='pt')
    outputs = model.generate(inputs, max_length=150, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response


In [26]:
def generate_response(prompt):
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors='pt')

    # Generate attention mask
    attention_mask = inputs['attention_mask']

    # Generate response
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=attention_mask,
        max_length=250,  # Increase the maximum length for a more detailed response
        min_length=50,   # Ensure a minimum length for the response
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,  # Prevents repetition of 3-grams
        num_beams=5,  # Beam search for better output
        temperature=0.7,  # Control the randomness of predictions
        top_k=50,  # Consider the top 50 tokens by probability
        top_p=0.9  # Nucleus sampling - consider the top 90% of probability mass
    )

    # Decode the output
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Add specific instructions for changing the password
    if "change my password" in prompt.lower() or "reset my password" in prompt.lower():
        response += "\n\nTo change your password, follow these steps:\n"
        response += "1. Log in to your account.\n"
        response += "2. Go to 'Account Settings' or 'Profile'.\n"
        response += "3. Click on 'Security' or 'Password Management'.\n"
        response += "4. Enter your current password.\n"
        response += "5. Enter your new password and confirm it.\n"
        response += "6. Save the changes.\n"
        response += "If you encounter any issues, please contact support for further assistance."

    return response

# Example usage
customer_query = "How can I reset my password?"
response = generate_response(customer_query)
print("Response:", response)


  raise ValueError(f"`early_stopping` must be a boolean or 'never', but is {self.early_stopping}.")
  f"`pad_token_id` should be positive but got {self.pad_token_id}. This will cause errors when batch generating, if there is padding. "


Response: How can I reset my password?

You can reset your password at any time by going to Settings > Security > Reset Password.

How do I change my password if I don't want to use my account

If you want to change your password, you can do so by following these steps:

To change your password, follow these steps:
1. Log in to your account.
2. Go to 'Account Settings' or 'Profile'.
3. Click on 'Security' or 'Password Management'.
4. Enter your current password.
5. Enter your new password and confirm it.
6. Save the changes.
If you encounter any issues, please contact support for further assistance.


In [30]:
def generate_response(prompt):
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors='pt')

    # Generate attention mask
    attention_mask = inputs['attention_mask']

    # Generate response
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=attention_mask,
        max_length=250,  # Increase the maximum length for a more detailed response
        min_length=50,   # Ensure a minimum length for the response
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,  # Prevents repetition of 3-grams
        num_beams=5,  # Beam search for better output
        temperature=0.7,  # Control the randomness of predictions
        top_k=50,  # Consider the top 50 tokens by probability
        top_p=0.9  # Nucleus sampling - consider the top 90% of probability mass
    )

    # Decode the output
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example usage
customer_query = "What are the avialable policy?"
response = generate_response(customer_query)
print("Response:", response)

Response: What are the avialable policy?

There are two main types of policy. The first is the policy of the government. The second type of policy is that of the private sector.

The private sector is the most important sector of the economy. It is responsible for the production of goods and services for the public. The private sector does not have a monopoly on the production and distribution of goods. It does not own the means of production. It has no control over the production or distribution of the goods or services. The government does not control the production, distribution or sale of the products or services of the public sector. In fact, it does not even control the distribution of these products and services. In other words, it is the government that controls the supply and demand of the product and service. This is why it is called the "private sector".

What is the difference between the government and private sector? What are the differences between the two types of gover