<a href="https://colab.research.google.com/github/Pranavi2606/Generative-Text-IIT-G/blob/main/T_Pranavi_IIT_G_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Case Study 1: Generative Text for Customer Support Automation**

Project Overview: Develop an AI-powered system to automate customer support interactions using generative models like GPT-3.5.

Use Cases:
1. Automated Response Generation:
*   Problem Statement: Customer support teams are overwhelmed by repetitive inquiries that could be handled by automated systems
*   Solution: Implement a generative AI model to automatically generate accurate and context-aware responses to common customer queries, reducing the load on human agents and improving response times.

2. Personalized Customer Engagement:
*   Problem Statement: Customers expect personalized interactions that cater to their specific needs and preferences.
*   Solution: Use generative AI to create personalized engagement messages based on customer data and interaction history, enhancing customer satisfaction and loyalty.




In [1]:
pip install transformers torch pandas


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [2]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

Creating a customer care dataset

In [3]:
data = {
    'utterances': [
        "how can i reset my password?",
        "what are your opening hours?",
        "can i change my subscription plan?",
        "how do i update my billing information?",
        "i have a problem with my order. can you help?",
        "where can i find the user manual?",
        "my account is locked. what should i do?",
        "how do i contact customer support?",
        "What is the refund policy?",
        "I want to view the delivery options",
        "How do I track my order?",
        "Can I track my shipment online?",
        "I need help with the installation process.",
        "How can I cancel my order?",
        "I want to file a complaint for a service",
        "Is there a discount for bulk purchases?",
        "Can I change the delivery address?",
        "How do I use the promotional code?",
        "tell me how soon I can expect my ticket to arrive",
        "i have a question about my account",
        "The product I received is damaged. What now?",
        "How long does the warranty last?",
        "i am happy with the service, could i lodge an opinion?",
        "Can I return a product without the receipt?",
        "How do I know if my payment was successful?",
        "What payment methods do you accept?",
        "I want to subscribe to the newsletter"
    ]
}

Convert to DataFrame

In [4]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,utterances
0,how can i reset my password?
1,what are your opening hours?
2,can i change my subscription plan?
3,how do i update my billing information?
4,i have a problem with my order. can you help?


Save the CSV file

In [5]:
df.to_csv("Customer_Support_Automation.csv", index=False)

Read the saved CSV File

In [6]:
df=pd.read_csv("Customer_Support_Automation.csv")

Download Natural Language Toolkit resources

In [7]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Data cleaning and preprocessing function

In [8]:
def preprocess_alltext(utterances):
    utterances = utterances.lower() # Convert all utterances to lowercase
    utterances = re.sub(r'[^a-z\s]', '', utterances) # Remove punctuation and numbers
    words = word_tokenize(utterances) # Tokenize utterances
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words] # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words] # Lemmatize words
    utterances = ' '.join(words) # Join words back into a single string
    return utterances

Applying preprocessing to the utterances data

In [9]:
df['utterances'] = df['utterances'].apply(preprocess_alltext)

In [10]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling

with open("train.txt", "w") as f:
    for line in df['utterances']:
        f.write(line + "\n")

Function to load dataset

In [11]:
def load_dataset(file_path, tokenizer, block_size=128):
    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size
    )
    return dataset

Loading the pre-trained GPT-2 model and tokenizer

In [12]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Prepare the dataset and data collator

In [13]:
train_dataset = load_dataset("train.txt", tokenizer)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)



In [14]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/314.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m204.8/314.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.32.1


In [15]:
import accelerate
import transformers
import torch

print("Accelerate version:", accelerate.__version__)
print("Transformers version:", transformers.__version__)
print("Torch version:", torch.__version__)


Accelerate version: 0.32.1
Transformers version: 4.41.2
Torch version: 2.3.0+cu121


Tokenize the dataset

In [16]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

In [17]:
!pip install pyarrow==14.0.1 datasets --quiet


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source

Initializing Accelerator

In [18]:
from accelerate import Accelerator
accelerator = Accelerator()

from torch.optim import AdamW
learning_rate = 1e-4
optimizer = AdamW(model.parameters(), lr=learning_rate)

Prepare the model, optimizer, and dataloader for distributed training

In [19]:
model, optimizer, train_dataloader = accelerator.prepare(
    model,
    optimizer,
    train_dataset
)

Creating a training loop

In [20]:
num_train_epochs = 1
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        if isinstance(batch['input_ids'], torch.Tensor) and batch['input_ids'].dim() == 1:
            batch['input_ids'] = batch['input_ids'].unsqueeze(0)
        if isinstance(batch['attention_mask'], torch.Tensor) and batch['attention_mask'].dim() == 1:
            batch['attention_mask'] = batch['attention_mask'].unsqueeze(0)
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss.item()}")

In [21]:
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Check if 'batch' is a dictionary before proceeding
        if isinstance(batch, dict):
            print(f"Step {step} batch keys: {batch.keys()}")  # Print keys of the batch dictionary
            for key in batch:
                # Check if the item is a tensor before printing its shape
                if isinstance(batch[key], torch.Tensor):
                    print(f"{key} shape: {batch[key].shape}")
                else:
                    print(f"{key} is not a tensor")
        else:
            print(f"Step {step}: Batch is not a dictionary, it's a {type(batch)}")
        break  # Stop after processing the first batch

In [22]:
import torch

# Training loop
num_train_epochs = 1
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Print batch shape
        print(f"Step {step} batch shape: {batch.shape}")

        # Ensure tensors have the correct dimensions
        if batch.dim() == 1:
            batch = batch.unsqueeze(0)

        # Move tensor to the appropriate device
        batch = batch.to(accelerator.device)

        # Assuming batch contains the input_ids and attention_mask
        # If batch is a tensor, you may need to generate attention masks
        attention_mask = torch.ones(batch.shape, device=batch.device)

        # Forward pass
        outputs = model(input_ids=batch, attention_mask=attention_mask, labels=batch) # Add labels for loss calculation
        # Check if model outputs a loss directly
        if hasattr(outputs, 'loss'):
            loss = outputs.loss
        else:
            # Calculate loss manually if needed (example with cross-entropy loss)
            loss_fn = torch.nn.CrossEntropyLoss()
            logits = outputs.logits  # Assuming your model outputs logits
            loss = loss_fn(logits.view(-1, logits.size(-1)), batch.view(-1))

        # Backward pass and optimization
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss.item()}")


In [23]:
model.save_pretrained("./gpt2-customer-support")
tokenizer.save_pretrained("./gpt2-customer-support")

('./gpt2-customer-support/tokenizer_config.json',
 './gpt2-customer-support/special_tokens_map.json',
 './gpt2-customer-support/vocab.json',
 './gpt2-customer-support/merges.txt',
 './gpt2-customer-support/added_tokens.json')

Load the fine-tuned model and tokenizer

In [24]:
model = GPT2LMHeadModel.from_pretrained("./gpt2-customer-support")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-customer-support")

Function to generate responses

In [25]:
def generate_response(input):
    inputs = tokenizer.encode(input, return_tensors='pt')
    outputs = model.generate(inputs, max_length=150, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [26]:
def generate_response(input):
    inputs = tokenizer(input, return_tensors='pt')
    attention_mask = inputs['attention_mask']
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=attention_mask,
        max_length=250,
        min_length=50,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_beams=5,
        temperature=0.7,
        top_k=50,
        top_p=0.9)

    # Decode the output
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Contacting customer support executive
    if "contact customer support" in input.lower() or "talk to person" in input.lower():
        response += "\n\nTo talk to a customer support executive. Please dial 1800-1234-5678"

    return response

# Example usage
customer_query = "How can I contact customer support executive?"
response = generate_response(customer_query)
print("Generated Response:", response)



Generated Response: How can I contact customer support executive?

If you have any questions or concerns, please contact Customer Support.

How do I find out if I am eligible to apply for a credit card?


You can apply for credit card at any of the following locations:

CVS Pharmacy

P.O. Box 622

San Francisco, CA 94103

Phone: (415) 639-7000

Email: support@cvspharmacy.com

Website: www.cvs.com/creditcard

What if I don't have my credit card number or credit card information on file with the credit card company or if I'm not sure if I need to contact the card company to get my information?




If your credit card is not on file, you may not be able to contact your card company. If you do have your card number on file and you don't want to contact them, you can file a complaint with the Consumer Financial Protection Bureau (CFPB) at 1-800-FRA-1222. You can also file an online complaint at www.consumerfinance.gov/consumer-finance-complaint

To talk to a customer support executive. Plea