### Introduction

This project leverages advanced natural language processing (NLP) techniques to analyze and classify text data using tools from the Hugging Face Transformers library. The primary focus is on two key functionalities: sentiment analysis and zero-shot classification. These tools enable sophisticated text processing without requiring extensive retraining or pre-labeled datasets.



###Project Summary
The implementation demonstrates how sentiment analysis can quickly determine the emotional tone of a statement, while zero-shot classification allows for the categorization of text into user-defined categories. The real-world use case involves analyzing a service desk complaint tickets dataset. Here, the challenge is to classify customer complaints into predefined ticket types, simulating a scenario where this information is missing. By leveraging the "facebook/bart-large-mnli" model, the system assigns the most relevant category to each complaint with high accuracy.

Note: Zero-shot classification, as demonstrated here, eliminates the need for extensive labeled datasets, making it a versatile solution for dynamic and ever-changing industries. By automating classification and analysis, businesses can save time, enhance accuracy, and improve customer experiences across various touchpoints.

In [None]:
# Install the version that works with the models

!pip install transformers==4.47.1





In [None]:
!pip install torch



In [None]:
#import relevant modules

from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
#Test the sentiment analysis with a sample to see how it works

sentiment_pipeline("I love cooking and dancing")

[{'label': 'POSITIVE', 'score': 0.9996930360794067}]

In [None]:
# Initialize a zero-shot classification pipeline using the "facebook/bart-large-mnli" model.
# Zero-shot classification enables categorizing text into custom, user-defined labels without any prior training on those labels.
#It is particularly useful for tasks where predefined categories may vary or are not available during model training.

zero_shot_classification = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


### Let's test this model on some sentences and see how well it performs.

In [None]:

sequence_to_classify = "one day I will see the world"
candidate_labels = ["travel", "cooking", "dancing"]
zero_shot_classification(sequence_to_classify, candidate_labels)

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938650727272034, 0.0032737907022237778, 0.002861032262444496]}

In [None]:
sequence_to_classify = "I am learning how to cook a new recipe"
candidate_labels = ["travel", "cooking", "dancing"]
zero_shot_classification(sequence_to_classify, candidate_labels)

{'sequence': 'I am learning how to cook a new recipe',
 'labels': ['cooking', 'dancing', 'travel'],
 'scores': [0.9957269430160522, 0.002226715674623847, 0.0020463846158236265]}

In [None]:
sequence_to_classify = "I am learning some frameworks used by AI engineers so I can become the best AI engineer ever"
candidate_labels = ["travel", "cooking", "dancing", "education", "career"]
zero_shot_classification(sequence_to_classify, candidate_labels)

{'sequence': 'I am learning some frameworks used by AI engineers so I can become the best AI engineer ever',
 'labels': ['career', 'education', 'travel', 'dancing', 'cooking'],
 'scores': [0.7340741753578186,
  0.15928830206394196,
  0.0577242448925972,
  0.028168536722660065,
  0.02074473537504673]}

### Test on a Real-World Dataset

In this example, we'll use a service desk complaint tickets dataset. While the dataset includes predefined ticket types, we will simulate a scenario where this information is missing.

Although the I loaded the data manually, it can be found here:


```
# import kagglehub

path = kagglehub.dataset_download("suraj520/customer-support-ticket-dataset")

```



In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Load the csv file which contains extracted customer_support_tickets from Kaggle
path='/content/drive/My Drive/customer_support_tickets.csv'
data = pd.read_csv(path)

In [None]:
data.head()

Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


In [None]:
# Extract unique ticket types defined by the organization to provide contextual knowledge.
# These ticket types will serve as our candidate labels for classification, as referenced earlier.

data["Ticket Type"].unique()

array(['Technical issue', 'Billing inquiry', 'Cancellation request',
       'Product inquiry', 'Refund request'], dtype=object)

The `Ticket Description` column contains the complaints or queries raised by customers.These descriptions will be classified into predefined categories (candidate labels). The candidate labels represent common ticket types, such as: Technical issue, Billing inquiry, Cancellation request, Product inquiry, Refund request, and Customer service.

Using zero-shot classification, we assign each ticket description to the most relevant label.

In [None]:
sequence_to_classify = data['Ticket Description']
candidate_labels = ['Technical issue', 'Billing inquiry', 'Cancellation request',
       'Product inquiry', 'Refund request', 'Customer service']
zero_shot_classification(sequence_to_classify, candidate_labels)

{'sequence': "I'm having an issue with the {product_purchased}. Please assist.\n\nYour billing zip code is: 71701.\n\nWe appreciate that you have requested a website address.\n\nPlease double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.",
 'labels': ['Product inquiry',
  'Billing inquiry',
  'Technical issue',
  'Customer service',
  'Refund request',
  'Cancellation request'],
 'scores': [0.34255823493003845,
  0.29293471574783325,
  0.1684422492980957,
  0.1558038592338562,
  0.024653857573866844,
  0.0156070776283741]}

In [None]:
candidate_labels = ['Technical issue', 'Billing inquiry', 'Cancellation request',
       'Product inquiry', 'Refund request', 'Customer service']

In [None]:
# Define a function to classify a single sequence and return the recommended class
def get_recommended_class(description):
    result = zero_shot_classification(description, candidate_labels)
    return result['labels'][0]  # The top recommended label


In [None]:
# Apply the classifier to each ticket description and store the recommended class
data["Recommended Class"] = data["Ticket Description"].apply(get_recommended_class)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8469 entries, 0 to 8468
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticket ID                     8469 non-null   int64  
 1   Customer Name                 8469 non-null   object 
 2   Customer Email                8469 non-null   object 
 3   Customer Age                  8469 non-null   int64  
 4   Customer Gender               8469 non-null   object 
 5   Product Purchased             8469 non-null   object 
 6   Date of Purchase              8469 non-null   object 
 7   Ticket Type                   8469 non-null   object 
 8   Ticket Subject                8469 non-null   object 
 9   Ticket Description            8469 non-null   object 
 10  Ticket Status                 8469 non-null   object 
 11  Resolution                    2769 non-null   object 
 12  Ticket Priority               8469 non-null   object 
 13  Tic


Retrieve and display the sentiment classifications alongside other ticket details. The selected columns provide a comprehensive view of each ticket, including:
- Ticket and customer details (e.g., ID, Name, Email, Age, Gender)
- Product and purchase information
- Ticket metadata (e.g., Type, Subject, Description)
- The recommended classification result for each ticket.

`Note: Processing sentiment classifications for large datasets can be computationally intensive.`

In [None]:
# Retrieve the the sentiments classes for the different models side by side, this can be heavy during computation
df= data[["Ticket ID", "Customer Name", "Customer Email", "Customer Age", "Customer Gender", "Product Purchased", "Date of Purchase",
          "Ticket Type", "Ticket Subject", "Ticket Description", "Recommended Class"]]
df.head()

KeyError: "['Recommended Class'] not in index"

### Broader Applications

This technique can be applied to various domains, including:
*   Customer Support Automation: Streamlining ticket handling by classifying inquiries into predefined categories.
*  Content Moderation: Categorizing user-generated content to ensure compliance with community guidelines.
*   Market Research: Analyzing customer feedback and reviews to identify key trends and concerns.
*   HR Operations: Classifying employee feedback or support tickets to enhance operational efficiency.
*   Education Technology: Categorizing student questions or feedback for personalized responses.



