### Introduction

It can be quite complex to gather requirement of customer of product improvement and enhancement when scaling to millions of cutomers and filter them. This simple solution aims to summatize the customer conversations and priotize the requirements with few lines of code.

**ML Approach:**
1. Read the data
2. Summarize the conversations
3. Identify the unmet needs
4. Score the needs
5. Save results

**Output**
1. unmet_needs.csv file contains the unmet needs of the customer in the conversation
2. result.csv contains entire output with summary and needs

**Assumption:**
1. All conversations revolve around the same product. 
2. Scoring of the needs is based on generic business impact.

### Imports

In [None]:
# !pip install webvtt-py
# !pip install tiktoken
# !pip install openai

In [None]:
# Import Libraries
import webvtt
import tiktoken
import time
import os
import pandas as pd
import re
import json
import openai

### Variables

Please extract the zip file and provide the unzip folder as folder_path

In [3]:
# Input unzipped folder location
folder_path = "transcripts"
# Default OpenAI model used
MODEL = "gpt-3.5-turbo-16k"
# OpenAI key
openai.api_key = "insert-your-key-here"

### Helper Functions

In [4]:
def timing(func):
    """
    A decorator to measure the execution time of a function.

    Args:
        func (callable): The function to be timed.

    Returns:
        callable: A wrapped function that measures execution time.
    """
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f"{func.__name__} took {elapsed_time:.4f} seconds to execute.")
        return result
    return wrapper

In [5]:
def save_dataframe_to_csv(df, filename, index=False):
    """
    Save a DataFrame to a CSV file.

    Args:
        df (pd.DataFrame): The DataFrame to be saved.
        filename (str): The name of the CSV file to be created.
        index (bool): Whether or not to include the DataFrame index in the CSV file (default is False).

    Returns:
        None
    """
    df.to_csv(filename, index=index)

In [6]:
@timing
def read_csv_to_dataframe(file_path):
    """
    Read a CSV file into a DataFrame.

    Args:
        file_path (str): The path to the CSV file to be read.

    Returns:
        pd.DataFrame or None: The DataFrame if successful, None if there was an error.
    """
    try:
        df = pd.read_csv(file_path)
        return df
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return None
    except Exception as e:
        print(f"Error: An exception occurred - {e}")
        return None

In [7]:
def num_tokens_from_string(string, encoding_name="gpt-3.5-turbo"):
    """
    Calculate the number of tokens in a text string using a specified encoding.

    Args:
        string (str): The input text string.
        encoding_name (str): The name of the encoding to use.

    Returns:
        int: The number of tokens in the input string.
    """
    encoding = tiktoken.encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [8]:
def json_string_to_dict(json_str):
    """
    Convert a JSON string to a Python dictionary.

    Args:
        json_str (str): The JSON string to be converted.

    Returns:
        dict or None: The dictionary representation of the JSON string or None if decoding fails.
    """
    try:
        # Use json.loads() to parse the JSON string into a dictionary
        data_dict = json.loads(json_str)
        return data_dict
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        return None

In [9]:
@timing
def get_completion_from_messages(messages, model=MODEL, temperature=0):
    """
    Generate a completion from a list of messages using the OpenAI GPT-3.5 Turbo model.

    Args:
        messages (list): A list of message objects in the conversation.
        model (str): The name of the GPT-3.5 Turbo model to use (default is None).
        temperature (float): The temperature parameter for randomness in text generation (default is 0).

    Returns:
        str: The generated completion text or None if an error occurs.
    """
    try:
        response = openai.ChatCompletion.create(
            model=model,
            messages=messages,
            temperature=temperature,
        )
        if response.choices and response.choices[0].message:
            return response.choices[0].message["content"]
        else:
            print("Error: No response content received.")
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

In [10]:
def lists_to_dataframe(*lists):
    """
    Convert variable number of lists into a DataFrame.

    Args:
        *lists: Variable number of lists.

    Returns:
        pd.DataFrame: A DataFrame with columns named after the list variables.
    """
    data = {name: lst for name, lst in zip([var for var in locals() if isinstance(locals()[var], list)], lists)}
    df = pd.DataFrame(data)
    return df

### Data Loading

In [11]:
def get_vtt_files(folder_path):
    """
    Get a list of .vtt files in the specified folder.

    Args:
        The path to the folder to search for .vtt files.

    Returns:
        A list of file paths to .vtt files in the folder.
    """
    # Use a list comprehension to create a list of file paths
    files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.lower().endswith(".vtt")]
    return files

In [12]:
@timing
def read_vtt_files(files):
    """
    Read .vtt files and create a DataFrame.

    :param files: List of file paths to .vtt files.
    :type files: list
    :return: DataFrame containing the parsed data.
    :rtype: pandas.DataFrame
    """
    start_times, end_times, captions, identifiers, sources = [], [], [], [], []
    
    for file in files:
        try:
            for caption in webvtt.read(file):
                start_times.append(caption.start)
                end_times.append(caption.end)
                captions.append(caption.text)
                identifiers.append(caption.identifier)
                sources.append(file)
        except Exception as e:
            print(f"Error reading file '{file}': {e}")
    
    data = {
        'start_time': start_times,
        'end_time': end_times,
        'caption': captions,
        'identifier': identifiers,
        'source': sources
    }
    
    df = pd.DataFrame(data)
    return df

In [13]:
# Get data
files = get_vtt_files(folder_path)
df = read_vtt_files(files)
# df.head()
df.tail()

read_vtt_files took 0.1352 seconds to execute.


Unnamed: 0,start_time,end_time,caption,identifier,source
3401,00:38:11.594,00:38:14.997,good day and you hope you enjoy\nyour weekend ...,db45b327-207f-49d4-ac7b-4a7643aee86e-8,transcripts/call-4.vtt
3402,00:38:14.997,00:38:15.699,closer to it.,db45b327-207f-49d4-ac7b-4a7643aee86e-9,transcripts/call-4.vtt
3403,00:38:17.160,00:38:18.250,"Thank you, Bernard.",f40fd9f1-c9e5-4012-aff4-6df494e75515-0,transcripts/call-4.vtt
3404,00:38:19.600,00:38:20.200,Personas.,b145f7c4-a6e7-4e71-83ef-f03449bfb101-0,transcripts/call-4.vtt
3405,00:38:22.550,00:38:25.250,"Alright, cheers. Bye, bye.",b63b24f8-6717-4e75-814d-3b61dc052e45-0,transcripts/call-4.vtt


### Data Pre-processing

In [14]:
@timing
def preprocess_dataframe(df):
    """
    Preprocess a DataFrame containing captions.

    Args:
        df (pd.DataFrame): The DataFrame to be preprocessed.

    Returns:
        pd.DataFrame: The preprocessed DataFrame.
    """
    df = df.sort_values(by=['source', 'start_time'])
    df['caption'] = df['caption'].astype(str)
    df['caption'] = df['caption'].str.replace('\n', ' ')
    df['identifier'] = df['identifier'].apply(lambda x: re.sub(r'-\d+$', '', x))
    mask = df['identifier'] != df['identifier'].shift()
    
    df['group_id'] = mask.cumsum()
    df = df.groupby('group_id').agg({'start_time': 'first', 'end_time': 'last', 'caption': ' '.join, 'identifier': 'first', 'source': 'first'}).reset_index()
    df = df.drop('group_id', axis=1)
    return df

In [15]:
df = preprocess_dataframe(df)

preprocess_dataframe took 0.0582 seconds to execute.


In [16]:
def condense_caption(df):
    """
    Concatenate caption from the 'caption' column based on the 'source' column.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.

    Returns:
        pd.DataFrame: The processed DataFrame.
    """
    df['caption'] = df.groupby('source')['caption'].transform(' '.join)
    df = df.groupby('source').agg({'start_time': 'first', 'end_time': 'last', 'caption': 'first', 'identifier': 'first'}).reset_index()
    return df

In [17]:
df = condense_caption(df)
df.head()

Unnamed: 0,source,start_time,end_time,caption,identifier
0,transcripts/call-1.vtt,00:00:02.240,00:51:16.610,"The platform through its. I mean, every other ...",eaba48af-ba4d-4649-8573-02552246990a
1,transcripts/call-2.vtt,00:00:02.260,00:29:46.980,Sure. That's your transition to the new platfo...,7fb9629a-5ce5-4f06-b23c-9a766a37533a
2,transcripts/call-3.vtt,00:00:04.410,00:39:46.980,"OK, cool. Um, so as a as, as sort of in terms ...",95b08182-2301-48f1-884d-66b5e5a1f273
3,transcripts/call-4.vtt,00:00:02.220,00:38:25.250,Directly with the people who build the platfor...,e0968d7f-6ad8-47ec-88c8-4f5bbf9cea9b
4,transcripts/call-5.vtt,00:00:02.780,00:47:49.320,"Um, So what I'll do is um. I'll share my scree...",a3b83e0e-077b-4d3a-9d9a-e1ab66edeb2d


In [18]:
# Get max number of tokens
token_length = df['caption'].apply(num_tokens_from_string)
print(max(token_length))

10102


Max token length is less than 16K, hence can use gpt-3.5-turbo-16k. If it were greater, then either would have to select a model with bigger context window or chunk the input text in small batches to fit the context window.

### Modeling

In [20]:
def get_unmet_need(instruction, text):
    prompt = instruction + text
    messages = [{'role': 'user', 'content': prompt}]
    response = get_completion_from_messages(messages)
    return response

In [21]:
def process_conversations(df, instruction):
    result = []
    for conv in df['caption'].tolist():
        response_str = get_unmet_need(instruction, conv)
        response_dict = json_string_to_dict(response_str)
        result.append(response_dict)
    return result

In [22]:
# Instruction
instruction = """
1. Create a summary of the conversation separated by triple backticks in less than 100 words.
2. Identify the unment products needs from the conversation. Provide these needs a score between 1 to 10, where 10 is the highest for business impact on fulflling the need. Provide no more than 5 needs.
Please provide a respose in form of json format with 2 keys as 1.'summary', 2.'needs':'score'.
```
"""

In [23]:
result = process_conversations(df, instruction)

get_completion_from_messages took 8.6151 seconds to execute.
get_completion_from_messages took 6.5500 seconds to execute.
get_completion_from_messages took 9.4113 seconds to execute.
get_completion_from_messages took 7.9129 seconds to execute.
get_completion_from_messages took 6.4599 seconds to execute.


In [24]:
result

[{'summary': 'The conversation was about the new platform and its features. The participants discussed the need for a simple user interface, the ability to customize the dashboard, and the importance of clear definitions for data fields. They also mentioned the need for better organization of data and the option to collapse or expand sections. The participants provided feedback on the report builder and suggested improvements for data visualization. They mentioned that they do not use data on the go and prefer to access platforms on laptops or desktops.',
  'needs': {'Simple user interface': 8,
   'Customizable dashboard': 7,
   'Clear definitions for data fields': 9,
   'Better organization of data': 6,
   'Improved data visualization': 7}},
 {'summary': 'The conversation was about the new platform and its features. Belinda, an information scientist advisor, discussed her use of the platform for research and development purposes. The platform allows her to access information on compet

### Post Processing

In [25]:
result_df = pd.DataFrame(result)
result = pd.concat([df, result_df], axis=1)
save_dataframe_to_csv(result, 'result.csv')

In [26]:
def extract_needs_and_scores(result):
    """
    Extracts 'needs' and 'score' values from a DataFrame column.

    Args:
        result (pd.DataFrame): The DataFrame containing the 'needs' column.

    Returns:
        tuple: A tuple of two lists, 'needs' and 'score'.
    """
    needs = []
    score = []

    for needs_dict in result['needs']:
        for need, s in needs_dict.items():
            needs.append(need)
            score.append(s)

    return needs, score

In [27]:
needs, score = extract_needs_and_scores(result)

# Save unmet needs
unmet_need_df = pd.DataFrame({'needs':needs, 'score':score})
unmet_need_df.sort_values('score', ascending=False, inplace = True)
save_dataframe_to_csv(unmet_need_df, 'unmet_needs.csv')

### Summary

In [28]:
# Print top unmet needs
top_unmet_needs = unmet_need_df['needs'].head(5).to_list()
output = '\n'.join(top_unmet_needs)
print(f"The key unmet needs from the product in the conversations are:\n{output}")

The key unmet needs from the product in the conversations are:
Ability to filter and analyze data at a granular level
Export visuals
Clear definitions for data fields
Contact analyst functionality
Charts for sales data


### Further Exploration

**Data**
- Text Chunking - make it more robust by chunking the input text into smaller batches, if itoken length exceeds 16k tokens

**Engineering**
- Modular object Oriented code with unit test. Analyse processing time of functions and improve if required (vectorization, caching, etc.)

**Prompt Design**
- Prompt Chaining- break down prompts in sub-task and evaluate performance in terms of NLP task and processing time. For e.g., summarize prompt > unmet need extractor > validation
- Response Verification - Adding subsequent prompts to verify if the extracted unmet need is indeed present in the conversation

**Post-processing**
- Group similar extracted features into a one to avoid repetitiveness. Also, that can be used an input feature for scoring.

**Evaluation**
- Evaluation - Annotate the conversation manually and see how effective the model is.

**Model**
- Model - explore how results vary with by changing models. For e.g., trying GPT-4