### Objective

In this notebook, we implement the entire process of simulating data science problem-solving with role-playing chatbots. The complete system consists of a scenario generator, a client-data scientist dual-chatbot module, as well as an assessor.

In [1]:
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.memory import ConversationBufferMemory
from langchain.prompts import (
    ChatPromptTemplate, 
    MessagesPlaceholder, 
    SystemMessagePromptTemplate, 
    HumanMessagePromptTemplate
)
from langchain.chains import ConversationChain
import utilities
import os

from abc import ABC, abstractmethod

### 1. Abstract base class

We create an `LLMBot` abstract base class to define the common methods shared by various bots we will define later.

In [2]:
class LLMBot(ABC):
    """Class definition for a single LLM bot"""
    
    def __init__(self, endpoint_type, temperature):
        """Initialize the large language model and its associated memory.

        Args:
        --------------
        endpoint_type: "chat" or "completion".
        temperature: temperature of the LLM.
        """        
        # Instantiate llm
        # Reminder: need to set up openAI API key 
        # (e.g., via environment variable OPENAI_API_KEY)
        if endpoint_type == 'chat':
            self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", 
                                temperature=temperature)
        elif endpoint_type == 'completion':
            self.llm = OpenAI(model_name="text-davinci-003", 
                            temperature=temperature)
        else:
            raise KeyError("Currently unsupported endpoint type!")

    @abstractmethod
    def instruct(self):
        """Determine the context of LLM bot behavior. 
        """
        pass

    @abstractmethod
    def step(self):
        """Response produced by the LLM bot. 
        """
        pass

### 2. Scenario generator

The main purpose of the scenario generator bot is to generate detailed description of a concrete data science project. We divide this task into two steps, where the first step is to generate a draft description of the scenario, and the second step is to fill in the details.

In [3]:
class ScenarioGenerator(LLMBot):

    industry_specifics = {
        'healthcare': """types of patients treated (e.g., age, medical conditions), 
        common treatments and procedures, challenges faced in patient care, medical equipment used.""",
        
        'finance': """types of financial products and services offered, 
        common financial transactions, challenges faced in fraud detection, 
        tools and technologies used for financial analysis.""",
        
        'retail': """types of products sold, common sales channels (e.g., online, in-store), 
        challenges faced in inventory management, tools and technologies used for sales analysis.""",
        
        'technology': """types of technology products or services offered, 
        common challenges faced in product development, tools and technologies 
        used for data analysis and product testing.""",
        
        'manufacturing': """types of products manufactured, machines used in the production process, 
        common issues faced by the company, tools and technologies used for quality control.""",
        
        'transportation': """types of transportation services provided, 
        common routes and destinations, challenges faced in route optimization, 
        tools and technologies used for vehicle maintenance and monitoring.""",
        
        'energy': """types of energy produced (e.g., renewable, non-renewable), 
        common challenges faced in energy production and distribution, 
        tools and technologies used for energy monitoring and analysis.""",
        
        'real estate': """types of properties managed or sold, 
        common challenges faced in property management or sales, 
        tools and technologies used for property analysis and valuation.""",
        
        'education': """levels of education provided (e.g., primary, secondary, tertiary), 
        common challenges faced in student assessment and curriculum development, 
        tools and technologies used for educational analysis and research.""",
        
        'government': """types of public services provided, 
        common challenges faced in public service delivery and management, 
        tools and technologies used for data analysis and decision-making.""",
        
        'non-profit': """ types of services or programs provided, 
        common challenges faced in program delivery and management, 
        tools and technologies used for data analysis and decision-making."""
    }
    
    
    def __init__(self, temperature=1.0):       
        """Setup scenario generator bot.
        
        Args:
        --------------
        temperature: temperature of the LLM.
        """   
        
        # Instantiate llm
        super().__init__('chat', temperature)
            
        # Instantiate memory
        self.memory = ConversationBufferMemory(return_messages=True)
        
    
    def instruct(self, industry, business_size, problem_type, details):
        """Determine the context of scenario generator. 
        
        Args:
        --------------
        industry: interested industry, e.g., healthcare, finance, etc.
        business_size: large, medium, small.
        problem_type: type of machine learning problem, e.g., classification, regression, etc.
        details: specific details added to the description.
        """        
        
        self.industry = industry
        self.business_size = business_size
        self.problem_type = problem_type
        self.details = ScenarioGenerator.industry_specifics[industry]
        
        prompt = ChatPromptTemplate.from_messages([
            MessagesPlaceholder(variable_name="history"),
            HumanMessagePromptTemplate.from_template("""{input}""")
        ])

        self.scen_generator = ConversationChain(memory=self.memory, prompt=prompt, 
                                          llm=self.llm)
        
    def step(self):
        """Interact with the LLM bot. 
        """       
        
        # 1st stage (draft)
        print("Generating scenario description: drafting stage...")
        prompt_1st = self._get_1st_stage_prompt()
        self.interm_scenario = self.scen_generator.predict(input=prompt_1st)
        
        # 2nd stage (review and enrich)
        print("Generating scenario description: refining stage...")
        prompt_2nd = self._get_2nd_stage_prompt()
        self.scenario = self.scen_generator.predict(input=prompt_2nd)
        print("Scenario description generated!")
        
        return self.scenario
    
    
    def _get_1st_stage_prompt(self):
        
        prompt = f"""For a {self.industry} company of {self.business_size} size focusing on {self.problem_type} problems, 
        generate a concrete data science project scenario that a data scientist might encounter in real life. 
        Please provide concrete and specific details relevant to the selected industry and problem type.

        For the generated scenario, please provide:
        1. A specific and realistic description of a problem faced by the company.
        2. The desired outcome that the company is hoping to achieve by solving the problem.
        3. A list of the top 3 most relevant data sources that might be available for solving the problem.

        Output format:
        Problem description: [content of problem description]
        Desired outcome: [content of desired outcome]
        Available data: [content of available data]
        """
        
        return prompt
    
    
    def _get_2nd_stage_prompt(self):
        
        prompt = f"""Based on the previously generated scenario, please enrich the problem description 
        by providing more specific details (such as {self.details}) about the problem.

        Output format:
        Enriched problem description: [content of enriched problem description]
        Desired outcome: [content of desired outcome]
        Available data: [content of available data]
        """
        
        return prompt

Quick test of generating the data science case study

In [4]:
# Basic information
industry = "manufacturing"
business_size = "medium"
problem_type = "anomaly detection"
details = "Types of products manufactured, machines used in the production process, \
common issues faced by the company, tools and technologies used for quality control."

# Instantiate scenario generator
scen_generator = ScenarioGenerator(temperature=1.0)
scen_generator.instruct(industry, business_size, problem_type, details)

# Create the problem description
scenario = scen_generator.step()
print(scenario)

Generating scenario description: drafting stage...
Generating scenario description: refining stage...
Scenario description generated!
Enriched problem description: The medium-sized manufacturing company produces a variety of electronic components such as resistors, capacitors, and diodes used in smartphones, laptops, and other electronic devices. The production process involves multiple steps, including cutting, shaping, cleaning, component placement, soldering, and surface finishing. The company uses machines such as pick and place machines, reflow ovens, wave soldering machines, and automatic optical inspection machines to automate the production process.

Despite implementing different quality control measures such as visual inspections and statistical process control, the company is experiencing multiple issues such as surface finish defects, poor solder connections, and misplaced components. These defects result in increased production costs, lower quality products, and customer c

### 3. Dual-chatbot module

#### 3.1 Client chatbot

The purpose of the client chatbot is to clarify the situation and give feedback on possible solutions.

In [6]:
class ClientBot(LLMBot):
    """Class definition for the client bot, created with LangChain."""
    
    def __init__(self, temperature=0.8):       
        """Setup scenario generator bot.
        
        Args:
        --------------
        temperature: temperature of the LLM.
        """   
        
        # Instantiate llm
        super().__init__('chat', temperature)
            
        # Instantiate memory
        self.memory = ConversationBufferMemory(return_messages=True)
        
        
    def instruct(self, industry, business_size, scenario):
        """Determine the context of client chatbot. 
        """
        
        self.industry = industry
        self.business_size = business_size
        self.scenario = scenario
        
        # Define prompt template
        prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(self._specify_system_message()),
            MessagesPlaceholder(variable_name="history"),
            HumanMessagePromptTemplate.from_template("""{input}""")
        ])
        
        # Create conversation chain
        self.conversation = ConversationChain(memory=self.memory, prompt=prompt, 
                                              llm=self.llm, verbose=False)
        

    def step(self, prompt):
        """Client chatbot speaks. 
        """
        response = self.conversation.predict(input=prompt)
        
        return response
        

    def _specify_system_message(self):
        """Specify the behavior of the client chatbot.
        """      
        
        # Prompt
        prompt = f"""You are role-playing a representative from a {self.industry} company of {self.business_size} size and 
        you are meeting with a data scientist (which is played by another bot), to discuss how to leverage machine learning 
        to address a problem your company is facing. 
        
        The problem description, desired outcome, and available data are:
        {self.scenario}.
        
        Your ultimate goal is to work with the data scientist to define a clear problem and agree on a suitable data science solution or approach.

        Guidelines to keep in mind:
        - **Get Straight to the Point**: Start the conversation by directly addressing the problem at hand. There is no need for pleasantries or introductions.
        - **Engage in Conversation**: Respond to the data scientist's questions and prompts. Do not provide all the information at once or provide the entire conversation yourself.
        - **Clarify and Confirm**: Always make sure to clarify and confirm the problem, desired outcome, and any proposed solutions with the data scientist. 
        - **Stay in Role**: Your role as a client is to represent your company's needs and work with the data scientist to define a clear problem and agree on a suitable data science solution or approach. Do not try to propose solutions.
        - **Provide Information as Needed**: Provide information about the problem, available data, constraints, and requirements as it becomes relevant in the conversation. If the data scientist asks a question and the information was not provided in the problem description, it is okay to improvise and create details that seem reasonable.
        - **Collaborate**: Collaborate with the data scientist to clearly define the problem and to consider any proposed solutions or approaches.
        """

        return prompt

#### 3.2 Data scientist bot

The purpose of the data scientist chatbot is understanding the problem in depth and proposing possible solutions.

In [7]:
class DataScientistBot(LLMBot):
    """Class definition for the data scientist bot."""

    def __init__(self, temperature=0.8):   
        """Setup scenario generator bot.
        
        Args:
        --------------
        temperature: temperature of the LLM.
        """   
        
        # Instantiate llm
        super().__init__('chat', temperature)
            
        # Instantiate memory
        self.memory = ConversationBufferMemory(return_messages=True)

        
    def instruct(self, industry, business_size, problem_type):
        """Determine the context of data scientist chatbot. 
        """
        
        self.industry = industry
        self.business_size = business_size
        self.problem_type = problem_type
        
        # Define prompt template
        prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(self._specify_system_message()),
            MessagesPlaceholder(variable_name="history"),
            HumanMessagePromptTemplate.from_template("""{input}""")
        ])
        
        # Create conversation chain
        self.conversation = ConversationChain(memory=self.memory, prompt=prompt, 
                                              llm=self.llm, verbose=False)
        

    def step(self, prompt):
        """Data scientist chatbot speaks. 
        """
        response = self.conversation.predict(input=prompt)
        
        return response
        

    def _specify_system_message(self):
        """Specify the behavior of the data scientist chatbot.
        """      
        
        # Prompt
        prompt = f"""You are role-playing a data scientist meeting with a representative (which is played by another chatbot) 
        from a {self.industry} company of {self.business_size} size. They are currently concerned with 
        a {self.problem_type} problem.

        Your ultimate goal is to understand the problem in depth and agree on a suitable data science solution or approach 
        by engaging in a conversation with the client representative. 

        Guidelines to keep in mind:
        - **Engage in Conversation**: You are only the data scientist. Do not provide the entire conversation yourself.
        - **Understand the Problem**: Make sure to ask questions to get a clear and detailed understanding of the problem, the desired outcome, available data, constraints, and requirements.
        - **Propose Solutions**: Based on the information provided by the client, suggest possible data science approaches or solutions to address the problem.
        - **Consider Constraints**: Be mindful of any constraints that the client may have, such as budget, timeline, or data limitations, and tailor your proposed solutions accordingly.
        - **Collaborate**: Collaborate with the client to refine the problem definition, proposed solutions, and ultimately agree on a suitable data science approach.
        """

        return prompt

#### 3.3 Simulate client-data scientist chatbot conversation

In [8]:
# Create two chatbots
client = ClientBot()
data_scientist = DataScientistBot()

# Specify instructions
client.instruct(industry, business_size, scenario)
data_scientist.instruct(industry, business_size, problem_type)

In [9]:
# Book-keeping
question_list = []
answer_list = []

# Start conversation
for i in range(6):
    if i == 0:
        question = client.step('Start the conversation')
    else:
        question = client.step(answer)
    question_list.append(question)
    print("👨‍💼 Client: " + question)
    
    answer = data_scientist.step(question)
    answer_list.append(answer)

    print("👩‍💻 Data Scientist: " + answer)
    print("\n\n")

👨‍💼 Client: Hello, I'm from a medium-sized manufacturing company that produces electronic components such as resistors, capacitors, and diodes. We are having some issues with defects in our products, and we were hoping to discuss with you how we can leverage machine learning to solve this problem.
👩‍💻 Data Scientist: Of course, I'd be happy to help. Can you give me more details about the types of defects you are experiencing? Are they consistent or do they vary? And how are they currently being detected?



👨‍💼 Client: We are experiencing various defects such as surface finish defects, poor solder connections, and misplaced components. These defects are not consistent and can vary from batch to batch. We are currently detecting the defects through visual inspections, statistical process control, magnifying glasses, x-ray machines, and automatic optical inspection machines, but we are still not able to identify the root cause of the defects consistently.
👩‍💻 Data Scientist: I see. It so

### 4. Assessment

To further enhance the user's learning experience, it is beneficial to reflect on the conversation and extract the key learning points for the user to review. Those key learning points could include the specific strategy adopted by the data scientist bot in terms of scoping the problem, the various aspects covered/not covered in the conversation, as well as potential follow-up questions or topics for discussion.

We adopted a two-stage approach, where we first condense the generated conversation script, and then feed it (together with other relevant information) to the assessor bot to analyze the conversation.

In [11]:
class SummarizerBot(LLMBot):

    def __init__(self, temperature=0.8):       
        """Setup summarizer bot.
        
        Args:
        --------------
        temperature: temperature of the LLM.
        """   
        
        # Instantiate llm
        super().__init__('completion', temperature)
        
    
    def instruct(self):
        """Determine the context of summarizer. 
        """        
        
        template = """Please concisely summarize the following segment of a conversation between a client and 
        a data scientist discussing a potential data science project:

        {conversation}
        """

        self.prompt = PromptTemplate(
            template=template,
            input_variables=["conversation"],
        )


    def step(self, q_list, a_list):
        """Summarize the conversation script. 
        
        Args:
        ---------
        q_list: list of responses from the client bot
        a_list: list of responses from the data scientist bot
        """     
        
        # Loop over individual rounds
        conversation_summary = []
        for i, (q, a) in enumerate(zip(q_list, a_list)):
            print(f"Processing {i+1}/{len(q_list)}th conversation round.")

            # Compile one round of conversation
            conversation_round = ''
            conversation_round += 'Client: ' + q + '\n\n'
            conversation_round += 'Data scientist: ' + a

            response = self.llm.predict(self.prompt.format(conversation=conversation_round))
            conversation_summary.append(response)
            
        return conversation_summary

In [12]:
class AssessorBot(LLMBot):

    def __init__(self, temperature=0.8):       
        """Setup assessor bot.
        
        Args:
        --------------
        temperature: temperature of the LLM.
        """   
        
        # Instantiate llm
        super().__init__('completion', temperature)
        
    
    def instruct(self, industry, business_size, problem_type):
        """Determine the context of assessor. 
        """        
        
        self.industry = industry
        self.business_size = business_size
        self.problem_type = problem_type
        
        
        template = """You are a senior data scientist who has been asked to review a conversation between a data scientist 
        and a client from a {industry} company of {business_size} size, focusing on a {problem_type} problem. 
        The client and data scientist are discussing how to define and scope a data science project to address the problem.

        Please provide an assessment of the conversation, focusing on the strategy adopted by the data scientist to 
        define and scope the problem, any potential room for improvement, and any other key points you think are important.
        Please organize your response with nicely formatted bulletpoints.

        Here is the conversation: 
        {conversation}
        """

        self.prompt = PromptTemplate(
            template=template,
            input_variables=["industry", "business_size", "problem_type", "conversation"],
        )


    def step(self, conversation_summary):
        """Assess the conversation script. 
        
        Args:
        ---------
        conversation_summary: condensed version of the conversation script.
        """     
        
        analysis = self.llm.predict(self.prompt.format(industry=self.industry,
                                                        business_size=self.business_size,
                                                        problem_type=self.problem_type,
                                                        conversation=' '.join(conversation_summary)))
        
        return analysis

Test the proposed workflow for analyzing the simulated conversation

In [13]:
# Instantiate the summarizer
summarizer = SummarizerBot()
summarizer.instruct()

# Create conversation summary
conversation_summary = summarizer.step(question_list, answer_list)

Processing 1/6th conversation round.
Processing 2/6th conversation round.
Processing 3/6th conversation round.
Processing 4/6th conversation round.
Processing 5/6th conversation round.
Processing 6/6th conversation round.


In [14]:
# Instantiate the assessor
assessor = AssessorBot()
assessor.instruct(industry, business_size, problem_type)

# Perform assessment
analysis = assessor.step(conversation_summary)
print(analysis)



Assessment of the Conversation: 
• The data scientist has taken a structured approach to define and scope the problem by asking relevant questions about the manufacturing process, data availability, and existing algorithms. 
• The data scientist has proposed an unsupervised anomaly detection approach which is a good fit for the problem and is well-suited for detecting patterns or anomalies without the need for labels, which would be difficult to obtain for this type of problem. 
• The data scientist has also proposed a feasibility study to determine if this is the best approach for the client's timeline and budget constraints, providing an estimation of the cost and timeline for the project. 
• The data scientist has also proposed a more detailed project plan with timeline and costs for the implementation phase. 
• The data scientist could have asked more questions about the client's expectations, such as required accuracy or false positives and false negatives and how the solution w