### Objective

In this notebook, we develop an LLM workflow to generate a specific data science scenario

In [1]:
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI, AzureOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.prompts import (
    ChatPromptTemplate, 
    MessagesPlaceholder, 
    SystemMessagePromptTemplate, 
    HumanMessagePromptTemplate
)
from langchain.chains import ConversationChain
import utilities
import os

In [2]:
problem_type_list = ["classification", "regression", "clustering",
                    "anomaly detection", "recommendation", "time series analysis",
                    "natural language processing", "computer vision"]
business_size_list = ["small (less than 100 employees)",
                    "Medium (100-500 employees)",
                    "large (more than 500 employees)"]
industry_list = ["healthcare", "finance", "retail", "technology",
                "manufacturing", "transportation", "energy",
                "real estate", "education", "government", "non-profit"]

In [None]:
scen_generator = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.0)
memory = ConversationBufferMemory(return_messages=True)
prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("""{input}""")
])

conversation = ConversationChain(memory=memory, prompt=prompt, 
                                  llm=scen_generator, verbose=False)

In [None]:
# Generate description of the scenario
industry = "manufacturing"
business_size = "medium"
problem_type = "anomaly detection"

# Prompt
scene_generation = f"""Given a {industry} company of {business_size} size focusing on {problem_type} problems, 
generate a detailed and specific data science project scenario. Please provide:
1. A detailed description of the problem faced by the company.
2. The desired outcome from solving the problem.
3. The type of data that might be available for solving the problem.

Output format:
Problem description: [content of problem description]
Desired outcome: [content of desired outcome]
Data availability: [content of data availability]
"""

response = conversation.predict(input=scene_generation)
print(response)

In [None]:
# Prompt
additional_aspects = f"""Based on the scenario generated, please provide additional details that are 
relevant to defining and scoping the data science project. For example, consider the following aspects:
- Constraints: Any constraints that must be considered when solving the problem.
- Current System: A description of the current system or process in place.
- Success Metrics: How success will be measured for this project.
- Stakeholders: Key stakeholders involved in the project.
- Technical Infrastructure: The technical infrastructure available or required for the project.
- Data Privacy and Security: Any data privacy and security considerations that must be taken into account.

Output format:
[aspect name]: [aspect content]
"""

response = conversation.predict(input=additional_aspects)
print(response)

### Strategy 2

In [None]:
scene_generator = OpenAI(model_name="text-davinci-003", temperature=1.0)

In [None]:
template = """For a {industry} company of {business_size} size facing {problem_type} problems, 
generate a detailed and specific data science project scenario. Please provide:

1. A detailed description of the problem faced by the company, including any specific challenges, constraints, or requirements.
2. A detailed description of the desired outcome from solving the problem, including any specific goals, metrics, or success criteria.
3. A detailed description of the type of data that might be available for solving the problem, 
including any specific data sources, data types, or data formats.

Your output format should be:
Problem description: [content of problem description]
Desired outcome: [content of desired outcome]
Available data: [content of available data types]
Constraints and requirements: [content of constraints and requirements]

Make sure that the generated scenario is realistic and typical of the problems that a company in the {industry} 
industry might encounter.
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["industry", "business_size", "problem_type"],
)

In [None]:
industry = "manufacturing"
business_size = "medium"
problem_type = "anomaly detection"

# Generate scenarios
response = scen_generator.predict(prompt.format(industry=industry, 
                                                business_size=business_size, 
                                                problem_type=problem_type))
print(response)

### Strategy 3

We don't ask the LLM to provide constraints, but only focus on problem description, desired outcome, and data availability. Our goal is to get specific scenario description. For that, we divide the task into two steps, where in the first step, we prompt the LLM to generate a generic scenario, and in the second step, we prompt the LLM to fill in te details.

In [3]:
# Initialize an LLM
scen_generator = AzureOpenAI(
                    deployment_name="deployment-5af509f3323342ee919481751c6f8b7d",
                    model_name="text-davinci-003",
                    openai_api_base="https://abb-chcrc.openai.azure.com/",
                    openai_api_version="2023-03-15-preview",
                    openai_api_key=os.environ["OPENAI_API_KEY_AZURE"],
                    openai_api_type="azure",
                    temperature=1.0
                )

In [23]:
# 1st Stage
template = """For a {industry} company of {business_size} size focusing on {problem_type} problems, 
generate a concrete data science project scenario that a data scientist might encounter in real life. 
Please provide concrete and specific details relevant to the selected industry and problem type.

For the generated scenario, please provide:
1. A specific and realistic description of a problem faced by the company.
2. The desired outcome that the company is hoping to achieve by solving the problem.
3. A list of potential data sources that might be available for solving the problem.

Output format:
Problem description: \n
[content of problem description]
Desired outcome: \n
[content of desired outcome]
Available data: \n
[list of potential data sources]

"""

prompt = PromptTemplate(
    template=template,
    input_variables=["industry", "business_size", "problem_type"],
)

In [34]:
industry = "manufacturing"
business_size = "medium"
problem_type = "anomaly detection"

# Generate scenarios
response = scen_generator.predict(prompt.format(industry=industry, 
                                                business_size=business_size, 
                                                problem_type=problem_type))
print(response)


Problem description: 
The manufacturing company produces goods in large volumes and runs extensive production lines. Due to the large number of production items, the company often finds small anomalies or errors that lead to a waste of resources. As such, the company is striving to develop a system to detect and identify these anomalies in an automated way, which can save considerable time and resources. 

Desired outcome: 
The company wants to develop a system to identify anomalies on the production line, allowing the production process to be more efficient and to reduce waste, while ensuring a high-quality output.

Available data sources: 
1. Production records containing information such as production date, production volumes, quality records, and any other relevant production-related data. 
2. Environmental conditions (temperature, humidity, etc.)
3. Product specifications (dimensions, design parameters, etc.)
4. Machine data (performance, logs, etc.)
5. Human resource records (em

In [35]:
response.split('\n')

['',
 'Problem description: ',
 'The manufacturing company produces goods in large volumes and runs extensive production lines. Due to the large number of production items, the company often finds small anomalies or errors that lead to a waste of resources. As such, the company is striving to develop a system to detect and identify these anomalies in an automated way, which can save considerable time and resources. ',
 '',
 'Desired outcome: ',
 'The company wants to develop a system to identify anomalies on the production line, allowing the production process to be more efficient and to reduce waste, while ensuring a high-quality output.',
 '',
 'Available data sources: ',
 '1. Production records containing information such as production date, production volumes, quality records, and any other relevant production-related data. ',
 '2. Environmental conditions (temperature, humidity, etc.)',
 '3. Product specifications (dimensions, design parameters, etc.)',
 '4. Machine data (performa

In [32]:
scenario = {}
sections = response.split('\n')
for i, section in enumerate(sections):
    if ':' in section:
        key, value = section.split(':', 1)
        scenario[key] = sections[i+1]

In [33]:
scenario

{'Problem description': 'The manufacturing company has experienced significant losses due to defects in their production process. With increasingly complex production processes and tight deadlines, it is becoming increasingly difficult for workers to identify anomalies or faulty products before they are shipped out. ',
 'Desired outcome': 'The manufacturing company would like to be able to detect anomalies in their product processes as efficiently as possible in order to minimize losses and increase customer satisfaction. ',
 'Available data': '1. Historical records of product defects.'}

In [36]:
# 2nd stage
template = """Enrich the problem description below by providing more specific details (such as {details}) 
about the problem.

problem description: {problem}

Output format:
Problem description: [content of problem description]
Desired outcome: [copy of desired outcome in the provided problem description]
Available data: [copy of potential data sources in the provided problem description]
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["details", "problem"],
)

In [37]:
details = "Types of products manufactured, machines used in the production process, \
common issues faced by the company, tools and technologies used for quality control."

# Generate scenarios
detailed_response = scen_generator.predict(prompt.format(details=details,
                                               problem=response))
print(detailed_response)


Enrichment: 
Types of Products Manufactured: The manufacturing company produces a variety of products, ranging from consumer electronics to automotive parts. 

Machines Used in Production Process: The company uses a combination of automated and manual machines to produce their goods, including robotic arms, conveyor belts, and injection molding machines. 

Common Issues Faced by the Company: Common issues faced by the company include quality control issues due to minor imperfections in production, high levels of scrap due to production errors, and long lead times due to inefficient production processes. 

Tools and Technologies Used for Quality Control: The company uses various tools and technologies for quality control, including automated visual inspection systems, automated testing platforms, and computerized measurement systems.


### Strategy 4

Solve the same problem in a chat environment

In [50]:
scen_generator = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.0)
memory = ConversationBufferMemory(return_messages=True)
prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("""{input}""")
])

conversation = ConversationChain(memory=memory, prompt=prompt, 
                                  llm=scen_generator, verbose=False)

In [51]:
# Generate description of the scenario
industry = "manufacturing"
business_size = "medium"
problem_type = "anomaly detection"

# Prompt
scene_generation = f"""For a {industry} company of {business_size} size focusing on {problem_type} problems, 
generate a concrete data science project scenario that a data scientist might encounter in real life. 
Please provide concrete and specific details relevant to the selected industry and problem type.

For the generated scenario, please provide:
1. A specific and realistic description of a problem faced by the company.
2. The desired outcome that the company is hoping to achieve by solving the problem.
3. A list of the top 3 most relevant data sources that might be available for solving the problem.

Output format:
Problem description: [content of problem description]
Desired outcome: [content of desired outcome]
Available data: [content of available data]
"""

response = conversation.predict(input=scene_generation)
print(response)

Problem description: 

A medium-sized manufacturing company that produces automotive parts is facing a problem with quality control. The company has been experiencing an increasing number of defective products, leading to customer complaints and returns. The defect rate has been steadily rising, causing the company to incur additional costs for rework and affecting customer satisfaction. The company wants to identify the root cause of the defects and implement measures to reduce the defect rate.

Desired outcome: 

The company aims to identify the factors contributing to the increasing defect rate and develop a predictive model to detect anomalies during the manufacturing process. Ultimately, the goal is to reduce the defect rate to an acceptable level, improve product quality, and minimize customer complaints and returns.

Available data: 

1. Manufacturing data: The company keeps records of various manufacturing parameters such as temperature, pressure, speed, and time for each step 

In [52]:
# Prompt
additional_details = f"""Based on the previously generated scenario, please enrich the problem description 
by providing more specific details (such as {details}) about the problem.

Output format:
Enriched problem description: [content of enriched problem description]
Desired outcome: [content of desired outcome]
Available data: [content of available data]
"""

response = conversation.predict(input=additional_details)
print(response)

Enriched problem description:

A medium-sized manufacturing company in the automotive industry specializes in producing engine components. They have a range of products such as piston rings, crankshafts, and cylinder heads. The production process involves multiple steps, including machining, heat treatment, and assembly. The company uses various machines, such as CNC (Computer Numerical Control) machines for precision cutting, hydraulic presses for forging, and robotic arms for assembly.

Recently, the company has been facing a rise in product defects, including dimensional inaccuracies, improper surface finishes, and misalignments. These defects not only affect the performance and reliability of the products but also lead to customer complaints and returns. The company's quality control team has been struggling to identify the root causes of these defects, as they occur across different stages of the production process.

Desired outcome:

The company aims to pinpoint the specific fact