# Generating keywords for new tech in earnings call
In this example, we will show you how to generate keywords and cluster them based on semantic meaning from a company's earning call from csv data scraped from seekingalpha using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we use earning call transcript from Emeren Group (https://seekingalpha.com/article/4632495-emeren-group-ltd-sol-q2-2023-earnings-call-transcript)

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)


### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install langchain pandas pypdf



In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.op.op import OpScope
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformForClassificationOpenAIGPT3p5Config, TransformForClusteringOpenAIGPT4Config
from uniflow.op.model.model_config import OpenAIModelConfig
from langchain.text_splitter import RecursiveCharacterTextSplitter
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input csv
First, we need to split the contents of earning call transcript to get text chunks that we can feed into the model. We will use `RecursiveCharacterTextSplitter` from langchain.

In [4]:
file = "earnings_call_sample_data.csv"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", file)

In [6]:
df = pd.read_csv(input_file)

In [7]:
df['company'][0]

'Emeren Group Ltd (SOL) Stock'

In [8]:
text_to_split = df['content'][0]

##### Load and split the text

In [9]:
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)

In [10]:
chunks = splitter.split_text(text_to_split)

In [11]:
print(len(chunks))
chunks[1]


37


"after the market close today and is available on our website at ir.emeren.com. We also provided a supplemental presentation that's posted on our IR website that we will reference during our prepared remarks. On the call with me today are Mr. Himanshu Shah, Chairman of the Board; Mr. Yumin Liu, Chief Executive Officer; and Mr. Ke Chen, Chief Financial Officer. Before we continue, please turn to slide two. Let me remind you that remarks made during this call may include predictions, estimates and other information that might be considered forward-looking. These forward-looking statements represent Emeren Group's current judgment for the future. However, they are subject to risks and uncertainties that could cause actual results to differ materially. Those risks are described under Risk Factors and elsewhere and Emeren Group's filings with the SEC. Please do not place undue reliance on these forward-looking statements, which reflect Emeren Group's opinions only as of the date of this"

In [12]:
data =[Context(context=p) for p in chunks]
# data = [Context(context = chunks[1])]

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. Because we are not generating the default questions and answers, we need to have a custom `instruction` and custom `examples`, which we configure in the `PromptTemplate` class.

First, we give a custom `instruction` to the `PromptTemplate`. This ensures we are instructing the LLM to generate summaries instead of the default questions and answers.

Next, we give a sample of `Context` examples to the `PromptTemplate` class. This is an example answer based on the `context`.

In [13]:
guided_prompt = PromptTemplate(instruction="""
            Does the text mention any cutting-edged technology applications, any new technology methods, or any new area of innovations? If yes, return the names of each technology in a list of strings as the answer. If no, return an empty list.
            """,
            few_shot_prompt=[
                Context(
                    context="Our new business wins are supported by our product leadership strategy of bringing new product to market that provides value for our customers, such as market-leading 500 bar GDi technology, helping customers improve efficiency, reduce emissions and lower costs leveraging our GDi technology and capital to provide a value-focused solution for our off-highway diesel applications and hydrogen ICE that differentiates us from our competition. We're helping our customers move towards carbon neutral and carbon-free fuels with solutions using ethanol, biofuels and hydrogen, as it's our view that a liquefied or gaseous fuel is going to be a key element of our journey to carbon neutrality.",
                    answer=["500 bar GDi technology", "carbon neutral"]
                ),
                    Context(
                    context="The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.",
                    answer=[],
                ),
            ],
)

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s GPT4 model to generate questions and answers.


In [13]:
config1 = TransformForClassificationOpenAIGPT3p5Config(
    model_config=OpenAIModelConfig(),
)

In [14]:
with OpScope(name="transform_flow1"):
    client1 = TransformClient(config1)

In [15]:
client1 = TransformClient(config1)
out = client1.run(data)

  0%|          | 0/37 [00:00<?, ?it/s]

100%|██████████| 37/37 [00:44<00:00,  1.21s/it]


In [16]:
out

[{'output': [{'response': ['answer: []'], 'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526727c10>},
 {'output': [{'response': ['answer: []'], 'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526d01150>},
 {'output': [{'response': ['answer: []'], 'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f55267561a0>},
 {'output': [{'response': ['answer: []'], 'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526757550>},
 {'output': [{'response': ["Answer: ['solar project', 'renewable energy markets', 'battery energy storage system']"],
    'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526757a00>},
 {'output': [{'response': ["answer: ['battery energy storage system']"],
    'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526757a60>},
 {'output': [{'response': ["answer: ['rooftop distributed generation']"],
    'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7f5526757b20>},
 {'output': [{'response': ["an

### Process the output

Let's take a look of the generated output. We get a list of keywords after combining and removing repetition

In [17]:
import re
unique_words = set()

for entry in out:
    for output in entry['output']:
        if output['response'] and output['response'][0] != 'answer: []':
            words = re.findall(r"'([^']*)'", output['response'][0])
            unique_words.update(words)

aggregated_words_list = list(unique_words)
print(aggregated_words_list)

['storage development', 'EPC services', 'ancillary services', 'solar pipeline', 'module prices', 'solar project', 'renewable energy', 'storage projects', 'solar independent storage', 'storage pipeline', 'storage', 'RTP NTPC', 'solar development', 'solar side', 'battery energy storage system', 'solar and storage', 'IPP', 'IPP construction', 'batteries', 'battery market', 'solar project pipeline', 'renewable energy markets', 'solar', 'energy storage', 'solar projects in China', 'Solar Plus', 'rooftop distributed generation', 'solar panels']


Then we want to generate key value pairs in aggregated_words_list based on their semantic meaning based on the following prompt:

In [18]:
guided_prompt = PromptTemplate(instruction="""
                As an expert in cutting-edge technologies, your task is to analyze a given list of technology-related terms. Your goal is to cluster these terms into groups based on their semantic similarities. Each group represents a unique category or 'signal' of technology. You will return your analysis as a dictionary. In this dictionary, each key is a 'signal', representing a specific category, and the associated value is a list of technology terms that belong to that category based on their semantic meaning.
            """,
            few_shot_prompt = [
                Context(
                    context=["artificial intelligence", "AI", "500 bar GDi technology", "ML", "500 bar GDi", "machine learning"],
                    answer={
                        "500_BAR_GDI": ["500 bar GDi technology", "500 bar GDi"],
                        "AIML": ["artificial intelligence", "AI", "ML", "machine learning"],
                    }
                ),
                Context(
                    context=["cryptocurrency", "blockchain", "Bitcoin", "Ethereum", "digital currency", "crypto mining"],
                    answer={
                        "CRYPTO_CURRENCY": ["cryptocurrency", "Bitcoin", "Ethereum", "digital currency"],
                        "BLOCKCHAIN_TECH": ["blockchain", "crypto mining"],
                    },
                ),
            ]
)

In [19]:
config2 = TransformForClusteringOpenAIGPT4Config(
    model_config=OpenAIModelConfig(),
)

In [20]:
with OpScope(name="transform_flow2"):
    client2 = TransformClient(config2)

In [21]:
context = Context(context=aggregated_words_list)
output2 = client2.run([context])

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:06<00:00,  6.29s/it]


In [22]:
data_str = output2[0]['output'][0]['response'][0]

In [23]:
import json
data_str = output2[0]['output'][0]['response'][0]
data_str = data_str.replace("answer: ", "")
data_str = data_str.replace("'", '"')

data_dict = json.loads(data_str)

df = pd.DataFrame([(key, ', '.join(values)) for key, values in data_dict.items()], columns=['Category', 'Technology Strings'])
df

Unnamed: 0,Category,Technology Strings
0,SOLAR_TECH,"solar pipeline, solar project, solar independe..."
1,RENEWABLE_ENERGY,"renewable energy, renewable energy markets"
2,BATTERY_TECH,"battery energy storage system, batteries, batt..."
3,EPC_SERVICES,"storage development, EPC services, ancillary s..."
4,SOLAR_AND_STORAGE,"solar and storage, energy storage"
