# Code Execution with Taskweaver

In this notebook we will introduce the concepts of code execution with Code Interpreters, such as Assistants API and Taskweaver. We will experiment with different prompts here with Taskweaver, and we will see how we can get these CIs to generate and execute code for us. 


### Python Imports


In [1]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append('..\\code')


import os
from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, Markdown, HTML
from PIL import Image
import logging
import webbrowser
from doc_utils import *


def show_img(img_path, width = None):
    if width is not None:
        display(HTML(f'<img src="{img_path}" width={width}>'))
    else:
        display(Image.open(img_path))


## Taskweaver Installation
TaskWeaver requires **Python >= 3.10**. It can be installed by running the following command from the project root folder. 

Please follow the below commands **very carefully** and start by creating a **new** conda environment:


```bash
# create the conda environment
conda create -n mmdoc python=3.10 # or 3.11

# activate the conda environment
conda activate mmdoc

# install the project requirements
pip install -r requirements.txt

# clone the repository
git clone https://github.com/microsoft/TaskWeaver.git

# cd into Taskweaver
cd TaskWeaver

# install the Taskweaver requirements
pip install -r requirements.txt

# copy the Taskweaver project directory into the root folder and name it 'test_project'
cp -r project ../test_project/

```

<br/>




### Populate the Config file

Inside the newly copied `test_project` directory, there's a file called `taskweaver_config.json` which needs to be populated. Please refer to the `taskweaver_config.sample.json` file in the root folder of this repo, fill in the Azure OpenAI model values for GPT-4-Turbo, rename it to `taskweaver_config.json`, and then copy it inside `test_project` (or overwrite existing).

<br/>


```json
{
    "llm.api_base": "https://<resource>.openai.azure.com/",
    "llm.api_key":  "<key>",
    "llm.api_type": "azure",
    "llm.api_version": "2023-07-01-preview",
    "llm.model": "gpt-4",
    "llm.embedding_api_type": "openai",
    "llm.embedding_model": "text-embedding-ada-002",
    "session.code_interpreter_only":"false",
    "code_generator.enable_auto_plugin_selection":"false",
    "code_generator.auto_plugin_selection_topk":5,
    "code_generator.prompt_compression":"true",
    "round_compressor.rounds_to_compress":3,
    "round_compressor.rounds_to_retain":4,
    "session.max_internal_chat_round_num":20
}
```


### Make sure we have the OpenAI Models information

We will need the GPT-4-Turbo and GPT-4-Vision models for this notebook.

When running the below cell, the values should reflect the OpenAI reource you have created in the `.env` file.

In [2]:
model_info = {
        'AZURE_OPENAI_RESOURCE': os.environ.get('AZURE_OPENAI_RESOURCE'),
        'AZURE_OPENAI_KEY': os.environ.get('AZURE_OPENAI_KEY'),
        'AZURE_OPENAI_MODEL_VISION': os.environ.get('AZURE_OPENAI_MODEL_VISION'),
        'AZURE_OPENAI_MODEL': os.environ.get('AZURE_OPENAI_MODEL'),
}


### Sample Data

Generate the sample data we will work with in this notebook. 

In [3]:
doc = read_pdf("sample_data/1_slide_1.pdf")
png_files = extract_pages_as_png_files(doc, work_dir = 'sample_data/extracted_images')


result, description = call_gpt4v(png_files[0], gpt4v_prompt = image_description_prompt, temperature = 0.2, model_info=model_info)
print(f"Status: {description}")
Markdown(result)

PDF File 1_slide_1.pdf has 1 pages.
Page 0 saved as sample_data/extracted_images/page_0.png

[92mStart of GPT4V Call to process file(s) ['sample_data/extracted_images/page_0.png'] with model: https://oai-tst-sweden.openai.azure.com/ [0m
endpoint https://oai-tst-sweden.openai.azure.com/openai/deployments/gpt4v/extensions/chat/completions?api-version=2023-12-01-preview

[92mEnd of GPT4V Call to process file(s) ['sample_data/extracted_images/page_0.png'] with model: https://oai-tst-sweden.openai.azure.com/ [0m
Status: Image was successfully explained, with Status Code: 200


The image is a document page with multiple charts and graphics related to infrastructure investment and market trends. 

The first chart at the top left is a bar graph titled "Industry infrastructure AUM^1" which shows the growth of Assets Under Management (AUM) in the infrastructure industry from 2017 to 2027. The bars represent the AUM in trillions of dollars, with a Compound Annual Growth Rate (CAGR) indicated by arrows. The values are as follows: 729 billion in 2017 with a 10% CAGR, 1,190 billion in 2022 with the same CAGR, and a forecast of 2,541 billion in 2027 with a 16% CAGR.

The second chart at the top right is a stacked bar graph titled "$75T global infrastructure funding need^2" which shows the cumulative infrastructure investment and needs from 2022 to 2040 in trillions of dollars, segmented by Energy, Telecom & digital, Transport, and Water. The solid part of the bars represents the investment, and the dotted part represents the needs. The values are as follows: Energy has 23 trillion in investment and 40 trillion in needs, Telecom & digital has 7 trillion in investment and an equal amount in needs, Transport has 5 trillion in investment and needs, and Water is not labeled with a value.

The third chart at the bottom left is a bar graph titled "Clients allocating more to infra in new market regime^3" which shows the percentage of clients allocating more, the same, or less capital to various investment types. The investment types are Private Debt, Infrastructure, Private Equity, Hedge Funds, Venture Capital, and Real Estate. The bars are color-coded: red for more capital, gray for the same amount of capital, and black for less capital.

The fourth chart at the bottom right is a bar graph titled "Infrastructure fares well in inflationary environments^4" which compares the 20-year total returns (annualized) of various investment types in high growth/high inflation and low growth/high inflation environments. The investment types are Global Direct Infrastructure, Global Direct Real Estate, Global Equities, and Global Fixed Income. The bars are color-coded: red for high growth/high inflation and black for low growth/high inflation.

The purpose of this image seems to be to inform about the growth potential of infrastructure as an investment segment, the funding needs for global infrastructure, how clients are adjusting their capital allocation in response to market changes, and how infrastructure investments perform in different inflationary environments.

Here is the Markdown representation of the data from the charts:

```markdown
## Industry infrastructure AUM
| Year | AUM ($T) | CAGR (%) |
|------|----------|----------|
| 2017 | 0.729    | 10       |
| 2022 | 1.190    | 10       |
| 2027 | 2.541    | 16       |

## $75T global infrastructure funding need
| Sector            | Investment ($T) | Needs ($T) |
|-------------------|-----------------|------------|
| Energy            | 23              | 40         |
| Telecom & digital | 7               | 7          |
| Transport         | 5               | 5          |
| Water             | Not labeled     | Not labeled|

## Clients allocating more to infra in new market regime
| Investment Type | More Capital (%) | Same Amount of Capital (%) | Less Capital (%) |
|-----------------|------------------|----------------------------|------------------|
| Private Debt    | 43               | 39                         | 18               |
| Infrastructure  | 37               | 45                         | 18               |
| Private Equity  | 28               | 57                         | 15               |
| Hedge Funds     | 22               | 50                         | 28               |
| Venture Capital | 22               | 43                         | 35               |
| Real Estate     | 18               | 52                         | 30               |

## Infrastructure fares well in inflationary environments
| Investment Type           | High Growth / High Inflation (%) | Low Growth / High Inflation (%) |
|---------------------------|----------------------------------|--------------------------------|
| Global Direct Infrastructure | 17                               | 23                             |
| Global Direct Real Estate    | 16                               | 8                              |
| Global Equities              | 15                               | 2                              |
| Global Fixed Income          | 0                                | 8                              |
```

The Python code to create the Pandas Dataframe for the "Industry infrastructure AUM" chart is as follows:

```python
import pandas as pd

df_infrastructure_aum_123456 = pd.DataFrame({
    'Year': [2017, 2022, 2027],
    'AUM ($T)': [0.729, 1.190, 2.541],
    'CAGR (%)': [10, 10, 16]
})
```

For the other charts, similar dataframes can be created by following the structure of the dataframe above, adjusting the column names and values accordingly.

### Start with Taskweaver

Create the Taskweaver app and session.

In [8]:
import sys
sys.path.append(os.path.abspath("../TaskWeaver/"))

from taskweaver.app.app import TaskWeaverApp

app = TaskWeaverApp(app_dir='../test_project')
session = app.get_session()

taskweaver_logger = logging.getLogger('taskweaver.logging')
taskweaver_logger.setLevel(logging.ERROR)

INFO:taskweaver.logging:Session 20240317-135527-b90f68f3 is initialized
INFO:taskweaver.logging:Planner initialized successfully
INFO:taskweaver.logging:CodeGenerator initialized successfully
INFO:taskweaver.logging:CodeInterpreter initialized successfully.


### Send Message 

Send a message to the Session to start TaskWeaver's Planner and Code Interpreter.

In [10]:
user_query = """
The below is the Context:
## START OF CONTEXT
{py_code}
## END OF CONTEXT

Here is the Chain of Thought and the step-by-step that you should follow:

    1. Please analyze the question first, and locate the variables of interests in the question. For each variable, try to locate the relevant dataframes in the Context above (delimited by '## START OF CONTEXT' and '## END OF CONTEXT') and the relevant variable assignment statements.
    2. Use the Context delimited by '## START OF CONTEXT' and '## END OF CONTEXT' to identify and print to the output the variables of interest. Include the variable assignment statements in the output. Limit this list to the relevant variables **ONLY**. Generate the Python code that will do this step and execute it.
    3. Use the Context delimited by '## START OF CONTEXT' and '## END OF CONTEXT' to identify and print to the output the relevant variables or dataframes names, and print to the output all their columns. Also print all the variable assignment statements. Include the variables or dataframes assignment statements in the output. Limit this list to the relevant variables or dataframes **ONLY**. Generate the Python code that will do this step and execute it.
    4. If you have trouble accessing the previously defined variables or the dataframes for any reasons, then use the Context delimited by '## START OF CONTEXT' and '## END OF CONTEXT' to extract the information you need, and then generate the needed Python code.
    5. Generate the answer to the query. You **MUST** clarify AND print to the output **ALL** calculation steps leading up to the final answer.
    6. You **MUST** detail how you came up with the answer. Please provide a complete description of the calculation steps taken to get to the answer. Please reference the PDF Document and the page number you got the answer from, e.g. "This answer was derived from document 'Sales_Presentation.pdf', page 34".
    7. Generate in **FULL** the answer with all explanations and calculations steps associated with it, and share it with the user in text.
    8. If the answer contains numerical data, then you **MUST** create an Excel file with an extension .xlsx with the data, you **MUST** include inside the Excel the steps of the calculations, the justification, and **ALL** the reference and source numbers and tables that you used to come up with a final answer in addition to the final answer (this Excel is meant for human consumption, do **NOT** use programming variable names as column or row headers, instead use names that are fully meaningful to humans), you **MUST** be elaborate in your comments and rows and column names inside the Excel, you **MUST** save it to the working directory, and then you **MUST** print the full path of the Excel sheet with the final answer - use os.path.abs() to print the full path.
    9. **VERY IMPORTANT**: do **NOT** attempt to create a list of variables or dataframes directly. Instead, you should access the data from the variables and dataframes that were defined in the Python file that was run.
    

Question: {query}

In your final answer, be elaborate in your response. Describe your logic and the calculation steps to the user, and describe how you deduced the answer step by step. If there are any assumptions you made, please state them clearly. Describe in details the computation steps you took, quote values and quantities, describe equations as if you are explaining a solution of a math problem to a 12-year old student. Please relay all steps to the user, and clarify how you got to the final answer. Please reference the PDF Document and the page number you got the answer from, e.g. "This answer was derived from document 'Sales_Presentation.pdf', page 34". After generating the final response, and if the final answer contains numerical data, then you **MUST** create an Excel file with an extension .xlsx with the data, you **MUST** include inside the Excel the steps of the calculations, the justification, and **ALL** the reference and source numbers and tables that you used to come up with a final answer in addition to the final answer (this Excel is meant for human consumption, do **NOT** use programming variable names as column or row headers, instead use names that are fully meaningful to humans), you **MUST** be elaborate in your comments and rows and column names inside the Excel, you **MUST** save it to the working directory, and then you **MUST** print the full path of the Excel sheet with the final answer - use os.path.abs() to print the full path.

"""

query = "What will be the value of the infra AUM in the year 2025? Mind the CAGR."

prompt = user_query.format(py_code=extract_code(result), query = query)

response_round = session.send_message(prompt, event_handler=TWHandler(verbose=True))

logc(response_round.to_dict()['post_list'][-1]['message'])


[92mTaskweaver: [94m>>> Agent: Planner
>>> Message:
Round started[0m

[92mTaskweaver: [94m>>> Agent: Planner
>>> Message:
1. Analyze the question and locate variables of interest
2. Print relevant variables and their assignment statements <sequentially depends on 1>
3. Print relevant dataframe names and their columns, and variable assignment statements <sequentially depends on 2>
4. Generate Python code to access variables/dataframes if needed <sequentially depends on 3>
5. Calculate the infra AUM value for the year 2025 considering CAGR <sequentially depends on 4>
6. Detail the calculation steps <sequentially depends on 5>
7. Generate the full answer with explanations and calculation steps <sequentially depends on 6>
8. Create an Excel file with the data and calculation steps if the answer contains numerical data <sequentially depends on 7>
9. Save the Excel file and print the full path <sequentially depends on 8>[0m

[92mTaskweaver: [94m>>> Agent: Planner
>>> Message:
1. Ana