# Overview 

<span style="color:red; font-size:18px; font-weight:bold;">In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.</br>
We want to accomplish the following:</span>
1. Establish a pipeline to read in a csv and pass it to our LLM. 
2. Establish a Basic Inference Agent.
3. Test the Basic Inference on a few tasks.

# Setting up the Environment 

In [21]:
####################################################################################################
import os
import re
import subprocess
import sys

from langchain import hub
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_experimental.utilities import PythonREPL
from langchain.agents import Tool
from langchain.agents.format_scratchpad.openai_tools import (
    format_to_openai_tool_messages,
)
from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser

from langchain.agents import AgentExecutor

from langchain.agents import create_structured_chat_agent
from langchain.memory import ConversationBufferWindowMemory

from langchain.output_parsers import PandasDataFrameOutputParser
from langchain.prompts import PromptTemplate

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

from langchain_community.chat_models import ChatAnyscale
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np

plt.style.use('ggplot')
####################################################################################################

In [3]:
# insert your API key here
os.environ["ANYSCALE_API_KEY"] = "esecret_8btufnh3615vnbpd924s1t3q7p"
memory_key = "history"

# Reading the Dataframe and saving it 
<span style="font-size:16px;color:red;">We are saving the dataframe in a csv file in the cwd because we will iteratively update this df if need be between each inference. This way the model will continuously have the most upto date version of the dataframe at its disposal</span>

# DECLARE YOUR LLM HERE 

In [5]:
llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0)

# REPL Tool 
<span style="font-size:16px;color:red;">This tool enables our LLM to 'execute' python code so that we can actually display the results to the end-user.</span>

In [6]:
python_repl = PythonREPL()

repl_tool = Tool(
    name="python_repl",
    description="""A Python shell. Shell can dislay charts too. Use this to execute python commands.\
    You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\
    matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\
    asked you to plot the results, always print the final output using print(...)""",
    func=python_repl.run,
)

tools = [repl_tool]

# Building an Agent 

In [30]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are Machine Learning Inference agent. Your job is to use your tools to answer a user query\
            in the best manner possible.\
            The user gives you two things:
            1. A query, and a target task.
            Your job is to extract what you need to do from this query and then accomplish that task.\
            In case the user asks you to do things like regression/classification, you will need to follow this:\
            READ IN YOUR DATAFRAME FROM: 'df.csv'
            1. Split the dataframe into X and y (y= target, X=features)
            2. Perform some preprocessing on X as per user instructions. 
            3. Split preprocessed X and y into train and test sets.
            4. If a model is specified, use that model or else pick your model of choice for the given task.
            5. Run this model on train set. 
            6. Predict on Test Set.
            5. Print accuracy score by predicting on the test.
            
            Provide no explanation for your code. Enclose all your code between triple backticks ``` """,
        ),
        ("user", "Dataframe named df:\n{df}\nQuery: {input}\nList of Tools: {tools}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

In [None]:
os.mkdir('code')

In [18]:
def run_code(code):
    
    with open(f'{os.getcwd()}/code/code.py', 'w') as file:
        file.write(code)
        
    try:
        print("Running code ...\n")
        result = subprocess.run([sys.executable, 'code/code.py'], capture_output=True, text=True, check=True, timeout=20)
        return result.stdout, False
    
    except subprocess.CalledProcessError as e:
        print("Error in code!\n")
        return e.stdout + e.stderr, True
    
    except subprocess.TimeoutExpired:
        return "Execution timed out.", True

In [32]:
def infer(user_input, df, llm):

    agent = (
        {
            "input": lambda x: x["input"],
            "tools": lambda x:x['tools'],
            "df": lambda x:x['df'],
            "agent_scratchpad": lambda x: format_to_openai_tool_messages(
                x["intermediate_steps"]
            )
        }
        | prompt
        | llm
        | OpenAIToolsAgentOutputParser()
    )
    
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
    
    error_flag = True    
    
    while error_flag:
        result = list(agent_executor.stream({"input": user_input, 
                                         "df":pd.read_csv('df.csv'), "tools":tools}))
        
        pattern = r"```python\n(.*?)\n```"
        matches = re.findall(pattern, result[0]['output'], re.DOTALL)
        code = "\n".join(matches)
        code += "\ndf.to_csv('./df.csv', index=False)"
        
        res, error_flag = run_code(code)
        
        print(res)
            
    # execute the code
    return res

In [33]:
infer('help me predict the fare using linear regression. Drop Nan rows and non-numerical columns.', df, llm)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m ```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
import pandas as pd
import numpy as np

df = pd.read_csv('df.csv')

# Drop rows with NaN values and non-numerical columns
df = df.dropna().select_dtypes(include=['int64', 'float64'])

X = df.drop('Fare', axis=1)
y = df['Fare']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(model.score(X_test, y_test))
```[0m

[1m> Finished chain.[0m
Running code ...

0.3610845260081037



'0.3610845260081037\n'

# Testing the Agent

### First we perform some data cleaning. Observe how each infer call updates the df and passes it onwards.

In [11]:
df.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [12]:
# infer("Please drop the following columns inplace from the original dataframe: Name, Ticket, Cabin, PassengerId.")

In [13]:
# infer("Please map the strings in the following columns of the original dataframe itself to integers: sex, embarked")

In [14]:
# infer("Please remove NaNs with most frequent value in the respective column. Do this inplace in original dataframe.")

### Linear Regression to predict fare

In [None]:
infer("""Now I want you to help me run linear regression to predict the Fare. Please drop NaN rows and drop \n
    non numerical columns too.""")

### SVR to predict Fare

In [None]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Fare column.
2. Split X and y into train, test sets.
3. Fit a SVR model to the train set.
4. Predict on test set.
5. print the MSE loss on the predictions.

Run all steps from 1 to 5.
""")

### Logistic Regression to Classify Survival

In [None]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Logistic Regression model to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")

### SVM to Classify Survival 

In [None]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Gaussian SVM with Poly kernel to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")

### Naive Bayes to Classify Survival

In [None]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train, test sets.
3. Fit a Naive Bayes model to the train set.
4. Predict on test set.
5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 5.
""")

### KNN to Classify Survival
<span style="font-size:16px;">We perform a grid search to figure out the best n_neighbors parameter!</span>

In [None]:
infer("""Now I want you to follow all these steps in order:\n
1. Split the dataframe into X and y. Where y is Survived column.
2. Split X and y into train and test sets.
3. Fit a KNN model to the train set.
4. Find the best n_neighbors parameter out of range (1,10). Import some grid search method to do this.
5. Predict with best model on test set.
6. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.

Run all steps from 1 to 6.
""")