<a href="https://colab.research.google.com/github/TheDumbEngineer/AI_Assistants/blob/main/ReverseEngineeringAssistantV2_0_RAG_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Reverse Engineering Assistant V2.0


In [1]:
!pip install langchain langchain_community langchain_chroma langchain-openai langchainhub unstructured[pdf]



In [2]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

··········


In [3]:
#Using the UnstructuredPDFLoader for loading the document
from langchain_community.document_loaders import UnstructuredPDFLoader

In [4]:
#Create the loader
loader = UnstructuredPDFLoader("/content/drive/MyDrive/Book PDFs/PracticalBinary.pdf")

In [5]:
#Get the loaded data
data = loader.load()

In [6]:
#Importing the libraries
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain import hub
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chains import TransformChain
from langchain.chat_models import ChatOpenAI
import sys

In [26]:
#Splitting the data and creating a vectorestore
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [8]:
#Retrieving and generating using the relevant data from the pdf.
retriever = vectorstore.as_retriever()
prompt = hub.pull("byteberzerker/reverse_helper")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("Give me a step by step guide on binary analysis, complete with the tools used and how to use them")





'Binary analysis is a complex but fascinating topic in hacking and computer science. It involves analyzing binary programs to understand how they work, identify vulnerabilities, and improve security. Here is a step-by-step guide to binary analysis along with the tools used and how to use them:\n\n1. **Understand Binary Formats**: \n   - Learn about the C compilation process, which involves preprocessing, compiling, assembling, and linking.\n   - Familiarize yourself with the anatomy of a binary, including the structure of the binary file, headers, sections, and segments.\n\n2. **Learn Assembly Language**:\n   - Study the instruction set architecture and assembly syntax of the target platform.\n   - Practice writing and understanding assembly code to get a deeper understanding of how binaries are executed.\n\n3. **Use Tools for Binary Analysis**:\n   - Start with basic tools like Radare, IDA Pro, or OllyDbg for static and dynamic analysis of binaries.\n   - Explore more advanced tools l

#With Nice Formatting, No Memory

In [27]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
prompt = hub.pull("byteberzerker/reverse_helper")

#Test Retrieved docs
retrieved_docs = retriever.invoke("I have a Windows 64-bit binary, what tools can I use to analyze it and how do I use the tools?")
x = len(retrieved_docs)
print(x)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result = rag_chain.invoke("I have a Windows 64-bit binary, what tools can I use to analyze it and how do I use the tools?")



# Function to print the result nicely
def print_nice_result(result):
    print("\n=== RAG Chain Result ===\n")
    sections = result.split("\n\n")
    for idx, section in enumerate(sections, start=1):
        print(f"Section {idx}:\n{section.strip()}\n")
        print("=" * 40)

# Print the nicely formatted result
print_nice_result(result)

# cleanup
vectorstore.delete_collection()



6

=== RAG Chain Result ===

Section 1:
To analyze a Windows 64-bit binary, you can use the following tools:

Section 2:
1. **Angr**: Angr is a Python-oriented reverse engineering platform that can be used to build your own binary analysis tools. It offers advanced features like backwards slicing and symbolic execution. Angr is free and open source.

Section 3:
2. **Pin**: Pin is a dynamic binary instrumentation engine that allows you to build tools to add or modify a binary's behavior at runtime. It is free but not open source and is developed by Intel. Pin supports Intel CPU architectures.

Section 4:
3. **Dyninst**: Dyninst is a dynamic binary instrumentation API that can also be used for disassembly. It is free and open source and is more research-oriented than Pin.

Section 5:
4. **Unicorn**: Unicorn is a lightweight CPU emulator that supports multiple platforms. It can be used for binary analysis and emulation. Unicorn is free and open source.

Section 6:
To use these tools, you 