# Data Ingestion Techniques

In [1]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("test.txt")
document = loader.load()
document

[Document(page_content='The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. \nThe best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. \nExperiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. \nOn the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literatur

In [2]:
from langchain_community.document_loaders import WebBaseLoader
import bs4
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2018-06-24-attention/",), 
                        bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                            class_=("post-title", "post-content", "post-header")
                        )))
document = loader.load()
document

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(page_content='\n\n      Attention? Attention!\n    \nDate: June 24, 2018  |  Estimated Reading Time: 21 min  |  Author: Lilian Weng\n\n\n\n[Updated on 2018-10-28: Add Pointer Network and the link to my implementation of Transformer.]\n[Updated on 2018-11-06: Add a link to the implementation of Transformer model.]\n[Updated on 2018-11-18: Add Neural Turing Machines.]\n[Updated on 2019-07-18: Correct the mistake on using the term “self-attention” when introducing the show-attention-tell paper; moved it to Self-Attention section.]\n[Updated on 2020-04-07: A follow-up post on improved Transformer models is here.]\nAttention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence. Take the picture of a Shiba Inu in Fig. 1 as an example.\n\nFig. 1. A Shiba Inu in a men’s outfit. The credit of the original photo goes to Instagram @mensweardog.\nHuman visual attention allows us to focus on a certain region with 

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Java_Full_Notes.pdf")
document = loader.load()
document[:5]

[Document(page_content='Java Notes  \nUnit - 1 \nJava: java is a n object -oriented  programing language which is developed by James Gosling  at sun \nmicrosystems in 1991. It was initially named  Oak.  \nJava Buzzwords:  \n➔ Simple  \n➔ Platform  independence  \n➔ Secure  \n➔ Portable  \n➔ Multi -threaded  \n➔ Robust  \n➔ Reach standard libraries  \n➔ Object oriented programing  \nData Type  \njava datatype is the type of variable  to store different types of data. In java there are 8 types  od data \ntypes, byte, short, char, int, long, float, double and Boolean which are categories in 4 types.  \n➔ Integers (byte, short, int and long)  \n➔ Floating Point ( float and double)  \n➔ Character (char)  \n➔ Boolean (true/false)  \nVariables  \nVariables in java are the basic unit of storage in which we can store different types of data. In simple \nterm variables are the containers in java in which we can store different data with their  different data \ntypes. Variables have their  own sc

# Splitting Text into chunks

In [29]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    separators=["\n"],
    length_function=len
)

In [30]:
documents = splitter.split_documents(document)
documents[:5]

[Document(page_content='Java Notes  \nUnit - 1 \nJava: java is a n object -oriented  programing language which is developed by James Gosling  at sun \nmicrosystems in 1991. It was initially named  Oak.  \nJava Buzzwords:  \n➔ Simple  \n➔ Platform  independence  \n➔ Secure  \n➔ Portable  \n➔ Multi -threaded  \n➔ Robust  \n➔ Reach standard libraries  \n➔ Object oriented programing  \nData Type  \njava datatype is the type of variable  to store different types of data. In java there are 8 types  od data \ntypes, byte, short, char, int, long, float, double and Boolean which are categories in 4 types.  \n➔ Integers (byte, short, int and long)  \n➔ Floating Point ( float and double)  \n➔ Character (char)  \n➔ Boolean (true/false)  \nVariables  \nVariables in java are the basic unit of storage in which we can store different types of data. In simple \nterm variables are the containers in java in which we can store different data with their  different data', metadata={'source': 'Java_Full_Note

# Vector embeding and database

In [31]:
from langchain.embeddings import HuggingFaceHubEmbeddings
from dotenv import load_dotenv
load_dotenv()
embeddings = HuggingFaceHubEmbeddings(model="sentence-transformers/all-MiniLM-L6-v2")

In [32]:
text = "Hi"
embeddings.embed_query(text)[:5]

[-0.09047622233629227,
 0.040439605712890625,
 0.023905642330646515,
 0.05894797295331955,
 -0.022882355377078056]

In [33]:
from langchain_community.vectorstores import Chroma
db = Chroma.from_documents(documents, embeddings)

In [34]:
query="what is java and who is the author of java?"
res = db.similarity_search(query)
res[0].page_content

'Java Notes  \nUnit - 1 \nJava: java is a n object -oriented  programing language which is developed by James Gosling  at sun \nmicrosystems in 1991. It was initially named  Oak.  \nJava Buzzwords:'

In [35]:
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(documents, embeddings)

In [36]:
query="what is java and who is the author of java?"
res = db.similarity_search(query)
res[0].page_content

'o JDK consists of:  \n▪ Javac: Java compiler, it is used to compile the java code to the java byte code.  \n▪ Java: Java interpreter, it is used to execute the java byte code in JVM.  \n▪ Javap: Java dissembler, it is used to examine the byte code instructions.  \n▪ Javadoc: Documentation generator, generate the document for the java program  \n▪ Debugger: tool for debugging the java application.  \n➔ JVM: It stands for Java Virtual Machine (JVM)  \no JVM is used to compile the java code to the java byte code to run the java program  \no It is used to run the java program into any machine  \no It consists of:  \n▪ Memory management: it is used to mange memory or to perform garbase \ncollection auto magically . \n▪ Used to debug the java byte code  \n▪ Also act as an interpreter to convert the java byte code to the machine native \nlanguage.  \n➔ JRE: It stands for Java Runtime Environment (JRE)  \no JRE is the subset of JDK'

# Creating prompt template and loading llm

In [37]:
from langchain_groq import ChatGroq
llm = ChatGroq(model="llama3-8b-8192", temperature=0.8)
llm.invoke("What is ML") 

AIMessage(content='ML stands for Machine Learning, which is a subset of Artificial Intelligence (AI) that involves training algorithms to learn from data and make predictions or decisions without being explicitly programmed.\n\nMachine Learning is a type of algorithm that enables a computer to learn from experience and improve its performance on a task over time. The algorithm learns by recognizing patterns in data, identifying relationships between data points, and making predictions or decisions based on that information.\n\nThere are three main types of Machine Learning:\n\n1. **Supervised Learning**: In this type of learning, the algorithm is trained on labeled data, meaning the data is accompanied by correct outputs or responses. The algorithm learns to map inputs to outputs based on the labeled data and makes predictions on new, unseen data.\n2. **Unsupervised Learning**: In this type of learning, the algorithm is trained on unlabeled data, and it must find patterns or relationsh

In [38]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Think step by step before providing a detailed answer.
I will tip you $1000 if the user finds the answer helpful.
<context>
{context}
</context>
Question: {input}"""
)

# Creating stuff chain to combine llm with the prompt

In [39]:
from langchain.chains.combine_documents import create_stuff_documents_chain
document_chain = create_stuff_documents_chain(
    llm = llm,
    prompt = prompt
)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), config={'run_name': 'format_inputs'})
| ChatPromptTemplate(input_variables=['context', 'input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], template='\nAnswer the following question based only on the provided context.\nThink step by step before providing a detailed answer.\nI will tip you $1000 if the user finds the answer helpful.\n<context>\n{context}\n</context>\nQuestion: {input}'))])
| ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x0000022988A9D4D0>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x0000022988A9E310>, model_name='llama3-8b-8192', temperature=0.8, groq_api_key=SecretStr('**********'))
| StrOutputParser(), config={'run_name': 'stuff_documents_chain'})

# Creating retriever chain to retrieve the data from the vector store and pass it to the document chain for further processing

In [40]:
from langchain.chains import create_retrieval_chain
retrieval_chain = create_retrieval_chain(
    db.as_retriever(),
    document_chain
)

# Testing

In [42]:
response = retrieval_chain.invoke({"input": "what is java and who is the author of java?"})
response

{'input': 'what is java and who is the author of java?',
 'context': [Document(page_content='o JDK consists of:  \n▪ Javac: Java compiler, it is used to compile the java code to the java byte code.  \n▪ Java: Java interpreter, it is used to execute the java byte code in JVM.  \n▪ Javap: Java dissembler, it is used to examine the byte code instructions.  \n▪ Javadoc: Documentation generator, generate the document for the java program  \n▪ Debugger: tool for debugging the java application.  \n➔ JVM: It stands for Java Virtual Machine (JVM)  \no JVM is used to compile the java code to the java byte code to run the java program  \no It is used to run the java program into any machine  \no It consists of:  \n▪ Memory management: it is used to mange memory or to perform garbase \ncollection auto magically . \n▪ Used to debug the java byte code  \n▪ Also act as an interpreter to convert the java byte code to the machine native \nlanguage.  \n➔ JRE: It stands for Java Runtime Environment (JRE)

In [43]:
def ask_question(question):
    return retrieval_chain.invoke({"input": question})['answer']

In [45]:
response = ask_question("Provide me the simple code for class creation")
print(response)

Based on the provided context, here is a simple code for class creation in Java:

```java
class MyClass{
    // Properties (fields)
    public int myInt;
    private String myString;

    // Constructors
    public MyClass(){   // default constructor
        myInt = 0;
        myString = "";
    }
    public MyClass(int num, String str){   // parameterized constructor
        myInt = num;
        myString = str;
    }
    public MyClass(MyClass obj){  // copy constructor
        myInt = obj.myInt;
        myString = obj.myString;
    }

    // Methods
    public void setValues(int num, String str){
        myInt = num;
        myString = str;
    }
    public void getValues(){
        System.out.println("myInt = " + myInt);
        System.out.println("myStr = "+ myString);
    }
}
```

Note that this code defines a class `MyClass` with properties `myInt` and `myString`, two constructors (default and parameterized), and two methods (`setValues` and `getValues`).
