## Document Loaders In LangChain

In [1]:
from langchain_community.document_loaders import Docx2txtLoader

In [4]:
loader = Docx2txtLoader("resource/Mufaddal Shakir.docx")

data = loader.load()

In [5]:
data[0].page_content

'Mufaddal Shakir \n\nEmail ID: mufaddal@infraspec.dev\n\nPhone: +91 9175593155\n\n\n\nSUMMARY\n\nComputer Engineering graduate with a strong commitment to continuous learning and problem-solving.\n\nDeveloped over 12 mobile applications during college, demonstrating proficiency in Android Studio, Java, Firebase, Flutter, and web technologies.\n\nWell-versed in CI/CD workflows and Git, with comprehensive knowledge of DevOps best practices.\n\nImplemented Argo Rollouts and ArgoCD for Kubernetes deployment.\n\nProvisioned infrastructure using Terraform, adhering to best practices for efficient and scalable deployment.\n\nExperienced in leading successful migrations, optimizing deployments, and contributing to ongoing feature development. \n\nA dynamic individual, always ready for new experiences and adept at meeting challenges head-on.\n\n\n\nSKILLS\n\nLanguages: C++, Python, Java, SQL (MySQL)\n\nFrameworks: Flutter, Nodejs, MongoDB, Android(Java)\n\nPlatforms: Firebase, AWS\n\nTools: Git

In [6]:
data[0].metadata

{'source': 'resource/Mufaddal Shakir.docx'}

## Text Splitters

Why do we need text splitters in first place?

LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

Splitting data into chunks can be done in native python but it is a tidious process. Also if necessary, you may need to experiment with various delimiters in an iterative manner to ensure that each chunk does not exceed the token length limit of the respective LLM.

**Langchain provides a better way through text splitter classes.**

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " ", ""],  # List of separators based on requirement (defaults to ["\n\n", "\n", " ", ""])
    chunk_size = 1000,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)

In [8]:
chunks = r_splitter.split_text(data[0].page_content)

In [10]:
len(chunks)

8

Recursive text splitter uses a list of separators, i.e. separators = ["\n\n", "\n", " ", ""]

So now it will first split using \n\n and then if the resulting chunk size is greater than the chunk_size parameter which is 300 in our case, then it will use the next separator which is \n

In [11]:
for chunk in chunks:
    print(len(chunk))

974
938
988
928
937
842
982
67


Langchain can be used to recursively split the text based on a list of separators. This class is RecursiveTextSplitter.

In [14]:
print(chunks[0])

Mufaddal Shakir 

Email ID: mufaddal@infraspec.dev

Phone: +91 9175593155



SUMMARY

Computer Engineering graduate with a strong commitment to continuous learning and problem-solving.

Developed over 12 mobile applications during college, demonstrating proficiency in Android Studio, Java, Firebase, Flutter, and web technologies.

Well-versed in CI/CD workflows and Git, with comprehensive knowledge of DevOps best practices.

Implemented Argo Rollouts and ArgoCD for Kubernetes deployment.

Provisioned infrastructure using Terraform, adhering to best practices for efficient and scalable deployment.

Experienced in leading successful migrations, optimizing deployments, and contributing to ongoing feature development. 

A dynamic individual, always ready for new experiences and adept at meeting challenges head-on.



SKILLS

Languages: C++, Python, Java, SQL (MySQL)

Frameworks: Flutter, Nodejs, MongoDB, Android(Java)

Platforms: Firebase, AWS

Tools: Git & Github


In [15]:
# test_imports.py
try:
    from transformers import AutoModel, AutoTokenizer
    from sentencepiece import SentencePieceProcessor
    print("Imports successful")
except Exception as e:
    print(f"Error: {e}")

Error: No module named 'sentencepiece'


In [16]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


In [18]:
text_splitter = SemanticChunker(embeddings)

In [91]:
import sys
!{sys.executable} -m pip install --user numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [19]:
from transformers import AutoModel