GitHub - Shehjad2019/langchain-learning: A structured learning repository covering LangChain concepts, data ingestion, text processing, embeddings, and vector databases.

📘 LangChain Learning Repository

A structured and practical learning journey covering LangChain, LLMs, data ingestion, text processing, embeddings, vector databases, and the fundamentals required to build modern AI applications such as RAG, chatbots, and summarizers.

This repo contains hands-on notebooks, examples, and notes from my personal study and practice with LangChain.

📂 Repository Structure

Each folder represents a key stage of the LLM pipeline.

01 — Data Ingestion

Learn how to load and extract data from multiple sources:

PDF ingestion

Text documents

Web pages

YouTube transcripts

Directory loaders

Preprocessing raw text

02 — Data Transformation

Transform raw text into usable chunks:

Text cleaning

RecursiveCharacterTextSplitter

Character splitter

HTML/JSON splitters

Understanding chunk size, overlap, and tokenization

03 — Embeddings

Convert text into vector form for semantic tasks:

HuggingFace embeddings

Ollama embeddings

Sentence Transformers

Embedding dimensions & similarity metrics

04 — Vector Databases

Store and retrieve embeddings for semantic search:

Chroma DB

FAISS Index

Adding documents

Running queries

Building the foundation of a RAG system

🎯 Purpose of This Repository

This repo is a learning-oriented notebook collection, meant to demonstrate mastery of:

RAG Foundations

LangChain Components

Vector Databases

Semantic Search

Data Pipelines

Text Processing for AI

Embedding-based retrieval

It is not a final project — but rather a structured knowledge base showing all concepts learned along the way.

🛠 Technologies Used

LangChain

LangChain Community Tools

LangChain Groq

HuggingFace Embeddings

ChromaDB

FAISS

PyMuPDF (PDF extraction)

BeautifulSoup (HTML parsing)

Sentence Transformers

Python Dotenv

📦 Installation

Create a Python environment (optional):

python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows

Install dependencies:

pip install -r requirements.txt

🚀 How to Use

Open any notebook using Jupyter or VS Code:

jupyter notebook

or simply use VS Code's built-in notebook support.

Explore concepts in order:

01 → 02 → 03 → 04

This progresses from data ingestion → splitting → embeddings → vector search.

📚 What I Learned

How to structure data for LLMs

How chunking affects accuracy

How embeddings work behind the scenes

How vector databases perform similarity search

Why RAG pipelines depend on clean chunking

The power of LangChain’s modular components

How different loaders and splitters work

⭐ Future Additions

RAG mini-project

Chatbot project

Semantic search example

Document QA pipeline using Groq models

Updated examples with latest LangChain versions

👨‍💻 Author

Shehjad Patel AI Developer | Python | LangChain | LLM Engineering

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
01-Data-Ingestion		01-Data-Ingestion
02-Data-Transformation		02-Data-Transformation
03-Embeddings		03-Embeddings
04-Vector-DB		04-Vector-DB
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages