# HSN Code Classification System using RAG and Knowledge Graphs

## AI/ML Engineering Course Assignment

**Author:** Santosh Kumar  
**Date:** January 8, 2026

---

### System Overview

This notebook implements an intelligent HSN (Harmonized System Nomenclature) Code Classification System combining:
- Knowledge Graph construction for hierarchical relationships
- RAG (Retrieval-Augmented Generation) for semantic search
- Vector embeddings for similarity matching
- Natural language query processing

### Architecture

```
User Query → Query Processor → Vector Search → Knowledge Graph → Disambiguation → Result
```

---
## Section 1: Introduction and Setup
---

In [None]:
import pandas as pd
import numpy as np
import json
import warnings
from typing import List, Dict, Tuple, Optional
from collections import defaultdict
import re
import PyPDF2
import pdfplumber

warnings.filterwarnings('ignore')

print("✓ Core libraries loaded")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from matplotlib.patches import FancyBboxPatch

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Visualization libraries loaded")

In [None]:
from sentence_transformers import SentenceTransformer
import faiss

print("✓ ML libraries loaded")

### Load and Explore Dataset from PDF

In [None]:
def extract_hsn_data_from_pdf(pdf_path: str) -> pd.DataFrame:
    """
    Extract HSN data from PDF using pdfplumber for table extraction
    """
    print(f"Extracting data from PDF: {pdf_path}...")
    
    all_tables = []
    
    with pdfplumber.open(pdf_path) as pdf:
        print(f"Total pages: {len(pdf.pages)}")
        
        for page_num, page in enumerate(pdf.pages, 1):
            tables = page.extract_tables()
            
            if tables:
                for table in tables:
                    if table and len(table) > 0:
                        all_tables.extend(table)
    
    if not all_tables:
        raise ValueError("No tables found in PDF")
    
    # Assume first row is header
    header = all_tables[0]
    data = all_tables[1:]
    
    # Create DataFrame
    df = pd.DataFrame(data, columns=header)
    
    # Clean column names
    df.columns = df.columns.str.strip()
    
    # Convert HSN Code to numeric if possible
    if 'HSN Code' in df.columns:
        df['HSN Code'] = pd.to_numeric(df['HSN Code'], errors='coerce')
    
    # Remove rows with missing HSN codes
    df = df.dropna(subset=['HSN Code'])
    
    print(f"✓ Extracted {len(df)} records from PDF")
    
    return df

# Extract data from PDF
df = extract_hsn_data_from_pdf('hsn_data.pdf')

print(f"\nDataset Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData Types:")
print(df.dtypes)
print(f"\nMissing Values:")
print(df.isnull().sum())
print(f"\nFirst 3 records:")
df.head(3)

In [None]:
print("Dataset Statistics:\n")
print(f"Total HSN Codes: {df['HSN Code'].nunique()}")
print(f"Total Chapters: {df['ChapterNumber'].nunique()}")
print(f"\nChapter Distribution:")
print(df['ChapterNumber'].value_counts())

---
## Section 2: Data Processing and Enhancement
---

### 2.1 Extract Hierarchical Structure

In [None]:
def extract_hierarchy(hsn_code: int) -> Dict[str, str]:
    hsn_str = str(hsn_code).zfill(8)
    
    return {
        'chapter': hsn_str[:2],
        'heading': hsn_str[:4],
        'subheading': hsn_str[:6],
        'full_code': hsn_str
    }

df['hierarchy'] = df['HSN Code'].apply(extract_hierarchy)
df['chapter'] = df['hierarchy'].apply(lambda x: x['chapter'])
df['heading'] = df['hierarchy'].apply(lambda x: x['heading'])
df['subheading'] = df['hierarchy'].apply(lambda x: x['subheading'])
df['full_code'] = df['hierarchy'].apply(lambda x: x['full_code'])

print("✓ Hierarchy extracted")
df[['HSN Code', 'chapter', 'heading', 'subheading', 'full_code']].head()

### 2.2 Create Enriched Documents

In [None]:
def create_enriched_document(row: pd.Series) -> Dict:
    hierarchy = extract_hierarchy(row['HSN Code'])
    
    doc = {
        'hsn_code': row['HSN Code'],
        'full_code': hierarchy['full_code'],
        'description': row['Description'],
        'trade_status': row['FinalHSN'],
        'hierarchy': {
            'chapter': {
                'code': hierarchy['chapter'],
                'description': row['Chapter_Description']
            },
            'heading': {
                'code': hierarchy['heading'],
                'description': row['Heading_Description']
            },
            'subheading': {
                'code': hierarchy['subheading'],
                'description': row['Subheading_Description']
            },
            'specific': {
                'code': hierarchy['full_code'],
                'description': row['Description']
            }
        },
        'full_context': f"""HSN Code: {hierarchy['full_code']}
Product: {row['Description']}
Chapter {hierarchy['chapter']}: {row['Chapter_Description']}
Heading {hierarchy['heading']}: {row['Heading_Description']}
Subheading {hierarchy['subheading']}: {row['Subheading_Description']}
Trade Status: {row['FinalHSN']}"""
    }
    
    return doc

enriched_documents = [create_enriched_document(row) for _, row in df.iterrows()]

print(f"✓ Created {len(enriched_documents)} enriched documents")
print("\nSample enriched document:")
print(json.dumps(enriched_documents[0], indent=2))

### 2.3 Data Validation

In [None]:
def validate_hsn_data(documents: List[Dict]) -> Dict[str, any]:
    validation_results = {
        'total_documents': len(documents),
        'unique_hsn_codes': len(set(doc['hsn_code'] for doc in documents)),
        'unique_chapters': len(set(doc['hierarchy']['chapter']['code'] for doc in documents)),
        'missing_descriptions': sum(1 for doc in documents if not doc['description']),
        'valid_hierarchy': sum(1 for doc in documents if len(doc['full_code']) == 8)
    }
    
    validation_results['data_quality_score'] = (
        validation_results['valid_hierarchy'] / validation_results['total_documents'] * 100
    )
    
    return validation_results

validation = validate_hsn_data(enriched_documents)
print("Data Validation Results:\n")
for key, value in validation.items():
    print(f"{key}: {value}")

---
## Section 3: Knowledge Graph Construction
---

### 3.1 Build Knowledge Graph

In [None]:
class HSNKnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()
        self.hsn_data = {}
        
    def build_graph(self, documents: List[Dict]):
        for doc in documents:
            hsn_code = doc['full_code']
            hierarchy = doc['hierarchy']
            
            self.hsn_data[hsn_code] = doc
            
            chapter_code = hierarchy['chapter']['code']
            heading_code = hierarchy['heading']['code']
            subheading_code = hierarchy['subheading']['code']
            
            self._add_node(chapter_code, 'chapter', hierarchy['chapter']['description'])
            self._add_node(heading_code, 'heading', hierarchy['heading']['description'])
            self._add_node(subheading_code, 'subheading', hierarchy['subheading']['description'])
            self._add_node(hsn_code, 'specific', doc['description'])
            
            self.graph.add_edge(chapter_code, heading_code, relation='contains')
            self.graph.add_edge(heading_code, subheading_code, relation='contains')
            self.graph.add_edge(subheading_code, hsn_code, relation='contains')
            
    def _add_node(self, code: str, level: str, description: str):
        if code not in self.graph:
            self.graph.add_node(code, level=level, description=description)
            
    def get_hierarchy_path(self, hsn_code: str) -> List[Dict]:
        if hsn_code not in self.hsn_data:
            return []
            
        doc = self.hsn_data[hsn_code]
        hierarchy = doc['hierarchy']
        
        return [
            {'level': 'chapter', 'code': hierarchy['chapter']['code'], 
             'description': hierarchy['chapter']['description']},
            {'level': 'heading', 'code': hierarchy['heading']['code'], 
             'description': hierarchy['heading']['description']},
            {'level': 'subheading', 'code': hierarchy['subheading']['code'], 
             'description': hierarchy['subheading']['description']},
            {'level': 'specific', 'code': hsn_code, 
             'description': doc['description']}
        ]
        
    def get_children(self, code: str) -> List[str]:
        if code not in self.graph:
            return []
        return list(self.graph.successors(code))
        
    def get_siblings(self, hsn_code: str) -> List[str]:
        if hsn_code not in self.graph:
            return []
            
        parents = list(self.graph.predecessors(hsn_code))
        if not parents:
            return []
            
        parent = parents[0]
        siblings = [child for child in self.graph.successors(parent) if child != hsn_code]
        return siblings
        
    def get_statistics(self) -> Dict:
        levels = defaultdict(int)
        for node in self.graph.nodes():
            level = self.graph.nodes[node]['level']
            levels[level] += 1
            
        return {
            'total_nodes': self.graph.number_of_nodes(),
            'total_edges': self.graph.number_of_edges(),
            'nodes_by_level': dict(levels),
            'max_depth': max(len(list(nx.ancestors(self.graph, node))) 
                           for node in self.graph.nodes() if self.graph.out_degree(node) == 0)
        }

kg = HSNKnowledgeGraph()
kg.build_graph(enriched_documents)

print("✓ Knowledge Graph constructed")
print("\nGraph Statistics:")
stats = kg.get_statistics()
for key, value in stats.items():
    print(f"{key}: {value}")

### 3.2 Visualize Knowledge Graph

In [None]:
def visualize_kg_subset(kg: HSNKnowledgeGraph, chapter_code: str = '40', max_nodes: int = 20):
    subgraph_nodes = set([chapter_code])
    
    for node in list(subgraph_nodes):
        if node in kg.graph:
            children = list(kg.graph.successors(node))[:5]
            subgraph_nodes.update(children)
            
    subgraph = kg.graph.subgraph(list(subgraph_nodes)[:max_nodes])
    
    plt.figure(figsize=(16, 10))
    pos = nx.spring_layout(subgraph, k=2, iterations=50, seed=42)
    
    color_map = {
        'chapter': '#FF6B6B',
        'heading': '#4ECDC4',
        'subheading': '#45B7D1',
        'specific': '#96CEB4'
    }
    
    node_colors = [color_map[subgraph.nodes[node]['level']] for node in subgraph.nodes()]
    node_sizes = [3000 if subgraph.nodes[node]['level'] == 'chapter' else 
                  2000 if subgraph.nodes[node]['level'] == 'heading' else
                  1500 if subgraph.nodes[node]['level'] == 'subheading' else 1000
                  for node in subgraph.nodes()]
    
    nx.draw_networkx_nodes(subgraph, pos, node_color=node_colors, 
                          node_size=node_sizes, alpha=0.9)
    nx.draw_networkx_edges(subgraph, pos, edge_color='gray', 
                          arrows=True, arrowsize=20, width=2, alpha=0.6)
    nx.draw_networkx_labels(subgraph, pos, font_size=8, font_weight='bold')
    
    legend_elements = [plt.Line2D([0], [0], marker='o', color='w', 
                                 markerfacecolor=color, markersize=10, label=level.capitalize())
                      for level, color in color_map.items()]
    plt.legend(handles=legend_elements, loc='upper left', fontsize=10)
    
    plt.title(f'HSN Knowledge Graph - Chapter {chapter_code} Hierarchy', fontsize=16, fontweight='bold')
    plt.axis('off')
    plt.tight_layout()
    plt.show()
    
visualize_kg_subset(kg, '40', max_nodes=25)

### 3.3 Test Knowledge Graph Queries

In [None]:
test_hsn = '40011010'

print(f"Testing Knowledge Graph with HSN: {test_hsn}\n")
print("="*60)

print("\n1. Hierarchy Path:")
hierarchy_path = kg.get_hierarchy_path(test_hsn)
for item in hierarchy_path:
    print(f"  {item['level'].upper()}: {item['code']} - {item['description']}")

print("\n2. Sibling Codes:")
siblings = kg.get_siblings(test_hsn)
for sibling in siblings[:5]:
    if sibling in kg.hsn_data:
        print(f"  {sibling}: {kg.hsn_data[sibling]['description']}")

---
## Section 4: RAG System Implementation
---

### 4.1 Initialize Embedding Model

In [None]:
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"✓ Model loaded: {embedding_model.get_sentence_embedding_dimension()} dimensions")

### 4.2 Create Vector Store

In [None]:
class HSNVectorStore:
    def __init__(self, embedding_model):
        self.embedding_model = embedding_model
        self.index = None
        self.documents = []
        self.embeddings = None
        
    def create_embeddings(self, documents: List[Dict]):
        self.documents = documents
        
        texts = [doc['full_context'] for doc in documents]
        
        print(f"Creating embeddings for {len(texts)} documents...")
        self.embeddings = self.embedding_model.encode(texts, show_progress_bar=True)
        
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(self.embeddings.astype('float32'))
        
        print(f"✓ Vector store created with {self.index.ntotal} vectors")
        
    def search(self, query: str, top_k: int = 5) -> List[Tuple[Dict, float]]:
        query_embedding = self.embedding_model.encode([query])
        
        distances, indices = self.index.search(query_embedding.astype('float32'), top_k)
        
        results = []
        for idx, distance in zip(indices[0], distances[0]):
            if idx < len(self.documents):
                similarity_score = 1 / (1 + distance)
                results.append((self.documents[idx], similarity_score))
                
        return results
    
    def get_statistics(self) -> Dict:
        return {
            'total_vectors': self.index.ntotal if self.index else 0,
            'embedding_dimension': self.embeddings.shape[1] if self.embeddings is not None else 0,
            'total_documents': len(self.documents)
        }

vector_store = HSNVectorStore(embedding_model)
vector_store.create_embeddings(enriched_documents)

print("\nVector Store Statistics:")
vs_stats = vector_store.get_statistics()
for key, value in vs_stats.items():
    print(f"{key}: {value}")

### 4.3 Test Vector Search

In [None]:
test_query = "natural rubber latex prevulcanised"
print(f"Test Query: '{test_query}'\n")
print("="*60)

results = vector_store.search(test_query, top_k=3)

for i, (doc, score) in enumerate(results, 1):
    print(f"\nResult {i} (Similarity: {score:.4f}):")
    print(f"HSN Code: {doc['full_code']}")
    print(f"Description: {doc['description']}")
    print(f"Chapter: {doc['hierarchy']['chapter']['description']}")

---
## Section 5: Intelligent Query Processing
---

### 5.1 Query Processor with Disambiguation

In [None]:
class HSNQueryProcessor:
    def __init__(self, vector_store: HSNVectorStore, knowledge_graph: HSNKnowledgeGraph):
        self.vector_store = vector_store
        self.kg = knowledge_graph
        self.similarity_threshold = 0.3
        
    def process_query(self, query: str) -> Dict:
        query_lower = query.lower().strip()
        
        hsn_pattern = r'\b(\d{8})\b'
        hsn_match = re.search(hsn_pattern, query)
        if hsn_match:
            return self._handle_direct_hsn_lookup(hsn_match.group(1))
        
        if any(keyword in query_lower for keyword in ['chapter', 'broad', 'category', 'classification']):
            return self._handle_broad_category_query(query)
        
        return self._handle_product_query(query)
    
    def _handle_direct_hsn_lookup(self, hsn_code: str) -> Dict:
        if hsn_code in self.kg.hsn_data:
            doc = self.kg.hsn_data[hsn_code]
            hierarchy_path = self.kg.get_hierarchy_path(hsn_code)
            siblings = self.kg.get_siblings(hsn_code)
            
            return {
                'query_type': 'direct_lookup',
                'status': 'success',
                'result': {
                    'hsn_code': hsn_code,
                    'description': doc['description'],
                    'trade_status': doc['trade_status'],
                    'hierarchy': hierarchy_path,
                    'related_codes': [s for s in siblings[:5]]
                }
            }
        else:
            return {
                'query_type': 'direct_lookup',
                'status': 'not_found',
                'message': f'HSN Code {hsn_code} not found in database'
            }
    
    def _handle_broad_category_query(self, query: str) -> Dict:
        results = self.vector_store.search(query, top_k=10)
        
        chapters = {}
        for doc, score in results:
            chapter_code = doc['hierarchy']['chapter']['code']
            if chapter_code not in chapters:
                chapters[chapter_code] = {
                    'code': chapter_code,
                    'description': doc['hierarchy']['chapter']['description'],
                    'count': 0
                }
            chapters[chapter_code]['count'] += 1
        
        return {
            'query_type': 'broad_category',
            'status': 'needs_refinement',
            'message': 'Multiple chapters found. Please provide more specific product details.',
            'chapters': list(chapters.values())
        }
    
    def _handle_product_query(self, query: str) -> Dict:
        results = self.vector_store.search(query, top_k=10)
        
        high_confidence_results = [(doc, score) for doc, score in results 
                                   if score >= self.similarity_threshold]
        
        if not high_confidence_results:
            return {
                'query_type': 'product_query',
                'status': 'no_match',
                'message': 'No matching HSN codes found. Please refine your query.',
                'suggestions': [doc['description'] for doc, _ in results[:5]]
            }
        
        if len(high_confidence_results) == 1:
            doc, score = high_confidence_results[0]
            return {
                'query_type': 'product_query',
                'status': 'single_match',
                'confidence': score,
                'result': {
                    'hsn_code': doc['full_code'],
                    'description': doc['description'],
                    'trade_status': doc['trade_status'],
                    'hierarchy': self.kg.get_hierarchy_path(doc['full_code'])
                }
            }
        
        return self._disambiguate_results(high_confidence_results, query)
    
    def _disambiguate_results(self, results: List[Tuple[Dict, float]], query: str) -> Dict:
        unique_codes = {}
        for doc, score in results:
            hsn_code = doc['full_code']
            if hsn_code not in unique_codes or score > unique_codes[hsn_code]['score']:
                unique_codes[hsn_code] = {
                    'hsn_code': hsn_code,
                    'description': doc['description'],
                    'score': score,
                    'hierarchy': doc['hierarchy'],
                    'trade_status': doc['trade_status']
                }
        
        options = sorted(unique_codes.values(), key=lambda x: x['score'], reverse=True)[:5]
        
        return {
            'query_type': 'product_query',
            'status': 'disambiguation_needed',
            'message': 'Multiple matching HSN codes found. Please select the most appropriate one:',
            'options': options
        }
    
    def format_response(self, result: Dict) -> str:
        if result['status'] == 'success':
            r = result['result']
            output = f"""\n{'='*70}
HSN CODE DETAILS
{'='*70}

HSN Code: {r['hsn_code']}
Description: {r['description']}
Trade Status: {r['trade_status']}

HIERARCHY:
"""
            for level in r['hierarchy']:
                output += f"  {level['level'].upper()}: {level['code']} - {level['description']}\n"
            
            if r.get('related_codes'):
                output += f"\nRELATED CODES: {', '.join(r['related_codes'][:5])}\n"
            
            return output
        
        elif result['status'] == 'single_match':
            r = result['result']
            output = f"""\n{'='*70}
HSN CODE MATCH (Confidence: {result['confidence']:.2%})
{'='*70}

HSN Code: {r['hsn_code']}
Description: {r['description']}
Trade Status: {r['trade_status']}

HIERARCHY:
"""
            for level in r['hierarchy']:
                output += f"  {level['level'].upper()}: {level['code']} - {level['description']}\n"
            
            return output
        
        elif result['status'] == 'disambiguation_needed':
            output = f"""\n{'='*70}
{result['message']}
{'='*70}\n"""
            
            for i, option in enumerate(result['options'], 1):
                output += f"""\nOPTION {i} (Confidence: {option['score']:.2%}):
  HSN Code: {option['hsn_code']}
  Description: {option['description']}
  Chapter: {option['hierarchy']['chapter']['description']}
  Heading: {option['hierarchy']['heading']['description']}
  Trade Status: {option['trade_status']}\n"""
            
            return output
        
        elif result['status'] == 'needs_refinement':
            output = f"""\n{'='*70}
{result['message']}
{'='*70}\n"""
            
            for chapter in result['chapters']:
                output += f"\nChapter {chapter['code']}: {chapter['description']} ({chapter['count']} matches)\n"
            
            return output
        
        else:
            return f"\n{result['message']}\n"

query_processor = HSNQueryProcessor(vector_store, kg)
print("✓ Query processor initialized")

---
## Section 6: Test Cases and Validation
---

### Test Case 1: Direct Product Query

In [None]:
query1 = "What is the HSN code for natural rubber latex?"
print(f"Query: {query1}")

result1 = query_processor.process_query(query1)
print(query_processor.format_response(result1))

### Test Case 2: Specific Product Type

In [None]:
query2 = "HSN code for prevulcanised rubber"
print(f"Query: {query2}")

result2 = query_processor.process_query(query2)
print(query_processor.format_response(result2))

### Test Case 3: Broad Category Query

In [None]:
query3 = "Rubber products classification"
print(f"Query: {query3}")

result3 = query_processor.process_query(query3)
print(query_processor.format_response(result3))

### Test Case 4: Similar Products Disambiguation

In [None]:
query4 = "Natural rubber latex"
print(f"Query: {query4}")

result4 = query_processor.process_query(query4)
print(query_processor.format_response(result4))

### Test Case 5: Direct HSN Lookup

In [None]:
query5 = "Tell me about HSN 40011010"
print(f"Query: {query5}")

result5 = query_processor.process_query(query5)
print(query_processor.format_response(result5))

### Additional Test Cases

In [None]:
additional_queries = [
    "conveyor belts",
    "synthetic rubber latex",
    "transmission belts for machinery",
    "vulcanised rubber thread",
    "reclaimed rubber"
]

print("Additional Test Cases:\n")
print("="*70)

for query in additional_queries:
    print(f"\n\nQuery: '{query}'")
    result = query_processor.process_query(query)
    print(query_processor.format_response(result))

---
## Section 7: Performance Metrics and Analysis
---

In [None]:
import time

def benchmark_system(query_processor, test_queries: List[str], num_runs: int = 10):
    results = {
        'query_times': [],
        'retrieval_accuracy': [],
        'disambiguation_rate': 0,
        'direct_match_rate': 0
    }
    
    status_counts = defaultdict(int)
    
    for query in test_queries:
        start_time = time.time()
        result = query_processor.process_query(query)
        end_time = time.time()
        
        results['query_times'].append(end_time - start_time)
        status_counts[result['status']] += 1
    
    total_queries = len(test_queries)
    results['disambiguation_rate'] = status_counts['disambiguation_needed'] / total_queries
    results['direct_match_rate'] = (status_counts['success'] + status_counts['single_match']) / total_queries
    results['avg_query_time'] = np.mean(results['query_times'])
    results['status_distribution'] = dict(status_counts)
    
    return results

test_queries = [
    "What is the HSN code for natural rubber latex?",
    "HSN code for prevulcanised rubber",
    "Rubber products classification",
    "Natural rubber latex",
    "Tell me about HSN 40011010",
    "conveyor belts",
    "synthetic rubber",
    "transmission belts"
]

benchmark_results = benchmark_system(query_processor, test_queries)

print("System Performance Metrics:\n")
print("="*70)
print(f"Average Query Time: {benchmark_results['avg_query_time']:.4f} seconds")
print(f"Direct Match Rate: {benchmark_results['direct_match_rate']:.2%}")
print(f"Disambiguation Rate: {benchmark_results['disambiguation_rate']:.2%}")
print(f"\nStatus Distribution:")
for status, count in benchmark_results['status_distribution'].items():
    print(f"  {status}: {count} ({count/len(test_queries):.2%})")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(benchmark_results['query_times'], bins=20, color='skyblue', edgecolor='black')
axes[0].set_xlabel('Query Time (seconds)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Query Response Time Distribution')
axes[0].axvline(benchmark_results['avg_query_time'], color='red', linestyle='--', 
                label=f"Avg: {benchmark_results['avg_query_time']:.4f}s")
axes[0].legend()

status_data = benchmark_results['status_distribution']
axes[1].bar(range(len(status_data)), list(status_data.values()), color='coral', edgecolor='black')
axes[1].set_xticks(range(len(status_data)))
axes[1].set_xticklabels(list(status_data.keys()), rotation=45, ha='right')
axes[1].set_xlabel('Query Status')
axes[1].set_ylabel('Count')
axes[1].set_title('Query Status Distribution')

plt.tight_layout()
plt.show()

---
## Section 8: System Limitations and Future Improvements
---

### Current Limitations

1. **Limited Dataset**: Sample dataset contains only Chapter 40 (Rubber products)
2. **Embedding Model**: Using lightweight model (all-MiniLM-L6-v2) for efficiency
3. **No LLM Integration**: Classification based purely on embeddings without generative AI
4. **Static Threshold**: Fixed similarity threshold may not work for all product types
5. **No User Feedback Loop**: System doesn't learn from user corrections

### Future Improvements

1. **Expand Dataset**: Include all HSN chapters for comprehensive coverage
2. **Advanced Embeddings**: Use domain-specific or larger embedding models
3. **LLM Integration**: Add GPT/Claude for natural language understanding and generation
4. **Dynamic Thresholds**: Implement adaptive similarity thresholds per category
5. **Active Learning**: Incorporate user feedback to improve classification accuracy
6. **Multi-language Support**: Add support for product descriptions in multiple languages
7. **Image Integration**: Allow product image-based HSN code classification
8. **Regulatory Updates**: Automatic synchronization with HSN code updates
9. **Export Compliance**: Add trade restrictions and compliance checking
10. **API Development**: Build REST API for enterprise integration

### Scalability Considerations

- **Vector Database**: Migrate to Pinecone/Weaviate for production scale
- **Caching**: Implement Redis for frequent query caching
- **Load Balancing**: Distribute requests across multiple instances
- **Monitoring**: Add comprehensive logging and performance monitoring

---
## Section 9: Conclusion
---

### Summary

This notebook successfully implements an HSN Code Classification System combining:

1. **Knowledge Graph**: Hierarchical representation of HSN codes with 4 levels (Chapter → Heading → Subheading → Specific)
2. **Vector Store**: FAISS-based similarity search using sentence transformers
3. **RAG System**: Retrieval-augmented approach for accurate classification
4. **Query Processing**: Intelligent disambiguation and multi-strategy query handling

### Key Achievements

✓ Data processing pipeline with hierarchical enhancement  
✓ Knowledge graph construction with NetworkX  
✓ Vector embeddings and semantic search  
✓ Intelligent query disambiguation  
✓ Comprehensive test case validation  
✓ Performance benchmarking and visualization  

### Real-World Application

This system can be deployed for:
- Export documentation automation
- Customs declaration assistance
- Trade compliance verification
- E-commerce product categorization
- Supply chain management

### Learning Outcomes

Through this assignment, I gained hands-on experience with:
- Building knowledge graphs from structured data
- Implementing vector similarity search
- Designing intelligent query processing systems
- Handling ambiguous queries with disambiguation
- Performance optimization and benchmarking
- Real-world AI system architecture

---
## References
---

1. **Sentence Transformers**: Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
2. **FAISS**: Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs
3. **NetworkX**: Hagberg, A., Schult, D., & Swart, P. (2008). Exploring network structure, dynamics, and function using NetworkX
4. **HSN Classification**: World Customs Organization - Harmonized System Nomenclature
5. **RAG Systems**: Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

---

**End of Notebook**