This project implements a simplified search engine based on Section 23.6 of the textbook, designed to index and search HTML documents from a small website. The implementation uses inverted indexing as the core data structure and TF-IDF (Term Frequency-Inverse Document Frequency) for ranking search results.
Ankit Dhandharia
Course: CS600
Project: Search Engine Implementation
✅ Inverted Index: Efficient term-to-document mapping
✅ NLTK Stopwords: Comprehensive English stopword filtering
✅ TF-IDF Ranking: Relevance-based result ranking
✅ Hyperlink Extraction: Web crawler component
✅ Boundary Condition Handling: Robust error handling
✅ Interactive & Test Modes: Dual operation modes
Description: An inverted index is the fundamental data structure used in search engines. It maps each term (word) to the set of documents that contain it.
Data Structure: defaultdict(set)
- Key: Term (string)
- Value: Set of document filenames containing the term
Example:
{
"security": {"index.html", "network_security.html", "cloud_security.html"},
"encryption": {"cryptography.html", "cloud_security.html"},
"malware": {"malware_analysis.html", "incident_response.html"}
}Algorithm:
- For each HTML document:
- Parse HTML and extract text content
- Tokenize text into individual words
- Filter out stopwords and single characters
- For each remaining term:
- Add the document to that term's set in the inverted index
Time Complexity: O(D × T) where D = number of documents, T = average terms per document
Space Complexity: O(V × D) where V = vocabulary size
Benefits:
- O(1) lookup to find documents containing a term
- Efficient for AND queries (set intersection)
- Scalable for large document collections
Description: TF-IDF measures how important a term is to a document in a collection. It combines term frequency (how often a term appears in a document) with inverse document frequency (how rare the term is across all documents).
Formula:
TF-IDF(term, document) = TF(term, document) × IDF(term)
Where:
TF(term, document) = (count of term in document) / (total terms in document)
IDF(term) = log(total documents / documents containing term)
Implementation:
def calculate_tf_idf(self, term, document):
# Term Frequency (normalized)
term_count = self.term_frequency[term][document]
doc_length = self.document_lengths[document]
tf = term_count / doc_length if doc_length > 0 else 0
# Inverse Document Frequency
total_docs = len(self.documents)
docs_with_term = self.document_frequency[term]
idf = math.log(total_docs / docs_with_term) if docs_with_term > 0 else 0
# TF-IDF Score
return tf * idfRanking Process:
- Find all documents containing ALL query terms (AND semantics)
- For each matching document:
- Calculate TF-IDF score for each query term
- Sum the scores across all query terms
- Sort documents by total score (highest first)
Benefits:
- Gives higher scores to documents where query terms are frequent
- Penalizes common terms that appear in many documents
- Rewards rare, specific terms that are more discriminative
- Better than simple term frequency counting
Term Frequency: defaultdict(Counter)
- Maps each term to a Counter of {document: frequency}
- Used in TF-IDF calculation
- Example:
{"security": {"doc1.html": 5, "doc2.html": 3}}
Document Frequency: Counter
- Counts how many documents contain each term
- Used for IDF calculation
- Example:
{"security": 4, "encryption": 2}
Document Lengths: dict
- Maps each document to its total number of terms (after filtering)
- Used for normalizing term frequency
- Example:
{"index.html": 97, "cryptography.html": 260}
Hyperlinks: defaultdict(set)
- Maps each document to the set of URLs it links to
- Supports web crawler functionality
- Example:
{"index.html": {"cybersecurity_fundamentals.html", "cloud_security.html"}}
- Read HTML files from the
webpages/directory - Use BeautifulSoup to parse HTML and extract text
- Extract hyperlinks (anchor tags) for web crawler component
- Tokenize text using regex:
\b\w+\b - Convert all tokens to lowercase for case-insensitive matching
- Remove stopwords using NLTK's English stopwords corpus
- Stopwords: common words like "the", "and", "is", "are", etc.
- These words appear frequently but provide little semantic value
- Remove single-character tokens
- Keep only alphanumeric words
- Count term occurrences in each document (term frequency)
- Update inverted index: add document to each term's set
- Update document frequency: count documents containing each term
- Store document length for normalization
- Parse and filter the search query (same as documents)
- Find documents containing ALL query terms (AND semantics)
- Use set intersection on inverted index entries
- Calculate TF-IDF scores for matching documents
- Rank results by total TF-IDF score (descending)
- Display results with URLs and scores
- Empty query: Detect and prompt user for input
- Stopwords-only query: Notify user and request specific terms
- Non-existent terms: Report terms not found in any document
- No matching documents: Clear message when no results found
- File errors: Handle missing files and parsing errors gracefully
The project requires the following Python libraries:
pip install beautifulsoup4 nltk- BeautifulSoup4: HTML parsing and text extraction
- NLTK: Natural Language Toolkit for comprehensive stopwords
Project/
├── search_engine.py # Main search engine implementation
├── webpages/ # Directory containing HTML documents
│ ├── input.txt # URL mapping file (filename URL)
│ ├── index.html # Homepage with navigation
│ ├── cybersecurity_fundamentals.html
│ ├── cloud_security.html
│ ├── cryptography.html
│ ├── network_security.html
│ ├── malware_analysis.html
│ ├── incident_response.html
│ └── ethical_hacking.html
├── output.txt # Test output with boundary conditions
└── README.md # This file
The project includes 8 HTML pages on cybersecurity topics:
- index.html: Homepage with navigation to all topics
- cybersecurity_fundamentals.html: Core security principles (CIA triad)
- cloud_security.html: Cloud service models and security challenges
- cryptography.html: Encryption, hashing, digital signatures
- network_security.html: Firewalls, IDS/IPS, network attacks
- malware_analysis.html: Malware types and analysis techniques
- incident_response.html: IR lifecycle and best practices
- ethical_hacking.html: Penetration testing methodologies
Each page contains:
- Rich, relevant content (300-500 words)
- Hyperlinks to related pages (web crawler aspect)
- Proper HTML structure for parsing
Run the search engine in interactive mode to enter queries manually:
python search_engine.pyFeatures:
- Enter search queries and see ranked results
- Type
statsto view search engine statistics - Type
exitorquitto exit
Example Session:
🔍 Mini Search Engine - Interactive Mode
Type 'exit' or 'quit' to exit
Type 'stats' to see statistics
Enter search query: cloud security
Search terms (after filtering): ['cloud', 'security']
'cloud' → found in 2 documents
'security' → found in 6 documents
Found 2 matching document(s)
1. cloud_security.html
Relevance Score (TF-IDF): 0.123456
URL: https://www.coursera.org/articles/cybersecurity-analyst-skills
Links: 7 outgoing hyperlink(s)
2. index.html
Relevance Score (TF-IDF): 0.045678
URL: https://cybersecurity-hub.example.com/
Links: 7 outgoing hyperlink(s)
Run predefined test queries and save output to output.txt:
python search_engine.py --testTest Cases:
- Empty query
- Stopwords-only query
- Non-existent term
- Single-term queries
- Multi-term queries (AND semantics)
- Complex multi-word queries
The output demonstrates:
- Boundary condition handling
- Ranking algorithm effectiveness
- Proper error messages
The webpages/input.txt file maps filenames to URLs:
filename URL
// Example:
index.html https://cybersecurity-hub.example.com/
cybersecurity_fundamentals.html https://www.securitymagazine.com/articles/101473-the-fundamentals-of-cybersecurity-in-the-age-of-ai
cloud_security.html https://www.coursera.org/articles/cybersecurity-analyst-skills
cryptography.html https://cybersecurity-hub.example.com/cryptography
network_security.html https://www.malwarebytes.com/cybersecurity
malware_analysis.html https://www.coursera.org/articles/cybersecurity-analyst-skills
incident_response.html https://www.linkedin.com/pulse/fundamentals-cybersecurity-comprehensive-guide-beginners-mytectra-l8eoc/
ethical_hacking.html https://www.purdueglobal.edu/blog/information-technology/ethical-hacker/
-
Empty Query
- Input:
"" - Expected: Error message requesting search terms
- Input:
-
Stopwords-Only Query
- Input:
"the and is are" - Expected: Message indicating only stopwords, request specific terms
- Input:
-
Non-Existent Term
- Input:
"xyzabc123notfound" - Expected: Message indicating term not found in any document
- Input:
-
Single Term Positive Match
- Input:
"security","encryption","malware" - Expected: Ranked list of matching documents
- Input:
-
Multi-Term AND Query
- Input:
"cloud security","network attack","encryption cryptography" - Expected: Only documents containing ALL terms, ranked by TF-IDF
- Input:
-
Complex Query
- Input:
"security threats protection" - Expected: Documents with all three terms, demonstrating ranking
- Input:
- Run
python search_engine.py --test - Review
output.txtfor all test cases - Verify boundary conditions are handled correctly
- Confirm TF-IDF ranking produces sensible results
- Check that hyperlinks are extracted from all pages
Preprocessing: O(Q) where Q = query length
- Tokenize and filter query terms
Document Retrieval: O(T) where T = number of query terms
- For each term, lookup in inverted index: O(1)
- Set intersection of document sets: O(D) where D = avg docs per term
Ranking: O(M × T) where M = matching documents
- Calculate TF-IDF for each term in each matching document
Sorting: O(M log M)
- Sort matching documents by score
Total: O(Q + T + D + M×T + M log M)
For typical queries (few terms, few matches), this is very efficient.
Inverted Index: O(V × D_avg)
- V = vocabulary size
- D_avg = average documents per term
Term Frequency: O(V × D_avg)
Total: O(V × D) where D = total documents
This is acceptable for small to medium collections. For large scale, additional optimization techniques like compression and sharding would be needed.
- NLTK Stopwords: More comprehensive (179 words) vs hardcoded (~25 words)
- TF-IDF Ranking: Better relevance scoring vs simple term frequency
- Hyperlink Extraction: Supports web crawler functionality
- Better Documentation: Extensive comments and docstrings
- Error Handling: Robust boundary condition handling
- Statistics Display: Useful insights into indexed data
- Dual Modes: Interactive for exploration, test for automated validation
- AND-only queries (no OR or NOT operators)
- No phrase search (e.g., "cloud computing")
- No stemming or lemmatization (e.g., "encrypt" vs "encryption")
- Limited to local HTML files (no actual web crawling)
- Add query operators (OR, NOT, phrase search)
- Implement Porter Stemmer for word normalization
- Add caching for faster repeated queries
- Implement actual web crawler to fetch pages
- Support for proximity search (terms near each other)
- Add search suggestions and auto-complete
- Build web interface for easier interaction
- Course Textbook: Section 23.6 - Search Engine
- NLTK Documentation: https://www.nltk.org/
- BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/
- TF-IDF Algorithm: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- Inverted Index: https://en.wikipedia.org/wiki/Inverted_index
This search engine implementation demonstrates core information retrieval concepts including inverted indexing, stopword filtering, and TF-IDF ranking. The system efficiently searches a small collection of cybersecurity-related web pages, handling various query types and boundary conditions robustly.
The use of well-established data structures (sets, dictionaries, counters) from Python's standard library ensures both efficiency and code clarity. NLTK integration provides industrial-strength stopword filtering, while TF-IDF ranking produces relevance scores that align with user expectations.
For questions or issues, please contact: ankitdhandharia21@gmail.com