## 1. Abstract
- Brief summary
- Objectives:
  - Crawl web pages, extract content.
  - Build an index for search.
  - Enable ranked query results via Flask API.
- Next steps:
  - Improvements 

## 2. Overview
- Solution Outline: Explain the full workflow:
- Relevant Literature: Briefly cite research/articles on web crawling, TF-IDF, cosine similarity, or search systems.
- Proposed System: Highlight architecture and its intended functionality.

## 3. Design
- System Capabilities:
  - Max pages/depth crawling
  - TF-IDF index construction
  - Top-K query results
- Interactions: How crawler, indexer, and query processor communicate.
- Integration: File formats (JSON index, CSV queries), data flow.

## 4. Architecture
- Software Components:
  - Scrapy spider
  - Scikit-Learn indexer
  - Flask query processor
- Interfaces: APIs, endpoints, command-line interfaces.
- Implementation Notes: Python version, libraries (Scrapy, sklearn, Flask)

## 5. Operation
- Installation Instructions: Dependencies, environment setup, pip/venv usage.
- Software Commands:
  - Running crawler
  - Building the index
  - Running Flask server and querying
- Inputs: Seed URL, CSV queries, configuration options.

## 6. Conclusion
- Success/Failure:
- Outputs:

## 7. Data Sources
- Links to websites crawled (if allowed).

## 8. Test Cases
- Framework: Pytest, unittest, or custom scripts.
- Harness: How to test crawler, indexer, query processor.
- Coverage: Edge cases like empty pages, invalid URLs, long queries, or spelling errors.

## 9. Source Code
- https://docs.scrapy.org/en/latest/intro/tutorial.html

## 10. Bibliography

In [8]:
import nest_asyncio
nest_asyncio.apply()

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
import os
import uuid

### Step 1: Web Crawling

In [None]:
os.makedirs("pages", exist_ok=True)

class NotebookCrawler(scrapy.Spider):
    name = "notebook_crawler"
    
    def __init__(self, seed_url, allowed_domain, max_pages, max_depth, *args, **kwargs):
        super(NotebookCrawler, self).__init__(*args, **kwargs)
        self.start_urls = [seed_url]
        self.allowed_domains = [allowed_domain]
        self.max_pages = max_pages
        self.visited = set()
        
        # Set max depth in custom settings
        self.custom_settings = {
            'DEPTH_LIMIT': max_depth,
            'AUTOTHROTTLE_ENABLED': True,
            'LOG_ENABLED': True,
            'CLOSESPIDER_PAGECOUNT': max_pages
        }
    
    def parse(self, response):
        # Stop if max pages reached
        if len(self.visited) >= self.max_pages:
            self.logger.info(f"Reached max pages limit: {self.max_pages}")
            return
        
        # Generate UUID as the complete filename
        page_uuid = str(uuid.uuid4())
        filename = f"{page_uuid}.html"
        
        # Save HTML content
        with open(f"pages/{filename}", "w", encoding="utf-8") as f:
            f.write(response.text)
        
        self.visited.add(response.url)
        self.logger.info(f"Saved page {len(self.visited)}/{self.max_pages}: {filename}")
        
        # Extract and follow links
        if len(self.visited) < self.max_pages:
            links = LinkExtractor(allow_domains=self.allowed_domains).extract_links(response)
            for link in links:
                if link.url not in self.visited:
                    yield response.follow(link.url, self.parse)

# Usage with configurable parameters
process = CrawlerProcess(settings={
    'LOG_LEVEL': 'INFO',
})

process.crawl(
    NotebookCrawler,
    seed_url='https://quotes.toscrape.com',      
    allowed_domain='quotes.toscrape.com',      
    max_pages=100,                                  
    max_depth=5                                   
)
try:
    process.start()
except:
    pass

2025-11-23 17:43:55 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: scrapybot)
2025-11-23 17:43:55 [scrapy.utils.log] INFO: Versions:
{'lxml': '6.0.2',
 'libxml2': '2.11.9',
 'cssselect': '1.3.0',
 'parsel': '1.10.0',
 'w3lib': '2.3.1',
 'Twisted': '25.5.0',
 'Python': '3.14.0 (tags/v3.14.0:ebf955d, Oct  7 2025, 10:15:03) [MSC v.1944 '
           '64 bit (AMD64)]',
 'pyOpenSSL': '25.3.0 (OpenSSL 3.5.4 30 Sep 2025)',
 'cryptography': '46.0.3',
 'Platform': 'Windows-11-10.0.26200-SP0'}
2025-11-23 17:43:55 [scrapy.addons] INFO: Enabled addons:
[]
2025-11-23 17:43:55 [scrapy.extensions.telnet] INFO: Telnet Password: e50700739db06670
2025-11-23 17:43:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-11-23 17:43:55 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'INFO'}
2025-11-23 17:43:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['s

2025-11-23 17:43:55 [scrapy.core.engine] INFO: Spider opened
2025-11-23 17:43:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-11-23 17:43:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-11-23 17:43:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-11-23 17:43:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-11-23 17:43:55 [notebook_crawler] INFO: Saved page 1/100: 356811ed-d500-4aa1-9fcc-6f844fbd1b04.html
2025-11-23 17:43:55 [notebook_crawler] INFO: Saved page 1/100: 356811ed-d500-4aa1-9fcc-6f844fbd1b04.html
2025-11-23 17:43:56 [notebook_crawler] INFO: Saved page 2/100: 03533d63-e885-4bdf-8e74-b63882da4dd8.html
2025-11-23 17:43:56 [notebook_crawler] INFO: Saved page 2/100: 03533d63-e885-4bdf-8e74-b63882da4dd8.html
2025-11-23 17:43:56 [notebook_crawler] INFO: Saved page 3/100: 86a78166-6e2c-4cdd-a

### Step 2: Indexing

In [25]:
from bs4 import BeautifulSoup
import os

# Create folder for cleaned text files
os.makedirs("cleaned_text", exist_ok=True)

def html_to_text(html_file):
    with open(html_file, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'html.parser')
        
    for script in soup(["script", "style"]):
        script.decompose()
    
    text = soup.get_text()

    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = ' '.join(chunk for chunk in chunks if chunk)

    return text

pages_dir = "pages"
documents = {}

for filename in os.listdir(pages_dir):
    if filename.endswith('.html'):
        filepath = os.path.join(pages_dir, filename)
        clean_text = html_to_text(filepath)
        documents[filename] = clean_text
        
        text_filename = filename.replace('.html', '.txt')
        with open(f"cleaned_text/{text_filename}", 'w', encoding='utf-8') as f:
            f.write(clean_text)
        
        print(f"Processed: {filename} ({len(clean_text)} characters)")

Processed: 0151dabc-1519-4c16-89c2-7e01dbef2723.html (352 characters)
Processed: 0200b28a-d172-4eaf-bcb8-f6cbefa3d371.html (295 characters)
Processed: 03533d63-e885-4bdf-8e74-b63882da4dd8.html (465 characters)
Processed: 05d98930-8c67-4077-a16b-230742c1a196.html (465 characters)
Processed: 0bb881a2-c200-43e5-abcf-e202b6f564fd.html (2525 characters)
Processed: 0da8c278-15cf-49ba-893f-4f01d48c736e.html (564 characters)
Processed: 0dae95d3-3135-4caa-b44f-ade7cc014e0b.html (1936 characters)
Processed: 11c964eb-994d-4584-9666-e8cc7088125f.html (373 characters)
Processed: 13252d39-d3c6-4918-bb23-718b1f3578bc.html (413 characters)
Processed: 199fd625-5091-4809-bdef-684692e18622.html (1673 characters)
Processed: 19a234b6-7df4-49e1-b6bd-364fd0d6ea23.html (401 characters)
Processed: 1c550459-5f94-403a-a234-2cea3d84b090.html (385 characters)
Processed: 21109878-e79f-4b27-9f51-b7e70480a36b.html (3967 characters)
Processed: 2343ecfb-cea0-403b-883d-2809b1ce179e.html (1936 characters)
Processed: 28cb

In [None]:
import json
import re
from collections import defaultdict

def tokenize_with_positions(text):
    text = text.lower()
    tokens = re.findall(r'\b[a-z]+\b', text)
    
    token_positions = defaultdict(list)
    for pos, token in enumerate(tokens):
        token_positions[token].append(pos)
    
    return dict(token_positions)

def build_positional_inverted_index(documents):
    inverted_index = defaultdict(list)
    
    for doc_id, text in documents.items():
        clean_doc_id = doc_id.replace('.html', '')
        
        token_positions = tokenize_with_positions(text)
        
        for token, positions in token_positions.items():
            inverted_index[token].append([clean_doc_id, positions])
    
    return dict(inverted_index)

inverted_index = build_positional_inverted_index(documents)

# Save to JSON file with custom formatting
with open('index.json', 'w', encoding='utf-8') as f:
    f.write('{\n')
    items = list(inverted_index.items())
    for i, (token, entries) in enumerate(items):
        f.write(f'  "{token}": [\n')
        for j, entry in enumerate(entries):
            entry_json = json.dumps(entry)
            if j < len(entries) - 1:
                f.write(f'    {entry_json},\n')
            else:
                f.write(f'    {entry_json}\n')
        if i < len(items) - 1:
            f.write('  ],\n')
        else:
            f.write('  ]\n')
    f.write('}\n')

print(f"Inverted index saved to index.json")
print(f"Total unique tokens: {len(inverted_index)}")


Inverted index saved to index.json
Total unique tokens: 2586


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import os

def load_document_text(doc_id):
    filepath = f"cleaned_text/{doc_id}.txt"
    if os.path.exists(filepath):
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    return ""

def search(query_terms, top_k):
    doc_id_list = sorted([f.replace('.txt', '') for f in os.listdir('cleaned_text') if f.endswith('.txt')])
    
    corpus = [load_document_text(doc_id) for doc_id in doc_id_list]
    
    vectorizer = TfidfVectorizer(lowercase=True, token_pattern=r'\b[a-z]+\b')
    tfidf_matrix = vectorizer.fit_transform(corpus)
    
    feature_names = vectorizer.get_feature_names_out()
    
    query_string = ' '.join(query_terms).lower()
    
    query_vector = vectorizer.transform([query_string])
    
    similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    results = []
    for idx, doc_id in enumerate(doc_id_list):
        score = similarities[idx]
        
        tfidf_weights = {}
        doc_vector = tfidf_matrix[idx]
        
        for query_term in query_terms:
            query_term_lower = query_term.lower()
            if query_term_lower in feature_names:
                term_idx = np.where(feature_names == query_term_lower)[0]
                if len(term_idx) > 0:
                    tfidf_weights[query_term_lower] = doc_vector[0, term_idx[0]]
                else:
                    tfidf_weights[query_term_lower] = 0.0
            else:
                tfidf_weights[query_term_lower] = 0.0
        
        results.append((doc_id, score, tfidf_weights))
    
    results.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\nTop {top_k} results for query: {query_terms}\n")
    for rank, (doc_id, score, tfidf_weights) in enumerate(results[:top_k], 1):
        print(f"{rank}. Document: {doc_id}")
        print(f"   Cosine Similarity Score: {score:.4f}")
        print(f"   TF-IDF Weights: {', '.join([f'{term}: {weight:.4f}' for term, weight in tfidf_weights.items()])}")
        print()
    
    return results[:top_k]

# Example query
query = ["better", "to", "be"]
results = search(query, top_k=5)


Top 5 results for query: ['better', 'to', 'be']

1. Document: f75ec424-708b-4640-ac8b-3a640a13a4b6
   Cosine Similarity Score: 0.3420
   TF-IDF Weights: better: 0.1790, to: 0.1715, be: 0.3064

2. Document: 3f86168d-959e-4134-83ac-3adf63083b23
   Cosine Similarity Score: 0.2103
   TF-IDF Weights: better: 0.1006, to: 0.1125, be: 0.2010

3. Document: 6f4b0467-50a1-42c9-bf00-ead1e8651c44
   Cosine Similarity Score: 0.2103
   TF-IDF Weights: better: 0.1006, to: 0.1125, be: 0.2010

4. Document: 9ecbcd87-0af8-471e-a0e2-1168d0a8495c
   Cosine Similarity Score: 0.2103
   TF-IDF Weights: better: 0.1006, to: 0.1125, be: 0.2010

5. Document: ad531183-964e-46d7-bafa-ff9869d87f75
   Cosine Similarity Score: 0.2103
   TF-IDF Weights: better: 0.1006, to: 0.1125, be: 0.2010

