## 1. Abstract
- Brief summary
- Objectives:
  - Crawl web pages, extract content.
  - Build an index for search.
  - Enable ranked query results via Flask API.
- Next steps:
  - Improvements 

## 2. Overview
- Solution Outline: Explain the full workflow:
- Relevant Literature: Briefly cite research/articles on web crawling, TF-IDF, cosine similarity, or search systems.
- Proposed System: Highlight architecture and its intended functionality.

## 3. Design
- System Capabilities:
  - Max pages/depth crawling
  - TF-IDF index construction
  - Top-K query results
- Interactions: How crawler, indexer, and query processor communicate.
- Integration: File formats (JSON index, CSV queries), data flow.

## 4. Architecture
- Software Components:
  - Scrapy spider
  - Scikit-Learn indexer
  - Flask query processor
- Interfaces: APIs, endpoints, command-line interfaces.
- Implementation Notes: Python version, libraries (Scrapy, sklearn, Flask)

## 5. Operation
- Installation Instructions: Dependencies, environment setup, pip/venv usage.
- Software Commands:
  - Running crawler
  - Building the index
  - Running Flask server and querying
- Inputs: Seed URL, CSV queries, configuration options.

## 6. Conclusion
- Success/Failure:
- Outputs:

## 7. Data Sources
- Links to websites crawled (if allowed).

## 8. Test Cases
- Framework: Pytest, unittest, or custom scripts.
- Harness: How to test crawler, indexer, query processor.
- Coverage: Edge cases like empty pages, invalid URLs, long queries, or spelling errors.

## 9. Source Code

## 10. Bibliography

In [1]:
import nest_asyncio
nest_asyncio.apply()

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
import os

### Step: 1 (Web Crawling)

In [None]:
os.makedirs("pages", exist_ok=True)

class NotebookCrawler(scrapy.Spider):
    name = "notebook_crawler"
    
    def __init__(self, seed_url, allowed_domain, max_pages, max_depth, *args, **kwargs):
        super(NotebookCrawler, self).__init__(*args, **kwargs)
        self.start_urls = [seed_url]
        self.allowed_domains = [allowed_domain]
        self.max_pages = max_pages
        self.visited = set()
        
        # Set max depth in custom settings
        self.custom_settings = {
            'DEPTH_LIMIT': max_depth,
            'AUTOTHROTTLE_ENABLED': True,
            'LOG_ENABLED': True,
            'CLOSESPIDER_PAGECOUNT': max_pages
        }
    
    def parse(self, response):
        # Stop if max pages reached
        if len(self.visited) >= self.max_pages:
            self.logger.info(f"Reached max pages limit: {self.max_pages}")
            return
        
        # Save HTML content
        url_safe = response.url.replace("https://", "").replace("http://", "").replace("/", "_")
        with open(f"pages/{url_safe}.html", "w", encoding="utf-8") as f:
            f.write(response.text)
        
        self.visited.add(response.url)
        self.logger.info(f"Saved page {len(self.visited)}/{self.max_pages}: {response.url}")
        
        # Extract and follow links
        if len(self.visited) < self.max_pages:
            links = LinkExtractor(allow_domains=self.allowed_domains).extract_links(response)
            for link in links:
                if link.url not in self.visited:
                    yield response.follow(link.url, self.parse)

# Usage with configurable parameters
process = CrawlerProcess(settings={
    'LOG_LEVEL': 'INFO',
})

process.crawl(
    NotebookCrawler,
    seed_url='https://quotes.toscrape.com',      
    allowed_domain='quotes.toscrape.com',      
    max_pages=100,                                  
    max_depth=5                                   
)

process.start()

2025-11-23 16:55:33 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: scrapybot)
2025-11-23 16:55:33 [scrapy.utils.log] INFO: Versions:
{'lxml': '6.0.2',
 'libxml2': '2.11.9',
 'cssselect': '1.3.0',
 'parsel': '1.10.0',
 'w3lib': '2.3.1',
 'Twisted': '25.5.0',
 'Python': '3.14.0 (tags/v3.14.0:ebf955d, Oct  7 2025, 10:15:03) [MSC v.1944 '
           '64 bit (AMD64)]',
 'pyOpenSSL': '25.3.0 (OpenSSL 3.5.4 30 Sep 2025)',
 'cryptography': '46.0.3',
 'Platform': 'Windows-11-10.0.26200-SP0'}
2025-11-23 16:55:33 [scrapy.addons] INFO: Enabled addons:
[]
2025-11-23 16:55:33 [scrapy.extensions.telnet] INFO: Telnet Password: ac4c4e6e96f16494
2025-11-23 16:55:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-11-23 16:55:33 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'INFO'}
2025-11-23 16:55:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['s

ReactorNotRestartable: 

2025-11-23 16:55:33 [scrapy.core.engine] INFO: Spider opened
2025-11-23 16:55:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-11-23 16:55:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 1/100: https://quotes.toscrape.com
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 2/100: https://quotes.toscrape.com/tag/value/page/1/
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 3/100: https://quotes.toscrape.com/tag/aliteracy/page/1/
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 4/100: https://quotes.toscrape.com/tag/adulthood/page/1/
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 5/100: https://quotes.toscrape.com/tag/success/page/1/
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 6/100: https://quotes.toscrape.com/tag/classic/page/1/
2025-11-23 16:55:34 [notebook_crawler] INFO: Saved page 7/100: https://qu