Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 247 additions & 0 deletions config/scraping.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
# Code4Ved Web Scraping Configuration

# General settings
user_agent: "Code4Ved/1.0"
timeout: 30
max_retries: 3
respect_robots: true

# Rate limiting
default_rate_limit: 1.0
burst_size: 5

# Content filtering
min_text_length: 100
max_text_length: 1000000
allowed_formats:
- html
- pdf
- plaintext
- xml
- json

# Storage settings
storage_path: "data/raw"
create_directories: true
duplicate_detection: true

# Logging
log_level: "INFO"
log_format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
log_file: null

# Performance settings
max_concurrent_requests: 5
request_queue_size: 100

# Validation settings
validate_content: true
validate_encoding: true

# SSL settings
verify_ssl: false # Disable SSL verification for development
ssl_warnings: false # Disable SSL warnings
Comment on lines +41 to +43
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Disable SSL verification should be environment-based, not hardcoded.

Disabling SSL verification in a committed configuration file creates a MITM attack surface that could accidentally propagate to production. The comment says "for development" but config files are not development-only artifacts.

Recommend one of these approaches:

Option 1 (preferred): Enable SSL verification and use environment variable overrides for development:

-verify_ssl: false  # Disable SSL verification for development
-ssl_warnings: false  # Disable SSL warnings
+verify_ssl: ${VERIFY_SSL:-true}
+ssl_warnings: ${SSL_WARNINGS:-true}

Option 2: Move to a separate .env.development file that is .gitignore-d and loaded only during local testing, keeping production configs secure by default.

This aligns with the principle of secure-by-default configurations.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In config/scraping.yaml around lines 41-43, the file hardcodes verify_ssl: false
and ssl_warnings: false which is insecure; change the defaults to enable SSL
verification (verify_ssl: true and keep ssl_warnings true/omit) and update the
application to read an environment variable (e.g. SCRAPING_VERIFY_SSL) to allow
overriding to false only in local/dev environments, or alternatively move the
false settings into a separate .env.development/.yaml file that is .gitignore-d
and loaded only in local runs; ensure documentation and config loading logic
prefer the secure default and allow explicit, non-committed overrides for
development.


# Source configurations
sources:
vedicheritage:
name: "vedicheritage"
base_url: "https://vedicheritage.gov.in"
description: "Government of India Vedic Heritage Portal"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://vedicheritage.gov.in/robots.txt"
rate_limit: 1.0
max_pages: 1000
supported_formats:
- html
- pdf

gretil:
name: "gretil"
base_url: "http://gretil.sub.uni-goettingen.de"
description: "Göttingen Register of Electronic Texts"
language: "en"
encoding: "utf-8"
robots_txt_url: "http://gretil.sub.uni-goettingen.de/robots.txt"
rate_limit: 0.5
max_pages: 500
supported_formats:
- html
- plaintext
- xml
Comment on lines +60 to +72
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Upgrade GRETIL source to HTTPS.

Line 58 configures gretil with http:// instead of https://. This exposes potential man-in-the-middle attacks on Vedic heritage content, which is particularly sensitive in an open-source context.

Apply this fix:

  gretil:
    name: "gretil"
-   base_url: "http://gretil.sub.uni-goettingen.de"
+   base_url: "https://gretil.sub.uni-goettingen.de"
    description: "Göttingen Register of Electronic Texts"
    language: "en"
    encoding: "utf-8"
-   robots_txt_url: "http://gretil.sub.uni-goettingen.de/robots.txt"
+   robots_txt_url: "https://gretil.sub.uni-goettingen.de/robots.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gretil:
name: "gretil"
base_url: "http://gretil.sub.uni-goettingen.de"
description: "Göttingen Register of Electronic Texts"
language: "en"
encoding: "utf-8"
robots_txt_url: "http://gretil.sub.uni-goettingen.de/robots.txt"
rate_limit: 0.5
max_pages: 500
supported_formats:
- html
- plaintext
- xml
gretil:
name: "gretil"
base_url: "https://gretil.sub.uni-goettingen.de"
description: "Göttingen Register of Electronic Texts"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://gretil.sub.uni-goettingen.de/robots.txt"
rate_limit: 0.5
max_pages: 500
supported_formats:
- html
- plaintext
- xml
🤖 Prompt for AI Agents
In config/scraping.yaml around lines 56 to 68, the gretil source is configured
with http URLs; update the base_url and robots_txt_url to use https (change
"http://gretil.sub.uni-goettingen.de" to "https://gretil.sub.uni-goettingen.de"
and "http://gretil.sub.uni-goettingen.de/robots.txt" to
"https://gretil.sub.uni-goettingen.de/robots.txt") leaving all other fields
unchanged.


ambuda:
name: "ambuda"
base_url: "https://ambuda.org"
description: "Open Source Sanskrit Platform"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://ambuda.org/robots.txt"
rate_limit: 1.0
max_pages: 2000
supported_formats:
- html
- json

sanskritdocuments:
name: "sanskritdocuments"
base_url: "https://sanskritdocuments.org"
description: "Sanskrit Documents Archive"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://sanskritdocuments.org/robots.txt"
rate_limit: 0.5
max_pages: 1000
supported_formats:
- html
- pdf
- plaintext

vedpuran:
name: "vedpuran"
base_url: "https://vedpuran.net"
description: "Ved and Puran PDF Downloads"
language: "hi"
encoding: "utf-8"
robots_txt_url: "https://vedpuran.net/robots.txt"
rate_limit: 0.5
max_pages: 500
supported_formats:
- pdf
- html

veducation:
name: "veducation"
base_url: "https://www.veducation.world"
description: "Vedic Education Library"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.veducation.world/robots.txt"
rate_limit: 1.0
max_pages: 800
supported_formats:
- html
- pdf

ignca:
name: "ignca"
base_url: "https://ignca.gov.in"
description: "Indira Gandhi National Centre for the Arts"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://ignca.gov.in/robots.txt"
rate_limit: 0.5
max_pages: 1000
supported_formats:
- html
- pdf

sanskritebooks:
name: "sanskritebooks"
base_url: "https://www.sanskritebooks.org"
description: "Sanskrit E-books Collection"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.sanskritebooks.org/robots.txt"
rate_limit: 0.5
max_pages: 600
supported_formats:
- html
- pdf
- plaintext

sanskritlinguistics:
name: "sanskritlinguistics"
base_url: "http://www.sanskrit-linguistics.org"
description: "Sanskrit Linguistics Digital Corpus"
language: "en"
encoding: "utf-8"
robots_txt_url: "http://www.sanskrit-linguistics.org/robots.txt"
rate_limit: 0.3
max_pages: 400
supported_formats:
- html
- xml
- plaintext
Comment on lines +154 to +166
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Upgrade sanskritlinguistics source from HTTP to HTTPS.

The sanskritlinguistics source uses unencrypted HTTP for both base_url and robots_txt_url. All other sources in this configuration use HTTPS. Upgrade to HTTPS for consistency and security.

Apply this diff:

  sanskritlinguistics:
    name: "sanskritlinguistics"
-   base_url: "http://www.sanskrit-linguistics.org"
+   base_url: "https://www.sanskrit-linguistics.org"
    description: "Sanskrit Linguistics Digital Corpus"
    language: "en"
    encoding: "utf-8"
-   robots_txt_url: "http://www.sanskrit-linguistics.org/robots.txt"
+   robots_txt_url: "https://www.sanskrit-linguistics.org/robots.txt"
    rate_limit: 0.3
    max_pages: 400
    supported_formats:
      - html
      - xml
      - plaintext
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sanskritlinguistics:
name: "sanskritlinguistics"
base_url: "http://www.sanskrit-linguistics.org"
description: "Sanskrit Linguistics Digital Corpus"
language: "en"
encoding: "utf-8"
robots_txt_url: "http://www.sanskrit-linguistics.org/robots.txt"
rate_limit: 0.3
max_pages: 400
supported_formats:
- html
- xml
- plaintext
sanskritlinguistics:
name: "sanskritlinguistics"
base_url: "https://www.sanskrit-linguistics.org"
description: "Sanskrit Linguistics Digital Corpus"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.sanskrit-linguistics.org/robots.txt"
rate_limit: 0.3
max_pages: 400
supported_formats:
- html
- xml
- plaintext
🤖 Prompt for AI Agents
In config/scraping.yaml around lines 154 to 166, the sanskritlinguistics source
uses unencrypted HTTP for base_url and robots_txt_url; update both values to use
HTTPS (change "http://www.sanskrit-linguistics.org" to
"https://www.sanskrit-linguistics.org" and
"http://www.sanskrit-linguistics.org/robots.txt" to
"https://www.sanskrit-linguistics.org/robots.txt"), then verify the HTTPS
endpoints are reachable and adjust any trailing slashes or site-specific
redirects as needed.


sanskritlibrary:
name: "sanskritlibrary"
base_url: "https://sanskritlibrary.org"
description: "Sanskrit Library Digital Repository"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://sanskritlibrary.org/robots.txt"
rate_limit: 0.5
max_pages: 800
supported_formats:
- html
- xml
- pdf

titus:
name: "titus"
base_url: "https://titus.fkidg1.uni-frankfurt.de"
description: "TITUS Database of Indo-European Texts"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://titus.fkidg1.uni-frankfurt.de/robots.txt"
rate_limit: 0.3
max_pages: 500
supported_formats:
- html
- xml
- plaintext

templepurohit:
name: "templepurohit"
base_url: "https://www.templepurohit.com"
description: "Temple Purohit - Hindu Texts and Resources"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.templepurohit.com/robots.txt"
rate_limit: 0.5
max_pages: 300
supported_formats:
- html
- pdf

vyasaonline:
name: "vyasaonline"
base_url: "https://www.vyasaonline.com"
description: "Vyasa Online - Maha Puranas Collection"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.vyasaonline.com/robots.txt"
rate_limit: 0.5
max_pages: 400
supported_formats:
- html
- pdf

gitasupersite:
name: "gitasupersite"
base_url: "https://www.gitasupersite.iitk.ac.in"
description: "Gita Supersite - IIT Kanpur Bhagavad Gita Portal"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.gitasupersite.iitk.ac.in/robots.txt"
rate_limit: 0.5
max_pages: 200
supported_formats:
- html
- pdf
- xml

adhyeta:
name: "adhyeta"
base_url: "https://www.adhyeta.org.in"
description: "Adhyeta - Vedic Learning Platform"
language: "en"
encoding: "utf-8"
robots_txt_url: "https://www.adhyeta.org.in/robots.txt"
rate_limit: 0.5
max_pages: 300
supported_formats:
- html
- pdf
102 changes: 102 additions & 0 deletions data/raw/.content_hashes.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
[
"3d59c42d1ed471e77266773c20f93dd7e350d1434d65769c84e5bf24778e99ac",
"fa11f8cc00f9d51181a08c5b1e72decb75de3d623b2fc0e86a0cb0e540a82067",
"13456ae0fe181d00735a25aa3af5310a6d4c96d7e1d1d746cfab760f27b0dc5a",
"b97ca896d765d40192d06f9a8a42581d3b4a5151c33f84b768c6f2be7c905bfd",
"5d0476abbc0a34eacec8b2d920ed43d5b3f15760758e8b2a50281260ee8f6121",
"547778d0bae0630b675efb8d1f3f24e0ed0c0935a96094b46c7c9f93308fd904",
"1fa58dcb5a7e05079c42457eb9151db57badd964d5db5adc68dfe5e6630d05ad",
"699b5b89e7eb6303040ff4d45c1f5c205a96c3b420cfd3e1bbbdcd8203854eed",
"cc4f3c14ca420e81c09af5e496dcfce454f5ee59aae189f85c61c6a86222b623",
"a7efe70f87a1e2055fe91703e4106c466abc574fadaba9e0aac5095e67afe6f4",
"b51f63cd206661a3f3756b8059c4c179f8ba9c5b1f164140696918a6faedffd0",
"8948ad34e41581bb91cba252ba915a8a9fa07c99877c32774ddcf50ae743ad50",
"a2978fbc8718c7491f02aa65c3c097752560acc8d6326462fd8165d8f910aead",
"4815bfe6d606d5f2ff35ef762624e3916b9670885806559435bf40ce7e2449d0",
"e0ceac2c56f4cf5db749de757aa9d8158a8eb7f8f35705bdc2618555bedd7e9e",
"5256fb5c73fdb8e93888bde5857d3bcc4794e0c59c92828312b46e35d62b997a",
"f9039c79bb1eec7deefb26a7bd9a3b2ee07e9dfb78a44543eb85dd93869642f2",
"26ae7dbf827846dd20eec9a3c5f765bf4d6044d243b70146745e56468af3e5ac",
"ed2ef87b27c861cbf2ca727ccb235428ad367225da32984449af023d9b18191b",
"8cea88988fdfda6114b02d5eadeb9ba21a8c419c8b7c9575dc773f529630dc05",
"0a8ef9b96fef3f01fc411e66bd5149c75c28b4a1f158f94c83179f8c78776b51",
"a0eec7f4dce9dfbef6b8859c740185ff1d4a4744791e231acc8ac2852fd8e8e6",
"e0a16b8993ea913a435aa7e66eddaf69b9d91f36837eb74d2933f4d014d8614b",
"4af14c3e85424e23c47421bd76bdc1f30a71f65e7ab4dfc157dba36eec45d9ae",
"350737b980626774dd2baf8d8193fb50a35e8fbaab4bcbe1387d35572011cbd8",
"045bc965db60fefc75f0ba6719ea0760af9ecd3ef7bf2b7a90350953004c81e0",
"f8a94e96985fdfbe624f350e7e01aae4e28a3402b81c1cdd2d7ac10093ff4c14",
"cd6868d950ee9757570c07f4df9e6a72f3c1d768da2ec72f831d4518f0fe41a0",
"5990cd0c87753d1eab87bcbdac62c14dbf427f681bee73b43fd68d09d3742bbb",
"05370e80dd26c1a3a146576cb53df812998f55e43261ad7c1c6d721856597376",
"994c2556954fae00f8d36a241d0e7b8a6396e883a075e04569386c34e31b4751",
"92b72461174c394eb02c0e3b5910c6e8f2b57bc07f55689c3fbdd171ffe9255f",
"5e81fae843388546ec4338beac47278d4431e87f254e8d5d71c327c34e2a5cf0",
"d116089095e32e4976097ec55540eebca66e25740ad30e1aa4afee5b663afd5c",
"24f9fdcb77e4af450dfb3cdee33df3851ca3f8375967c6d8dd1cb2a7409c2865",
"9eab89bced3669f1a1b1e4c18b87efa086f6b063b936d9072eca0a607c792d55",
"431dbc84fcc2f1aaf38ea29cacad036592b9f695f0a93534b367f0911b560183",
"01d8486a76e11e301e29ff14cf6862555d57f2678fbd56b5348d32cf895ecb28",
"7324ff565ed6e36b63b81541b8b1231ced18373841f5cc894bbff5c3cc5c67b7",
"716187b6e9a1a839d52d4c2a6f1c93f1e492661e5de5d3c871e7b3841f2757f2",
"15cea0ba23fb15b5f5baf5b0d6170beb29f15fd9cf5904323b04e25b6bb098d9",
"b5f43d15e81c80c8d2f00e63b4543e0a21814e0e1f31de5c5fe95f8113e25c68",
"17572a489f6c9a4a16463b745f23bf8c981e5f205cd3f908446cd205aa654bff",
"604a3ff1f6ef34f7d390e3b45ded06a4659c3b511d346d5896c2761266f3892f",
"c2e63044f6115c83f59f9a11c7def2d34cd14ab02af5d932be8d700f1530e877",
"b0123964b05fb425ed2072ba587b5afcf3e805d8a7d1bb884105de9955ee321f",
"136933c64fddf2960b870dcf11e991d4af12119dbadd4679f2b5ccb6f5824f95",
"0c34679eb906c0a7acc249b5a6df39d166d561a1d8c2196e341e311c08e235b4",
"cc58913a09e816a9d60654a18175f204816c155e874633424f50969f1c46ed0a",
"0435a72f76f5b8535a47db68b20661f461e9dbb97bd404bcd90ac13dca1851ed",
"ae7a14c3b289b2f9ab341011d1bd525c38f5d21ccd347d69483e5abafab7c36f",
"83f721f9cad7903237fb54899285f471402abad65ddcbae7f2144ace9492c0c4",
"581eede925c2c4813874883fe196749b9bad49f7aedf94cbe3ce0cab44c3b757",
"d4e373f12c8e07abe18620ed6e24f59747d0c8ef11c06e2012c4ca3612106c34",
"99862d1959186fe55cd06d40ea3c6aa6fce5fb006d4e20944ab5ef65c329e159",
"5cacdfd0614f085e93623b5e21d013e1ad607e94d2af1c79c2faad1a567dc2b9",
"96e68c3308686b13a16802203e63ad51715dea794282fc014da679ee55f71307",
"d795432ec079871f0cfb349db3dfad911bcb60482b3284167e887e2c551e90b8",
"a8b8f0e8dfe5737537ec735c4eaa61a10f6e3d871534427fe94abfadd5c6b38a",
"2fb0c8de41975c1b99326538a6c955b0459e8d8119ed0497ea45f0d1a764896e",
"d6be73d5773fbe78e90dc0e96f500d2c475441d7869c32cbf229521dc7d06bc6",
"cb16b9ce4baf8dba67b3c8140e7e0ebc6ceac0614e3fcb70f8327cdf40ead836",
"c55a5afa6894ccebfaa93d59632eaa459eb0cd94d352226d7219771a596d612d",
"907f05b81a4d7267629ddeea2e1abfe210910f73a36ee26221ab7da8aa0b1dfb",
"18a7ddb4cb7d91a17b184f86dd1af5b1586f4836142576fea167299220c27f33",
"188e699b5d92e5dfa809a7cc74045500cb9eb979ebd468c106e97d19cafbacf4",
"a21d8f76f8b5e63d1554f705f76449b842f200459c6ed37cdcf9c014eb4b41cb",
"7f1b8c00161c7696028da0373589c1531a85c565049729c64b38e390287c8cf3",
"b388dd9aa0a5d894c2031fd83efc698db383ce72418c01cc4396e8b15fa250a6",
"a6decf1cf753e2321f487f942cab081e6a16a2bc75774953353e84fec963a72a",
"3421097cf013a7e6cf7c3687653cbeed7df9da1a935c906a0409dc24121dca86",
"77426a1f6182b1aab1b0edfeb3694034292e5de536b013e21c14966746a297b3",
"6245e1feaf8cf55f75d9ac3bd8dce9e3db5d0db36e6eccdef46ceb91641b6119",
"2000e36342fb095db2ab954b9407fe23ed566a036bfec8198c3e81fd0d74cab6",
"bf1b2ddc29768de857588fc323945944eede8d7d8720ff2038f27f1e899b6930",
"4ac20e40f8de51b3762117e52c04377a12325b0b75c2fdd373a68c9e51817dce",
"1234e67b3bc61901446a6376b547de9b9bd1bfaf10063fadbd605f14a9671ae3",
"bc79b5a80eedbbf8a20291756224eb922358ab22a55a56330ec5d35ebfaeae32",
"134b8719cda2d6c738c5971746b2bd4f330e63cf1aaebc273043038e453ed876",
"10fce7e88642916cdf161bacf8b0cdeeb1b8913c7fde1ffee1390af86250ad74",
"e681513f32493bd8567a9eb0bd038e4403f90cd58bf669ee33a9a4eba07fef5d",
"8b33eebbbfa137e759f48d9cfbd4db579bfe0adc6e4bff125d79cbb32155ea31",
"53b7f1a8d9a8130807de1edb4ac10239cd64586361c9f98e7c2d9251587759bf",
"d105de7e7aad8173dc22aa4556ba226382b573e500a00879dbe0bbfab0f22a3c",
"a5257fbdf5f16f92e88378c6842511e229c0613152ce9e7d84862c4a0c4825cd",
"1b50334e5056aecb2b95ab080b4558b61f0fc2734a149aa32097fbb066ee5a37",
"69e058b3ded0113f57672d943fa8e8cfd67912eba04ea7a653e91317193b37ac",
"899fb523aaaf21ab722aef3ebcb1b8c6d5b626185f719ff594ce5c27bf7878eb",
"09c00535006f1f43e3966dab1e800bb4bc23daf27728cbf4d629132aef40557e",
"dad3ab8ea792b9a7afe534f74ce9e9d102ae0df566f421d66c69da0cebb4a73a",
"50133c816799e34f0455cae85d5289f55c275d452744fd6e1b5325360f5b3893",
"09f211b65c6329067be86ccd4f1108ca37b81af419a22fe82a893621fb7d1397",
"f6ff273ca8c6fbe414a524cada508be71b4a9db6ca397d80ccbe481d3f6321df",
"dffc061e84e576c1c8f91bf1ae57dd20025324f74a5d748d8ff375426fa4d517",
"c6655ed87ea080b554356ef78abe6e804e6d70506079279506fbfa71b77a3fd5",
"c389f4f7cedf1b7fd69c62fac07882389982808c2eabeee814bc6f4e1b72ec25",
"c51af4adfadf83d6c40c6b75d17170ad81af854027159f605f9e576006efc0eb",
"60cc83d70db46bfab870d13eff93cff6b44e0892efb0ee8a37800fb595c95557",
"edd99f84fb6ddb058e713ded5cefefd0536b6092da8906955f76fc043b408c48",
"344a24b2a98b80072a35601a9bd0c53d6f6e2f0b192a7052d7aa80c6a4ae7a27"
]
Loading