Skip to content

shreevatsa/sanskrit

Repository files navigation

NOTE (July 2018): The description below is out-of-date at the moment. It can only be used to get an approximate idea.

Table of contents

What

Code to identify the metre of a Sanskrit verse.

Web version currently serving at http://sanskritmetres.appspot.com/

Can also be used as a Python library.

Examples

In the web version, try the following inputs.

kāṣṭhād agnir jāyate mathya-mānād-
bhūmis toyaṃ khanya-mānā dadāti|
sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ
mārgārabdhāḥ sarva-yatnāḥ phalanti||

or (note that this one intentionally has many typos):

काष्ठाद् अग्नि जायते
मथ्यमानाद्भूमिस्तोय खन्यमाना ददाति।
सोत्साहानां नास्त्यसाध्यं
नराणां मार्गारब्धाः सवयत्नाः फलन्ति॥

If using as a library (TODO: document this better):

import identifier_pipeline

verse = r'''kāṣṭhād agnir jāyate mathya-mānād-
bhūmis toyaṃ khanya-mānā dadāti|
sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ
mārgārabdhāḥ sarva-yatnāḥ phalanti||'''

identifier = identifier_pipeline.IdentifierPipeline()
match_results = identifier.IdentifyFromText(verse)

How

The design of the program is as follows.

Transform the input (Read, Scan)

The input passes through the following representations.

The raw input

This is whatever is typed into the textarea (for the web interface) or given as input to `IdentifierPipeline`. Consider the examples above.

The input in slp1

Whatever the input script (transliteration scheme) used, the input is cleaned up and “read” into a limited Sanskrit alphabet (slp1). For instance, the examples above are read as the following:

kAzWAdagnirjAyatemaTyamAnAd
BUmistoyaMKanyamAnAdadAti
sotsAhAnAMnAstyasADyaMnarARAM
mArgArabDAHsarvayatnAHPalanti

and

kAzWAdagnijAyate
maTyamAnAdBUmistoyaKanyamAnAdadAti
sotsAhAnAMnAstyasADyaM
narARAMmArgArabDAHsavayatnAHPalanti

respectively.

The metrical signature of the input

We next scan the input, to reduce it to a pattern of laghu (denoted L) and guru (denoted G) syllables.

Our two examples above are scanned into the lists:

['GGGGGLGGLGG',
 'GGGGGLGGLGL',
 'GGGGGLGGLGG',
 'GGGGGLGGLGL']

and

['GGGLGLG',
 'GLGGGGGLGLGGLGL',
 'GGGGGLGG',
 'LGGGGGGLLGGLGL']

respectively.

Identify

Finally, we compare this metrical signature (or “pattern lines”) against a database of known patterns.

For example, in our database we have the information that Śālinī is a sama-vṛtta metre consisting of 4 lines (pāda-s / quarters) each having the pattern

GGGG—GLGGLGG

Thus Śālinī is recognized as the (probable, best-guess) metre of the input verse.

Note that in the second example, even though no line matches a line of Śālinī, the program is still clever enough to detect a match.

Look at the README inside the identify directory for more details on the matching heuristics used.

Thus the code can detect partial matches: if there are metrical errors in the verse, but some parts of it are in some metre, then that metre still has a chance of being recognized.

We might also have multiple results when we have multiple metres guessed, such as when different lines are in different metres.

Display

The detected metre is displayed, along with how the verse fits the metre, and information about the metre.

TODO: Describe this.


(Everything below this line needs even more rewriting.)

Code organization

See deps.png for the dependency graph.

Read

Covered by the files in read and their dependencies.

Detecting the transliteration format of the input, removing junk characters that are not part of the verse, and transliterating the input to SLP1 (the encoding we use internally).

Scan

Determining the pattern of gurus and laghus.

The functions in scan.py take this cleaned-up verse, and convert it to a pattern of laghus and gurus. A “pattern” means a sequence over the alphabet {‘L’, ‘G’}.

Identify

Identification algorithm: Given a verse,

  1. Look for the full verse’s pattern in known_metre_patterns.
  2. Loop through known_metre_regexes and see if any match the full verses’s pattern.
  3. Look in known_partial_patterns (then known_partial_regexes) for: – whole verse, – each line, – each half, – each quarter.
  4. [TODO/Maybe] Look for substrings, find closest match, etc.? Might have to restrict to the popular metres for efficiency.

Metrical data

  • A “pattern” means a sequence over the alphabet {‘L’, ‘G’}.
  • A “regex” (for us) is a regular expression that matches some patterns.

(TODO: This is obsolete.) We use the following data structures:

  • known_metre_patterns, a dict mapping a pattern to a MatchResult.
  • known_metre_regexes, a list of pairs (regex, MatchResult).
  • known_partial_patterns, a dict mapping a pattern to MatchResult-s.
  • known_partial_regexes, a list of pairs (regex, MatchResult).

    A MatchResult is usually arrived at by looking at a pattern (or list of patterns), and can be seen as a tuple (metre_name, match_type):

    metre_name - name of the metre, match_type - used to distinguish between matching one pāda (quarter) or one ardha (half) of a metre. Or, in ardha-sama metres, it can distinguish between odd and even pādas.

Display

Display the list of metres found as possible guesses. For vrtta metres, we also try to “align” the input verse to the metre, so that it’s more clear where to break it, etc. (And when the input verse has metrical errors, it’s clear what they are.)