# Sanskrit Stemmer using Sanskrit Heritage Segmenter

## Usage

[Sanskrit Heritage Reader](https://sanskrit.inria.fr/) produces the segmented forms of a given sentence. It also produces the morphological analysis of each of the segments. Hence, the segmenter has been used here as a morphological analyser.

Import `sh_segmenter` to start using the stemmer. `sh_segmenter.py` has functions to get the stemming format and normalize the input_text and then access the Sanskrit Heritage Segmenter to generate the morphological analyses

In [None]:
import sh_segmenter as sh

`urlname` is a global variable containing the Sanskrit Heritage Segmenter's url (http://sanskrit.inria.fr/cgi-bin/SKT/sktgraph2.cgi). Change this to (http://localhost/cgi-bin/SKT/sktgraph2 or the corresponding cgi file) if the Segmenter installed in the local machine.

In [4]:
urlname = "http://sanskrit.inria.fr/cgi-bin/SKT/sktgraph2.cgi"

Call `run_stemmer` to get the morphological analysis.

### Input

It has two inputs:

1. input_text -> input word or sentence in WX notation
2. stemmer_format -> `word`, `sent-iter`, `sent-joint`
    * sent-iter -> gets the morphological analyses of each of the words individually and then merges the results
    * sent-joint -> segments the given sentence and then gets the morphological analyses of each of the segments

In [2]:
sh.run_stemmer("gacCawi", "word")

'{"word": ["gacCawi"], "morph": [{"derived_stem": "gam", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["pr. [1] ac. sg. 3"]}, {"derived_stem": "gacCaw", "base": "gam", "derivatioanal_morph": "ppr. [1] ac.", "inflectional_morphs": ["n. sg. loc.", "m. sg. loc."]}]}'

### Output

The output is in json with two key-value pairs:

1. word -> segmented form
2. morph -> list containing the possible morphological analyses with each entry having the following key-value pairs:
    * derived_stem -> the stem/root (prātipadika/dhātu)
    * base -> if the form is derived, this field has the base stem/root
    * derivatioanal_morph -> if the form is derived, this field has the derivational morph analysis
    * inflectional_morph -> list of possible inflectional morph analysis

The stem/root may contain '#number' indicating the homonymy index according to the Sanskrit Heritage Engine's lexicon.

In [6]:
sh.run_stemmer("hiwam", "word")

'{"word": ["hiwam"], "morph": [{"derived_stem": "hiwa#1", "base": "hi#2", "derivatioanal_morph": "pp.", "inflectional_morphs": ["n. sg. acc.", "n. sg. nom.", "m. sg. acc."]}, {"derived_stem": "hiwa#2", "base": "XA#1", "derivatioanal_morph": "pp.", "inflectional_morphs": ["n. sg. acc.", "n. sg. nom.", "m. sg. acc."]}]}'

### Stemming a sentence

To run a sentence, the parameter for `stemmer_format` is `sent-joint`

In [7]:
sh.run_stemmer("kaSciw kAnwAvirahaguruNA svAXikArAw pramawwaH", "sent-joint")

'{"word": ["kaSciw kAnwA-viraha-guruNA svAXikArAw pramawwaH", "kaSciw kAnwA viraha-guruNA svAXikArAw pramawwaH", "kaH ciw-kAnwA-viraha-guruNA svAXikArAw pramawwaH", "kaH ciw-kAnwA viraha-guruNA svAXikArAw pramawwaH", "kaSciw kAnwA-viraha-guruNA sva-aXikArAw pramawwaH", "kaH ciw kAnwA-viraha-guruNA svAXikArAw pramawwaH", "kaSciw kAnwA viraha-guruNA sva-aXikArAw pramawwaH", "kaH ciw-kAnwA-viraha-guruNA sva-aXikArAw pramawwaH", "kaH ciw-kAnwA viraha-guruNA sva-aXikArAw pramawwaH", "kaH ciw kAnwA viraha-guruNA svAXikArAw pramawwaH"], "morph": [{"derived_stem": "kim", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["m. sg. nom."]}, {"derived_stem": "kiFciw", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["m. sg. nom."]}, {"derived_stem": "ciw#2", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["iic."]}, {"derived_stem": "ciw#2", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["m. sg. nom.", "n. sg. acc.", "n. sg. nom.", "f. sg. no

For sentence-based morph analysis, the `word` parameter will have the possible segmentation solutions. And the `morph` parameter will have all the segments present in the possible segmentation solutions.

### Errors

Sometimes, the stemmer results into incorrect analysis and it is possible due to the following reasons:
1. The Sanskrit Heritage Lexicon will not have the intended stem or root and this will produce in an incorrect analysis. (Out of Vocabulary words)
2. There could be segmentation errors

These are indicated with the following cues:
1. `derived_stem` will have the same value as the input
2. `inflectional_morph` will have a "?" as the entry

In [8]:
sh.run_stemmer("anasUyanwaH","word")

'{"word": ["anasUyanwaH"], "morph": [{"derived_stem": "anasUyanwaH", "base": "", "derivatioanal_morph": "", "inflectional_morphs": ["?"]}]}'

## Running the scripts directly

The script `sanskrit_stemmer.py` can be directly run with input_text and stemmer_format as arguments:

```
python3 sanskrit_stemmer.py "gacCawi" "word"
python3 sanskrit_stemmer.py "kaSciw kAnwAvirahaguruNA svAXikArAw pramawwaH" "sent-joint"
```

Also, a file containing words (separated by newline) can be fed to the script `sanskrit_stemmer_file.py` along with an output_file (json) and stemmer_format. The results (in json format) are written to the output file.

```
python3 sanskrit_stemmer.py sample_input_words.txt sample_output_words.json "word"
python3 sanskrit_stemmer.py sample_input_sents.txt sample_output_sents.json "sent-joint"
```

A shell script `sanskrit_stemmer.sh` and sample input files (`sample_input_words.txt` and `sample_input_sents.txt`) are also provided.

## Additional Information

### Modes of Segmenter

There are two modes of the Segmenter:

1. In mode 1 (indicated by the parameter value 'g'), all segments are generated
2. In mode 2 (indicated by the parameter value 'b'), best segments are chosen from the set of all segments using some heuristics.

This stemmer uses the best mode to choose the best possible segments and rejects the less likely segments. Further, the best mode can be made to return either the best solution or the best 10 solutions.

Functions `request_word_analysis` and `request_sentence_analysis` initialize the parameters in the variabl `env_vars`. The `mode` parameter can be modified according to the requirement:
* 'b' -> for best 10 seolutions
* 'f' -> for the first solution alone

Examples will be uploaded soon..

---
These examples are provided in the sample_stemming notebook