Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.
You need Apache Spark, Python 3.* , dependency packages - toolz, PyYaml installed globally:
Install Spark using the instructions.
If Pysark is not using Python 3, Just set the environment variable:
export PYSPARK_PYTHON=python3
Now Clone and install requirements:
$ git clone https://github.com/Dineshkarthik/n-gram_processor.git
$ cd n-gram_processor
$ pip install -r requirements.txt
$ spark-submit ngrams_collector.py
path: '/path/to/your/input/'
outpath: '/path/to/your/output/'
function: 'frequency'
order: 'right'
distance: 3
selector: 'python'
case_sensitive: 'yes'
- path - path to your input directory
- outpath - where you wanted your output files
- function -
frequency
orwords
- frequency - returns frequency of words in particular distance. ex: [('programming', 3), ('functions', 2)]
- words - returns set of words present in specified distance. ex: ['python programming language', 'python function programming']
- order -
left
orright
- left - from left to right
- right - from right to left
- distance - position of the word to be selected
- selector - word from which the distance need to be calculated.
- case_sensitive -
yes
orno
, yes if texts are case sensitive else no.