n-gram_processor

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

Installation

You need Apache Spark, Python 3.* , dependency packages - toolz, PyYaml installed globally:

Install Spark using the instructions.

If Pysark is not using Python 3, Just set the environment variable:

export PYSPARK_PYTHON=python3

Now Clone and install requirements:

$ git clone https://github.com/Dineshkarthik/n-gram_processor.git
$ cd n-gram_processor
$ pip install -r requirements.txt
$ spark-submit ngrams_collector.py

Configuration

path: '/path/to/your/input/'
outpath: '/path/to/your/output/'
function: 'frequency'
order: 'right'
distance: 3
selector: 'python'
case_sensitive: 'yes'

path - path to your input directory
outpath - where you wanted your output files
function - frequency or words
- frequency - returns frequency of words in particular distance. ex: [('programming', 3), ('functions', 2)]
- words - returns set of words present in specified distance. ex: ['python programming language', 'python function programming']
order - left or right
- left - from left to right
- right - from right to left
distance - position of the word to be selected
selector - word from which the distance need to be calculated.
case_sensitive - yes or no, yes if texts are case sensitive else no.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
ngrams_processor.py		ngrams_processor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

n-gram_processor

Installation

Configuration

About

Releases

Packages

Contributors 3

Languages

License

Dineshkarthik/n-gram_processor

Folders and files

Latest commit

History

Repository files navigation

n-gram_processor

Installation

Configuration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages