Skip to content

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

License

Notifications You must be signed in to change notification settings

Dineshkarthik/n-gram_processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

n-gram_processor

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

Installation

You need Apache Spark, Python 3.* , dependency packages - toolz, PyYaml installed globally:

Install Spark using the instructions.

If Pysark is not using Python 3, Just set the environment variable:

export PYSPARK_PYTHON=python3

Now Clone and install requirements:

$ git clone https://github.com/Dineshkarthik/n-gram_processor.git
$ cd n-gram_processor
$ pip install -r requirements.txt
$ spark-submit ngrams_collector.py

Configuration

path: '/path/to/your/input/'
outpath: '/path/to/your/output/'
function: 'frequency'
order: 'right'
distance: 3
selector: 'python'
case_sensitive: 'yes'
  • path - path to your input directory
  • outpath - where you wanted your output files
  • function - frequency or words
    • frequency - returns frequency of words in particular distance. ex: [('programming', 3), ('functions', 2)]
    • words - returns set of words present in specified distance. ex: ['python programming language', 'python function programming']
  • order - left or right
    • left - from left to right
    • right - from right to left
  • distance - position of the word to be selected
  • selector - word from which the distance need to be calculated.
  • case_sensitive - yes or no, yes if texts are case sensitive else no.

About

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages