Skip to content

extract topics from corpus. Supervised and Unsupervised modes.

Notifications You must be signed in to change notification settings

PyAntony/nlp-topic-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Topic Modeling

Topic modeling application to extract top features/topics. It supports supervised and unsupervised modes.

Usage

  • You can use the nlp_env.yml file to create your conda environment. Example:
conda env create -f /home/user/Desktop/nlp_env.yml

Otherwise you can manually install the needed packages (Anaconda is recommended).

  • IMPORTANT: after installing Spacy you must download the required model running:
python -m spacy download en_core_web_sm
  • To see all options go to the root directory and run:
python nlp.py --help
Usage: nlp.py [OPTIONS]

  Topic modeling application. Modes:

  --> Supervised (default mode):

  Logistic Regression is used to get top coefficients. Application prints
  and saves the top coefficients (log of the odds), the classes support, and
  the trained model in the 'output' directory.

  Usage: place your file (csv of json) in the 'input' directory and specify
  your X and y labels using the command line options. Application expects 1
  file only, if multiple files are present only the first one is considered.
  To run navigate to root directory and type "python nlp.py" + options,
  e.g.:

  "python nlp.py --x_train text --y_train target"

  --> Unsupervised:

  latent Dirichlet allocation (LDA) is used with multiple components
  (topics). It is activated when passing the argument 'lda_components'.
  Multiple files can be processed. A directory is created inside the
  'output' directory per file processed. Output includes, per file, per
  number of topics:

  - Count of documents with topic (png file).
  - Percentage of documents with topic (png file).
  - Total percentage across documents normalized (png file).
  - Top features/words (csv file).

  Also, per file:

  - Line plot with likelihood for all number of topics (png file).
  - Line plot with perplexity for all number of topics (png file).

  Usage: place your file or files in the 'input' directory and pass the
  range of topics desired using the 'lda_components' option, e.g.:

  "python nlp.py --x_train text --lda_components 2-8"

Options:
  --x_train TEXT         X_train column name. Multiple column names can be
                         passed with '*' between. Text for all columns will be
                         concatenated as one.  [default: X_train]
  --y_train TEXT         y_train column name. There must be 1 label per
                         instance.  [default: y_train]
  --n_features TEXT      Get the top n_features.  [default: 20]
  --lda_components TEXT  Range of LDA components to process. Example: '2-6'.
  --help                 Show this message and exit.

To Do

  • [] Validate options before computing.
  • [] Add output examples to documentation.
  • [] Use cross validation to get LDA scores (perhaps).
  • [] Decouple the production of the plots (matplotlib objects) and saving to directories (no side effects).
  • [] Add module for Logistic Regression prediction.

About

extract topics from corpus. Supervised and Unsupervised modes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages