Skip to content

nikitaeverywhere/edu-text-analysis-experiments

Repository files navigation

TF-IDF, Sigma and Other (Experimental) Texts Analysing Tools

TF-IDF and Sigma analysis written in Python, which outputs results to the convenient *.xlsx spreadsheets for detailed analysis.

TF-IDF analysis allows to detect the most "important" words in the given text of some text corpus (set of articles, etc). These "important" words are those which occur in the particular document more than in any other document of the same text corpus.

While TF-IDF analysis is useful for a set of articles, Sigma analysis is useful to analyze the most "important" words in a single, usually large text (books, documents, etc).

There are a couple of more advanced scripts:

  • Matrix output for Gephi in gephi.py. Sample output file is gephi.csv in this repository.
  • Horizontal visibility graph building with hor-vis-graph.py. A couple of sample files are included in hor-vis-graph/ directory.
  • Other experiments (see below)

Preview / Examples

Experimental semantic network builder (main concepts from this article):

Graph

TF-IDF applied to some news articles text corpus:

Excel Spreadsheet

Sigma method applied to the book "The Hunger Games":

Excel Spreadsheet

Analysis of article about Putin with horizontal visibility graph and other articles text corpus:

Graph

Usage

  1. Install Python 3, clone the repository, enter repository directory with cd edu-tf-idf.
  2. Install required dependencies: pip3 install -r requirements.txt.
  3. Place texts to analyze in /texts directory (there are a couple already).
  4. Run the analyzer with py tf-idf.py command (there are many!).

Example

TF-IDF: Run the program (by default, picks texts from texts/news):
py tf-idf.py

Result:

Reading texts...
Done! Computing TF-IDF ranks...
Progressing text 2225/2225
Done! Writing results...
Writing worksheet 2225/2225
Done!

Output goes to tf-idf.xlsx file ready for analysis.

Sigma method (by default, picks texts from texts/books):
py sigma.py

Result goes to sigma.xlsx file.

Horizontal Visibility Graph (exports to hor-vis-graph/ directory, picks from texts/news):
py hor-vis-graph.py

Check the result in hor-vis-graph/ directory, visualize it using Gephi.

Individual Text Analysis

Run experimental semantic network builder with

py analyze_text.py texts/news/tech/001.txt

Check the result in analyzed/<text-title> directory, visualize it using Gephi.

License

MIT © Nikita Savchenko