GitHub - IraPS/rusdracor_topic_modeling: Topic Modeling 200 Years of Russian Drama

This is a repository for the project "Topic Modeling 200 Years of Russian Drama"

the processed TEI-xml files with excluded proper names of the characters
the script for text-extraction from the TEI-files
the stop-words and proper-names lists and the script revoming them
the preprocessed corpora of 90 Russian plays
- each folder has subfolders byauthor, bycharacter, byplay, bysex
- lemmatisation and POS-tagging was done with pymystem3 Python module (wrapped Mystem)
the corpus version that was used for the project (Only Nouns corpus)
- it also includes subfolders bygenre and byyear_range
- checkout the TM (modeling only nouns-based topics) you will need only this folder.

Action	Description	Dependencies
stopwords_and_others/ extract_capitalised_words.py	Extracting all capitalised words not in the beginning of a sentence	os, re, nltk
stopwords_and_others/ characters(proper)_names.txt	Filtered the list to keep only character's proper names
stopwords_and_others/ remove_characters(proper)_names_from_TEI.py	Removing proper names from the TEI documents	os, re
scripts_for_text_extraction/ get_plays_texts_clean_POS_restriction.py	Extracting characters' speech-texts from the TEI documents with POS restictions (different options available)	os, re, codecs, glob, lxml, pymystem3
classification_using_TM_vectors_gender.py	Trying to choose the best model with a character's gender classificaton task	sklearn
semantic_vectors.py	Choosing the best model by calculating "semdensity" of topics	sklearn, numpy, glob, re, matplotlib, wordcloud, random, gensim, logging, pymystem3, pre-downloaded vectors' model
topic_modeling_predict_year.py	Applying the model to spot topics' temporal distribution	sklearn, numpy, glob, re, matplotlib, wordcloud, random
topic_modeling_predict_genre.py	Applying the model to spot topics' distribution by genre	sklearn, numpy, glob, re, matplotlib, wordcloud, random
topic_modeling_predict_author.py	Applying the model to spot topics' distribution by author	sklearn, numpy, glob, re, matplotlib, wordcloud, random
topic_modeling_predict_gender.py	Applying the model to spot topics' distribution by character's gender	sklearn, numpy, glob, re, matplotlib, wordcloud, random

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
corpora		corpora
graphs_6_topics		graphs_6_topics
scripts_for_text_extraction		scripts_for_text_extraction
stopwords_and_others		stopwords_and_others
tei_without_proper_names		tei_without_proper_names
.gitignore		.gitignore
Genre_by_us.txt		Genre_by_us.txt
README.md		README.md
classification_using_TM_vectors_gender.py		classification_using_TM_vectors_gender.py
semantic_vectors.py		semantic_vectors.py
topic_modeling_predict_author.py		topic_modeling_predict_author.py
topic_modeling_predict_gender.py		topic_modeling_predict_gender.py
topic_modeling_predict_genre.py		topic_modeling_predict_genre.py
topic_modeling_predict_year.py		topic_modeling_predict_year.py