# Text Analysis with Topic Models for the Humanities and Social Sciences

*Text Analysis with Topic Models for the Humanities and Social Sciences* (TAToM)
consists of a series of tutorials covering basic procedures in quantitative text
analysis. The tutorials cover the preparation of a text corpus for analysis and
the exploration of a collection of texts using topic models and machine
learning.

These tutorials cover basic as well as somewhat advanced procedures and make
extensive use of the Python programming language to organize, analyze, and
visualize data.

These tutorials are written by [Allen Riddell](http://ariddell.org).  Comments
are welcome, as are reports of bugs and typos. Please use the [project’s issue
tracker](https://github.com/ariddell/tatom/issues).

## Contents

- [Preliminaries](preliminaries.ipynb)
  - [Required Python packages](preliminaries.ipynb#required-python-packages)
  - [Installing Python](preliminaries.ipynb#installing-python)
  - [Installing Python packages](preliminaries.ipynb#installing-python-packages)
- [Getting started](getting_started.ipynb)
  - [For those new to Python](getting_started.ipynb#for-those-new-to-python)
  - [For those new to NumPy](getting_started.ipynb#for-those-new-to-numpy)
  - [For those new to Matplotlib](getting_started.ipynb#for-those-new-to-matplotlib)
- [Working with text](working_with_text.ipynb)
  - [Creating a document-term matrix](working_with_text.ipynb#creating-a-document-term-matrix)
  - [Comparing texts](working_with_text.ipynb#comparing-texts)
  - [Visualizing distances](working_with_text.ipynb#visualizing-distances)
  - [Clustering texts based on distance](working_with_text.ipynb#clustering-texts-based-on-distance)
  - [Exercises](working_with_text.ipynb#exercises)
- [Preprocessing](preprocessing.ipynb)
  - [Tokenizing](preprocessing.ipynb#tokenizing)
  - [Chunking](preprocessing.ipynb#chunking)
  - [Grouping](preprocessing.ipynb#grouping)
  - [Exercises](preprocessing.ipynb#exercises)
- [Feature selection: finding distinctive words](feature_selection.ipynb)
  - [Measuring “distinctiveness”](feature_selection.ipynb#measuring-distinctiveness)
  - [Bayesian group comparison](feature_selection.ipynb#bayesian-group-comparison)
  - [Log likelihood ratio and $ \chi^2 $ feature selection](feature_selection.ipynb#log-likelihood-ratio-and-chi-2-feature-selection)
  - [Mutual information feature selection](feature_selection.ipynb#mutual-information-feature-selection)
  - [Feature selection as exploratory data analysis](feature_selection.ipynb#feature-selection-as-exploratory-data-analysis)
  - [Exercises](feature_selection.ipynb#exercises)
- [Topic modeling with MALLET](topic_model_mallet.ipynb)
  - [Running MALLET](topic_model_mallet.ipynb#running-mallet)
  - [Processing MALLET output](topic_model_mallet.ipynb#processing-mallet-output)
  - [Inspecting the topic model](topic_model_mallet.ipynb#inspecting-the-topic-model)
- [Topic modeling in Python](topic_model_python.ipynb)
  - [Using Non-negative matrix factorization](topic_model_python.ipynb#using-non-negative-matrix-factorization)
  - [Inspecting the NMF fit](topic_model_python.ipynb#inspecting-the-nmf-fit)
- [Visualizing topic models](topic_model_visualization.ipynb)
  - [Visualizing topic shares](topic_model_visualization.ipynb#visualizing-topic-shares)
  - [Visualizing topic-word associations](topic_model_visualization.ipynb#visualizing-topic-word-associations)
- [Visualizing trends](visualizing_trends.ipynb)
  - [Plotting trends](visualizing_trends.ipynb#plotting-trends)
- [Classification, Machine Learning, and Logistic Regression](classification_logistic_regression.ipynb)
  - [Predicting genre classifications](classification_logistic_regression.ipynb#predicting-genre-classifications)
  - [Logistic regression](classification_logistic_regression.ipynb#logistic-regression)
- [Case study: Racine’s early and late tragedies ](case_study_racine.ipynb)
  - [Corpus: sixty tragedies](case_study_racine.ipynb#corpus-sixty-tragedies)
  - [Racine’s atypical plays](case_study_racine.ipynb#racine-s-atypical-plays)
- [Datasets](datasets.ipynb)
  - [British novels](datasets.ipynb#british-novels)
  - [French plays](datasets.ipynb#french-plays)
  - [Les Misérables](datasets.ipynb#les-miserables)
  - [Stopword lists](datasets.ipynb#stopword-lists)
- [References](references.ipynb)


![_static/plot_doctopic_heatmap.png](_static/plot_doctopic_heatmap.png)  
![_static/plot_word_topic.png](_static/plot_word_topic.png)  