Skip to content


Switch branches/tags
This branch is 147 commits ahead, 5 commits behind michellejm:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Header image for the Text Analysis workshop

Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text faster than humanly possible. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.

In this workshop, you will:

In this workshop, you will learn skills like:

  • How to prepare texts for computational analysis, including strategies for transforming texts into numbers
  • How to use NLTK methods such as concordance and similar
  • How to clean and standardize your data, including powerful tools such as stemmers and lemmatizers
  • Compare frequency distribution of words in a text to quantify the narrative arc
  • Understand stop words and how to remove them when needed.
  • Utilize Part-of-Speech tagging to gather insights about a text
  • Transform any document that you have (or have access to) in a .txt format into a text that can be analyzed computationally
  • How to tokenize your data and put it in nltk compatible format.

This workshop is estimated to take you 10 hours to complete.

Get Started


  1. Text as Data
  2. Cleaning and Normalizing
  3. Using the NLTK Corpus
  4. Searching for Words
  5. Positioning Words
  6. Types vs. Tokens
  7. Length and Unique Words
  8. Lexical Density
  9. Data Cleaning: Removing Stop Words
  10. Data Cleaning: Lemmatizing Words
  11. Data Cleaning: Stemming Words
  12. Data Cleaning: Results
  13. Make Your Own Corpus
  14. Make Your Own Corpus (continued)
  15. Part-of-Speech Tagging

Before you get started

If you do not have experience or basic knowledge of the following workshops, you may want to look into those before you start with Text Analysis with Python and NLTK:

Ethical Considerations

Before you start the Text Analysis with Python and NLTK workshop, we want to remind you of some ethical considerations to take into account when you read through the lessons of this workshop:

  • In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be careful when analizing it.
  • It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not tell the story, our analysis does. We must be careful not to draw hasty and simplistic conclusions for things that are complex. Just because we found out that author A uses more unique words than author B, does it mean that A is a better writer than B?

Pre-reading suggestions

Before you start the Text Analysis with Python and NLTK workshop, you may want to read a couple of our pre-reading suggestions:

Projects that use these skills

You may also want to check out a couple of projects that use the skills discussed in this workshop:

Get Started


This workshop is the result of a collaborative effort of a team of people, mostly involved presently or in the past, with the Graduate Center's Digital Initiatives. If you want to see statistics for contributions to this workshop, you can do so here. This is a list of all the contributors:

Digital Research Institute (DRI) Curriculum by Graduate Center Digital Initiatives is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at When sharing this material or derivative works, preserve this paragraph, changing only the title of the derivative work, or provide comparable attribution.

Creative Commons License


@DHRI-Curriculum Session on text analysis with NLTK, including discussion of cleaning data, creating text corpora, and analyzing texts programmatically.







No packages published


  • Jupyter Notebook 100.0%