MS&E 231 Discussion Section: Text Processing and NLP
This discussion section will provide a brief introduction to a few different aspects of text processing and NLTK. There is an incredible wealth of information and software out there to make your life easier and accomplish a wide variety of language-related tasks, but today we'll be looking at just a small subset, via three notebooks:
- Text processing and topic modeling (
- Document vector representation and toxic comment classification (
- using scikit-learn
- Word embedding visualization (
- using GloVe word vectors and gensim (again)
scikit-learn (e.g. using
nltk requires a further step to download corpora; i would recommend simply grabbing everything with
Let's get started!
Further NLP resources
We just scratched the surface in lab, and looked at these packages:
There are many other great pieces of software, among them:
- spaCy (does too many things to name)
- MALLET (document classificiation, topic modeling, sequence tagging)
If you're working with sentiment analysis of online speech, you might consider the Perspective API.
One Stanford course with many more resources to check out is CS 224N.