Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md
TopicModeling.ipynb
ToxicCommentLearning.ipynb
WordEmbeddingViz.ipynb

README.md

MS&E 231 Discussion Section: Text Processing and NLP

This discussion section will provide a brief introduction to a few different aspects of text processing and NLTK. There is an incredible wealth of information and software out there to make your life easier and accomplish a wide variety of language-related tasks, but today we'll be looking at just a small subset, via three notebooks:

  • Text processing and topic modeling (TopicModeling.ipynb)
  • Document vector representation and toxic comment classification (ToxicCommentLearning.ipynb)
  • Word embedding visualization (WordEmbeddingViz.ipynb)
    • using GloVe word vectors and gensim (again)

Prereqs

Install nltk, gensim, and scikit-learn (e.g. using pip3 install).

Once installed, nltk requires a further step to download corpora; i would recommend simply grabbing everything with nltk.download('all').

Let's get started!

Further NLP resources

We just scratched the surface in lab, and looked at these packages:

There are many other great pieces of software, among them:

  • spaCy (does too many things to name)
  • MALLET (document classificiation, topic modeling, sequence tagging)

If you're working with sentiment analysis of online speech, you might consider the Perspective API.

One Stanford course with many more resources to check out is CS 224N.

Some of the state-of-the-art language understanding methods (as of this writing), if you're interested in keeping up with these sorts of things, are BERT and XLNet.

You can’t perform that action at this time.