### Working with unstructured data (Module 07)


- A huge area. We will just scratch the surface.
- We will focus on text, but this also applies to images, audio data, some sensor data ...  
- I first heard about this area ~7 years ago and was very excited. I think it is a little more common these days? 
- Why text? People interact with each other through language, so computational text analysis is a neat way to combine interests in computers & human society.
- Jargon: "natural language processing," "computational linguistics," "text as data," "NLP+CSS." These things have slightly different meanings, but there is a lot of overlap.

#### Resources

If six 50-minute sessions working with text data is not enough for you, there are many ways to keep going with it.

- Courses:
    - [LING 1200](https://catalog.colorado.edu/courses-a-z/ling/) (cross listed with INFO)
    - [MS degree at CU](https://www.colorado.edu/linguistics/graduate-program/computational-linguistics-clasic-ms)


- Books: 
    - [Intro to Natural Language Processing](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf) (Available as hardback, or use the linked PDF off GitHub.)
    - [Foundations of Statistical Natural Language Processing](https://nlp.stanford.edu/fsnlp/) (Older text book, still good. You can find PDFs online.)
    
    
- Conferences:
    - [Text as data](https://www.textasdata2019.net/) (Political science + computer science)
    - [CS + J](http://cplusj.org/) (Not strictly for text, but comes up a lot)
    - [*CL](http://aclweb.org/) (Umbrella org for conferences focused on computers + text, aka "natural langauge processing")
    - [#NLProc](https://twitter.com/hashtag/nlproc?lang=en) (A group of very knowledgeable researchers and practitioners talk about this stuff all day on Twitter. If you want to keep up with the latest and greatest, this is a good way to do so.)
    

- Software:
    - [NLTK](https://github.com/nltk/nltk) Very popular software for NLP in Python. Common entry into NLP.
    - [Spacy](https://spacy.io/) Another popular Python NLP library. Way more performant than NLTK, and perhaps better maintained. 
    - [Hugging Face](https://github.com/huggingface) New NLP library focused on specific kinds of neural networks that are very popular. You might find this one hard to work with, but good to know about.
    - [AllenNLP](https://allennlp.org/) Another new-ish one that is good to just be aware of. More focused on research.

### Structured vs. unstructured data

Examples:
- [Structured](https://github.com/nytimes/covid-19-data/blob/master/us.csv)
- [Unstructured](https://www.reddit.com/r/cuboulder/)

Questions:
- What is the difference between structured and unstructured?
- What are other examples of unstructured data?

### Text step 1: Tokenization 

- Text usually starts as a string of characters
- We need to take that string of characters and turn it into a list of words
- Tokenization is the process of doing so
- _Sorry, 4604 students. Everyone needs to know this. It is the first step in most NLP pipelines._

In [3]:
from ptb import TreebankWordTokenizer
# This is the Penn Tree Bank tokenizer from NLTK as just one file
tok = TreebankWordTokenizer()
tok.tokenize("Hello this is a tokenizer")

tok.tokenize("I can't go outside")

['I', 'ca', "n't", 'go', 'outside']

Questions: 
    
- How is this different than split (h/t Jay) 
- Why do you think it might make sense to split "can't" into "ca," "n't"? What information does this give to a dumb computer?