## Natural Language Processing in Python

In this tutorial, we are going to explore how to analyze large bodies of text using Natural Language Processing (NLP). With NLP, we can get a completely different perspectives on textual data that simply aren't possible using traditional close reading methods. 

We are going to work with the Python library *NLTK* (Natural Language Toolkit), which contains a variety of programs to work with data containing human language. Let's start by importing the necessary libraries.

In [38]:
import re
from nltk import *

#from nltk.corpus import stopwords
#from nltk.stem import WordNetLemmatizer
#from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bge_j\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bge_j\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bge_j\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### From text to tokens
A computer does not *read* a text. To process a collection of texts on a computational level, we need to approach our text in a different way. 

Our first step will be to split our text up into individual tokens. Instead of full bodies of text, we are going to work with our texts on a different scale. By splitting up the texts into individual words (or sentences), we are able to run computations on the word frequencies. 

Let's start by tokenizing the sentence below:

In [55]:
mySentence = "Everyday, I'm reminded of the fact that koala's fingerprints are almost identical to those of a human. That's cute!"
word_tokenize(mySentence)

['Everyday',
 ',',
 'I',
 "'m",
 'reminded',
 'of',
 'the',
 'fact',
 'that',
 'koala',
 "'s",
 'fingerprints',
 'are',
 'almost',
 'identical',
 'to',
 'those',
 'of',
 'a',
 'human',
 '.',
 'That',
 "'s",
 'cute',
 '!']

Now, let's split it on a different level. What's the difference?

In [57]:
sent_tokenize(mySentence)

["Everyday, I'm reminded of the fact that koala's fingerprints are almost identical to those of a human.",
 "That's cute!"]

It is most common in NLP to work on a word level, as this allows for more fine-grained analyses. Let's stick to that.

There are multiple ways of tackling tokenization:

- **Bag-of-Words (BoW):** Counts the frequency of words in each document.
- **Term Frequency-Inverse Document Frequency (TF-IDF):** Weighs the frequency of words by how common or rare they are across all documents.
- **Word Embeddings:** Represent words in continuous vector space (e.g., Word2Vec, GloVe). This method considers the context of the words within the larger text.

#### tf-idf

Short explanation of  

#### tf-idf

Short explanation of concept and practical application of tf-idf.

In [10]:
# Show how to vectorize a corpus of cleaned texts into a tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer

Hallo, wereld
