# Feature Transformations

This tutorial features:
* tokenization
* stemming
* lemmatization
* tf-idf
* encoders
* decoders


In [7]:
#Tokenization
import spacy
from nltk.tokenize import word_tokenize

sentence = """
Hello there, Karl Marx; how are you doing today? Was the play good and fun? 
Did you laugh at the jokes, filling your eyes with glee"""

#In Spacy
nlp = spacy.load("en")
doc = nlp(sentence)
print("spacy tokens:")
tokens = list(doc)
print(tokens)

#In NLTK
print("nltk tokens:")
print(word_tokenize(sentence))

spacy tokens:
[
, Hello, there, ,, Karl, Marx, ;, how, are, you, doing, today, ?, Was, the, play, good, and, fun, ?, 
, Did, you, laugh, at, the, jokes, ,, filling, your, eyes, with, glee]
nltk tokens:
['Hello', 'there', ',', 'Karl', 'Marx', ';', 'how', 'are', 'you', 'doing', 'today', '?', 'Was', 'the', 'play', 'good', 'and', 'fun', '?', 'Did', 'you', 'laugh', 'at', 'the', 'jokes', ',', 'filling', 'your', 'eyes', 'with', 'glee']


# Token Customization

If you want to customize Spacy's tokenizer it's pretty straight forward:

Just check out:

* https://spacy.io/docs/usage/customizing-tokenizer
* https://spacy.io/docs/api/tokenizer (api reference)

How to do customization in NLTK:

* http://stackoverflow.com/questions/3930267/nltk-custom-tokenizer-and-tagger
* http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer
* http://www.nltk.org/api/nltk.tokenize.html (api reference)

In [8]:
#Stemming

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('walking')
#Note: Spacy doesn't have a stemmer

'walk'

In [17]:
#Lemmatization

#reference: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

sentence = "This isn't a dogs and ponies show, after all!  We are barely making any molla..." 
#NLTK
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()
print("dogs ->",wnl.lemmatize('dogs'))
print("ponies ->",wnl.lemmatize('ponies'))
print("Sentence parsed by NLTK")
print(wnl.lemmatize(sentence))

#Spacy
from spacy.en import English
parser = English()
parsedData = parser(sentence)
print("Sentence parsed by Spacy")
print(" ".join([token.lemma_ for token in parsedData]))
    

dogs -> dog
ponies -> pony
Sentence parsed by NLTK
This isn't a dogs and ponies show, after all!  We are barely making any molla...
Sentence parsed by Spacy
this be not a dog and pony show , after all !   -PRON- be barely make any molla ...
