# Feature Transformations

This tutorial features:
* tokenization
* stemming
* lemmatization
* tf-idf
* encoders
* decoders


In [7]:
#Tokenization
import spacy
from nltk.tokenize import word_tokenize

sentence = """
Hello there, Karl Marx; how are you doing today? Was the play good and fun? 
Did you laugh at the jokes, filling your eyes with glee"""

#In Spacy
nlp = spacy.load("en")
doc = nlp(sentence)
print("spacy tokens:")
tokens = list(doc)
print(tokens)

#In NLTK
print("nltk tokens:")
print(word_tokenize(sentence))

spacy tokens:
[
, Hello, there, ,, Karl, Marx, ;, how, are, you, doing, today, ?, Was, the, play, good, and, fun, ?, 
, Did, you, laugh, at, the, jokes, ,, filling, your, eyes, with, glee]
nltk tokens:
['Hello', 'there', ',', 'Karl', 'Marx', ';', 'how', 'are', 'you', 'doing', 'today', '?', 'Was', 'the', 'play', 'good', 'and', 'fun', '?', 'Did', 'you', 'laugh', 'at', 'the', 'jokes', ',', 'filling', 'your', 'eyes', 'with', 'glee']


# Token Customization

If you want to customize Spacy's tokenizer it's pretty straight forward:

Just check out:

* https://spacy.io/docs/usage/customizing-tokenizer
* https://spacy.io/docs/api/tokenizer (api reference)

How to do customization in NLTK:

* http://stackoverflow.com/questions/3930267/nltk-custom-tokenizer-and-tagger
* http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer
* http://www.nltk.org/api/nltk.tokenize.html (api reference)

In [8]:
#Stemming

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('walking')
#Note: Spacy doesn't have a stemmer

'walk'

In [17]:
#Lemmatization

#reference: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

sentence = "This isn't a dogs and ponies show, after all!  We are barely making any molla..." 
#NLTK
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()
print("dogs ->",wnl.lemmatize('dogs'))
print("ponies ->",wnl.lemmatize('ponies'))
print("Sentence parsed by NLTK")
print(wnl.lemmatize(sentence))

#Spacy
from spacy.en import English
parser = English()
parsedData = parser(sentence)
print("Sentence parsed by Spacy")
print(" ".join([token.lemma_ for token in parsedData]))
    

dogs -> dog
ponies -> pony
Sentence parsed by NLTK
This isn't a dogs and ponies show, after all!  We are barely making any molla...
Sentence parsed by Spacy
this be not a dog and pony show , after all !   -PRON- be barely make any molla ...


In [19]:
# combining Stemming and Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
wnl = WordNetLemmatizer()
stemmer = PorterStemmer()
stemmer.stem(wnl.lemmatize('walking'))

'walk'

In [84]:
# Term Frequency - Inverse Document Frequency

#Reference One: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
#Reference Two: http://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
#NLTK
import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

document = """
Trump loses some sway with fellow Republicans
In the aftermath of President Trump's firing of FBI Director James Comey, two things have happened when it comes to Trump's standing with his own party. One, congressional Republicans haven't abandoned Trump on Comey (yes, some have criticized the timing, but there isn't a growing GOP demand for a special counsel). Two, enough of them are beginning to buck their president on non-Comey items. That includes Sens. John McCain (R-AZ) and Ben Sasse (R-NE) saying they're opposed to Trump's pick to be U.S. trade representative, and three Republicans who worked to defeat a GOP-backed resolution to repeal an Obama environmental regulation.

So while Republicans aren't in open revolt when it comes to Comey directly, we've seen some signs in the last 24 hours how Trump's legislative agenda, as well as his staffing priorities, could be in peril. And as for Democrats, it's hard to imagine how the president has the juice to win over any members of the opposition. Don't forget: At this same point in time of George W. Bush's presidency, there were a handful of Democrats who supported his tax cuts, and Ted Kennedy was working with his administration on education.

NBC's Lester Holt interviews Trump
NBC's Lester Holt today interviews President Trump, and it will air on "Nightly News." What will President Trump say about his recent firing of FBI Director James Comey? Be sure to watch!
"""

stemmer = PorterStemmer()
wnl = WordNetLemmatizer()

def stem_lemmatize_tokens(tokens, stemmer):
    stem_lemmatized = []
    for item in tokens:
        stem_lemmatized.append(wnl.lemmatize(item))
    return stem_lemmatized

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_lemmatize_tokens(tokens, stemmer)
    return stems


lowers = document.lower()
translator = str.maketrans('', '', string.punctuation)

no_punctuation = lowers.translate(translator)
processed_text = tokenize(no_punctuation)
#this can take some time
tfidf = TfidfVectorizer(stop_words='english')
tfs = tfidf.fit_transform(processed_text)
for index,elem in enumerate(processed_text):
    print(elem,tfs[index])
snippet = "NBC's Lester Holt interviews Trump"
response = tfidf.transform([snippet])
#print(response)

  (0, 98)	0.481029274755
  (0, 50)	0.619923719835
  (0, 38)	0.619923719835


In [None]:
# Encoder - Decoder

from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Encoding
# encode class values as integers
def one_hot_encoding(df):
    encoder = LabelEncoder()
    encoder.fit(df)
    encoded_df = encoder.transform(df)
    # convert integers to dummy variables (i.e. one hot encoded)
    return np_utils.to_categorical(encoded_df)

df = pd.DataFrame()

df['e'] = [random.choice(('Chicago', 'Boston', 'New York',None)) for i in range(df.shape[0])]
df = df[pd.notnull(df["e"])]
encoded = one_hot_encoding(df["e"])

# Decoding

#http://stackoverflow.com/questions/22548731/how-to-reverse-sklearn-onehotencoder-transform-to-recover-original-data