## 6.1 Cleaning Text

In [3]:
text_data = ["  Interrobang. By aishwarya Henriette    ",
            "Parking And Going. By Karl Gautier",
            "    Today Is The night. By Jarek Prakash"]

# strip whitespaces
strip_whitespace = [string.strip() for string in text_data]
strip_whitespace

['Interrobang. By aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [5]:
remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods

['Interrobang By aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [6]:
def capitalizer(string: str) -> str:
    return string.upper()

[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [9]:
import re

def replace_letters_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "X", string)

[replace_letters_with_X(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

### See Also
* Beginners Tutorial for Regular Expressions in Python (https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/)

## 6.2 Parsing and Cleaning HTML

In [18]:
from bs4 import BeautifulSoup

html = """
    <div class='full_name'><span style='font-weight:bold'>Yan</span> Chin</div>
"""

soup = BeautifulSoup(html)

soup.find("div", {"class": "full_name"}).text

'Yan Chin'

### See Also
* Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## 6.3 Removing Punctuation

In [20]:
import unicodedata
import sys

text_data = ['Hi!!! I. Love. This. Song.....', '10000% Agree!!!! #LoveIT', 'Right?!?!']

# create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

# for each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## 6.4 Tokenizing Text

In [31]:
from nltk.tokenize import word_tokenize
string = "The science of today is the technology of tommorrow"

# tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tommorrow']

In [32]:
from nltk.tokenize import sent_tokenize
string = "The science of today is the technology of tommorw. Tommorrow is today"

# tokenize sentences
sent_tokenize(string)

['The science of today is the technology of tommorw.', 'Tommorrow is today']

## 6.5 Removing Stop Words

In [38]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

tokenized_words = ['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']

stop_words = stopwords.words('english')

# remove stop words
[word for word in tokenized_words if word not in stop_words]

[nltk_data] Downloading package stopwords to /Users/f00/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


OSError: No such file or directory: '/Users/f00/nltk_data/corpora/stopwords/english'

## 6.6 Stemming Words

In [39]:
from nltk.stem.porter import PorterStemmer

tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

# create stemmer
porter = PorterStemmer()

# apply stemmer
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

### See Also
* Porter Stemming Algorithm (https://tartarus.org/martin/PorterStemmer/)

## 6.7 Tagging Part of Speech

In [41]:
from nltk import pos_tag
from nltk import word_tokenize
import nltk
nltk.download('averaged_perceptron_tagger')

text_data = "Chris loved outdoor running"

text_tagged = pos_tag(word_tokenize(text_data))

text_tagged

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/f00/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

NLTK uses the Penn Treebank parts for speech tags, some examples:

| Tag | Parts of Speech |
|---  |-----------------|
|NNP| Proper noun, singular|
|NN| Noun, singular or mass|
|RB| Adverb|
|VBD| Verb, past tense|
|VBG| Verb, gerund or present participle|
|JJ| Adjective|
|PRP| Personal pronoun|

Once the text has been tagged, we can use the tags to find certain parts of speech. For example, here are all nouns:

In [42]:
[word for word, tag in text_tagged if tag in ['NN', 'NNS', 'NNP', 'NNPS']]

['Chris']

In [46]:
from sklearn.preprocessing import MultiLabelBinarizer

tweets = ["I am eating a burrito for breakfast",
         "Political science is an amazing field",
         "San Francisco is an awesome city"]

tagged_tweets = []

# tag each word and each tweet
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

# use one hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [47]:
# show feature names
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

In [49]:
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
import nltk
nltk.download('brown')
    
# get some text from the Brown
sentences = brown.tagged_sents(categories='news')

# split into 4000 stences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]

# create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)

trigram.evaluate(test)

[nltk_data] Downloading package brown to /Users/f00/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


0.8174734002697437

### See Also
* https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## 6.8 Encoding Text as a Bag of Words

In [50]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])

count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [51]:
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [52]:
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

In [53]:
count_2gram = CountVectorizer(ngram_range=(1,2), stop_words='english', vocabulary=['brazil'])
bag = count_2gram.fit_transform(text_data)
bag.toarray()

array([[2],
       [0],
       [0]])

In [54]:
count_2gram.vocabulary_

{'brazil': 0}

### See Also
* n-gram (https://en.wikipedia.org/wiki/N-gram)
* bag of words meets bags of popcorn (https://www.kaggle.com/c/word2vec-nlp-tutorial)

## 6.9 Weighting Word Importance

In [55]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])

# create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [56]:
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [57]:
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

$$
tfidf(t, d) = tf(t,d) * idf(t)
$$

where $t$ is a word

$d$ is a document

$$
idf(t) = log(\frac{1 + n_d}{1 + df(d, t}) +1
$$

where $n_d$ is the number of documents and 

$df(d,t)$ is term, $t$'s document frequency (i.e. number of documents where the term appears)

### See Also
* scikit-learn documentation: tf-idf term weighting (http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)