# Processing text

Some of the data comes in the form of free-form text. We'll often want to process such text to create features for our model using **NLP** - Natural Language Processing.

**Tokenization**

The first step is **tokenization**, this is the process of splitting a string into a list of strings, one string for each word.

For the string 'petro-vend fuel and fluids':

If we tokenize on **white space** (spaces, tab or newline character), we get:

Four tokens : ['petro-vend', 'fuel', 'and', 'fluids']

If we were to tokenize on **white space** and **punctuation** (spaces, tab, newline character, or any mark of punctuation), we get:

Five tokens: ['petro', 'vend', 'fuel', 'and', 'fluids']

**Count the occurence of our tokens**

Second, we need to count the number of times that a particular token occurs in a row/observation, this is called **bag of words** representation. It is the simplest technique available in order to represent text in a way(as numerical data) that our ML model can use. It is assumed that the number of times that a word is sufficient information. However, this approach does not include word order or grammer, so that information is lost.

A more sophisticated approach is to create **n-grams**. In addition for creating columns with the count of each single word, **1-grams**, we may have a column, **2-grams**, for every ordered pair of two words. Thus:

['petro', 'vend', 'fuel', 'and', 'fluids']

1-grams : 'petro', 'vend', 'fuel', 'and', 'fluids' columns
2-grams : 'petro vend', 'vend fuel', 'fuel and', 'and fluids' columns
3-grams : 'petro vend fuel', 'vend fuel and', 'fuel and fluids' columns

**n** can be any number.

**Represent text numerically**

Tools like sklearn cannot use text in their models, so we need to convert the text to numerical data. sklearn provides a **bag of words** function, the `CountVectorizer()`. It takes an array of strings and:

- tokenizes all of the strings
- builds a **vocabularly**, takes note of ALL the words/tokens that appear
- counts the occurences of each token in each row

We need to provide the function a regex that it is to use as the separator, e.g. `\\S+(?=\\s+)`, with which it can tokenize the strings.

We also need to ensure that our strings do not contain any `NaN` values, and simply replace them with an empty string.

We can then use the `count_vectorizer` object created with `.fit()` just like any other estimator in sklearn. `fit()` will parse all the strings for tokens and create the **vocabularly** 

### Using sklearn's CountVectorizer method

We'll look at the effects of tokenizing in different ways by comparing the **bag-of-words** representations resulting from different token patterns.

We will focus on one feature only, the `Position_Extra` column, which describes any additional information not captured by the `Position_Type` label.

In [1]:
# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# import our custom train_test split function
from multilabel import multilabel_train_test_split

# set seed for reproducibility
np.random.seed(0)

df = pd.read_csv('../data/TrainingData.csv',index_col=0)

In [2]:
df.loc[8960]

Function                                         Student Transportation
Use                                                                 O&M
Sharing                                                 Shared Services
Reporting                                                    Non-School
Student_Type                                                Unspecified
Position_Type                                                     Other
Object_Type                                  Other Compensation/Stipend
Pre_K                                                          Non PreK
Operating_Status                                      PreK-12 Operating
Object_Description        Extra Duty Pay/Overtime For Support Personnel
Text_2                                                              NaN
SubFund_Description                                          Operations
Job_Title_Description                     TRANSPORTATION,BUS DR., RADIO
Text_3                                                          

Looking at the output of the item in row 8960 reveals that the `Object_Description` is overtime pay. For who? The `Position_Type` is merely "other", but the `Position_Extra` elaborates: "BUS DRIVER". The item has a number of NaN values.

We'll turn the raw text in this column into a **bag-of-words** representation by creating tokens that contain only alphanumeric characters. Plus create a **bag-of-words** representation based on using whitespace only.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
TOKENS_BASIC = '\\S+(?=\\s+)'

# Fill NaN values in df.Position_Extra
df.Position_Extra.fillna('', inplace=True)

# Instantiate the CountVectorizer
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)
vec_basic.fit(df.Position_Extra)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:30])

msg = "There are {} tokens in Position_Extra if we split on white space"
print(msg.format(len(vec_basic.get_feature_names())))
print(vec_basic.get_feature_names()[:30])

There are 385 tokens in Position_Extra if we split on non-alpha numeric
['1st', '2nd', '3rd', '4th', '56', '5th', '9th', 'a', 'ab', 'accountability', 'adaptive', 'addit', 'additional', 'adm', 'admin', 'administrative', 'adult', 'aide', 'air', 'alarm', 'alt', 'and', 'any', 'area', 'arra', 'art', 'arts', 'assessment', 'assistant', 'assistive']
There are 415 tokens in Position_Extra if we split on white space
['&', '(no', '(slp)', '-', '-2nd', '1st', '2nd', '3rd', '4th', '56', '5th', '9th', 'a', 'ab', 'accountability', 'adaptive', 'addit', 'additional', 'adm', 'admin', 'administrative', 'adult', 'aide', 'air', 'alarm', 'alt', 'and', 'any', 'area', 'art']


Treating with alpha-numeric characters as tokens generates a smaller number of more meaningful tokens.