# Combining text columns for tokenization

In order to get a bag-of-words representation for all of the text data in our DataFrame, we must first convert the text data in each row of the DataFrame into a single string.

We previously looked at a single column, so this wasn't necessary, as so each row(column field) was already just a single string. `CountVectorizer` expects each row to be a single string. In order to use all of the text columns, we'll need a method to turn a list of strings into a single string.

We'll define the function `combine_text_columns()`. It will convert all text data in the DataFrame to a single string per row, that can be passed to the vectorizer object and made into a bag-of-words using the `.fit_transform()` method.

In [1]:
# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# set seed for reproducibility
np.random.seed(0)

df = pd.read_csv('../data/TrainingData.csv',index_col=0)

NUMERIC_COLUMNS = ['FTE', 'Total']
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']

In [3]:
# Define combine_text_columns()
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector """
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # Replace nans with blanks
    text_data.fillna('', inplace=True)
    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

Now we'll use `combine_text_columns()` to convert all text data in your DataFrame to a single vector that can be passed to the vectorizer object and made into a bag-of-words using the `.fit_transform()` method.

We'll compare the effect of tokenizing using any non-whitespace characters as a token and using only alphanumeric characters as a token(the vectorizer will accept pnly alpha numeric characters in our tokens).

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Create the basic & alphanumeric token pattern
TOKENS_BASIC = '\\S+(?=\\s+)'
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer & alphanumeric CountVectorizer
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = combine_text_columns(df)

# Fit and transform the models
vec_basic.fit_transform(text_vector)
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_basic & vec_alphanumeric
print("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))
print("There are {} alpha-numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))

There are 4757 tokens in the dataset
There are 3284 alpha-numeric tokens in the dataset


Notice that tokenizing on alpha-numeric tokens reduced the number of tokens, just as in the last exercise. We'll keep this in mind when building a better model with the Pipeline object next.