In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hide content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, set_matplotlib_formats
import myst_nb

import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'plotly_mimetype+svg'
pio.templates['book'] = go.layout.Template(
    layout=dict(
        margin=dict(l=10, r=10, t=10, b=10),
        autosize=True,
        width=350, height=250,
    )
)
pio.templates.default = 'seaborn+book'

set_matplotlib_formats('svg')
sns.set()
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:text)=
# Working with Text

A great quantity of data resides not as numbers but as text, such as names of dog breeds, restaurant violation descriptions, street addresses, web logs, and also free-form text in books, documents, blog posts, and Internet comments. If we want to organize and analyze the information contained in text, we might need to work with the strings in the ways described below.

+ Convert text into a standard format. This is also referred to as *canonicalizing text*. For example, we might need to convert characters to lower case, use common spellings and abbreviations, remove punctuation and blanks;
+ Extract a piece of text to create a feature. As an example, a string might contain a date embedded within it and we want to pull it out from the string to create a date feature.
+ Transform text into a feature. We might want to create a 0-1 variable to indicate whether or not a string contains any one of several words.
+ Text Analysis. In order to analyze and compare documents, we might, say, transform a document into  a vector of word counts.

As with most types of data, there are a multitude of techniques for working with text to address these challenges. In this chapter, we introduce a few of these techniques. We show how simple string manipulation tools are often all we need to put text in a standard form or extract portions of strings. We also introduce regular expressions for more general and robust pattern matching that may be needed to search for patterns in strings and perform document analysis. To make these text operations more concrete, we begin by introducing several examples that need string manipulations or regular expressions to prepare text for analysis. 