In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hide content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, set_matplotlib_formats
import myst_nb

import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'plotly_mimetype+svg'
pio.templates['book'] = go.layout.Template(
    layout=dict(
        margin=dict(l=10, r=10, t=10, b=10),
        autosize=True,
        width=350, height=250,
    )
)
pio.templates.default = 'seaborn+book'

set_matplotlib_formats('svg')
sns.set()
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

In [2]:
election = pd.DataFrame({
    'County': ['De Witt County', 'Lac qui Parle County', 'Lewis and Clark County',
        'St John the Baptist Parish'],
    'State': ['IL', 'MN', 'MT', 'LA'],
    'Voted': ['97.8', '98.8', '95.2', '52.6']
    
})
census = pd.DataFrame({
    'County': ['DeWitt  ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist'],
        'State': ['IL', 'MN', 'MT', 'LA'],
    'Population': ['16,798', '8,067', '55,716','43,044']
})

# String Manipulation

A few basic string manipulation tools can be handy for cleaning text or extracting a portion of a string. These common tools manipulate strings in the following ways:

- Transform upper case characters to lower case (or vice versa).
- Replace a substring with another or delete a substring.
- Split a string into pieces at a particular character. 
- Slice a string at specified locations. 

We need only the first two of these tools to clean the county names so that they are consistent across the two tables. 

## Converting Text to a Standard Format

We need to address the following inconsistencies between the county names in the two tables.

1.  Capitalization: `qui` vs `Qui`
1.  Different punctuation conventions: `St.` vs `St` 
1.  Omission of words: `County` and `Parish` are absent from the `census` table
1.  Use of whitespace: `DeWitt` vs `De Witt`
1.  Different abbreviation conventions: `&` vs `and` 

When we clean text, a typical first step is to convert all of the characters to lower case. It’s much easier to work with all lower case characters than to try to track combinations of upper and lower case. As a second step, we want to fix inconsistent words: replace `&` with `and`, remove `County` and `Parish`. Then, there's the punctuation and whitespace inconsistencies. Removing them should be the final step. We demonstrate how to do this with Python string methods.    

## Python String Methods

Python provides a variety of methods for basic string manipulation. Although simple, these methods are the primitives that piece together to form more complex string operations. 
These methods are conveniently defined on all Python strings and do not require importing other modules. Although it is worth familiarizing yourself with [the complete list of string methods](https://docs.python.org/3/library/stdtypes.html#string-methods), we describe a few of the most commonly used methods in the table below.

| Method              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `str.lower()`       | Returns a copy of a string with all letters converted to lowercase          |
| `str.replace(a, b)` | Replaces all instances of the substring `a` in `str` with the substring `b` |
| `str.strip()`       | Removes leading and trailing whitespace from `str`                          |
| `str.split(a)`      | Returns substrings of `str` split at a substring `a`                        |
| `str[x:y]`          | Slices `str`, returning indices x (inclusive) to y (not inclusive)          |

Following the steps described in the previous section, we create a method called `clean_county` that normalizes an input county name.

In [3]:
def clean_county(county):
    return (county
            .lower()
            .replace('county', '')
            .replace('parish', '')
            .replace('&', 'and')
            .replace('.', '')
            .replace(' ', ''))

We may now verify that the `clean_county` method produces matching counties for all the counties in both tables:

In [4]:
([clean_county(county) for county in election['County']],
 [clean_county(county) for county in census['County']]
)

(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],
 ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])

Since each county name in both tables now has the same transformed representation, we may successfully join the two tables.

## String Methods in pandas

In the code above we used a loop to transform each county name. `pandas` Series objects provide a convenient way to apply string methods to each item in the series. 

The `.str` property on `pandas` Series exposes the same string methods as Python does. Calling a method on the `.str` property calls the method on each item in the series.
This allows us to transform each string in the series without using a loop.
We save the transformed counties back into their originating tables:

In [5]:
election['County'] = (election['County']
 .str.lower()
 .str.replace('parish', '')
 .str.replace('county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', ''))

census['County'] = (census['County']
 .str.lower()
 .str.replace('parish', '')
 .str.replace('county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', ''))


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.



Now, the two tables should contain the same string representation of the county names and we can  join these tables.

In [6]:
election.merge(census, on=['County','State'])

Unnamed: 0,County,State,Voted,Population
0,dewitt,IL,97.8,16798
1,lacquiparle,MN,98.8,8067
2,lewisandclark,MT,95.2,55716
3,stjohnthebaptist,LA,52.6,43044


Note that we have merged on two columns: the county name and the state. We did this because some states have counties with the same name.  For example, California and New York both have a county called King.  

Python's string methods form a set of simple and useful operations for string manipulation. `pandas` Series implement the same methods that apply the underlying Python method to each string in the series. You may find the complete docs on Python's `string` methods [here](https://docs.python.org/3/library/stdtypes.html#string-methods) and the docs on Pandas `str` methods [here](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary).

We have carried out the first task with using only `str.lower()` and multiple calls to `str.replace()`. We tackle next the task of extracting text with another string method, `str.split()`.

## Extract a piece of text

Recall the issue of extracting the date from a Web log entry (shown below) to create a feature. 

In [7]:
log_entry = r'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"'

In [8]:
log_entry

'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"'

String splitting can help us narrow in on the pieces of information that form the date. For example, when we split the string on the left bracket, we get two strings:

In [9]:
log_entry.split('[')

['169.237.46.168 - - ',
 '26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"']

The second string has the date information, and if we want the day, month, and year, we can split that string on a colon. 

In [10]:
log_entry.split('[')[1].split(':')[0]

'26/Jan/2004'

To separate out these three parts of the date, we can split on the forward slash. All together we split the original string three times, each time keeping only the pieces we are interested in. 

In [11]:
(log_entry.split('[')[1]
 .split(':')[0]
 .split('/'))

['26', 'Jan', '2004']

If we want to also get the hour, minute, second, and time zone of the activity, we can use `split()` six times all together to get these eight components. But, this approach feels a bit hacky. Instead, we can achieve the same results with the simple expression below.

In [12]:
import re

pattern = r'[\[/:]' 
re.split(pattern, log_entry)[1:4]

['26', 'Jan', '2004']

This alternative approach uses a regular expression. We introduce this powerful notion next.