In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

In [2]:
election = pd.DataFrame({
    'County': ['De Witt County', 'Lac qui Parle County', 'Lewis and Clark County',
        'St John the Baptist Parish'],
    'State': ['IL', 'MN', 'MT', 'LA'],
    'Voted': ['97.8', '98.8', '95.2', '52.6']
    
})
census = pd.DataFrame({
    'County': ['DeWitt  ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist'],
        'State': ['IL', 'MN', 'MT', 'LA'],
    'Population': ['16,798', '8,067', '55,716','43,044']
})

In [3]:
log_entry = r'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"'

# String Manipulation

There are a few basic string manipulation tools that we use all the time 
when working with text:

- Transform upper case characters to lower case (or vice versa).
- Replace a substring with another or delete a substring.
- Split a string into pieces at a particular character. 
- Slice a string at specified locations.

We'll show how we can combine these tools to clean up the county names data.
Remember that we have two tables that we want to join, but the county names are
spelled inconsistently. Below, we've displayed the `election` 
and `census` dataframes.

In [4]:
dfs_side_by_side(election, census)

Unnamed: 0,County,State,Voted
0,De Witt County,IL,97.8
1,Lac qui Parle County,MN,98.8
2,Lewis and Clark County,MT,95.2
3,St John the Baptist Parish,LA,52.6

Unnamed: 0,County,State,Population
0,DeWitt,IL,16798
1,Lac Qui Parle,MN,8067
2,Lewis & Clark,MT,55716
3,St. John the Baptist,LA,43044


## Converting Text to a Standard Format with Python String Methods

We need to address the following inconsistencies between the county names in
the two tables.

1.  Capitalization: `qui` vs `Qui`
1.  Omission of words: `County` and `Parish` are absent from the `census` table
1.  Different abbreviation conventions: `&` vs `and`
1.  Different punctuation conventions: `St.` vs `St` 
1.  Use of whitespace: `DeWitt` vs `De Witt`

When we clean text, we usually start by converting all of the characters to
lowercase---it’s easier to work with all lowercase characters than to try to
track combinations of uppercase and lowercase. Next, we want to fix
inconsistent words by replacing `&` with `and` and removing `County` and
`Parish`. Finally, we'll fix up the punctuation and whitespace inconsistencies.

Following these steps, we create a method called `clean_county` that cleans
an input county name using two of Python's string methods.

In [6]:
def clean_county(county):
    return (county
            .lower()
            .replace('county', '')
            .replace('parish', '')
            .replace('&', 'and')
            .replace('.', '')
            .replace(' ', ''))

Python provides a variety of methods for basic string manipulation. Although
simple, these methods are the primitives that piece together to form more
complex string operations. These methods are conveniently defined on all Python
strings and do not require importing other modules. Although it is worth
familiarizing yourself with [the complete list of string
methods](https://docs.python.org/3/library/stdtypes.html#string-methods), we
describe a few of the most commonly used methods in the table below.

| Method              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `str.lower()`       | Returns a copy of a string with all letters converted to lowercase          |
| `str.replace(a, b)` | Replaces all instances of the substring `a` in `str` with the substring `b` |
| `str.strip()`       | Removes leading and trailing whitespace from `str`                          |
| `str.split(a)`      | Returns substrings of `str` split at a substring `a`                        |
| `str[x:y]`          | Slices `str`, returning indices x (inclusive) to y (not inclusive)          |

We next verify that the `clean_county` method produces matching counties for all the counties in both tables:

In [7]:
([clean_county(county) for county in election['County']],
 [clean_county(county) for county in census['County']])

(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],
 ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])

Since each county name in both tables now has the same transformed
representation, we can successfully join the two tables.

## String Methods in pandas

In the code above we used a loop to transform each county name. `pandas` Series
objects provide a convenient way to apply string methods to each item in the
series. 

The `.str` property on `pandas` Series exposes the same string methods as
Python does. Calling a method on the `.str` property calls the method on each
item in the series. This allows us to transform each string in the series
without using a loop. We save the transformed counties back into their
originating tables:

In [14]:
election['County'] = (election['County']
 .str.lower()
 .str.replace('parish', '')
 .str.replace('county', '')
 .str.replace('&', 'and')
 .str.replace('.', '', regex=False)
 .str.replace(' ', ''))

census['County'] = (census['County']
 .str.lower()
 .str.replace('parish', '')
 .str.replace('county', '')
 .str.replace('&', 'and')
 .str.replace('.', '', regex=False)
 .str.replace(' ', ''))

Now, the two tables should contain the same string representation of the county names and we can  join these tables.

In [15]:
election.merge(census, on=['County','State'])

Unnamed: 0,County,State,Voted,Population
0,dewitt,IL,97.8,16798
1,lacquiparle,MN,98.8,8067
2,lewisandclark,MT,95.2,55716
3,stjohnthebaptist,LA,52.6,43044


:::{note}

Note that we merged on two columns: the county name and the state. We did
this because some states have counties with the same name. For example,
California and New York both have a county called King.  

:::

Python's string methods form a set of simple and useful operations for string
manipulation. `pandas` Series implement the same methods that apply the
underlying Python method to each string in the series. To see the complete list
of methods, we recommend looking at the [Python documentation on `str`
methods][py_str] and the [Pandas documentation for the `.str`
accessor][pd_str]. We did the canonicalization task above using only
`str.lower()` and multiple calls to `str.replace()`. Next, we'll extract
text with another string method, `str.split()`.

[py_str]: https://docs.python.org/3/library/stdtypes.html#string-methods
[pd_str]: https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary


## Splitting Strings to Extract Pieces of Text

Let's say we want to extract the date from a web server's log entry shown
below.

In [5]:
log_entry

'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"'

In [6]:
from textwrap import fill
print(fill(log_entry, width=79))

169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1"
301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.0; .NET CLR 1.1.4322)"


String splitting can help us narrow in on the pieces of information that form
the date. For example, when we split the string on the left bracket, we get two
strings:

In [9]:
log_entry.split('[')

['169.237.46.168 - - ',
 '26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"']

The second string has the date information, and if we want the day, month, and
year, we can split that string on a colon. 

In [10]:
log_entry.split('[')[1].split(':')[0]

'26/Jan/2004'

To separate out these three parts of the date, we can split on the forward
slash. All together we split the original string three times, each time keeping
only the pieces we are interested in. 

In [11]:
(log_entry.split('[')[1]
 .split(':')[0]
 .split('/'))

['26', 'Jan', '2004']

By repeatedly using `split()`, we can extract out all the parts of the log
entry. But this approach is complicated---if we wanted to
also get the hour, minute, second, and time zone of the activity,
we would need to use `split()` six times in total.
There's a simpler way to extract out the parts:

In [34]:
import re

pattern = r'[ \[/:\]]' 
re.split(pattern, log_entry)[4:11]

['26', 'Jan', '2004', '10', '47', '58', '-0800']

This alternative approach uses a powerful tool called a regular expression,
which we'll cover in the next section.