# String Manipulation
*Curtis Miller*

Python is well equipped to manipulate strings. Common tasks include changing string format, finding substrings, and replacing string contents.

## Changing String Format
We can change the contents of strings to be more amenable to our analyses. Let's demonstrate on the opening paragraph of *Moby Dick*.

In [None]:
from pandas import DataFrame
import re, string    # Useful libraries
from datetime import datetime

In [None]:
moby = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me. "
moby

In [None]:
print(moby.upper())

In [None]:
print(moby.lower())

In [None]:
moby2 = moby.lower().split(" ")    # Make lowercase and split at spaces
moby2[0:10]

In [None]:
"-".join(moby2[:10])    # Form new string

In [None]:
string.punctuation

In [None]:
moby3 = "".join(c for c in moby.lower() if c not in string.punctuation)    # Remove punctuation
print(moby3)

In [None]:
# There is extra whitespace at the end; let's remove it
moby3.strip()

In [None]:
moby3 = moby3.strip()
# Replace all "extra" whitespace with exactly one space; need regular expressions
moby3 = re.sub('\s+', ' ', moby3)    # \s+ detects one whitespace character or more
moby3

In [None]:
# Now we can get a character vector containing just words
moby4 = moby3.split(" ")
moby4[:10]

## Finding Substrings with Regular Expressions

Getting substrings can be as simple or as complex as you need. Here, I demonstrate identifying useful information using regular expressings. I read in a log file I obtained from [here](http://www.monitorware.com/en/logsamples/apache.php). I'll be extracting information from this log file, reading line-by-line and putting the data in a list of dictionaries. (You can read more about setting up the expressions [here](https://docs.python.org/3/library/re.html).)

In [None]:
sample_line = "64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] \"GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1\" 200 4523"
print(sample_line)

In [None]:
# Format: Caller - - [datetime -port] "REQUEST" response_code length
# Now we develop a regex string to extract this information from the log file
caller_regex = "[\x21-\x7E]+(?= - - )"    # Match printable characters at least once, followed by " - - "
time_regex = "[0-9]{2}/[a-zA-Z]{3}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}"    # Match date/time in ##/AAA/####:##:##:## form
port_regex = "(?<=-)[0-9]{4}(?=\])"    # Match a port as four digits preceded by '-' and followed by ']'
request_regex = "(?<=\").+(?=\")"     # Match a request, wrapped in quotes
trailing_regex = "(?<=\" )[0-9]+ [0-9]+"    # Match the last two numbers, preceded by "

In [None]:
# Compile regex identifiers
caller_prog = re.compile(caller_regex)
time_prog = re.compile(time_regex)
port_prog = re.compile(port_regex)
request_prog = re.compile(request_regex)
trailing_prog = re.compile(trailing_regex)

# Test
caller_prog.search(sample_line)

In [None]:
caller_prog.search(sample_line).group(0)

In [None]:
time_prog.search(sample_line).group(0)

In [None]:
port_prog.search(sample_line).group(0)

In [None]:
request_prog.search(sample_line).group(0)

In [None]:
trailing_prog.search(sample_line).group(0)

In [None]:
# Now let's turn this log file into a data file
req = list()
with open("logfile.txt") as f:
    for line in f:
        linestr = str(line)    # Turn line to string
        # Variables to hold our data; will be added into a dict
        #print(linestr)
        #"""
        caller = caller_prog.search(linestr).group(0)
        calltime = time_prog.search(linestr).group(0)
        port = port_prog.search(linestr).group(0)
        request = request_prog.search(linestr).group(0)
        finalnum = trailing_prog.search(linestr).group(0).split(" ")
        req.append({"caller": caller,
                    # Create a datetime object to manage the time; parse the string to infer appropriate time
                    "time": datetime.strptime(calltime, "%d/%b/%Y:%H:%M:%S"),
                    "port": port,
                    "request": request,
                    "status": finalnum[0],
                    "size": int(finalnum[1])})
        #"""

req[:5]

In [None]:
# A dataframe containing this data
df = DataFrame(req)
df

## Replacing Contents with `format()`

If we can extract information from strings we can certainly put it into them. Here I demonstrate using the string method `format()` for creating formatted strings.

In [None]:
"My bonnie lies over the {}.".format("ocean")    # Replace {} with "ocean"

In [None]:
"My {} lies over the {}.".format("bonnie", "sea")    # Done in order

In [None]:
"My {1} lies over the {0}.".format("ocean", "bonnie")    # Give numbers for which argument to substitute

In [None]:
"Oh {verb} back my {noun} to me!".format(verb="bring", noun="bonnie")    # Keyword arguments

In [None]:
# Let's generate a revised log file that contains the same data but in a new format.
logline_template = "".join(["{time}: Client {caller} sent request \"{request}\" on port {port}; request returned code ",
                            "{status} with packet size {size}.\n"])
print(logline_template)

In [None]:
# A demonstration
s = df.iloc[0]
s

In [None]:
print(logline_template.format(**s))

In [None]:
loglines = df.apply(lambda s: logline_template.format(**s), axis=1)
loglines.head()

In [None]:
logstring = "".join(loglines.tolist())
print(logstring)

Regular expressions can also be used for replacing substrings.