<div class="alert alert-block alert-info">
<b>

# Python for Data Science Bootcamp
## Lecture 16 Part 2
    
## Textbook reference: Python Data Science Handbook 
## Chapter 3

Here are the topics for this section:

* Introduction to Panda string operations
* Panda string methods
* Methods using regular expressions
* Miscellaneous methods

Let's get started...
</b> :


<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Vectorized String Operations

### One strength of Python is its relative ease in handling and manipulating string data.
Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.
### In this section, we'll walk through some of the Pandas string operations.

## Introducing Pandas String Operations

We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [4]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13]) # Create array of 6 elements
x * 2

array([ 4,  6, 10, 14, 22, 26])

### vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.

For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax.

In [6]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

In [7]:
# Handling of missing data is problematic
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

### Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:

In [8]:
import pandas as pd
names = pd.Series(data) # Create Panda series
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

We can now call a **single method** that will capitalize all the entries, while skipping over any missing values:

In [9]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

In [10]:
str?

## Tables of Pandas String Methods

If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.
The examples in this section use the following series of names:

In [11]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
# Define Panda series of strings
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, **return a series of strings**:

In [12]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others **return numbers**:

In [13]:
monte.str.len() # Size of each string in Panda series

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

Or **Boolean values**:

In [14]:
monte.str.startswith('T') # How many strings start with letter T

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

Still others **return lists or other compound values for each element**:

In [15]:
monte.str.split() # Break individual string based on blank characters

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.

## Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.

### For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [16]:
# Example
# Use regular expression to extract first name
names=pd.Series(["smokey bear", "mickey mouse", "donald duck", "jimmy squirrel"])
# Get first names
# Begin search pattern "["
# Search for upper case A-Z or lower case a-z
# End search pattern "]"
# multiple instances "+"
# In words: "search for a string of letters upper or lower case"
first=names.str.extract('([A-Za-z]+)',expand=False)
print("First Names \n",first)

First Names 
 0    smokey
1    mickey
2    donald
3     jimmy
dtype: object


In [18]:
# Get last names names
# "\s" one whitespace
# Begin search pattern "["
# Search for upper case A-Z or lower case a-z
# End search pattern "]"
# multiple instances "+"
# In words: "After a whitespace.... 
# .... search for a string of letters upper or lower case"
#last=names.str.extract('.*\s([A-Za-z]+)',expand=False) 
last=names.str.extract('\s([A-Za-z]+)',expand=False) 
print("Last Names \n",last)

Last Names 
 0        bear
1       mouse
2        duck
3    squirrel
dtype: object


Or we can do something more complicated, like... 

### ...finding all names that start and end with a consonant, making use of the start-of-string  and end-of-string regular expression characters:

In [19]:
# Find all names that start and end with a consonant

# " ^ " start of string
# ".*" any character or characters except newline
# "$" end of string

print(monte)
# Find all names that start and end with a consonant
monte.str.findall('^[^AEIOU].*[^aeiou]$')
# Five of the 26 alphabet letters are vowels: A, E, I, O, and U

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object


0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.

### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations:

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

## Vectorized item access and slicing

### The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the **first three characters** of each array using ``str.slice(0, 3)``.
Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [20]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [22]:
# Find Last Names
# Break string by space - get the 2nd element
lastname = monte.str.split().str.get(-1) 
lastname

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

###### In summary, in this notebook we discussed:

* Panda string methods
* Methods using regular expressions
* Miscellaneous methods