Credits: This notebook contains an excerpt from the [Python Data Science Handbook]
by Jake VanderPlas;

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). <br/>
If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<a id="home"></a>
## Working with strings

| Section | Section-name | Section | Section-name | Section | Section-name | 
| :- | :- | :- | :- | :- | :- | 
| 1.a. | [Basic strings](#1a) |  1.b. | [String indexing](#1b) |  1.c. | [Basic string operations](#1c) | 
| 1.d. | [Finding a substring](#1d) | 1.e. | [String transformations](#1e) |  1.f. | [split and join](#1f) | 
| 2.a. | [String ops via list comprehensions](#2a) | nb_regxep | [regular expressions notebooks](#nb_regexps) | 
| 3.a. | [Pandas objects containing strings](#3a) | 3.b. | [Series string methods](#3b) | 3.c. | [Using pandas string methods](#3c) | 3.d. | [Basic strings](#3d) |
| 3.d. | [Series and regular expressions](#3d) | 3.e. | [Series  misc. string methods](#3e) | 3.f. | [Series item access and slicing](#3e) |
| 3.g. | [Indicator variables](#3g) | 

In [None]:
# Standard Improts
import os
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

warnings.simplefilter("ignore")
%matplotlib inline

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

[Go to the beginning of the notebook](#home)
<a id="1a"></a>
#### 1.a. Basic strings

In [None]:
print('A string is contained within 2 quotes:')
"John Smith"

print('You can also use single  quotes:')
'John Smith'

print('A string can be spaces and digits:') 
'1 2 3 4 5 6 '

print('A string can also be special characters:') 
'@#2_#]&*^%$'

In [None]:
# multiline string
hi = """Hi there 
Hi again
Bye"""
'''
explanation 
of 
something
'''
print(hi)  

[Go to the beginning of the notebook](#home)
<a id="1b"></a>
#### 1.b. String indexing

In [None]:
Name= "Jack Smith"
len_min1=len(Name)-1
print('Name       : '+Name)
print('len(Name)  : %d' %(len(Name)))
print('Name[5]    : '+'-'*5+Name[5])
print('Name[-1]   : '+'-'*len_min1+Name[-1])
print('Name[0:4]  : '+Name[0:4])
print('Name[::2]  : '+Name[::2])
print('Name[::-1] : '+Name[::-1])
print('Name[::-2] : '+Name[::-2])
print('Name[1:7:2]: '+Name[1:7:2])
print('hi '*3)

[Go to the beginning of the notebook](#home)
<a id="1c"></a>
#### 1.c. Basic string operations
You can find a list of all string methods in the [documentation](https://docs.python.org/2/library/stdtypes.html#string-methods).

In [None]:
s = "he'llo"
print(s)
print (s.capitalize()  )# Capitalize a string; prints "Hello"
print (s.upper()      ) # Convert a string to uppercase; prints "HELLO"
print (s.rjust(7)    )  # Right-justify a string, padding with spaces; prints "  hello"
print (s.center(7)  )   # Center a string, padding with spaces; prints " hello "
print (s.replace('l', '(ell)'))  # Replace all instances of one substring with another;
#                                  prints "he(ell)(ell)o"
print (s.replace('l', '(ell)',1)) 
print (s.replace("'",""))
print ('  world '.strip())  # Strip leading and trailing whitespace; prints "world"
print (', hi my name is John.'.strip(',. '))
print ('שלום לכולם,'.rstrip(','))
print ('שלום לכולם,'.lstrip(','))

[Go to the beginning of the notebook](#home)
<a id="1d"></a>
#### 1.d. Finding a substring

In [None]:
Name
Name.find('ck')

In [None]:
Name
Name.find('lm')

[Go to the beginning of the notebook](#home)
<a id="1e"></a>
#### 1.e. String transformations

In [None]:
ord('A')
some_str='string with ABCE'
translation =some_str.maketrans('BACD', 'abcd')
translation
some_str
some_str.translate(translation)

In [None]:
ord('א')
some_str='Hebrew Letters: אבג'
translation = s.maketrans('אבג', 'abc')
translation
some_str
some_str.translate(translation)

[Go to the beginning of the notebook](#home)
<a id="1f"></a>
#### 1.f. split and join

In [None]:
str_sentence  = 'This is a sentence'
str_sentence2 = 'This is, a sentence'
str_sentence
str_sentence.split(' ')
str_sentence2
str_sentence2.split(', ')

In [None]:
normalized_tokens = ['This', 'is', 'a', 'sentence','.']
normalized_tokens
norm_sentence = ' '.join(normalized_tokens)
norm_sentence

[Go to the beginning of the notebook](#home)
<a id="2a"></a>
### 2.a. String operations via list comprehensions
Use list comprehensions on simple python lists

In [None]:
lst_names1 = ['peter', 'Paul', 'MARY', 'gUIDO']
lst_names1
[s.capitalize() for s in lst_names1]
lst2=[name.capitalize().replace('Pe','Me') for name in lst_names1]
lst2
' '.join(lst2)

This is perhaps sufficient to work with some data, <br/>
but **it will break if there are any missing values**.<br/>
For example:

In [None]:
lst_names2 = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in lst_names2]

[Go to the beginning of the notebook](#home)
<a id="nb_regexps"></a>
### Python regular expressions notebooks
At this point you could go over the regular expression notebooks:
* [regular expressions - basic notebook](Ex09_RegularExpressions.ipynb)
* [regular expressions - advanced notebook](Ex09_RegularExpressions_adv.ipynb)

[Go to the beginning of the notebook](#home)
<a id="3a"></a>
### 3.a.  Pandas Series and Index objects containing strings
Using series or index string operations is possible via via the ``str`` attribute <br/>
So, for example, suppose we create a Pandas Series with this data:

In [None]:
import pandas as pd
sr_names = pd.Series(lst_names2)
sr_names

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [None]:
sr_names.str.upper()

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

[Go to the beginning of the notebook](#home)
<a id="3b"></a>
### 3.b.  Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. <br/>
Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

[Go to the beginning of the notebook](#home)
<a id="3c"></a>
#### 3.c.  Using Pandas String Methods
Pandas string syntax is similar to basic python string operations
The examples in this section use the following series of names:

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

In [None]:
monte.str.lower()

But some others return numbers:

In [None]:
monte.str.len()

Or Boolean values:

In [None]:
monte.str.startswith('Terry')

Still others return lists or other compound values for each element:

In [None]:
s2 = monte.str.lower().str.split()
s2

In [None]:
monte.str.split("e")

We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.

[Go to the beginning of the notebook](#home)
<a id="3d"></a>
#### 3.d.  Series string methods with regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, <br/>
and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.<br/>
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [None]:
monte
monte.str.extract('([A-Za-z]+)', expand=False)

In [None]:
monte.str.extract('([a-z]+)', expand=False)

Or we can do something more complicated, like finding all names that start and end with a consonant, <br/>
making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:

In [None]:
monte

In [None]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

In [None]:
monte.str.findall(r'[^AEIOU]\w+[^aeiou]')

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries <br/>
opens up many possibilities for analysis and cleaning of data.

[Go to the beginning of the notebook](#home)
<a id="3e"></a>
#### 3.e. Series  miscellaneous string methods
Finally, there are some miscellaneous methods that enable other convenient operations:

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

[Go to the beginning of the notebook](#home)
<a id="3f"></a>
#### 3.f. Series item access and slicing

The ``get()`` and ``slice()`` operations, in particular, enable a pandas element access from each array.<br/>
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.<br/>
Note that this behavior is also available through Python's normal indexing syntax–for example, <br/>
``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [None]:
monte.str[0:3]

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.<br/>
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [None]:
monte.str.split().str.get(-1)

[Go to the beginning of the notebook](#home)
<a id="3g"></a>
#### 3.g. Indicator variables

Another method that requires a bit of extra explanation is the ``get_dummies()`` method.<br/>
This is useful when your data has a column containing some sort of coded indicator.<br/>
For example, we might have a dataset that contains information in the form of codes, such as <br/>
    A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [None]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:

In [None]:
full_monte['info'].str.get_dummies('|')

With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation.