# Series string methods
If you working with text dataset it is quite a common task to clean it, split texts or extract some information about samples from their text. To do it we can take advantage of pandas vectorized string functions. They are preferred over base python str methods because of greater simplicity and NA handling. 

In [11]:
import numpy as np
import pandas as pd
import re
# Restricting number of displaying rows, just for convenience
pd.set_option('max_rows', 8)

## Load data

In [8]:
movie = pd.read_csv('data/movie.csv')
# Take 1 column-series
directors = movie['director_name']

## Methods
Pandas str methods are placed into `pd.DataFrame.str` attribute and their name is coincident to python str methods - `str.split()` and `pd.DataFrame.str.split()` are the same in their functonality. Some methods can accept regex pattern as an argument, it gives more flexibility.  
Furthermore you can get slices of string values. 

### Slicing

In [26]:
# Take 1st letter from all values
# Similar to directors.str.get(0)
directors.str[0]

0         J
1         G
2         S
3         C
       ... 
4912    NaN
4913      B
4914      D
4915      J
Name: director_name, Length: 4916, dtype: object

In [23]:
# Last 5 characters
directors.str[-5:]

0       meron
1       inski
2       endes
3       Nolan
        ...  
4912      NaN
4913    berds
4914     Hsia
4915     Gunn
Name: director_name, Length: 4916, dtype: object

### Length

In [15]:
# Number of characters in string
directors.str.len()

0       13.0
1       14.0
2       10.0
3       17.0
        ... 
4912     NaN
4913    16.0
4914    11.0
4915     8.0
Name: director_name, Length: 4916, dtype: float64

### Starting, ending

In [16]:
# Starting with something specific
is_starts_with_a = directors.str.startswith('A')
is_ends_with_z = directors.str.endswith('z')
is_starts_with_a

0       False
1       False
2       False
3       False
        ...  
4912      NaN
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: object

In [14]:
# Whether some value is True or all values are True in a series
is_starts_with_a.any(), is_starts_with_a.all()

(True, False)

In [13]:
directors

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

### Splitting

In [59]:
# Split strings to a list
# re pattern will work ok too
directors.str.split(' ')

0           [James, Cameron]
1          [Gore, Verbinski]
2              [Sam, Mendes]
3       [Christopher, Nolan]
                ...         
4912                     NaN
4913     [Benjamin, Roberds]
4914          [Daniel, Hsia]
4915             [Jon, Gunn]
Name: director_name, Length: 4916, dtype: object

In [47]:
# You can take element of each list with str attribute and slice or get method
directors.str.split(' ').str[0]

0             James
1              Gore
2               Sam
3       Christopher
           ...     
4912            NaN
4913       Benjamin
4914         Daniel
4915            Jon
Name: director_name, Length: 4916, dtype: object

In [52]:
# Let's use split again
genres = movie['genres'].str.split('|')
genres

0       [Action, Adventure, Fantasy, Sci-Fi]
1               [Action, Adventure, Fantasy]
2              [Action, Adventure, Thriller]
3                         [Action, Thriller]
                        ...                 
4912       [Crime, Drama, Mystery, Thriller]
4913               [Drama, Horror, Thriller]
4914                [Comedy, Drama, Romance]
4915                           [Documentary]
Name: genres, Length: 4916, dtype: object

In [56]:
# Number of assigned to each film genres
genres.str.len()

0       4
1       3
2       3
3       2
       ..
4912    4
4913    3
4914    3
4915    1
Name: genres, Length: 4916, dtype: int64

### Extracting columns
We can extract new columns for dataframe from text via regex named groups `(?P<name>pattern)` - name of group becomes name of series/dataframe column

In [60]:
# If expand argument is True, dataframe will be returned even if just one column (series) is returned
name = directors.str.extract(r'(?P<name>\w+)', expand=True)
family = directors.str.extract(r'(?P<surname> \w+)', expand=False)
name_n_family = directors.str.extract(r'(?P<name>\w+) (?P<surname>\w+)')

print(family)
name

0          Cameron
1        Verbinski
2           Mendes
3            Nolan
           ...    
4912           NaN
4913       Roberds
4914          Hsia
4915          Gunn
Name: surname, Length: 4916, dtype: object


  after removing the cwd from sys.path.


Unnamed: 0,name
0,James
1,Gore
2,Sam
3,Christopher
...,...
4912,
4913,Benjamin
4914,Daniel
4915,Jon


In [33]:
name_n_family

Unnamed: 0,name,surname
0,James,Cameron
1,Gore,Verbinski
2,Sam,Mendes
3,Christopher,Nolan
...,...,...
4912,,
4913,Benjamin,Roberds
4914,Daniel,Hsia
4915,Jon,Gunn


### Count

In [46]:
# Count something in string
directors.str.count('e')

0       2.0
1       2.0
2       2.0
3       1.0
       ... 
4912    NaN
4913    2.0
4914    1.0
4915    0.0
Name: director_name, Length: 4916, dtype: float64

### Concatenation
Text from series cell can be concatenated in 1 string or several series values can be concatenated element-wise

In [34]:
# Concatenate strings from Series in a 1
directors.str.cat(sep='-')[:100]

'James Cameron-Gore Verbinski-Sam Mendes-Christopher Nolan-Doug Walker-Andrew Stanton-Sam Raimi-Natha'

In [35]:
# Concatenate corresponding strings from Series
# Bond, James Bond
family.str.cat(directors, sep=', ')

0           Cameron, James Cameron
1        Verbinski, Gore Verbinski
2               Mendes, Sam Mendes
3         Nolan, Christopher Nolan
                   ...            
4912                           NaN
4913     Roberds, Benjamin Roberds
4914             Hsia, Daniel Hsia
4915                Gunn, Jon Gunn
Name: surname, Length: 4916, dtype: object

### Available string methods

In [50]:
# Available str methods
list(filter(lambda x: not x.startswith('_'), dir(pd.Series.str)))

['capitalize',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']