---
# Pandas String Funcitons
Pandas String Funcitons, Regex functions, and working with text

---

Code examples on the most frequently used functions - Collected, Created and Edited by Pawel Rosikiewicz www.SimpleAI.ch

__MOST IMPORTNAT LINK (all pandas str functions!) are here__     
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

### __What so special in text data?__
* often messy
* usually not in a standardized format ready for analysis
* more difficult to make solid conclusions, or plot,

### __String functions in pandas vs built-in string functions__
* examples of built-in string functions are in Python_data_structures notebook
* Python's string functions are for individual string objects
* Pandas functions are for Series and DataFrames

### __Pandas string funcitons__
* str attribute    
* names matching the corresponding build-in string functions         

In [80]:
import re
import numpy as np
import pandas as pd

## CREATED & INSPECT
---

![outliers_slide_01](images/Regex_and_String_Functions__Slide1.png)

### Create raw string & pd.Series with substrings

In [55]:
# string example
''' pd.series, different dtypes, each item is called a substring
    string buildin, funcitons, works only on each item separately, 
'''
s = pd.Series(['0','PawelBig', 'Anna Romanow', '01236789', np.nan, ' '])

In [56]:
# raw string example
'''used to search patterns with regex'''
pat = r'[a-z]+'

### Inspect substrings

In [58]:
s.dtype # dtype('0') for mixed dtypes

dtype('O')

In [32]:
# inspect each substring
s.str.len() # len of each substrings in a series
s.str.isalpha() # np.nan returns NAN!
s.str.isnumeric() # '0' & '01236789' are numeric!
s.str.isalnum() # if yoiu have even one whitespace - is is false
s.str.isdigit() # eas expected, 
s.str.isspace() # True for strings wiht empty space only ' '
s.str.islower() # True only if all char are ...
s.str.isupper() # same ...
s.str.isdecimal() # if all char are decimal, 
s.str.istitle() # This Is Title

0    False
1    False
2     True
3    False
4      NaN
5    False
dtype: object

## SEACRH & FIND
---

* you search for pattern in a string, 
* can use regular expression that are at the end of this notebook


![outliers_slide_01](images/Regex_and_String_Functions__Slide4.png)

### Find pattern
ie. if len(series)==8, then there will be 8 results retiurnsed in Pd.Series   
__.str functions may return for each substring:__
* True/False
    * __str.contains__; pat='REGEX', flags=re.IGNORECASE, na="place for NaN", case=True - case sensitive, regex=True, if False, it searches string literal, can be used with & | (or) eg. s.str.contains('0')|s.str.contains('1'), s.str.contains('0|1')
    * __str.match__; no case, and flas parameters, 
    * __str.startwith__; -||-
    * __str.endwith__; -||-
    
    </br>
    
* Index nr in each substring
    * __str.find__; the lowest index, with entire pattern contained in a substring after it.
    * __str.rfind__; the highes index ....
    * __str.index__; values error if not found, 
    * __str.rindex__; ...
        
    </br>
    
* 0/1
    * __str.get_dummies__; in different columns, 0/1

#### (a) Contains, match, startwith, endwith

In [94]:
s = pd.Series(['0','PawelBig', 'Anna Romanow', '01236789', np.nan, ' '])

In [108]:
# ignorcase, from re module - only strings as inputs
s.str.contains(pat='big', flags=re.IGNORECASE).values.tolist()

[False, True, False, False, nan, False]

In [109]:
# select with returnd results
"""remeber to replace na if you use that for selection!"""
flag = s.str.contains(pat='big', flags=re.IGNORECASE, na=False).values.tolist()
s.iloc[flag]

1    PawelBig
dtype: object

In [76]:
# more options
s.str.contains(
    pat='Pa', 
    case=True,# case sensitive
    regex=True, # if False, treats patern, as string literal, no regex
    na="whatever_here is Na" # Fill value for Na
).values

array([False, True, False, False, 'whatever_here is Na', False],
      dtype=object)

In [85]:
# and, or 
s.str.contains('0') | s.str.contains('1') #  0 or 1
s.str.contains('0') & s.str.contains('1') #  0 and 1
s.str.contains('0|1').values # works only with |

array([True, False, False, True, nan, False], dtype=object)

#### --- match, stre/endwith 
* match is more strinct then contains, 
* starte/end with are always casesensitive!
    * no flags!

In [83]:
# match with flag
s.str.match(pat='Pa', flags=re.IGNORECASE).values

array([False, True, False, False, nan, False], dtype=object)

In [93]:
s.str.startswith(pat='Paw', na="no other pasrams").values
s.str.endswith(pat='now', na="no other pasrams").values

array([False, False, True, False, 'no other pasrams', False], dtype=object)

#### (b) find, rfind, index, rindex
* find & rfind, retuirn pattern idx in each substring
    * Each of returned indexes corresponds to the position where the substring is fully contained between [start:end].
    * Return -1 on failure.
    * differences:
        * find - first encountered substring
        * rfind - the last encourntered substring
* index & rindex; works the same, but retunr ValiueError if pattern is not found

In [136]:
s = pd.Series(['0','PawelBig', 'Anna Romanow', '012367089', np.nan, ' '])

In [127]:
# find idx with the lowest value
s.str.find(sub="0").values

array([ 0., -1., -1.,  0., nan, -1.])

In [128]:
# .. idx with the highest value
s.str.rfind(sub="0").values

array([ 0., -1., -1.,  6., nan, -1.])

In [129]:
# define substring idx's to search
s.str.find(sub="0", start=0, end=3).values

array([ 0., -1., -1.,  0., nan, -1.])

#### (c) Get_dummies
* intersting option
    * Each string in Series is split by sep and returned as a DataFrame of dummy/indicator variables.
* do not return columns with nan presence,

In [145]:
s.str.get_dummies()

Unnamed: 0,Unnamed: 1,0,012367089,Anna Romanow,PawelBig
0,0,1,0,0,0
1,0,0,0,0,1
2,0,0,0,1,0
3,0,0,1,0,0
4,0,0,0,0,0
5,1,0,0,0,0


In [147]:
pd.Series(['a|b', 'a', 'a|c']).str.get_dummies()

Unnamed: 0,a,b,c
0,1,1,0
1,1,0,0
2,1,0,1


### Find pattern and return it
* __MAin Function__
    * __.str.findall()__; returns pd.Series, with list in each cell
    * __.str.extract()__; extract first match only, NAN IF NOTHING,expand=True, possible
    * __.str.extractall()__; extratc all mathcing patterns, if>1, creats mutiindex, always expanding results to df
* __Notes__:
    * these functions, match the pattern and extract it for work, or examination,
    * the rest of the substring is ignored

In [161]:
s = pd.Series(['bla', 'BLA', 'some random BLA and bla','blablabla'])

In [162]:
s.str.findall('bla', flags=re.IGNORECASE)

0              [bla]
1              [BLA]
2         [BLA, bla]
3    [bla, bla, bla]
dtype: object

In [163]:
# search only the end, or start
s.str.findall('bla$', flags=re.IGNORECASE) # endswith
s.str.findall('^bla', flags=re.IGNORECASE) # startswith, but with flags:)

0    [bla]
1    [BLA]
2       []
3    [bla]
dtype: object

#### extract and extractall functions
* require capture string in brackets eg: pat=r'(bla)'

In [172]:
s.str.extract(
    pat=r'(bla)', 
    expand=False, 
    flags=re.IGNORECASE
)

0    bla
1    BLA
2    BLA
3    bla
dtype: object

In [181]:
s.str.extractall(
    pat=r'(bla)', 
    flags=re.IGNORECASE
)# caution on mutiindex, 

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,bla
1,0,BLA
2,0,BLA
2,1,bla
3,0,bla
3,1,bla
3,2,bla


## SIZE/LENGHT TRANSFORMATIONS

![outliers_slide_01](images/Regex_and_String_Functions__Slide2.png)

#### (a) SLICE
* __with idx__; [idx] or [from, to], 
* __str.slice__; (from, to)
* __.str.get(one_idx_only)__; only one idx can e 
* Coments:
    * if we use from, to in any method, it returns, a sunstring, or its max lenght,
    * if we use only one index nr: it will return NaN if that index exceeds the lenghts of a substring
    * it works the ssame with all fucntions,


In [185]:
s = pd.Series(['bla', 'BLA', 'some random BLA and bla','blablabla', np.nan])

In [193]:
# slice with the index
'''NaN for idx>len(str)'''
s.str[5].values

array([nan, nan, 'r', 'a', nan], dtype=object)

In [194]:
# slice from to, 
'''returns max char, availale not NaN, unles it was np.nan'''
s.str[0:20].values
s.str.slice(0,20).values # same as str[0:2]

array(['bla', 'BLA', 'some random BLA and ', 'blablabla', nan],
      dtype=object)

#### (b) get exact charater
* __.str.get(one_idx_only)__; only one idx can e 

In [202]:
s.str.get(5).values
s.str.get(-1).values

array(['a', 'A', 'a', 'a', nan], dtype=object)

#### (b) JOIN & CONCATENATE

In [203]:
s1 = pd.Series(["a", "b", "c", np.nan])
s2 = pd.Series(["1", "2", np.nan, "4"])

In [210]:
# join all elements of a series
'''ie create one long substring'''
s1.str.cat() # 'abc'
s1.str.cat(sep=";_") # 'a;_b;_c'
s1.str.cat(sep=";_", na_rep="...") # 'a;_b;_c;_...'

'a;_b;_c;_...'

In [214]:
# join items in two strings
'''CAUTION: 1. best if the have the same lenght & indexing
            2. if na_rep is not used - it returns NaN, 
               for any joints with missing data
'''
s1.str.cat(s2, na_rep="_blah_")

0         a1
1         b2
2    c_blah_
3    _blah_4
dtype: object

#### (c) join string with different lenghts and indexes
* __.str.cat(join={'inner', 'outer', 'left', 'right'})__

In [223]:
s1 = pd.Series(["a", "b", "c"])
s2 = pd.Series([str(x) for x in list(range(1,5))], index=[1,2,6,7])

In [230]:
# default
'''join, as in left'''
s1.str.cat(s2, sep=",")

0    NaN
1    b,1
2    c,2
dtype: object

In [225]:
s1.str.cat(s2, join="outer", na_rep="----", sep=",")# use all avialbe indexes, 

0    a,----
1       b,1
2       c,2
6    ----,3
7    ----,4
dtype: object

In [229]:
s1.str.cat(s2, join="inner", na_rep="----", sep=",")
s1.str.cat(s2, join="left", na_rep="----", sep=",") # usee all from left, 
s1.str.cat(s2, join="right", na_rep="----", sep=",") # usee all from right, 

1       b,1
2       c,2
6    ----,3
7    ----,4
dtype: object

#### (d) Divide substrings - return 2 ITEMS - before, and after items

In [232]:
s = pd.Series(['bla', 'some random BLA and bla','blablabla', np.nan])

In [249]:
# split by white spaces
'''default'''
s.str.split(expand=True)

Unnamed: 0,0,1,2,3,4
0,bla,,,,
1,some,random,BLA,and,bla
2,blablabla,,,,
3,,,,,


In [250]:
# str.split with max_nr of splits, 
'''no flags, 
   Caution, n+1 for max splits, its index nr!
'''
s.str.split(
    pat="l",
    n=2, # max nr of splits, +1, because uts index nr. not lenght
    expand=True)

Unnamed: 0,0,1,2
0,b,a,
1,some random BLA and b,a,
2,b,ab,abla
3,,,


#### (e) Divide substrings - return 3 ITEMS

In [241]:
s = pd.Series(['bla', 'some random BLA and bla','blablabla', np.nan])

In [245]:
# split on the FIRTS pattern occcurence
s.str.partition("l", expand=True)

Unnamed: 0,0,1,2
0,b,l,a
1,some random BLA and b,l,a
2,b,l,ablabla
3,,,


In [246]:
# split on the LAST pattern occcurence
s.str.rpartition("l", expand=True)

Unnamed: 0,0,1,2
0,b,l,a
1,some random BLA and b,l,a
2,blablab,l,a
3,,,


## CONTENT MODIFICATIONS
---

![outliers_slide_01](images/Regex_and_String_Functions__Slide3.png)

### __(A) REPLACE__
---
* __.str.replace__; s.str.replcase("from", "into"), case, n, 
* __.str.slice_replace__; takes str, indexes to remove, and replaces them with new pattern, no case parameter!

In [258]:
s = pd.Series(['bla', 'some random BLA and bla','blablabla', np.nan])

s.str.replace(
    pat="l", 
    repl="-MOCCA-",
    n=2, # Number of replacements to make from start
    case=False
)


0                              b-MOCCA-a
1    some random B-MOCCA-A and b-MOCCA-a
2                  b-MOCCA-ab-MOCCA-abla
3                                    NaN
dtype: object

In [264]:
s.str.slice_replace(0,5,"______")

0                      ______
1    ______random BLA and bla
2                  ______abla
3                         NaN
dtype: object

### (B) CHANGE ENDS
---
* __Remove ends__
    * __.str.strip()__; both ends,s.str.strip(to_strip="Pattrern that will be removed"), no idx, or int accepted
    * __.str.rstrip()__; only end/right
    * __.str.lstrip()__; only beginning/left
* __Padding__
    * __.str.pad()__; 


In [277]:
s = pd.Series(['bla__', 'some random BLA and __bla','blabla__bla', np.nan])

In [276]:
# strip specified ends
''' you must provide pattern that will be removed
    CAUTION - doent work with indexes, or integers!, returns NaN only
'''
s.str.strip(to_strip="bla").values

array(['__', 'some random BLA and __', '__', nan], dtype=object)

In [287]:
# padding
''' . widht; min width of resulting string; 
            additional characters will be filled with fillchar
    . fillchar; “string of my choice”
    . side;  “both”, “right”, “left”
    comments:
        - NaN are not used, 
        - filchar, must be a sincle character, not a string
'''
s.str.pad(
    width=30,
    fillchar=":", # MUST BE A SINGLE CHARACTER
    side="both"
)

0    ::::::::::::bla__:::::::::::::
1    ::some random BLA and __bla:::
2    :::::::::blabla__bla::::::::::
3                               NaN
dtype: object

### (C)  CHNAGE COPY NR
---

In [289]:
s = pd.Series(['bla__', 'some random BLA and __bla','blabla__bla', np.nan])

In [290]:
s.str.repeat(2)

0                                           bla__bla__
1    some random BLA and __blasome random BLA and _...
2                               blabla__blablabla__bla
3                                                  NaN
dtype: object

In [293]:
pd.Series(['a', 'b', 'c']).str.repeat(repeats=[1,2,3])

0      a
1     bb
2    ccc
dtype: object

### (D)  LOWER/UPPER CASE
---

In [295]:
s = pd.Series(['bla__', 'some random BLA and __bla','blabla__bla', np.nan])

In [304]:
s.str.lower()
s.str.upper()
s.str.title()
s.str.capitalize() # upper only for the first char.
s.str.swapcase()
s.str.casefold() # remove any distintion of case in str, - all are lower now

0                        bla__
1    some random bla and __bla
2                  blabla__bla
3                          NaN
dtype: object

---
## PART 2 - REGEX
---

![outliers_slide_01](images/Regex_and_String_Functions__Slide5.png)