## Text Methods (String methods)

A normal Python String has a variety of method calls available.

In [1]:
# so here im creating a string.
mystring = 'hello'

In [3]:
# these are some string metods.
mystring.capitalize()

'Hello'

In [5]:
mystring.isalnum()

True

In [8]:
mystring.count('l') # number of 'l' litters in string is.

2

In [9]:
mystring.encode()

b'hello'

In [12]:
mystring.endswith('o') # With this we are checking th sring is ending with 'o'

True

In [13]:
mystring.upper()

'HELLO'

In [14]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

# Pandas and Text

Pandas can do a lot of more than what we show here. Full onlinedocmentation on things like advanced string indexing and regular expression with pandas can be found here:
https://pandas.pydata.org/docs/user_guide/text.html

## Text Methods on Pandas String Column

In [58]:
import pandas as pd
import numpy as np

In [16]:
names = pd.Series(['mango','orange','egale','elephant','50'])

In [17]:
names

0       mango
1      orange
2       egale
3    elephant
4          50
dtype: object

In [23]:
pd.DataFrame(names)

Unnamed: 0,0
0,mango
1,orange
2,egale
3,elephant
4,50


In [24]:
names.str.capitalize()

0       Mango
1      Orange
2       Egale
3    Elephant
4          50
dtype: object

In [25]:
names.str.upper()

0       MANGO
1      ORANGE
2       EGALE
3    ELEPHANT
4          50
dtype: object

In [26]:
names.str.isdecimal()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [27]:
names.str.islower()

0     True
1     True
2     True
3     True
4    False
dtype: bool

In [28]:
names.str.endswith('e')

0    False
1     True
2     True
3    False
4    False
dtype: bool

## Splitting, Grabbing, Expanding

In [29]:
tech_finance = ['GOOGLE,APPLE,AMEZON','JPM,BAC,GS']

In [30]:
len(tech_finance)

2

In [31]:
tickers = pd.Series(tech_finance)

In [33]:
tickers

0    GOOGLE,APPLE,AMEZON
1             JPM,BAC,GS
dtype: object

In [38]:
## Operation of spitting
tickers.str.split(',')

0    [GOOGLE, APPLE, AMEZON]
1             [JPM, BAC, GS]
dtype: object

In [39]:
tickers.str.split(',').str[0]

0    GOOGLE
1       JPM
dtype: object

In [40]:
tickers.str.split(',',expand=True)

Unnamed: 0,0,1,2
0,GOOGLE,APPLE,AMEZON
1,JPM,BAC,GS


## Cleaning or Editing Strings

In [41]:
messy_names = pd.Series(['cars    ','bo;bomb','  claire  '])

In [42]:
messy_names

0      cars    
1       bo;bomb
2      claire  
dtype: object

In [43]:
messy_names.str.replace(';','')

0      cars    
1        bobomb
2      claire  
dtype: object

In [44]:
messy_names.str.strip()

0       cars
1    bo;bomb
2     claire
dtype: object

In [48]:
messy_names.str.strip().str.replace(';','')

0      cars
1    bobomb
2    claire
dtype: object

In [49]:
messy_names.str.strip().str.replace(';','').str.capitalize()

0      Cars
1    Bobomb
2    Claire
dtype: object

## Alternative with Custom apply() call

In [51]:
def cleanup(name):
    name = name.replace(';','')
    name = name.strip()
    name = name.capitalize()
    return name

In [52]:
messy_names.apply(cleanup)

0      Cars
1    Bobomb
2    Claire
dtype: object

In [53]:
messy_names

0      cars    
1       bo;bomb
2      claire  
dtype: object

## Numpys Vectorize method

In [60]:
np.vectorize(cleanup)(messy_names)

array(['Cars', 'Bobomb', 'Claire'], dtype='<U6')

## Which one is more efficient?

In [54]:
import timeit

# code snippet to be executed inly once
setup = '''
import pandas as pd
import numpy as np
messy_names = pd.Series(['cars    ','bo;bomb','  claire  '])
def cleanup(name):
    name = name.replace(';','')
    name = name.strip()
    name = name.capitalize()
    return name
'''

In [55]:
# code snippet whose execution ime is to be measured
stmt_pandas_str = '''
messy_names.str.strip().str.replace(';','').str.capitalize()
'''

In [63]:
stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''

In [67]:
stmt_pandas_vectorize = '''
np.vectorize(cleanup)(messy_names)
'''

In [61]:
timeit.timeit(setup = setup,
                    stmt = stmt_pandas_str,
                     number = 1000)

0.9706704000000173

In [65]:
timeit.timeit(setup = setup,
                    stmt = stmt_pandas_apply,
                     number = 1000)

0.319305500000155

In [68]:
timeit.timeit(setup = setup,
                    stmt = stmt_pandas_vectorize,
                     number = 1000)

0.0491861999998946

YES! While .str() methods can be extremely covienent ,when it comes to performance,don't forget about np.vectorize()! Review the ' Useful Methods' lecture for a deeper discussion on np.vectorize()