# Editing Text Strings

## Summary

### Built-in string methods

### Regular Expression (re or regex) package

### lambda and .apply()

# Extracting different categories of character from string

## .isdigit(), .isalpha() .isspace() .isalnum()

### Documentation
https://docs.python.org/3/library/stdtypes.html#string-methods

In [3]:
import pandas as pd

def CharCat(sample_string):
    print(sample_string)
    print('isdigit(): {}'.format(sample_string.isdigit()))
    print('isalpha(): {}'.format(sample_string.isalpha()))
    print('isspace(): {}'.format(sample_string.isspace()))
    print('isalnum(): {}'.format(sample_string.isalnum()))

In [8]:
sample_string = 'take 5'

CharCat(sample_string)
CharCat(sample_string[:4])
CharCat('132')
# CharCat(132) not a string
CharCat(' ')
CharCat('?!')
CharCat('take5')

take 5
isdigit(): False
isalpha(): False
isspace(): False
isalnum(): False
take
isdigit(): False
isalpha(): True
isspace(): False
isalnum(): True
132
isdigit(): True
isalpha(): False
isspace(): False
isalnum(): True
 
isdigit(): False
isalpha(): False
isspace(): True
isalnum(): False
?!
isdigit(): False
isalpha(): False
isspace(): False
isalnum(): False
take5
isdigit(): False
isalpha(): False
isspace(): False
isalnum(): True


## Panda's .apply() method
Lets us specify a function and apply that function to each element in data frame or series.

In [10]:
# Create a series of dirty, annoying values.
money = pd.Series([400, 111, '$20', 57, 'Lots'])

# Running 'money.isdigit()' throws an error because .isdigit is a string
# attribute, _not_ a series attribute.

# print(money.isdigit()) spits out an error

# Instead, let's define a new function that takes a string as an argument
# and returns True if the string is all digits, otherwise False.

def is_a_digit(x):
    # First make sure we're operating on a string, then use our string method.
    return str(x).isdigit()

# Now let's apply our custom function to each element in our seires.
print(money.apply(is_a_digit))

0     True
1     True
2    False
3     True
4    False
dtype: bool


## Lambda functions
The above function's naming was kinda dumb as it was doing what 'isdigit()' was already doing with just an extra step, makes little sense to def a new function for it.

Frequently, we'll define new functions on the fly.  That's where lambda functions come in.

*Lambda Functions* are small, temporary, unnamed functions

In [11]:
# Create a series of dirty, annoying values.
money = pd.Series([400, 111, '$20', 57, 'Lots'])

# Here's a lambda function that mirrors the is_a_digit function above.
# Read this print atatement carefully and compare to the previous one.
print(money.apply(lambda x: str(x).isdigit()))

0     True
1     True
2    False
3     True
4    False
dtype: bool


## Filter

Filter() uses two arguments.
1. take single input and return boolean value
2. iterable where each element fed into first argument

Filter returns what's returned True in 1st argument

In [13]:
# We're using list() on the result because filter() returns an iterator.

print('Filtering the whole series:')
print(list(filter(lambda x: str(x).isdigit(), money)))

print('\nApply filter() to each value in the series:')
print(money.apply(lambda x: ''.join(list(filter(str.isdigit, str(x))))))
#comes out as a pandas series, more useful

Filtering the whole series:
[400, 111, 57]

Apply filter() to each value in the series:
0    400
1    111
2     20
3     57
4       
dtype: object


## Splitting Strings Apart

In [17]:
# Create a series of dirty, annoying strings.
words = pd.Series([
    'MollyMalone$molmal@gmail.com',
    'JeffreyJones$jefjo@hotmail.com',
    'DeadParrot$fjords@gmail.com'
])

# split on '$'.  We'll use the Pandas split method.
word_split = words.str.split('$', expand=True)
names = word_split[0]
emails = word_split[1]
print(names, '\n')
print(emails)

0     MollyMalone
1    JeffreyJones
2      DeadParrot
Name: 0, dtype: object 

0     molmal@gmail.com
1    jefjo@hotmail.com
2     fjords@gmail.com
Name: 1, dtype: object


In [15]:
# Splitting on capital letters
# Just because we can doesn't mean we should:
print(names.str.split('[A-Z]', expand=True))

  0       1      2
0      olly  alone
1    effrey   ones
2       ead  arrot


In [19]:
import re

# We expect the first name to follow the first capital letter.
firstname = names.apply(lambda x: re.findall('[A-Z][a-z]*', x)[0])

# We expect the last name to follow the second capital letter.
lastname = names.apply(lambda x: re.findall('[A-Z][a-z]*', x)[1])

print(firstname, '\n')
print(lastname)

0      Molly
1    Jeffrey
2       Dead
Name: 0, dtype: object 

0    Malone
1     Jones
2    Parrot
Name: 0, dtype: object


## Changing the content of strings

### Replace

In [20]:
# Use panda's useful Series.str.replace()

print(emails.str.replace('@', ' at '),'\n')

print(emails.str.replace('.com', ''))

0     molmal at gmail.com
1    jefjo at hotmail.com
2     fjords at gmail.com
Name: 1, dtype: object 

0     molmal@gmail
1    jefjo@hotmail
2     fjords@gmail
Name: 1, dtype: object


### Changing case

In [21]:
print(names.str.lower(), '\n')
print(names.str.upper(), '\n')
print(names.str.capitalize())

0     mollymalone
1    jeffreyjones
2      deadparrot
Name: 0, dtype: object 

0     MOLLYMALONE
1    JEFFREYJONES
2      DEADPARROT
Name: 0, dtype: object 

0     Mollymalone
1    Jeffreyjones
2      Deadparrot
Name: 0, dtype: object


### Stripping whitespace

In [23]:
spacy = '  What,  on earth, is going on here?    '
print(spacy)
print(spacy.strip())

  What,  on earth, is going on here?    
What,  on earth, is going on here?


In [24]:
# Seires of strings with annoying whitespace.
words = pd.Series([' duck', 'duck ', ' duck ', 'goose'])
print(words[0] == words[1])

stripped = words.str.strip()
print(stripped[0] == stripped[1])

False
True
