In [1]:
import pandas as pd
import numpy as np

At some point, I want to find a way to input text into Pandas and locate all the section headers in the document by their lack of punctuation at the end, i.e.:

" . . . the end of the first section.

Header for Section Two

This is section two . . . "

This would make it easier to create TTS audiobooks from text, since I currently have to manually find all the section headers and insert punctuation in order for the TTS software to read the headers more slowly. 

On to the lecture:

In [3]:
email = 'jose@email.com'

In [4]:
email.split('@')

['jose', 'email.com']

Now we can get domains and email providers seperately

In [5]:
names = pd.Series(['andrew', 'bobo', 'claire', 'david', '5'])

In [6]:
names

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

Note: 5 is still a string, not an int

In [7]:
names.str.upper()

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         5
dtype: object

In [8]:
names

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

In [9]:
email.isdigit()

False

In [10]:
'5'.isdigit()

True

Even though 5 is a string, the isdigit() method says that it's a digit

In [11]:
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [20]:
tech_finance = ['GOOG,APPL,AMZN', 'JPM,BAC,GS']

In [21]:
len(tech_finance)

2

In [22]:
tickers = pd.Series(tech_finance)

In [23]:
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [24]:
tickers.str.split(',')

0    [GOOG, APPL, AMZN]
1        [JPM, BAC, GS]
dtype: object

In [25]:
tech = 'GOOG,APPL,AMZN'

In [26]:
tech.split(',')[0]

'GOOG'

In [27]:
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

Now it's only reporting the first items after the ',' split. 

Now making three columns:

In [29]:
tickers.str.split(',', expand = True)

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


In [30]:
messy_names = pd.Series(['andrew  ','bo;bo','   claire   '])

In [31]:
messy_names

0        andrew  
1           bo;bo
2       claire   
dtype: object

In [32]:
messy_names[0]

'andrew  '

In [33]:
messy_names.str.replace(';','')

0        andrew  
1            bobo
2       claire   
dtype: object

In [34]:
messy_names.str.replace(';','').str.strip()

0    andrew
1      bobo
2    claire
dtype: object

In [35]:
messy_names.str.replace(';','').str.strip()[0]

'andrew'

In [36]:
messy_names.str.replace(';','').str.strip().str.capitalize()

0    Andrew
1      Bobo
2    Claire
dtype: object

These three methods have now cleaned and standardized the table

Now doing it as a method:

In [37]:
def cleanup(name):
    name = name.replace(';','')
    name = name.strip()
    name = name.capitalize()
    return name

In [38]:
messy_names.apply(cleanup)

0    Andrew
1      Bobo
2    Claire
dtype: object

It seems like in general, the string method calls approach is best for relatively simple applications, but once if statements get involved, doing it as a function is required (it's impossible to do as a string method call in that case)

Which method is more efficent from a processing standpoing?

The string method function is slightly slower, although much faster if vectorized. However, this may not matter until your tables are extremely large