# Pandas and String Manipulation
April 24, 2019
James Cage

My notes on manipulating text and strings. Based largely on [this page in the documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html).

In [5]:
import numpy as np
import pandas as pd

In [15]:
# Create some example data frames and series

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
# s.str.lower()
s.str.upper()
# s.str.len()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

In [34]:
# Example index

idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
idx

Index([' jack', 'jill ', ' jesse ', 'frank'], dtype='object')

In [41]:
# idx.str.strip()
# idx.str.lstrip()
# idx.str.rstrip()

# You can stack .str methods! Here's an uninteresting example:

idx.str.rstrip().str.lstrip() 

Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [53]:
# Sample dataframe

row_n, column_n = 3, 3

df = pd.DataFrame(np.random.randn(row_n, column_n), columns=[' Column A ', ' Column B ', 'Column C  '], index=range(row_n))
df

Unnamed: 0,Column A,Column B,Column C
0,0.591481,-0.328489,1.183213
1,-2.209901,-0.621937,0.338393
2,1.581862,0.725051,-0.954249


In [54]:
# It's not easy to tell, but column names above have leading & trailing blanks. 

df.columns

Index([' Column A ', ' Column B ', 'Column C  '], dtype='object')

In [55]:
# clean those up, and note stacking
# also note use of .str.replace()

# df.columns = df.columns.str.strip()
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.columns

Index(['column_a', 'column_b', 'column_c'], dtype='object')

In [56]:
df

Unnamed: 0,column_a,column_b,column_c
0,0.591481,-0.328489,1.183213
1,-2.209901,-0.621937,0.338393
2,1.581862,0.725051,-0.954249
