### [참고] <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Pandas Cheat Sheet</a>

#### https://pandas.pydata.org/docs/user_guide/text.html

**str 메소드**
- pandas.core.strings.StringMethods 의 별칭
- 호출할 수 없기 때문에 괄호를 사용하지 않음

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

In [6]:
s = pd.Series(
    ["suho      ","AA",np.nan,"     rabbit"]
)
s

0     suho      
1             AA
2            NaN
3         rabbit
dtype: object

str : 메소드이긴 하나 호출할 수 없기 때문에 괄호를 사용하지 않음

#### 1) lower() / upper() / len()

In [7]:
s.str.lower()

0     suho      
1             aa
2            NaN
3         rabbit
dtype: object

In [8]:
s.str.len()

0    10.0
1     2.0
2     NaN
3    11.0
dtype: float64

#### 2) strip()

In [9]:
s.str.strip()

0      suho
1        AA
2       NaN
3    rabbit
dtype: object

In [11]:
df = pd.DataFrame(np.random.randn(2,2),columns=[" Column A ", " Column B "])
df

Unnamed: 0,Column A,Column B
0,-0.003461,-0.258388
1,0.661821,-0.454846


In [13]:
df.columns.str.strip()

Index(['Column A', 'Column B'], dtype='object')

In [16]:
df.columns.str.strip().str.lower()

Index(['column a', 'column b'], dtype='object')

#### 3) split()

In [17]:
ser1 = pd.Series(["ha_a_b","hi_c_d",np.nan,"ho_e_f"])
ser1

0    ha_a_b
1    hi_c_d
2       NaN
3    ho_e_f
dtype: object

In [19]:
ser1.str.split("_")

0    [ha, a, b]
1    [hi, c, d]
2           NaN
3    [ho, e, f]
dtype: object

* **expand=True** 결과를 데이터 프레임으로 돌려줌

In [20]:
ser1.str.split("_",expand=True)

Unnamed: 0,0,1,2
0,ha,a,b
1,hi,c,d
2,,,
3,ho,e,f


#### 4) replace()

In [23]:
ser1 = pd.Series(["Suho","bAAa",np.nan,"cute_dog"])
ser1

0        Suho
1        bAAa
2         NaN
3    cute_dog
dtype: object

In [25]:
ser1.str.replace("^.a|dog","***",case=False,regex=True)

0        Suho
1       ***Aa
2         NaN
3    cute_***
dtype: object

- ^는 입력라인의 시작
- .a 는 맨 첫 문자를 포함하고 a로 끝나는 부분을 매칭
- | or 의 의미
- dog에 해당하는 부분 매칭
- case=False 대소문자 구분 안함
- regex=True 전달된 패턴이 정규식

#### 5) cat : 텍스트 이어 붙이기 

In [27]:
s = pd.Series(["ha","hi","ho"])

In [28]:
s.str.cat()

'hahiho'

In [29]:
s.str.cat(sep=",")

'ha,hi,ho'

### [실습2] 일기 형식의 데이터 가공

In [30]:
day_plan = [
    "1st_seq : getting up at 05:45am",
    "2nd_seq : swimming from 06:00 to 07:00am",
    "3nd_seq : my morning food is american style",
    "4nd_seq : Writing some proposal from 02:00pm to 06:00pm",
    "5nd_seq : Arriving at JongGak at 07:00pm",
    "6nd_seq : Fun with friends enjoy beer till 09:30pm",
    "7nd_seq : My house at 10:30pm and sleeping by 12:00pm"
]
df = pd.DataFrame(day_plan, columns=["schedule"])
df

Unnamed: 0,schedule
0,1st_seq : getting up at 05:45am
1,2nd_seq : swimming from 06:00 to 07:00am
2,3nd_seq : my morning food is american style
3,4nd_seq : Writing some proposal from 02:00pm t...
4,5nd_seq : Arriving at JongGak at 07:00pm
5,6nd_seq : Fun with friends enjoy beer till 09:...
6,7nd_seq : My house at 10:30pm and sleeping by ...


In [31]:
# split()

df["schedule"].str.split()

0               [1st_seq, :, getting, up, at, 05:45am]
1     [2nd_seq, :, swimming, from, 06:00, to, 07:00am]
2    [3nd_seq, :, my, morning, food, is, american, ...
3    [4nd_seq, :, Writing, some, proposal, from, 02...
4     [5nd_seq, :, Arriving, at, JongGak, at, 07:00pm]
5    [6nd_seq, :, Fun, with, friends, enjoy, beer, ...
6    [7nd_seq, :, My, house, at, 10:30pm, and, slee...
Name: schedule, dtype: object

In [33]:
df["schedule"].str.split().str.len()

0     6
1     7
2     8
3     9
4     7
5     9
6    10
Name: schedule, dtype: int64

In [34]:
df["schedule"].str.contains("My")

0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: schedule, dtype: bool

In [35]:
df["schedule"].str.count("\d")

0    5
1    9
2    1
3    9
4    5
5    5
6    9
Name: schedule, dtype: int64

In [None]:
df["schedule"].str.findall()