Trong chương này, tôi thảo luận về các công cụ cho **dữ liệu bị thiếu**, **dữ liệu trùng lặp**, **thao tác chuỗi** và một số phép biến đổi dữ liệu phân tích khác. Trong chương tiếp theo, tôi tập trung vào việc kết hợp và sắp xếp lại các bộ dữ liệu theo nhiều cách khác nhau ”.

## 7.1 Handling Missing Data

“Việc thiếu dữ liệu xảy ra phổ biến trong nhiều ứng dụng phân tích dữ liệu. Một trong những mục tiêu của gấu trúc là làm cho việc xử lý dữ liệu bị thiếu càng dễ dàng càng tốt. Ví dụ: tất cả các thống kê mô tả về đối tượng gấu trúc loại trừ dữ liệu bị thiếu theo mặc định. ”


“Cách thể hiện dữ liệu bị thiếu trong các đối tượng gấu trúc có phần không hoàn hảo, nhưng nó hoạt động hiệu quả đối với rất nhiều người dùng. Đối với dữ liệu số, gấu trúc sử dụng giá trị dấu phẩy động **NaN (Không phải số)** để biểu thị dữ liệu bị thiếu. Chúng tôi gọi đây là giá trị sentinel có thể dễ dàng phát hiện: ”

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [5]:
string_data = pd.Series(["addrdvark","artichoke",np.nan,'avocado'])

In [6]:
string_data

0    addrdvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [7]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [8]:
string_data[0] = None 
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

## Filtering Out Missing data

In [9]:
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])


In [10]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [11]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

Other hands

In [13]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [16]:
## how to make drop na rows
data = pd.DataFrame([
    [1.,6.5,3.],
    [1.,NA,NA],
    [NA,NA,NA],
    [NA,6.5,3.]
])


In [15]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [18]:
## drop all na
cleaned = data.dropna()

In [19]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [20]:
# pass sing how ='all' will only drop rows that are all NA:
data.dropna(how='all')


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [21]:
## Drop colums in the same way , pass axis = 1
data[4] = NA


In [22]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [23]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Một cách liên quan để lọc ra các hàng DataFrame có xu hướng liên quan đến dữ liệu chuỗi thời gian. Giả sử bạn chỉ muốn giữ lại các hàng chứa một số lượng quan sát nhất định. Bạn có thể chỉ ra điều này bằng đối số **thresh**

In [24]:
df = pd.DataFrame(np.random.rand(7,3))
df

Unnamed: 0,0,1,2
0,0.516086,0.076958,0.919035
1,0.340232,0.97525,0.672477
2,0.763534,0.688688,0.36868
3,0.724335,0.753198,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


In [25]:
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA

In [26]:
df

Unnamed: 0,0,1,2
0,0.516086,,
1,0.340232,,
2,0.763534,,0.36868
3,0.724335,,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


In [27]:
df.dropna()

Unnamed: 0,0,1,2
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


In [28]:
df.dropna(thresh= 2)

Unnamed: 0,0,1,2
2,0.763534,,0.36868
3,0.724335,,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


## Filling In Missing data

“Thay vì lọc ra dữ liệu bị thiếu (và có khả năng loại bỏ dữ liệu khác cùng với nó), bạn có thể muốn điền vào“ lỗ hổng ”bằng bất kỳ cách nào. Đối với hầu hết các mục đích, phương thức `fillna` là hàm workhorse để sử dụng. Gọi điền vào bằng một hằng số sẽ thay thế các giá trị bị thiếu bằng giá trị đó: ”


In [29]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.516086,0.0,0.0
1,0.340232,0.0,0.0
2,0.763534,0.0,0.36868
3,0.724335,0.0,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


“Calling fillna with a dict, you can use a different fill value for each column:”

In [30]:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,0.516086,0.5,0.0
1,0.340232,0.5,0.0
2,0.763534,0.5,0.36868
3,0.724335,0.5,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


`fillna` returns a new object, but you can modify the existing object in-palce 

In [31]:
_ = df.fillna(0,inplace=True)

In [32]:
_

In [33]:
df

Unnamed: 0,0,1,2
0,0.516086,0.0,0.0
1,0.340232,0.0,0.0
2,0.763534,0.0,0.36868
3,0.724335,0.0,0.880471
4,0.248853,0.049224,0.263158
5,0.483757,0.77409,0.480508
6,0.886168,0.164907,0.548951


## 7.2 Data Transformation
Cho đến nay trong chương này, chúng tôi đã quan tâm đến việc sắp xếp lại dữ liệu. Lọc, làm sạch và các phép biến đổi khác là một lớp hoạt động quan trọng khác ”.

### Removing Duplicates

In [34]:
data = pd.DataFrame({
    'k1':['one','two'] * 3 + ['two'],
    'k2': [1,1,2,3,3,4,4]
})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate

In [35]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

`drop_dulicates` returns a DataFrame where the duplicated array is False

In [36]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [37]:
data['v1'] = range(7)

In [38]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [40]:
data.drop_duplicates(['k1'],keep='last')

Unnamed: 0,k1,k2,v1
4,one,3,4
6,two,4,6


Các bản sao và drop_duplicates theo mặc định giữ nguyên kết hợp giá trị được quan sát đầu tiên. Vượt qua ``keep = 'last'`` sẽ trả về cái cuối cùng:

In [41]:
data.drop_duplicates(['k1'],keep='last')

Unnamed: 0,k1,k2,v1
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping 

Đối với nhiều tập dữ liệu, bạn có thể muốn thực hiện một số chuyển đổi dựa trên các giá trị trong một mảng, Chuỗi hoặc cột trong DataFrame. Hãy xem xét dữ liệu giả định sau đây được thu thập về các loại thịt khác nhau: ”

In [44]:
data = pd.DataFrame(
    {'food': ['bacon','pulled pork', 'bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
    'ounces': [4,3,12,6,7.5,8,3,5,6]
    }
)


In [45]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Giả sử bạn muốn thêm một cột cho biết loại động vật mà mỗi loại thực phẩm đến. Hãy viết ra bản đồ của từng loại thịt riêng biệt với loại động vật:

In [46]:
meat_to_animal = {
    'bacon':'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

Phương thức `map` trên Series chấp nhận một hàm hoặc đối tượng giống như `dict` có chứa `mapping`, nhưng ở đây chúng tôi gặp một vấn đề nhỏ là một số loại thịt được viết hoa và những loại khác thì không. Do đó, chúng ta cần chuyển đổi từng giá trị thành chữ thường bằng phương pháp Dòng `str.lower`:

In [48]:
lowercased = data['food'].str.lower()

In [49]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [50]:
data['animal']  = lowercased.map(meat_to_animal)

In [51]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [52]:
# We could also have passed a function tha does all the wordk
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replace Values



Điền vào dữ liệu bị thiếu bằng phương pháp `fillna` là một trường hợp đặc biệt của sự thay thế giá trị tổng quát hơn. Như bạn đã thấy, bản đồ có thể được sử dụng để sửa đổi một tập hợp con các giá trị trong một đối tượng nhưng việc `replace` cung cấp một cách đơn giản và linh hoạt hơn để làm điều đó. Hãy xem xét `Sê-ri` này

In [55]:
data = pd.Series([1.,-999,2.,-999.,-1000.,3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [56]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [57]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

Nếu muốn `replace` nhiều giá trị cùng 1 lúc, chỉ cần truyền cho nó vào 1 list 

In [58]:
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

Sử dụng 1 different cho mỗi va luế , pass 1 list anthor paramet thẻ 


In [60]:
data.replace([-999,-1000],[np.nan,0])


0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

Giống như các giá trị trong Chuỗi, các nhãn trục có thể được biến đổi tương tự bởi một hàm hoặc ánh xạ của một số dạng để tạo ra các đối tượng mới, được gắn nhãn khác nhau. Bạn cũng có thể sửa đổi các trục tại chỗ mà không cần tạo cấu trúc dữ liệu mới. Đây là một ví dụ đơn giản ”

In [63]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                    index=['Ohio','Colorado','New York']
                    , columns=['one','two','three','four'])

In [64]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [65]:
transform = lambda x: x[:4].upper()


In [66]:
transform

<function __main__.<lambda>(x)>

In [67]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [68]:
data.index = data.index.map(transform)

In [69]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Nếu muốn tạo 1 trasformed vesion của dữ liệu k cần sửa đến dữ liệu bao giờ,
1 dữ liệu rất hữu ích là `rename`

In [70]:
data.rename(index=str.title,columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


### Discretization and Binning 
Dữ liệu liên tục thường được tách rời hoặc được tách thành các“ thùng ”để phân tích. Giả sử bạn có dữ liệu về một nhóm người trong một nghiên cứu và bạn muốn nhóm họ thành các nhóm tuổi rời rạc: ”

In [71]:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]


In [73]:
bins = [18,25,35,60,100]

In [74]:
cats = pd.cut(ages,bins)

In [75]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

Đối tượng mà pandas trả về là một đối tượng Categorical đặc biệt. Đầu ra bạn thấy mô tả các thùng được tính toán bởi pandas.cut. Bạn có thể coi nó như một mảng chuỗi chỉ ra tên bin; bên trong nó chứa một mảng danh mục chỉ định tên danh mục riêng biệt cùng với nhãn cho dữ liệu độ tuổi trong thuộc tính `codes`.

In [76]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [77]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [78]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

### Detecing and Filtering Outliers 

### Permuation and Random Sampling
hoán vị và lấy ngẫu nhien
Việc hoán vị (sắp xếp lại thứ tự ngẫu nhiên) một Chuỗi hoặc các hàng trong DataFrame rất dễ thực hiện bằng cách sử dụng hàm `numpy.random.permutation`. Gọi hoán vị với độ dài của trục mà bạn muốn hoán vị sẽ tạo ra một mảng các số nguyên cho biết thứ tự mới: ”

In [79]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))

In [80]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [81]:
sampler = np.random.permutation(5)

In [82]:
sampler

array([1, 4, 3, 0, 2])

In [83]:
df.take(sampler)

Unnamed: 0,0,1,2,3
1,4,5,6,7
4,16,17,18,19
3,12,13,14,15
0,0,1,2,3
2,8,9,10,11


To select a random subset without replacement, you can use the sample method on Series and DataFrame”
 Chon sau khi random thì có cái methods `sample`

In [84]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
2,8,9,10,11


## Computing Indicator / Dummy Variables

In [85]:
df = pd.DataFrame({
    'key': ['b','b','a','c','a','b'],
    'data1': range(6)
})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [86]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [87]:
dummies = pd.get_dummies(df['key'],prefix = 'key')


In [88]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [89]:
df_with_dummy = df[['data1']].join(dummies)

In [90]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In [91]:
mnames = ['move_id','title','genres']

In [92]:
movies = pd.read_table('../datasets/movielens/movies.dat',sep='::',
                      header=None,
                      names=mnames)


  movies = pd.read_table('../datasets/movielens/movies.dat',sep='::',


In [93]:
movies[:10]

Unnamed: 0,move_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [94]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))

In [97]:
all_genres
print(len(all_genres))

6408


In [98]:
genners = pd.unique(all_genres)

In [99]:
genners

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [100]:
len(genners)

18

In [101]:
zero_matrix = np.zeros((len(movies),len(genners)))

In [102]:
zero_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [104]:
zero_matrix.shape

(3883, 18)

In [105]:
dummies = pd.DataFrame(zero_matrix,columns=genners)

In [106]:
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [108]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

In [109]:
for i , gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i,indices] = 1

In [110]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))


In [111]:
movies_windic.head()

Unnamed: 0,move_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Một công thức hữu ích cho các ứng dụng thống kê là kết hợp `get_dummies` với một hàm tùy chỉnh như `cut`:

In [112]:
np.random.seed(12345)
values = np.random.rand(10)

In [113]:
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [114]:
bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


## 7.3 String Manipulation

### String Object Methods

In [115]:
val = 'a,b,    guido'
val.split(',')

['a', 'b', '    guido']

`strip()` make remove whitespace 

In [116]:
pieces = [x.strip() for x in val.split(',')]

In [117]:
pieces

['a', 'b', 'guido']

### Regular Expression

In [118]:
import re
text = "foo bar\t baz \tqux"
re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']