# § Chapter 7 Data Cleaning and Preparation

## §7.1 Handling Missing Data
-	Filtering Out Missing Data
-	Filling In Missing Data

## §7.2 Data Transformation
-	Removing Duplicates
-	Transforming Data Using a Function or Mapping
-	Replacing Values
-	Renaming Axis Indexes
-	Discretization and Binning
-	Detecting and Filtering Outliers
-	Permutation and Random Sampling
-	Computing Indicator/Dummy Variables → *One hot* 

## §7.3 String Manipulation
-	String Object Methods
-	Regular Expressions
-	Vectorized String Functions in pandas

In [1]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

## §7.1 Handling Missing Data
-	Filtering Out Missing Data
-	Filling In Missing Data

## sentinel (哨兵) value: NaN (np.nan) for missing values

- The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. 
- For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data[2]

nan

In [4]:
type(string_data[2])

float

In [5]:
string_data[2] == np.nan

False

In [6]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [7]:
string_data[2] = None

In [8]:
string_data[2]

In [9]:
string_data[2] == np.nan

False

In [10]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [11]:
string_data[2] = 'NULL'

In [12]:
string_data[2]

'NULL'

In [13]:
string_data.isnull()

0    False
1    False
2    False
3    False
dtype: bool

### §7.1.1 Filtering Out Missing Data

In pandas, we’ve adopted a convention used in the R programming language by referring
to missing data as NA, which stands for not available.

In statistics applications,
NA data may either be data that does not exist or that exists but was not observed
(through problems with data collection, for example)

In [14]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna() # drop na

0    1.0
2    3.5
4    7.0
dtype: float64

In [15]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [16]:
data.notnull()

0     True
1    False
2     True
3    False
4     True
dtype: bool

In [17]:
data[1]

nan

In [18]:
data[1] == None

False

In [19]:
data[1] == np.nan

False

In [20]:
NA

nan

In [21]:
data[1] == NA

False

In [22]:
NA?

In [23]:
# Convert a string or number to a floating point number, if possible.
np.nan?

In [24]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [25]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [26]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [27]:
data.dropna(how='all') # drop the raw that is all NaN

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [28]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [29]:
data.dropna(axis=1, how='all') # by default axis=1

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [30]:
# A related way to filter out DataFrame rows tends to concern time series data. 
# Suppose you want to keep only rows containing a certain number of observations

df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [31]:
df.dropna()

Unnamed: 0,0,1,2
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [32]:
# 最多 drop 兩列
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


### §7.1.2 Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways.

In [33]:
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [34]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.204708,0.0,0.0
1,-0.55573,0.0,0.0
2,0.092908,0.0,0.769023
3,1.246435,0.0,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [35]:
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [36]:
# dictionary contains key (as column) and value as the assigned replaced value for NaN
df.fillna({1: 0.5, 2: 0}) # fill NaN with column1 => 0.5 , column2 => 0

Unnamed: 0,0,1,2
0,-0.204708,0.5,0.0
1,-0.55573,0.5,0.0
2,0.092908,0.5,0.769023
3,1.246435,0.5,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [37]:
_ = df.fillna(0, inplace=True) # inplace=True => also change the df itself
df

Unnamed: 0,0,1,2
0,-0.204708,0.0,0.0
1,-0.55573,0.0,0.0
2,0.092908,0.0,0.769023
3,1.246435,0.0,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [38]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,,1.34381
3,-0.713544,,-2.370232
4,-1.860761,,
5,-1.265934,,


In [39]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,0.124121,1.34381
3,-0.713544,0.124121,-2.370232
4,-1.860761,0.124121,-2.370232
5,-1.265934,0.124121,-2.370232


In [40]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,0.124121,1.34381
3,-0.713544,0.124121,-2.370232
4,-1.860761,,-2.370232
5,-1.265934,,-2.370232


In [41]:
# you might pass the mean or median value of a Series:

data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean()) # (1 + 3.5 + 7) / 3 = 3.8333333333333335

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

## §7.2 Data Transformation
-	Removing Duplicates
-	Transforming Data Using a Function or Mapping
-	Replacing Values
-	Renaming Axis Indexes
-	Discretization and Binning
-	Detecting and Filtering Outliers
-	Permutation and Random Sampling
-	Computing Indicator/Dummy Variables

### §7.2.1 Removing Duplicates

In [42]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [43]:
# duplicated returns a boolean Series indicating whether each row 
# is a duplicate (has been observed in a previous row) or not:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [44]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [45]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [46]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [47]:
# only the duplicaed items in 'k1' is considered, disregard the contents in the rest columns
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [48]:
data.drop_duplicates(['k1', 'k2'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5


In [49]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### §7.2.2 Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values
in an array, Series, or column in a DataFrame.

In [57]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [58]:
# Suppose you wanted to add a column indicating the type of animal that each food came from.

meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [59]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [60]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


### The *map method* on a Series accepts a function or dict-like object containing a mapping

Using map is a convenient way to perform element-wise transformations and other
data cleaning–related operations.

In [61]:
# dict-like object 自動比對後入列
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [66]:
# 
# lowercased.map(lambda x: meat_to_animal[x.lower()])

### §7.2.3 Replacing Values

In [67]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [68]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [69]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [70]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [71]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### §7.2.4 Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects.

In [72]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [73]:
type(data.index)

pandas.core.indexes.base.Index

In [74]:
transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [75]:
data.index

Index(['Ohio', 'Colorado', 'New York'], dtype='object')

In [76]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [77]:
# You can assign to index, modifying the DataFrame in-place:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [78]:
data.columns

Index(['one', 'two', 'three', 'four'], dtype='object')

In [79]:
type(data.columns)

pandas.core.indexes.base.Index

In [80]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [81]:
data.columns = data.columns.map(transform)

In [82]:
data

Unnamed: 0,ONE,TWO,THRE,FOUR
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [83]:
data.rename(index=str.title, columns=str.title)

Unnamed: 0,One,Two,Thre,Four
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [84]:
data

Unnamed: 0,ONE,TWO,THRE,FOUR
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [86]:
# rename can be used in conjunction with a dict-like object providing new values
# for a subset of the axis labels

# itemized replacement
data.rename(index={'OHIO': 'INDIANA'},
            columns={'THRE': 'peekaboo'})

Unnamed: 0,ONE,TWO,peekaboo,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [87]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True) # inplace=True => also change the data itself
data

Unnamed: 0,ONE,TWO,THRE,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [88]:
data.keys()

Index(['ONE', 'TWO', 'THRE', 'FOUR'], dtype='object')

In [89]:
data.keys()[1]

'TWO'

In [90]:
type(data.keys())

pandas.core.indexes.base.Index

In [91]:
data.rename(columns={'three': 'peekaboo', data.keys()[1]: 'bless'})

Unnamed: 0,ONE,bless,THRE,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [92]:
data

Unnamed: 0,ONE,TWO,THRE,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [93]:
data.rename(columns={'three': 'peekaboo', data.keys()[1]: 'bless'}, inplace= True)

In [94]:
data

Unnamed: 0,ONE,bless,THRE,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### §7.2.5 Discretization and Binning (分段)

Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [95]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [96]:
# The object pandas returns is a special Categorical object. 
# The output you see describes the bins computed by pandas.cut
# indicate the item belongs to which interval

bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [97]:
type(cats)

pandas.core.arrays.categorical.Categorical

In [98]:
# code of the interval
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [99]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [100]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

In [101]:
# inclusion of the boundary
cats2 = pd.cut(ages, [18, 26, 36, 61, 100], right=False) # right => Indicates whether bins includes the rightmost edge or not.

In [102]:
cats2.categories

IntervalIndex([[18, 26), [26, 36), [36, 61), [61, 100)], dtype='interval[int64, left]')

In [103]:
cats2.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [105]:
# You can also pass your own bin names by passing a list or array to the labels option:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [106]:
# 要是你不知道怎麼分段：按邊界值或是數量均等分

# If you pass an integer number of bins to cut instead of explicit bin edges, it will compute 
# equal-length bins based on the minimum and maximum values in the data.
# 按照 x 值等區分為 4 塊
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, interval[float64, right]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.97]]

In [107]:
data

array([0.4896, 0.3773, 0.8486, 0.9111, 0.3838, 0.3155, 0.5684, 0.1878,
       0.1258, 0.6876, 0.7996, 0.5735, 0.9732, 0.6341, 0.8884, 0.4954,
       0.3516, 0.7142, 0.5039, 0.2256])

In [108]:
pd.value_counts(pd.cut(data, 4, precision=2))

(0.34, 0.55]    6
(0.55, 0.76]    5
(0.76, 0.97]    5
(0.12, 0.34]    4
dtype: int64

In [109]:
# round-off
pd.cut(data, 4, precision=2)[-2]

Interval(0.34, 0.55, closed='right')

In [110]:
# A closely related function, qcut, bins the data based on sample quantiles
# 按照分類後，各類內含元素數目均分為 4 塊
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4)  # Cut into quantiles
cats

[(-0.0265, 0.62], (0.62, 3.928], (-0.68, -0.0265], (0.62, 3.928], (-0.0265, 0.62], ..., (-0.68, -0.0265], (-0.68, -0.0265], (-2.9499999999999997, -0.68], (0.62, 3.928], (-0.68, -0.0265]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.9499999999999997, -0.68] < (-0.68, -0.0265] < (-0.0265, 0.62] < (0.62, 3.928]]

In [111]:
pd.value_counts(cats)

(-2.9499999999999997, -0.68]    250
(-0.68, -0.0265]                250
(-0.0265, 0.62]                 250
(0.62, 3.928]                   250
dtype: int64

In [112]:
# Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):
# designated percentile
cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

In [113]:
pd.value_counts(cats)

(-1.187, -0.0265]                400
(-0.0265, 1.286]                 400
(-2.9499999999999997, -1.187]    100
(1.286, 3.928]                   100
dtype: int64

### §7.2.6 Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations.

In [114]:
data = pd.DataFrame(np.random.randn(1000, 4))
data

Unnamed: 0,0,1,2,3
0,-0.799318,0.777233,-0.612905,0.316447
1,0.838295,-1.034423,0.434304,-2.213133
2,0.758040,0.553933,0.339231,-0.688756
3,-0.815526,-0.332420,2.406483,-1.361428
4,-0.669619,0.781199,-0.395813,-0.180737
...,...,...,...,...
995,-0.856979,-0.446678,1.229042,-1.558031
996,-0.289339,-0.232531,0.409304,-0.813190
997,0.023646,0.232781,-0.345727,1.519174
998,1.060646,-1.456358,1.128420,0.032166


In [115]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.049091,0.026112,-0.002544,-0.051827
std,0.996947,1.007458,0.995232,0.998311
min,-3.64586,-3.184377,-3.745356,-3.428254
25%,-0.599807,-0.612162,-0.687373,-0.747478
50%,0.047101,-0.013609,-0.022158,-0.088274
75%,0.756646,0.695298,0.699046,0.623331
max,2.653656,3.525865,2.735527,3.366626


In [116]:
# any element in the series
col = data[2]
col[np.abs(col) > 3]

41    -3.399312
136   -3.745356
Name: 2, dtype: float64

In [117]:
# any element in any row of the data frame
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
41,0.457246,-0.025907,-3.399312,-0.974657
60,1.951312,3.260383,0.963301,1.201206
136,0.508391,-0.196713,-3.745356,-1.520113
235,-0.242459,-3.05699,1.918403,-0.578828
258,0.682841,0.326045,0.425384,-3.428254
322,1.179227,-3.184377,1.369891,-1.074833
544,-3.548824,1.553205,-2.186301,1.277104
635,-0.578093,0.193299,1.397822,3.366626
782,-0.207434,3.525865,0.28307,0.544635
803,-3.64586,0.255475,-0.549574,-1.907459


In [118]:
# np.sign()
data[np.abs(data) > 3] = np.sign(data) * 3 #hold value in -3 ~ 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.050286,0.025567,-0.001399,-0.051765
std,0.99292,1.004214,0.991414,0.995761
min,-3.0,-3.0,-3.0,-3.0
25%,-0.599807,-0.612162,-0.687373,-0.747478
50%,0.047101,-0.013609,-0.022158,-0.088274
75%,0.756646,0.695298,0.699046,0.623331
max,2.653656,3.0,2.735527,3.0


In [119]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,1.0,-1.0,1.0
1,1.0,-1.0,1.0,-1.0
2,1.0,1.0,1.0,-1.0
3,-1.0,-1.0,1.0,-1.0
4,-1.0,1.0,-1.0,-1.0


### §7.2.7 Permutation and Random Sampling

- Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the *numpy.random.permutation* function.
- To select a random subset without replacement, you can use the *sample* method on Series and DataFrame:

In [120]:
df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [121]:
# 0 ~ 4 隨機排列
sampler = np.random.permutation(5)
sampler

array([3, 1, 4, 2, 0])

In [122]:
# permutation according to row
df
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
4,16,17,18,19
2,8,9,10,11
0,0,1,2,3


In [126]:
df.sample(n=3) # random pick 3 raw

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
4,16,17,18,19


In [129]:
choices = pd.Series([5, 7, -1, 6, 4])
choices

0    5
1    7
2   -1
3    6
4    4
dtype: int64

In [130]:
# To generate a sample with replacement (to allow repeat choices), pass replace=True
# to sample:
draws = choices.sample(n=10, replace=True)
draws

4    4
4    4
3    6
4    4
1    7
0    5
3    6
4    4
3    6
1    7
dtype: int64

In [131]:
# 取後不置回
draws = choices.sample(n=5, replace=False)
draws

0    5
3    6
2   -1
1    7
4    4
dtype: int64

In [133]:
# expected erro
# draws = choices.sample(n=10, replace=False)
# draws

In [134]:
# expected erro
draws = choices.sample(n=10, replace=True)
draws

4    4
1    7
4    4
0    5
3    6
0    5
1    7
0    5
4    4
3    6
dtype: int64

In [135]:
draws.value_counts()

4    3
5    3
7    2
6    2
dtype: int64

### §7.2.8 Computing Indicator/Dummy Variables

- Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix.
- If a column in a DataFrame has k distinct values, you would derive a matrix or Data‐Frame with k columns containing all 1s and 0s.
- A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut

In [136]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [137]:
# consider the value as "one hot"
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [138]:
# you may want to add a prefix to the columns in the indicator Data‐Frame, 
# which can then be merged with the other data.
dummies = pd.get_dummies(df['key'], prefix='key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [139]:
# join
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


## If a row in a DataFrame belongs to multiple categories, things are a bit more complicated.

For much larger data, this method of constructing indicator variables
with multiple membership is not especially speedy. 
- It would be better to write a lower-level function that writes directly to a NumPy array, and 
- then wrap the result in a DataFrame.

### What is a DAT file?
A DAT file is a generic data file created by a specific application. It may contain data in binary or text format. DAT files are typically accessed only by the application that created them.

In [140]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames)
movies[:10]

  return func(*args, **kwargs)


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [141]:
movies["genres"]

0        Animation|Children's|Comedy
1       Adventure|Children's|Fantasy
2                     Comedy|Romance
3                       Comedy|Drama
4                             Comedy
                    ...             
3878                          Comedy
3879                           Drama
3880                           Drama
3881                           Drama
3882                  Drama|Thriller
Name: genres, Length: 3883, dtype: object

## Adding indicator variables for each genre requires a little bit of wrangling.

In [142]:
movies.shape

(3883, 3)

In [143]:
movies.genres[0]

"Animation|Children's|Comedy"

In [144]:
type(movies.genres[0])

str

In [145]:
movies.genres[0].split("|")

['Animation', "Children's", 'Comedy']

In [146]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)

In [147]:
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [148]:
len(genres)

18

In [149]:
# One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [150]:
gen = movies.genres[0]
gen

"Animation|Children's|Comedy"

In [151]:
dummies.columns

Index(['Animation', 'Children's', 'Comedy', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi',
       'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir', 'Western'],
      dtype='object')

In [152]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [153]:
dummies.columns.get_indexer?

In [154]:
# Compute indexer and mask for new index given the current index. 
# The indexer should be then used as an input to ndarray.take to align the
# current data to the new index.

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

In [155]:
movies.genres

0        Animation|Children's|Comedy
1       Adventure|Children's|Fantasy
2                     Comedy|Romance
3                       Comedy|Drama
4                             Comedy
                    ...             
3878                          Comedy
3879                           Drama
3880                           Drama
3881                           Drama
3882                  Drama|Thriller
Name: genres, Length: 3883, dtype: object

In [156]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [157]:
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [158]:
movies_windic = movies.join(dummies.add_prefix('Genre_')) #merge movies and dummies dataframe
movies_windic.iloc[0]

movie_id                                      1
title                          Toy Story (1995)
genres              Animation|Children's|Comedy
Genre_Animation                             1.0
Genre_Children's                            1.0
                               ...             
Genre_War                                   0.0
Genre_Musical                               0.0
Genre_Mystery                               0.0
Genre_Film-Noir                             0.0
Genre_Western                               0.0
Name: 0, Length: 21, dtype: object

### A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut

In [160]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 0.6532,
       0.7489, 0.6536])

In [161]:
pd.get_dummies?

In [162]:
# Convert categorical variable into dummy/indicator variables.

bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


## §7.3 String Manipulation
-	String Object Methods
-	Regular Expressions
-	Vectorized String Functions in pandas

### §7.3.1 String Object Methods

In [163]:
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [164]:
# split is often combined with strip to trim whitespace (including line breaks):
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [165]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [166]:
# A faster and more Pythonic way is to pass a list or tuple to the join method on the string '::':
    
'::'.join(pieces)

'a::b::guido'

In [167]:
'guido' in val

True

In [168]:
val

'a,b,  guido'

In [169]:
# position where the indicated letter appears
val.index(',')

1

In [170]:
val.find(':')

-1

In [172]:
#error substring not found
# val.index(':')

In [173]:
val.count(',')

2

In [174]:
val.replace(',', '::')

'a::b::  guido'

In [175]:
val.replace(',', '')

'ab  guido'

### §7.3.2 Regular Expressions

- Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. 
- A single expression, commonly called a regex, is a string formed according to the regular expression language.

### regular expression online`

https://regex101.com/

In [176]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/sa-TUpSx1JA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

%%HTML https://www.youtube.com/watch?v=sa-TUpSx1JA

## Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text
https://www.youtube.com/watch?v=sa-TUpSx1JA

## The *re* module functions fall into three categories: 
- pattern matching, 
- substitution, and 
- splitting.

In [180]:
# split
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text) # \s+ => space

['foo', 'bar', 'baz', 'qux']

In [181]:
print(text)

foo    bar	 baz  	qux


In [182]:
# regilar expression
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [183]:
# 到底 刪掉了 多少 '\s+'？
regex.findall(text)

['    ', '\t ', '  \t']

In [184]:
# pattern matching
matches = regex.finditer(text)
for match in matches:
    print(match)

<re.Match object; span=(3, 7), match='    '>
<re.Match object; span=(10, 12), match='\t '>
<re.Match object; span=(15, 18), match='  \t'>


In [185]:
print(text[15:18])

  	


To avoid unwanted escaping with \ in a regular expression, use raw string literals like r'C:\x' instead of the equivalent 'C:\\x'.

In [186]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' # email regular expression

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [187]:
# Using findall on the text produces a list of the email addresses:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

match and search are closely related to findall. 
- While findall returns all matches in a string, search returns only the first match. 
- More rigidly, match only matches at the beginning of the string.

In [188]:
# earch returns a special match object for the first email address in the text
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [189]:
text[m.start():m.end()]

'dave@google.com'

In [190]:
text

'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

In [191]:
# Matches zero or more characters at the beginning of the string.
# regex.match matches if the pattern occurs at the start of the string:
print(regex.match(text))

None


In [192]:
# 找到第一個比對成功的
print(regex.search(text))

<re.Match object; span=(5, 20), match='dave@google.com'>


In [193]:
# - substitution
print(regex.sub('REDACTED', text)) # REDACTED = 隱去

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



In [194]:
# 更改 pattern 的書寫型式

# pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# 特別分群

# Suppose you wanted to find email addresses and simultaneously segment each
# address into its three components: 
# username, 
# domain name, and 
# domain suffix. 
# To do this, put parentheses around the parts of the pattern to segment:
    
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [195]:
text

'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

### 補充：以下這兩個 methods 不太一樣，請小心使用
- regex.finditer
- regex.findall

In [196]:
matches = regex.finditer(text)

In [197]:
type(matches)

callable_iterator

In [198]:
for match in matches:
    print(match.group(0))
    print(match.group(1))

dave@google.com
dave
steve@gmail.com
steve
rob@gmail.com
rob
ryan@yahoo.com
ryan


In [199]:
# 當 match 的字串超過一個時，每個 matched 的結果是一個 re.Match
# 你可以用 group(), group(0), 取出不同的 matched parts
type(match)

re.Match

In [201]:
matches = regex.finditer(text)

for match in matches:
    print("The whole thing is: " + match.group())
    print("name is: " + match.group(1))
    print("company is: " + match.group(2))
    print("type of the association: " +match.group(3))
#     break

The whole thing is: dave@google.com
name is: dave
company is: google
type of the association: com
The whole thing is: steve@gmail.com
name is: steve
company is: gmail
type of the association: com
The whole thing is: rob@gmail.com
name is: rob
company is: gmail
type of the association: com
The whole thing is: ryan@yahoo.com
name is: ryan
company is: yahoo
type of the association: com


In [202]:
matches = regex.finditer(text)

for match in matches:
    print(match.group(2))

google
gmail
gmail
yahoo


In [203]:
# 當 match 的字串超過一個時，每個 matched 的結果是一個 tuple
matches = regex.findall(text)
matches

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [204]:
type(matches[0])

tuple

In [205]:
for match in matches:
    print(match)
    print(match[0])
    print(match[1])
    print(match[2])
    break

('dave', 'google', 'com')
dave
google
com


### 如果是 match, 答案是一筆與好多筆是不同的：
- 只有一筆時，回傳是 re.Match
- 當若干筆時，回傳是 list of tuples

In [206]:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [207]:
type(m)

re.Match

In [208]:
m.group(0)

'wesm@bright.net'

In [209]:
m.group(1)

'wesm'

In [210]:
m.group(2)

'bright'

In [211]:
m.group(3)

'net'

In [212]:
text

'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

In [213]:
matches = regex.findall(text)

In [214]:
type(matches)

list

In [215]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [216]:
type(matches[0])

tuple

In [217]:
text

'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

In [218]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [219]:
print(regex.findall(text))

[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]


In [220]:
# 將原文本中，出現的 patten 置換成另外新的 字串，其中 \1, \2, .. 分別代表在 re 中用 (..) 框出來的部份
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



### §7.3.3 Vectorized String Functions in pandas

### 當要各別處理的元素出現 NaN 的狀況時，原先 map() 會出問題，改用 data.str 方法

Cleaning up a messy dataset for analysis often requires a lot of string munging and
regularization. To complicate matters, *a column containing strings will sometimes
have missing data:*

- You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values
- To cope with this, Series has array-oriented methods for string operations that skip NA values

In [221]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [222]:
'gmail' in data[0]

False

In [223]:
'gmail' in data[1]

True

In [225]:
# expected error
# 'gmail' in data[3]

In [226]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [227]:
data.str

<pandas.core.strings.accessor.StringMethods at 0x21fb3ee4b70>

In [228]:
# Docstring:  
# Vectorized string functions for Series and Index. NAs stay NA unless
# handled otherwise by a particular method. 

data.str?

In [229]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [230]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [231]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [232]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [233]:
type(matches)

pandas.core.series.Series

In [None]:
# expected error
# matches.str

In [None]:
#?
# expected error
# matches.str[0]

In [None]:
#?
# expected error
# matches.str.get(1)

In [None]:
data.str

In [234]:
data.str.get(0)

Dave       d
Steve      s
Rob        r
Wes      NaN
dtype: object

In [235]:
data.str.get(1)

Dave       a
Steve      t
Rob        o
Wes      NaN
dtype: object

In [236]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS

## Conclusion