**Handling Missing Data**

In [2]:
import numpy as np
import pandas as pd

For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

In [3]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [6]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [7]:
string_data[0] = None


In [8]:
string_data.isnull()


0     True
1    False
2     True
3    False
dtype: bool

**Filtering Out Missing Data**

There are a few ways to filter out missing data. While you always have the option to
do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values:

In [9]:
from numpy import nan as NA

In [10]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [13]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [14]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [15]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any NAs. dropna by default drops
any row containing a missing value:

In [16]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

In [17]:
cleaned = data.dropna()

In [18]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [19]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [20]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass axis=1:

In [21]:
data[4] = NA

In [22]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [23]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data.

Suppose
you want to keep only rows containing a certain number of observations.

You can
indicate this with the thresh argument:

In [24]:
df = pd.DataFrame(np.random.randn(7, 3))

In [25]:
df

Unnamed: 0,0,1,2
0,0.02449,0.638746,-2.005438
1,-1.163218,-0.131288,0.074368
2,1.615597,0.795579,-0.59378
3,0.706481,1.447198,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [27]:
df.iloc[:4, 1] = NA

In [28]:
df.iloc[:2, 2] = NA

In [29]:
df

Unnamed: 0,0,1,2
0,0.02449,,
1,-1.163218,,
2,1.615597,,-0.59378
3,0.706481,,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [30]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [31]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.615597,,-0.59378
3,0.706481,,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


**Filling In Missing Data**

In [32]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.02449,0.0,0.0
1,-1.163218,0.0,0.0
2,1.615597,0.0,-0.59378
3,0.706481,0.0,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [34]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.02449,0.5,0.0
1,-1.163218,0.5,0.0
2,1.615597,0.5,-0.59378
3,0.706481,0.5,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [35]:
_ = df.fillna(0, inplace=True)

In [36]:
df

Unnamed: 0,0,1,2
0,0.02449,0.0,0.0
1,-1.163218,0.0,0.0
2,1.615597,0.0,-0.59378
3,0.706481,0.0,0.098022
4,-0.01186,0.980902,1.483357
5,1.708013,1.367259,0.984747
6,-0.164845,-0.806634,-0.543366


In [37]:
df = pd.DataFrame(np.random.randn(6, 3))

In [38]:
df.iloc[2:, 1] = NA

In [39]:
df.iloc[4:, 2] = NA

In [40]:
df

Unnamed: 0,0,1,2
0,-0.47175,1.717186,1.006653
1,-0.208798,-2.099232,0.15071
2,1.234129,,0.616605
3,-0.059285,,0.299292
4,-0.971254,,
5,-0.953221,,


In [41]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.47175,1.717186,1.006653
1,-0.208798,-2.099232,0.15071
2,1.234129,-2.099232,0.616605
3,-0.059285,-2.099232,0.299292
4,-0.971254,-2.099232,0.299292
5,-0.953221,-2.099232,0.299292


In [42]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.47175,1.717186,1.006653
1,-0.208798,-2.099232,0.15071
2,1.234129,-2.099232,0.616605
3,-0.059285,-2.099232,0.299292
4,-0.971254,,0.299292
5,-0.953221,,0.299292


In [43]:
data = pd.Series([1., NA, 3.5, NA, 7])

In [44]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

**Data Transformation**

**Removing Duplicates**

In [45]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1,1,2,3,3,4,4]})

In [46]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [47]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [48]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [49]:
data['v1'] = range(7)

In [50]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combina‐
tion. Passing keep='last' will return the last one:

In [51]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


**Transforming Data Using a Function or Mapping**

In [52]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
 'Pastrami', 'corned beef', 'Bacon',
 'pastrami', 'honey ham', 'nova lox'],
 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [53]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [54]:
meat_to_animal = {
 'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'
}

In [55]:
lowercased = data['food'].str.lower()

In [56]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [58]:
data['animal'] = lowercased.map(meat_to_animal)

In [59]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [60]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other
data cleaning–related operations.

**Replacing Values**

In [61]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [62]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [64]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [65]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [67]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

**Renaming Axis Indexes**

In [68]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
     index=['Ohio', 'Colorado', 'New York'],
     columns=['one', 'two', 'three', 'four'])

In [69]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [70]:
transform = lambda x: x[:4].upper()

In [72]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [74]:
data.index = data.index.map(transform)

In [75]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the origi‐
nal, a useful method is rename:


In [76]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [77]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [78]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [79]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [80]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


**Discretization and Binning**

In [81]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use cut, a function in pandas:


In [82]:
bins = [18, 25, 35, 60, 100]

In [83]:
cats = pd.cut(ages, bins)

In [84]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [85]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [86]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [87]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [89]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [90]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [98]:
pd.cut(ages, bins, labels=group_names)


[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass an integer number of bins to cut instead of explicit bin edges, it will com‐
pute equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into fourths:

In [100]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.28, 0.5], (0.5, 0.72], (0.065, 0.28], (0.72, 0.94], (0.5, 0.72], ..., (0.065, 0.28], (0.5, 0.72], (0.065, 0.28], (0.28, 0.5], (0.72, 0.94]]
Length: 20
Categories (4, interval[float64]): [(0.065, 0.28] < (0.28, 0.5] < (0.5, 0.72] < (0.72, 0.94]]

In [105]:
data = np.random.randn(1000) # Normally distributed

In [106]:
cats = pd.qcut(data, 4) # Cut into quartiles

In [107]:
cats

[(-0.68, -0.0342], (0.634, 3.379], (-2.936, -0.68], (-0.68, -0.0342], (-0.68, -0.0342], ..., (0.634, 3.379], (-2.936, -0.68], (-2.936, -0.68], (0.634, 3.379], (-0.0342, 0.634]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.68] < (-0.68, -0.0342] < (-0.0342, 0.634] < (0.634, 3.379]]

In [108]:
pd.value_counts(cats)

(0.634, 3.379]      250
(-0.0342, 0.634]    250
(-0.68, -0.0342]    250
(-2.936, -0.68]     250
dtype: int64

In [109]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.264, -0.0342], (-0.0342, 1.255], (-1.264, -0.0342], (-1.264, -0.0342], (-1.264, -0.0342], ..., (1.255, 3.379], (-1.264, -0.0342], (-1.264, -0.0342], (-0.0342, 1.255], (-0.0342, 1.255]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -1.264] < (-1.264, -0.0342] < (-0.0342, 1.255] < (1.255, 3.379]]

**Detecting and Filtering Outliers**

In [110]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [113]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.004545,0.012808,0.027722,-0.00764
std,1.000368,0.982436,0.987769,0.997781
min,-3.389936,-2.973503,-3.048461,-3.274249
25%,-0.652617,-0.653309,-0.628602,-0.671558
50%,0.003758,-0.012654,0.045124,-0.026597
75%,0.616468,0.653285,0.66949,0.689576
max,3.910722,3.676616,3.318354,3.342404


In [114]:
col = data[2]

In [120]:
col[np.abs(col) > 3]

96     3.318354
213   -3.048461
Name: 2, dtype: float64

In [121]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
96,0.276567,-0.189797,3.318354,0.355647
100,0.411139,3.18813,1.084914,-0.068744
192,3.910722,-1.011153,-0.587083,-0.860689
213,0.201152,1.249765,-3.048461,-0.505693
495,-1.282495,-0.203286,1.328639,3.001185
556,1.422001,3.057087,0.104227,-0.035
596,0.960022,3.676616,-0.745256,-1.7683
610,-2.263748,0.415073,-0.62558,3.342404
727,1.505128,-0.360988,1.180316,-3.03307
866,-0.934026,3.262203,0.253835,0.725468


In [123]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [125]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.005066,0.011624,0.027453,-0.007676
std,0.99596,0.978536,0.986608,0.995723
min,-3.0,-2.973503,-3.0,-3.0
25%,-0.652617,-0.653309,-0.628602,-0.671558
50%,0.003758,-0.012654,0.045124,-0.026597
75%,0.616468,0.653285,0.66949,0.689576
max,3.0,3.0,3.0,3.0


In [126]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,-1.0
1,-1.0,1.0,1.0,1.0
2,-1.0,-1.0,-1.0,-1.0
3,-1.0,1.0,-1.0,-1.0
4,1.0,-1.0,1.0,-1.0


**Permutation and Random Sampling**

In [127]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [128]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [129]:
sampler = np.random.permutation(5)

In [130]:
sampler

array([4, 0, 2, 3, 1])

In [134]:
# Re-indexing based on sampler
df.take(sampler)

Unnamed: 0,0,1,2,3
4,16,17,18,19
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
1,4,5,6,7


To select a random subset without replacement, you can use the sample method on
Series and DataFrame:

In [135]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
2,8,9,10,11
4,16,17,18,19
0,0,1,2,3


To generate a sample with replacement (to allow repeat choices), pass replace=True
to sample:

In [137]:
choices = pd.Series([5, 7, -1, 6, 4])

In [138]:
draws = choices.sample(n=10, replace=True)

In [139]:
draws

4    4
0    5
2   -1
0    5
0    5
0    5
0    5
0    5
0    5
0    5
dtype: int64

**Computing Indicator/Dummy Variables**

In [140]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

In [141]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [143]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [144]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [145]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [146]:
df_with_dummy = df[['data1']].join(dummies)

In [147]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In [148]:
mnames = ['movie_id', 'title', 'genres']

In [150]:
movies = pd.read_table('./book-support/datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames, engine='python')

In [151]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Adding indicator variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset:

In [152]:
all_genres = []

In [153]:
for x in movies.genres:
    all_genres.extend(x.split('|'))

In [154]:
genres = pd.unique(all_genres)

In [155]:
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [157]:
zero_matrix = np.zeros((len(movies), len(genres)))

In [158]:
dummies = pd.DataFrame(zero_matrix, columns=genres)

In [161]:
gen = movies.genres[0]

In [163]:
gen.split('|')

['Animation', "Children's", 'Comedy']

In [166]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

In [168]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [173]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [175]:
movies_windic


Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,3948,Meet the Parents (2000),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,3949,Requiem for a Dream (2000),Drama,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,3950,Tigerland (2000),Drama,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,3951,Two Family House (2000),Drama,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


For much larger data, this method of constructing indicator vari‐
ables with multiple membership is not especially speedy. It would
be better to write a lower-level function that writes directly to a
NumPy array, and then wrap the result in a DataFrame.

A useful recipe for statistical applications is to combine get_dummies with a discreti‐
zation function like cut:

In [177]:
np.random.seed(12345)

In [178]:
values = np.random.rand(10)

In [179]:
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [180]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [181]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


We set the random seed with numpy.random.seed to make the example deterministic

**String Manipulation**

**String Object Methods**

In [182]:
val = 'a,b, guido'

In [183]:
val.split(',')

['a', 'b', ' guido']

In [184]:
pieces = [x.strip() for x in val.split(',')]

In [185]:
pieces

['a', 'b', 'guido']

In [186]:
first, second, third = pieces

In [187]:
first + '::' + second + '::' + third

'a::b::guido'

In [188]:
'::'.join(pieces)

'a::b::guido'

In [189]:
'guido' in val

True

In [190]:
val.index(',')

1

In [192]:
val.find(':')

-1

In [193]:
val.count(',')

2

In [194]:
val.replace(',', '::')

'a::b:: guido'

In [195]:
val.replace(',', '')

'ab guido'

**Regular Expressions**

In [196]:
import re

In [197]:
text = "foo bar\t baz \tqux"

In [198]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and
then its split method is called on the passed text. You can compile the regex yourself
with re.compile, forming a reusable regex object:

In [199]:
regex = re.compile('\s+')

In [203]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [205]:
regex.findall(text)

[' ', '\t ', ' \t']

match and search are closely related to findall. While findall returns all matches
in a string, search returns only the first match. More rigidly, match only matches at
the beginning of the string. As a less trivial example, let’s consider a block of text and
a regular expression capable of identifying most email addresses:

In [206]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [207]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [208]:
regex

re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)

In [209]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search returns a special match object for the first email address in the text. For the
preceding regex, the match object can only tell us the start and end position of the
pattern in the string:


In [210]:
m = regex.search(text)

In [211]:
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [213]:
text[m.start():m.end()]

'dave@google.com'

In [216]:
print(regex.match(text))

None


In [217]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: username, domain name, and domain suffix. To
do this, put parentheses around the parts of the pattern to segment:

In [218]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [220]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [221]:
regex

re.compile(r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})',
re.IGNORECASE|re.UNICODE)

In [222]:
m = regex.match('wesm@bright.net')

In [223]:
m.groups()

('wesm', 'bright', 'net')

In [224]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [226]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)) # sub access groups

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



**Vectorized String Functions in pandas**

In [227]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [228]:
data

{'Dave': 'dave@google.com',
 'Steve': 'steve@gmail.com',
 'Rob': 'rob@gmail.com',
 'Wes': nan}

In [229]:
data = pd.Series(data)

In [230]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [231]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the NA
(null) values. To cope with this, Series has array-oriented methods for string opera‐
tions that skip NA values. These are accessed through Series’s str attribute; for exam‐
ple, we could check whether each email address has 'gmail' in it with str.contains:

In [248]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [233]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [234]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [235]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

In [245]:
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [251]:
data.str.get(1)

Dave       a
Steve      t
Rob        o
Wes      NaN
dtype: object

In [250]:
data.str[0]

Dave       d
Steve      s
Rob        r
Wes      NaN
dtype: object

In [252]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object