# 7  Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.

If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere in the pandas library, feel free to share your use case on one of the Python mailing lists or on the pandas GitHub site. Indeed, much of the design and implementation of pandas have been driven by the needs of real-world applications.

In this chapter I discuss tools for missing data, duplicate data, string manipulation, and some other analytical data transformations. In the next chapter, I focus on combining and rearranging datasets in various ways.

## 7.1 Handling Missing Data

Missing data occurs commonly in many data analysis applications. **One of the goals of pandas is to make working with missing data as painless as possible**. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with `float64` dtype, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data.

We call this a *sentinel value*: when present, it indicates a missing (or *null*) value:



In [1]:
import pandas as pd
import numpy as np

In [2]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])

float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The `isna` method gives us a Boolean Series with `True` where values are null:



In [3]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we've adopted a convention used in the R programming language by referring to missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python `None` value is also treated as NA:

In [4]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])

string_data

0    aardvark
1         NaN
2        None
3     avocado
dtype: object

In [5]:
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [6]:
float_data = pd.Series([1, 2, None], dtype='float64')

float_data

0    1.0
1    2.0
2    NaN
dtype: float64

In [7]:
float_data.isna()

0    False
1    False
2     True
dtype: bool

The pandas project has attempted to make working with missing data consistent across data types. Functions like `pandas.isna` abstract away many of the annoying details. See **[Table 7.1](https://wesmckinney.com/book/data-cleaning#tbl-table_na_method)** for a list of some functions related to missing data handling.



In [8]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is the same thing as doing:


In [9]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, there are different ways to remove missing data. You may want to drop rows or columns that are all NA, or only those rows or columns containing any NAs at all. `dropna` by default drops any row containing a missing value:



In [10]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [11]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how="all"` will drop only rows that are all NA:



In [12]:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return new objects by default and do not modify the contents of the original object.

To drop columns in the same way, pass `axis="columns"`:

In [13]:
data[4] = np.nan

data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [14]:
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose you want to keep only rows containing at most a certain number of missing observations. You can indicate this with the `thresh` argument:



In [15]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))

df.iloc[:4, 1] = np.nan

df.iloc[:2, 2] = np.nan

df

Unnamed: 0,0,1,2
0,-0.641319,,
1,0.139777,,
2,0.262599,,0.332474
3,-2.080875,,-0.011066
4,-1.366748,0.199107,-0.546201
5,-0.699452,0.540735,0.024848
6,-0.438676,0.167101,0.018942


In [16]:
df.dropna()

Unnamed: 0,0,1,2
4,-1.366748,0.199107,-0.546201
5,-0.699452,0.540735,0.024848
6,-0.438676,0.167101,0.018942


In [17]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.262599,,0.332474
3,-2.080875,,-0.011066
4,-1.366748,0.199107,-0.546201
5,-0.699452,0.540735,0.024848
6,-0.438676,0.167101,0.018942


### Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value:

In [18]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.641319,0.0,0.0
1,0.139777,0.0,0.0
2,0.262599,0.0,0.332474
3,-2.080875,0.0,-0.011066
4,-1.366748,0.199107,-0.546201
5,-0.699452,0.540735,0.024848
6,-0.438676,0.167101,0.018942


Calling fillna with a dictionary, you can use a different fill value for each column:



In [19]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.641319,0.5,0.0
1,0.139777,0.5,0.0
2,0.262599,0.5,0.332474
3,-2.080875,0.5,-0.011066
4,-1.366748,0.199107,-0.546201
5,-0.699452,0.540735,0.024848
6,-0.438676,0.167101,0.018942


The same interpolation methods available for reindexing (see Table 5.3) can be used with `fillna`:

In [20]:
df = pd.DataFrame(np.random.standard_normal((6, 3)))

df.iloc[2:, 1] = np.nan

df.iloc[4:, 2] = np.nan

df

Unnamed: 0,0,1,2
0,0.094628,1.059562,-0.110476
1,-1.227231,-2.197856,0.170834
2,1.264887,,-0.12822
3,0.731001,,-1.564761
4,1.521138,,
5,1.241979,,


In [21]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2
0,0.094628,1.059562,-0.110476
1,-1.227231,-2.197856,0.170834
2,1.264887,-2.197856,-0.12822
3,0.731001,-2.197856,-1.564761
4,1.521138,-2.197856,-1.564761
5,1.241979,-2.197856,-1.564761


In [22]:
df.fillna(method="ffill", limit=2)

  df.fillna(method="ffill", limit=2)


Unnamed: 0,0,1,2
0,0.094628,1.059562,-0.110476
1,-1.227231,-2.197856,0.170834
2,1.264887,-2.197856,-0.12822
3,0.731001,-2.197856,-1.564761
4,1.521138,,-1.564761
5,1.241979,,-1.564761


With `fillna` you can do lots of other things such as simple data imputation using the median or mean statistics:

In [23]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

See **[Table 7.2](https://wesmckinney.com/book/data-cleaning#tbl-table_fillna_function)** for a reference on fillna function arguments.


## 7.2 Data Transformation

So far in this chapter we’ve been concerned with handling missing data. Filtering, cleaning, and other transformations are another class of important operations.

### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:


In [24]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:



In [25]:

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` returns a DataFrame with rows where the `duplicated` array is `False` filtered out:

In [26]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates based only on the `"k1"` column:

In [27]:
data["v1"] = range(7)

data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [28]:
data.drop_duplicates(subset=["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep="last"` will return the last one:


In [29]:
data.drop_duplicates(["k1", "k2"], keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


## Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:



In [33]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                              "pastrami", "corned beef", "bacon",
                              "pastrami", "honey ham", "nova lox"],
                     "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:




In [34]:
meat_to_animal = {"bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

The `map` method on a Series (also discussed in Ch 5.2.5: Function Application and Mapping) accepts a function or dictionary-like object containing a mapping to do the transformation of values:



In [32]:
data["animal"] = data["food"].map(meat_to_animal)

data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:



In [35]:
def get_animal(x): 
    return meat_to_animal[x]

data["food"].map(get_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other data cleaning-related operations.


### Replacing Values

Filling in missing data with the `fillna` method is a special case of more general value replacement. As you've already seen, `map` can be used to modify a subset of values in an object, but `replace` provides a simpler and more flexible way to do so. Let’s consider this Series:




In [36]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The `-999` values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use `replace`, producing a new Series:



In [37]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the substitute value:




In [39]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:



In [40]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dictionary:



In [42]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

**Note:** The `data.replace` method is distinct from `data.str.replace`, which performs element-wise string substitution. We look at these string methods on Series later in the chapter.



### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in place without creating a new data structure. Here’s a simple example:



In [43]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])

Like a Series, the axis indexes have a `map` method:



In [44]:
def transform(x):
    return x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to the `index` attribute, modifying the DataFrame in place:



In [45]:
data.index = data.index.map(transform)

data


Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the original, a useful method is `rename`:



In [46]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, `rename` can be used in conjunction with a dictionary-like object, providing new values for a subset of the axis labels:



In [47]:
data.rename(index={"OHIO": "INDIANA"}, 
            columns={"three": "peakaboo"})

Unnamed: 0,one,two,peakaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


`rename` saves you from the chore of copying the DataFrame manually and assigning new values to its `index` and `columns` attributes.


### Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:


In [48]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use `pandas.cut`:



In [49]:
bins = [18, 25, 35, 60, 100]

age_categories = pd.cut(ages, bins)

age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see describes the bins computed by `pandas.cut`. Each bin is identified by a special (unique to pandas) interval value type containing the lower and upper limit of each bin:



In [50]:
age_categories.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [51]:
age_categories.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [52]:
age_categories.categories[0]

Interval(18, 25, closed='right')

In [53]:
pd.value_counts(age_categories)

  pd.value_counts(age_categories)


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

Note that `pd.value_counts(categories)` are the bin counts for the result of `pandas.cut`.

In the string representation of an interval, a parenthesis means that the side is open (exclusive), while the square bracket means it is *closed* (inclusive). You can change which side is closed by passing `right=False`:

In [55]:
pd.cut(ages, bins, right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64, left]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

You can override the default interval-based bin labeling by passing a list or array to the `labels` option:



In [56]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]

pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to `pandas.cut` instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:



In [58]:
data = np.random.uniform(size=20)

pd.cut(data, 4, precision=2)

[(0.75, 0.97], (0.75, 0.97], (0.32, 0.53], (0.75, 0.97], (0.53, 0.75], ..., (0.53, 0.75], (0.32, 0.53], (0.32, 0.53], (0.75, 0.97], (0.096, 0.32]]
Length: 20
Categories (4, interval[float64, right]): [(0.096, 0.32] < (0.32, 0.53] < (0.53, 0.75] < (0.75, 0.97]]

The `precision=2` option limits the decimal precision to two digits.

A closely related function, `pandas.qcut`, bins the data based on sample quantiles. Depending on the distribution of the data, using `pandas.cut` will not usually result in each bin having the same number of data points. Since `pandas.qcut` uses sample quantiles instead, you will obtain roughly equally sized bins:

In [59]:
data = np.random.standard_normal(1000)

quartiles = pd.qcut(data, 4, precision=2)

quartiles

[(-0.67, 0.011], (0.63, 2.79], (0.011, 0.63], (0.63, 2.79], (0.63, 2.79], ..., (-0.67, 0.011], (0.63, 2.79], (0.011, 0.63], (0.011, 0.63], (0.63, 2.79]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.51, -0.67] < (-0.67, 0.011] < (0.011, 0.63] < (0.63, 2.79]]

In [60]:
pd.value_counts(quartiles)

  pd.value_counts(quartiles)


(-2.51, -0.67]    250
(-0.67, 0.011]    250
(0.011, 0.63]     250
(0.63, 2.79]      250
Name: count, dtype: int64


Similar to `pandas.cut`, you can pass your own quantiles (numbers between 0 and 1, inclusive):



In [63]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]).value_counts()

(-2.504, -1.231]    100
(-1.231, 0.0109]    400
(0.0109, 1.219]     400
(1.219, 2.787]      100
Name: count, dtype: int64

We’ll return to `pandas.cut` and `pandas.qcut` later in the chapter during our discussion of aggregation and group operations, as these discretization functions are especially useful for quantile and group analysis.



### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:



In [64]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.021303,0.05455,-0.023899,-0.018161
std,0.998318,0.993144,0.956427,0.96708
min,-3.226613,-3.130076,-3.189194,-2.937579
25%,-0.710275,-0.635487,-0.681866,-0.6931
50%,-0.006763,0.026272,-0.042049,-0.048338
75%,0.705267,0.728037,0.614735,0.638792
max,3.229679,3.092536,3.026028,3.65632


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:



In [65]:
col = data[2]

col[col.abs() > 3]

255    3.026028
353   -3.189194
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the `any` method on a Boolean DataFrame:



In [67]:
data[(data.abs() > 3).any(axis="columns")]

Unnamed: 0,0,1,2,3
105,-0.836679,3.054836,1.357564,-1.009602
255,0.835623,-0.125992,3.026028,-0.488679
267,-0.178834,-3.130076,-0.892829,-0.027366
325,0.499544,0.022759,0.038194,3.65632
352,-0.26079,3.092536,-0.321103,0.131501
353,-0.92992,0.813502,-3.189194,1.561575
483,3.229679,-0.878474,0.135831,-1.05793
746,3.050921,-1.099165,-1.542938,0.52474
873,1.189179,-0.091922,-0.334599,3.183661
943,-3.226613,1.568792,1.100335,-1.451164


The parentheses around `data.abs() > 3` are necessary in order to call the `any` method on the result of the comparison operation.

Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [69]:
data[data.abs() > 3] = np.sign(data) * 3

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.021249,0.054532,-0.023736,-0.019001
std,0.996739,0.992291,0.955736,0.964211
min,-3.0,-3.0,-3.0,-2.937579
25%,-0.710275,-0.635487,-0.681866,-0.6931
50%,-0.006763,0.026272,-0.042049,-0.048338
75%,0.705267,0.728037,0.614735,0.638792
max,3.0,3.0,3.0,3.0


The statement `np.sign(data)` produces 1 and –1 values based on whether the values in `data` are positive or negative:



In [70]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,-1.0,-1.0
1,1.0,1.0,-1.0,1.0
2,-1.0,-1.0,-1.0,1.0
3,-1.0,1.0,1.0,-1.0
4,-1.0,-1.0,-1.0,-1.0


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is possible using the `numpy.random.permutation` function. Calling `permutation` with the length of the axis you want to permute produces an array of integers indicating the new ordering:



In [71]:
df = pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))

df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [72]:
sampler = np.random.permutation(5)

sampler

array([1, 2, 3, 4, 0])

That array can then be used in `iloc`-based indexing or the equivalent `take` function:



In [74]:
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
0,0,1,2,3,4,5,6


In [75]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
0,0,1,2,3,4,5,6


By invoking `take` with `axis="columns"`, we could also select a permutation of the columns:



In [76]:
column_sampler = np.random.permutation(7)

column_sampler

array([3, 6, 5, 0, 4, 2, 1])

In [77]:
df.take(column_sampler, axis="columns")

Unnamed: 0,3,6,5,0,4,2,1
0,3,6,5,0,4,2,1
1,10,13,12,7,11,9,8
2,17,20,19,14,18,16,15
3,24,27,26,21,25,23,22
4,31,34,33,28,32,30,29


To select a random subset without replacement (the same row cannot appear twice), you can use the `sample` method on Series and DataFrame:



In [78]:
df.sample(n=3)

Unnamed: 0,0,1,2,3,4,5,6
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27


To generate a sample *with* replacement (to allow repeat choices), pass `replace=True` to `sample`:



In [79]:
choices = pd.Series([5, 7, -1, 6, 4])

choices.sample(n=10, replace=True)

4    4
1    7
3    6
2   -1
0    5
3    6
0    5
1    7
2   -1
0    5
dtype: int64

### Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a *dummy* or *indicator* matrix. If a column in a DataFrame has `k` distinct values, you would derive a matrix or DataFrame with `k` columns containing all 1s and 0s. pandas has a `pandas.get_dummies` function for doing this, though you could also devise one yourself. Let’s consider an example DataFrame:



In [81]:
df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})

df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [83]:
pd.get_dummies(df["key"], dtype=float)

Unnamed: 0,a,b,c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


Here I passed `dtype=float` to change the output type from boolean (the default in more recent versions of pandas) to floating point.

In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. `pandas.get_dummies` has a prefix argument for doing this:

In [86]:
dummies = pd.get_dummies(df["key"], prefix="key", dtype=float)

df_with_dummy = df[["data1"]].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0.0,1.0,0.0
1,1,0.0,1.0,0.0
2,2,1.0,0.0,0.0
3,3,0.0,0.0,1.0
4,4,1.0,0.0,0.0
5,5,0.0,1.0,0.0


The `DataFrame.join` method will be explained in more detail in the next chapter.

If a row in a DataFrame belongs to multiple categories, we have to use a different approach to create the dummy variables. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in Ch 13: Data Analysis Examples:



In [88]:
mnames = ["movie_id", "title", "genres"]

movies = pd.read_table("/Users/timl/PythonWork/Data:Jupyter/McKinneyBook/Chapter678/movies.dat", sep="::", 
                    header=None, names=mnames, engine="python")

In [89]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


pandas has implemented a special Series method `str.get_dummies` (methods that start with `str.` are discussed in more detail later in String Manipulation) that handles this scenario of multiple group membership encoded as a delimited string:



In [90]:
dummies = movies["genres"].str.get_dummies("|")

dummies.iloc[:10, :6]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime
0,0,0,1,1,1,0
1,0,1,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0
5,1,0,0,0,0,1
6,0,0,0,0,1,0
7,0,1,0,1,0,0
8,1,0,0,0,0,0
9,1,1,0,0,0,0


Then, as before, you can combine this with `movies` while adding a `"Genre_"` to the column names in the `dummies` DataFrame with the `add_prefix` method:



In [91]:
movies_windic = movies.join(dummies.add_prefix("Genre_"))

movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western       

A useful recipe for statistical applications is to combine `pandas.get_dummies` with a discretization function like `pandas.cut`:



In [92]:
np.random.seed(12345) # to make the example repeatable

values = np.random.uniform(size=10)

values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])