# Chapter 7

# Data Cleaning and Preparation

In [1]:
import pandas as pd

In [2]:
import numpy as np

# 7.1 Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.
The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected:

In [3]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by refer‐ ring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [6]:
# The built-in Python None value is also treated as NA in object arrays:
string_data[0] = None

In [7]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

# Filtering Out Missing Data

In [8]:
from numpy import nan as NA

In [9]:
data = pd.Series([1,NA,3.5,NA,7])

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [11]:
# This is equivalent to:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                             [NA, NA, NA], [NA, 6.5, 3.]])

In [13]:
cleaned = data.dropna()

In [14]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [16]:
# Passing how='all' will only drop rows that are all NA:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [17]:
# To drop columns in the same way, pass axis=1:
data[4] = NA

In [18]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [19]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [20]:
df = pd.DataFrame(np.random.randn(7,3))

In [21]:
df.iloc[:4,1] = NA

In [22]:
df.iloc[:2,2] = NA

In [23]:
df 

Unnamed: 0,0,1,2
0,-0.708119,,
1,-1.413724,,
2,0.055258,,-0.453149
3,-0.512723,,-0.418453
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


In [24]:
df.dropna()

Unnamed: 0,0,1,2
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


In [25]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.055258,,-0.453149
3,-0.512723,,-0.418453
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


# Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most pur‐ poses, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [26]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.708119,0.0,0.0
1,-1.413724,0.0,0.0
2,0.055258,0.0,-0.453149
3,-0.512723,0.0,-0.418453
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


In [27]:
# Calling fillna with a dict, you can use a different fill value for each column:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,-0.708119,0.5,0.0
1,-1.413724,0.5,0.0
2,0.055258,0.5,-0.453149
3,-0.512723,0.5,-0.418453
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


In [28]:
# fillna returns a new object, but you can modify the existing object in-place:
_ = df.fillna(0, inplace=True)

In [29]:
df

Unnamed: 0,0,1,2
0,-0.708119,0.0,0.0
1,-1.413724,0.0,0.0
2,0.055258,0.0,-0.453149
3,-0.512723,0.0,-0.418453
4,0.93339,1.94157,-1.318099
5,-0.257618,0.028687,-0.92568
6,0.195392,1.044899,-0.243783


In [30]:
# The same interpolation methods available for reindexing can be used with fillna:
df = pd.DataFrame(np.random.randn(6,3))

In [31]:
df.iloc[2:,1] = NA

In [32]:
df.iloc[4:,2] = NA

In [33]:
df

Unnamed: 0,0,1,2
0,-1.171699,-0.741235,-0.960274
1,-1.225311,-0.912241,1.30871
2,1.162562,,-0.015181
3,0.494018,,-0.225116
4,-0.554228,,
5,1.251541,,


In [34]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-1.171699,-0.741235,-0.960274
1,-1.225311,-0.912241,1.30871
2,1.162562,-0.912241,-0.015181
3,0.494018,-0.912241,-0.225116
4,-0.554228,-0.912241,-0.225116
5,1.251541,-0.912241,-0.225116


In [35]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-1.171699,-0.741235,-0.960274
1,-1.225311,-0.912241,1.30871
2,1.162562,-0.912241,-0.015181
3,0.494018,-0.912241,-0.225116
4,-0.554228,,-0.225116
5,1.251541,,-0.225116


With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:

In [36]:
data = pd.Series([1.,NA,3.5,NA,7])

In [37]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

# 7.2 Data Transformation

So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning, and other transformations are another class of important operations.

# Removing Duplicates

In [38]:
# Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                             'k2': [1, 1, 2, 3, 3, 4, 4]})


In [39]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [40]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [41]:
# Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [42]:
data['v1'] = range(7)

In [43]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combina‐ tion. Passing keep='last' will return the last one:

In [44]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


# Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐ ues in an array, Series, or column in a DataFrame. Consider the following hypotheti‐ cal data collected about various kinds of meat:

In [45]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
         'Pastrami', 'corned beef', 'Bacon',
         'pastrami', 'honey ham', 'nova lox'],
         'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [46]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [47]:
 meat_to_animal = {
      'bacon': 'pig',
      'pulled pork': 'pig',
      'pastrami': 'cow',
      'corned beef': 'cow',
      'honey ham': 'pig',
      'nova lox': 'salmon'
}

In [48]:
lowercased = data['food'].str.lower()

In [49]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [50]:
data['animal'] = lowercased.map(meat_to_animal)

In [51]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [52]:
# We could also have passed a function that does all the work:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other data cleaning–related operations.

# Replacing Values

Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s con‐ sider this Series:

In [53]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [54]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True):

In [55]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the
substitute value:

In [56]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [57]:
# To use a different replacement for each value, pass a list of substitutes:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [58]:
# The argument passed can also be a dict:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The data.replace method is distinct from data.str.replace, which performs string substitution element-wise. We look at these string methods on Series later in the chapter

# Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or map‐ ping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example:

In [59]:
 data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                            index=['Ohio', 'Colorado', 'New York'],
                            columns=['one', 'two', 'three', 'four'])

In [60]:
# Like a Series, the axis indexes have a map method:
transform = lambda x: x[:4].upper()

In [61]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [62]:
# You can assign to index, modifying the DataFrame in-place:
data.index = data.index.map(transform)

In [63]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [64]:
# If you want to create a transformed version of a dataset without modifying the origi‐ nal, a useful method is rename:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new val‐ ues for a subset of the axis labels:

In [65]:
data.rename(index={'OHIO': 'INDIANA'},
                    columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [66]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [67]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


# Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [68]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use cut, a function in pandas:

In [69]:
bins = [18, 25, 35, 60, 100]

In [70]:
cats = pd.cut(ages, bins)

In [71]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see describes the bins computed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the dis‐ tinct category names along with a labeling for the ages data in the codes attribute:

In [72]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [73]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [74]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.
Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

In [75]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [76]:
# You can also pass your own bin names by passing a list or array to the labels option:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [77]:
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will com‐ pute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [78]:
data = np.random.rand(20)

In [79]:
pd.cut(data, 4, precision=2)

[(0.48, 0.72], (0.25, 0.48], (0.012, 0.25], (0.72, 0.95], (0.012, 0.25], ..., (0.48, 0.72], (0.25, 0.48], (0.25, 0.48], (0.25, 0.48], (0.25, 0.48]]
Length: 20
Categories (4, interval[float64, right]): [(0.012, 0.25] < (0.25, 0.48] < (0.48, 0.72] < (0.72, 0.95]]

The precision=2 option limits the decimal precision to two digits.
A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [80]:
data = np.random.randn(1000) # Normally distributed

In [81]:
cats = pd.qcut(data, 4) # Cut into quartiles

In [82]:
cats

[(-2.654, -0.694], (-2.654, -0.694], (0.000661, 0.675], (-0.694, 0.000661], (0.000661, 0.675], ..., (0.675, 3.156], (0.000661, 0.675], (-0.694, 0.000661], (0.000661, 0.675], (0.675, 3.156]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.654, -0.694] < (-0.694, 0.000661] < (0.000661, 0.675] < (0.675, 3.156]]

In [83]:
pd.value_counts(cats)

(-2.654, -0.694]      250
(-0.694, 0.000661]    250
(0.000661, 0.675]     250
(0.675, 3.156]        250
dtype: int64

In [84]:
# Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.292, 0.000661], (-1.292, 0.000661], (0.000661, 1.327], (-1.292, 0.000661], (0.000661, 1.327], ..., (0.000661, 1.327], (0.000661, 1.327], (-1.292, 0.000661], (0.000661, 1.327], (0.000661, 1.327]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.654, -1.292] < (-1.292, 0.000661] < (0.000661, 1.327] < (1.327, 3.156]]

# Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [85]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [86]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.021408,-0.023783,-0.005978,-0.039652
std,0.995684,1.036721,1.00794,0.978362
min,-2.909981,-3.156364,-2.963091,-2.690095
25%,-0.674577,-0.735023,-0.70891,-0.703134
50%,0.00069,0.004721,0.008756,0.022231
75%,0.652474,0.686841,0.708318,0.600685
max,3.159057,3.015478,2.978359,3.329722


In [87]:
# Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:
col = data[2]

In [88]:
col[np.abs(col) > 3]

Series([], Name: 2, dtype: float64)

In [89]:
# To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
238,0.336054,-3.069113,0.150945,1.368444
248,3.159057,0.245421,0.99076,2.96554
268,-0.001419,0.008697,1.291888,3.329722
276,-1.223937,3.015478,-1.146925,0.491471
305,0.056963,-3.156364,0.350028,-2.108143
687,-0.21572,0.484469,-0.671164,3.001899


In [90]:
# Values can be set based on these criteria. Here is code to cap values outside the inter‐ val –3 to 3:
data[np.abs(data) > 3] = np.sign(data) * 3

In [91]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.021567,-0.023573,-0.005978,-0.039984
std,0.995188,1.036013,1.00794,0.977274
min,-2.909981,-3.0,-2.963091,-2.690095
25%,-0.674577,-0.735023,-0.70891,-0.703134
50%,0.00069,0.004721,0.008756,0.022231
75%,0.652474,0.686841,0.708318,0.600685
max,3.0,3.0,2.978359,3.0


In [92]:
# The statement np.sign(data) produces 1 and –1 values based on whether the values in data are positive or negative:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,-1.0
1,-1.0,-1.0,1.0,-1.0
2,-1.0,1.0,1.0,-1.0
3,1.0,-1.0,-1.0,-1.0
4,1.0,-1.0,1.0,-1.0


# Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:

In [93]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [94]:
sampler = np.random.permutation(5)

In [95]:
sampler

array([1, 0, 4, 3, 2])

In [97]:
# That array can then be used in iloc-based indexing or the equivalent take function:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [98]:
df.take(sampler)

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
2,8,9,10,11


In [99]:
# To select a random subset without replacement, you can use the sample method on Series and DataFrame:
df.sample(n=3)

Unnamed: 0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7


In [100]:
# To generate a sample with replacement (to allow repeat choices), pass replace=True to sample:
choices = pd.Series([5, 7, -1, 6, 4])

In [101]:
draws = choices.sample(n=10, replace=True)

In [102]:
draws

3    6
2   -1
2   -1
0    5
0    5
4    4
4    4
3    6
2   -1
1    7
dtype: int64

# Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica‐ tions is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or Data‐ Frame with k columns containing all 1s and 0s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame:

In [103]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                           'data1': range(6)})

In [104]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator Data‐ Frame, which can then be merged with the other data. get_dummies has a prefix argu‐ ment for doing this:

In [105]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [106]:
df_with_dummy = df[['data1']].join(dummies)

In [107]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in Chapter 14:

In [108]:
mnames = ['movie_id', 'title', 'genres']

In [109]:
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames)

  movies = pd.read_table('datasets/movielens/movies.dat', sep='::',


FileNotFoundError: [Errno 2] No such file or directory: 'datasets/movielens/movies.dat'

Adding indicator variables for each genre requires a little bit of wrangling. First, we extract the list of unique genres in the dataset:

In [110]:
all_genres = []

In [111]:
for x in movies.genres:
            all_genres.extend(x.split('|'))

NameError: name 'movies' is not defined

In [112]:
genres = pd.unique(all_genres)

In [113]:
genres

array([], dtype=float64)

In [114]:
zero_matrix = np.zeros((len(movies), len(genres)))

NameError: name 'movies' is not defined

In [115]:
dummies = pd.DataFrame(zero_matrix, columns=genres)

NameError: name 'zero_matrix' is not defined

Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre:

In [116]:
gen = movies.genres[0]

NameError: name 'movies' is not defined

In [117]:
gen.split('|')

NameError: name 'gen' is not defined

In [118]:
dummies.columns.get_indexer(gen.split('|'))

NameError: name 'gen' is not defined

In [119]:
# Then, we can use .iloc to set values based on these indices:
for i, gen in enumerate(movies.genres):
            indices = dummies.columns.get_indexer(gen.split('|'))
            dummies.iloc[i, indices] = 1

NameError: name 'movies' is not defined

In [120]:
# Then, as before, you can combine this with movies:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

NameError: name 'movies' is not defined

In [121]:
movies_windic.iloc[0]

NameError: name 'movies_windic' is not defined

For much larger data, this method of constructing indicator vari‐ ables with multiple membership is not especially speedy. It would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame.

In [122]:
# A useful recipe for statistical applications is to combine get_dummies with a discreti‐ zation function like cut:
np.random.seed(12345)