# Missing Data

When cleaning up data for
analysis, it is often important to do analysis on the missing data itself to identify data
collection problems or potential biases in the data caused by missing data.

Some functions related to missing data handling:
- dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how
much missing data to tolerate.
- fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
- isnull Return boolean values indicating which values are missing/NA.
- notnull Negation of isnull

## Filtering Out Missing Data


There are a few ways to filter out missing data. While you always have the option to
do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values:

In [1]:
import pandas as pd
import numpy as np
from numpy import nan as NA

In [2]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [3]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:


In [4]:
 data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

dropna by default drops any row containing a missing value

In [5]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()

In [6]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [7]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [8]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass axis=1:

In [9]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [10]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:

In [24]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,1.671786,,
1,-0.65444,,
2,1.015331,,-1.211759
3,-0.06222,,-0.613826
4,1.242877,1.20811,-0.236407
5,-2.276661,-1.435561,-1.731508
6,0.211602,0.653685,1.235571


In [13]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.304337,0.696975,-2.393546
5,2.607568,0.088956,-0.151383
6,-0.138353,-0.812758,0.88396


In [16]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.1457,,-1.253357
3,-1.442061,,-1.039466
4,-0.304337,0.696975,-2.393546
5,2.607568,0.088956,-0.151383
6,-0.138353,-0.812758,0.88396


## Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along 
with it), you may want to fill in the “holes” in any number of ways.T the fillna method is the workhorse function to use. Calling fillna with a
constant replaces missing values with that value:

In [17]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.760978,0.0,0.0
1,0.859045,0.0,0.0
2,0.1457,0.0,-1.253357
3,-1.442061,0.0,-1.039466
4,-0.304337,0.696975,-2.393546
5,2.607568,0.088956,-0.151383
6,-0.138353,-0.812758,0.88396


Calling fillna with a dict, you can use a different fill value for each column:

In [18]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.760978,0.5,0.0
1,0.859045,0.5,0.0
2,0.1457,0.5,-1.253357
3,-1.442061,0.5,-1.039466
4,-0.304337,0.696975,-2.393546
5,2.607568,0.088956,-0.151383
6,-0.138353,-0.812758,0.88396


fillna returns a new object, but you can modify the existing object in-place:

In [26]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,1.671786,0.0,0.0
1,-0.65444,0.0,0.0
2,1.015331,0.0,-1.211759
3,-0.06222,0.0,-0.613826
4,1.242877,1.20811,-0.236407
5,-2.276661,-1.435561,-1.731508
6,0.211602,0.653685,1.235571


The same interpolation methods available for reindexing can be used with fillna:


In [27]:
df = pd.DataFrame(np.random.randn(6, 3))

In [28]:
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.297646,0.645862,0.181841
1,0.580929,0.724424,1.056036
2,1.647219,,0.45611
3,-0.921719,,0.459025
4,1.084724,,
5,-1.559651,,


In [29]:
df.fillna(method='ffill')

  df.fillna(method='ffill')


Unnamed: 0,0,1,2
0,-0.297646,0.645862,0.181841
1,0.580929,0.724424,1.056036
2,1.647219,0.724424,0.45611
3,-0.921719,0.724424,0.459025
4,1.084724,0.724424,0.459025
5,-1.559651,0.724424,0.459025


In [32]:
df.fillna(method='ffill', limit=2)

  df.fillna(method='ffill', limit=2)


Unnamed: 0,0,1,2
0,-0.297646,0.645862,0.181841
1,0.580929,0.724424,1.056036
2,1.647219,0.724424,0.45611
3,-0.921719,0.724424,0.459025
4,1.084724,,0.459025
5,-1.559651,,0.459025


In [33]:
df.ffill(limit=3)

Unnamed: 0,0,1,2
0,-0.297646,0.645862,0.181841
1,0.580929,0.724424,1.056036
2,1.647219,0.724424,0.45611
3,-0.921719,0.724424,0.459025
4,1.084724,0.724424,0.459025
5,-1.559651,,0.459025


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [30]:
data = pd.Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [31]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

### fillna function arguments

- _value_ Scalar value or dict-like object to use to fill missing values
- _method_ Interpolation; by default 'ffill' if function called with no other arguments
- _axis_ Axis to fill on; default axis=0
- _inplace_ Modify the calling object without producing a copy
- _limit_ For forward and backward filling, maximum number of consecutive periods to fill


#  Data Transformation

## Removing Duplicates

In [34]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:

In [35]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is
False:

In [36]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can
specify any subset of them to detect duplicates. Suppose we had an additional column
of values and wanted to filter duplicates only based on the 'k1' column:

In [40]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [38]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. 
Passing keep='last' will return the last one:

In [39]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


## Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. 
Consider the following hypothetical data collected about various kinds of meat:

In [41]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                                'Pastrami', 'corned beef', 'Bacon',
                                'pastrami', 'honey ham', 'nova lox'],
                       'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. 
Let’s write down a mapping of each distinct meat type to the kind of animal:


In [42]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a map
ping, but here we have a small problem in that some of the meats are capitalized an 
others are not
. Thus, we need to convert each value to lowercase using the str.low r
Series method:

In [43]:
lowercased = data['food'].str.lower()

In [44]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [45]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other
data cleaning–related operations

## Replacing Values


map can be used to modify a subset of values in
an object but replace provides a simpler and more flexible way to do so

In [46]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series (unless
you pass inplace=True):

In [47]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the
substitute value:

In [48]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [49]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [50]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The data.replace method is distinct from data.str.replace,
which performs string substitution element-wise

## Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. 
You can also modify the axes in-place without creating a new data structure.

In [51]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [54]:
transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to index, modifying the DataFrame in-place:

In [56]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the original, a useful method is rename:

In [57]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels

In [58]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


rename saves you from the chore of copying the DataFrame manually and assigning
to its index and columns attributes. Should you wish to modify a dataset in-place, pass inplace=True:

In [59]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


## Discretization and Binning


Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets:

In [60]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older.
To do so, you have to use cut, a function in pandas:

In [61]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. 
The output you see describes the bins computed by pandas.cut. 
You can treat it like an array of strings indicating the bin name; 
internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute:

In [62]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [63]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

Consistent with mathematical notation for intervals, a parenthesis means that the side
is open, while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False:

In [64]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [65]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into fourths:

In [66]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.48, 0.71], (0.24, 0.48], (0.71, 0.95], (0.24, 0.48], (0.71, 0.95], ..., (0.48, 0.71], (0.48, 0.71], (0.24, 0.48], (0.48, 0.71], (0.71, 0.95]]
Length: 20
Categories (4, interval[float64, right]): [(0.0045, 0.24] < (0.24, 0.48] < (0.48, 0.71] < (0.71, 0.95]]

__The precision=2 option limits the decimal precision to two digits.__


A closely related function, qcut, bins the data based on sample quantiles. Depending
on the distribution of the data, using cut will not usually result in each bin having the
same number of data points. Since qcut uses sample quantiles instead, by definition
you will obtain roughly equal-size bins:

In [67]:
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats

[(-0.738, -0.0572], (0.621, 2.791], (-0.0572, 0.621], (-3.0789999999999997, -0.738], (-0.0572, 0.621], ..., (-3.0789999999999997, -0.738], (0.621, 2.791], (-0.0572, 0.621], (-3.0789999999999997, -0.738], (-3.0789999999999997, -0.738]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.0789999999999997, -0.738] < (-0.738, -0.0572] < (-0.0572, 0.621] < (0.621, 2.791]]

In [69]:
cats.value_counts(cats)

(-3.0789999999999997, -0.738]    250
(-0.738, -0.0572]                250
(-0.0572, 0.621]                 250
(0.621, 2.791]                   250
Name: count, dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):


In [70]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.366, -0.0572], (1.286, 2.791], (-0.0572, 1.286], (-1.366, -0.0572], (-0.0572, 1.286], ..., (-3.0789999999999997, -1.366], (-0.0572, 1.286], (-0.0572, 1.286], (-1.366, -0.0572], (-3.0789999999999997, -1.366]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.0789999999999997, -1.366] < (-1.366, -0.0572] < (-0.0572, 1.286] < (1.286, 2.791]]

__These discretization functions are especially useful for quantile and group analysis.__

## Detecting and Filtering Outliers

In [71]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.020634,-0.034581,-0.005099,0.02464
std,1.011342,0.992569,0.995622,1.013271
min,-2.836871,-3.804096,-2.858322,-3.379322
25%,-0.714497,-0.683013,-0.633063,-0.673437
50%,-0.024782,-0.03227,0.004909,0.024138
75%,0.649325,0.619053,0.61575,0.676557
max,3.892747,3.416569,3.058031,3.187547


Suppose you wanted to find values in one of the columns exceeding 3 in absolute
value:

In [72]:
col = data[2]
col[np.abs(col) > 3]

653    3.022068
664    3.058031
Name: 2, dtype: float64

Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [75]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.021549,-0.034194,-0.005179,0.024825
std,1.008208,0.988467,0.995378,1.011475
min,-2.836871,-3.0,-2.858322,-3.0
25%,-0.714497,-0.683013,-0.633063,-0.673437
50%,-0.024782,-0.03227,0.004909,0.024138
75%,0.649325,0.619053,0.61575,0.676557
max,3.0,3.0,3.0,3.0


__The statement np.sign(data) produces 1 and –1 values based on whether the values in data are positive or negative:__


In [76]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,-1.0,1.0,1.0
1,-1.0,-1.0,1.0,-1.0
2,-1.0,1.0,1.0,-1.0
3,-1.0,1.0,1.0,1.0
4,-1.0,1.0,1.0,-1.0


## Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do
using the numpy.random.permutation function. Calling permutation with the length
of the axis you want to permute produces an array of integers indicating the new
ordering:


In [77]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([4, 2, 0, 3, 1])

That array can then be used in iloc-based indexing or the equivalent take function:


In [78]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [79]:
df.take(sampler)

Unnamed: 0,0,1,2,3
4,16,17,18,19
2,8,9,10,11
0,0,1,2,3
3,12,13,14,15
1,4,5,6,7


To select a random subset without replacement, you can use the sample method on
Series and DataFrame:

In [80]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
4,16,17,18,19


To generate a sample with replacement (to allow repeat choices), pass replace=True
to sample:

In [81]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

0    5
4    4
0    5
0    5
1    7
4    4
3    6
2   -1
2   -1
1    7
dtype: int64

## Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. 
If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. 
pandas has a _get_dummies_ function for doing this, though devising one yourself is not difficult

In [82]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [83]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, 
which can then be merged with the other data. _get_dummies_ has a prefix argument for doing this

In [84]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy 

Unnamed: 0,data1,key_a,key_b,key_c
0,0,False,True,False
1,1,False,True,False
2,2,True,False,False
3,3,False,False,True
4,4,True,False,False
5,5,False,True,False


__A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut:__

In [85]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [86]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,False,False,False,False,True
1,False,True,False,False,False
2,True,False,False,False,False
3,False,True,False,False,False
4,False,False,True,False,False
5,False,False,True,False,False
6,False,False,False,False,True
7,False,False,False,True,False
8,False,False,False,True,False
9,False,False,False,True,False


# String Manipulation

## String Object Methods

In many string munging and scripting applications, built-in string methods are sufficient. 
As an example, a comma-separated string can be broken into pieces with split:

In [87]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including line breaks):

In [88]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using
addition:

In [89]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
list or tuple to the join method on the string '::':

In [90]:
'::'.join(pieces)

'a::b::guido'

__Other methods are concerned with locating substrings. Using Python’s in keyword is
the best way to detect a substring, though index and find can also be used__

In [91]:
'guido' in val

True

In [92]:
val.index(',')

1

In [93]:
val.find(':')

-1

In [94]:
# count returns the number of occurrences of a particular substring:
val.count(',')

2

replace will substitute occurrences of one pattern for another. It is commonly used
to delete patterns, too, by passing an empty string:


In [95]:
val.replace(',', '')

'ab guido'

## Regex

In [97]:
import re

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

__Creating a regex object with re.compile is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles.__

While findall returns all matches
in a string, search returns only the first match. More rigidly, match only matches at
the beginning of the string. 

## Vectorized String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging and
regularization. To complicate matters, a column containing strings will sometimes
have missing data:

In [98]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [99]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the NA
(null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. 
These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains:


In [100]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or
index into the str attribute:


In [102]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

To access elements in the embedded lists, we can pass an index to either of these
functions:

In [106]:
# matches.str.get(1)
# matches.str[0]

You can similarly slice strings using this syntax:


In [107]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

### Vectorized string methods

_cat_ Concatenate strings element-wise with optional delimiter

_contains_ Return boolean array if each string contains pattern/regex

_count_ Count occurrences of pattern

_extract_ Use a regular expression with groups to extract one or more strings from a Series of strings; the result
will be a DataFrame with one column per group

_endswith_ Equivalent to x.endswith(pattern) for each element

_startswith_ Equivalent to x.startswith(pattern) for each element

_findall_ Compute list of all occurrences of pattern/regex for each string

_get_ Index into each element (retrieve i-th element)

_isalnum_ Equivalent to built-in str.alnum

_isalpha_ Equivalent to built-in str.isalpha

_isdecimal_ Equivalent to built-in str.isdecimal

_isdigit_ Equivalent to built-in str.isdigit

_islower_ Equivalent to built-in str.islower

_isnumeric_ Equivalent to built-in str.isnumeric

_isupper_ Equivalent to built-in str.isupper

_join_ Join strings in each element of the Series with passed separator

_len_ Compute length of each string

_lower_, _upper_ Convert cases; equivalent to x.lower() or x.upper() for each element

_match_ Use re.match with the passed regular expression on each element, returning matched groups as list

_pad_ Add whitespace to left, right, or both sides of strings

_center_ Equivalent to pad(side='both')

_repeat_ Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)

_replace_ Replace occurrences of pattern/regex with some other string

_slice_ Slice each string in the Series

_split_ Split strings on delimiter or regular expression

_strip_ Trim whitespace from both sides, including newlines

_rstrip_ Trim whitespace on right side

_lstrip_ Trim whitespace on left side

# Data cleaning examples

# Cats

### Read csv

In [140]:
df = pd.read_csv('mcats.csv')

In [141]:
df

Unnamed: 0,ID,Breed,Location of origin,Type,Body type,Coat type and length,Coat pattern
0,1,Abyssinian,"Unspecified, but somewhere in Afro-Asia, likel...",Natural,Semi-foreign,Short,Ticked tabby
1,2,Aegean,Greece,Natural,Moderate,Semi-long,Multi-color
2,3,American Bobtail,United States,Mutation of shortened tail,Cobby,Semi-long,All
3,4,American Curl,United States,Mutation,Semi-foreign,Semi-long,All
4,5,American Ringtail,United States,Mutation,Foreign,Semi-long,All
...,...,...,...,...,...,...,...
100,101,York Chocolate,"New York, United States",Natural,Moderate,Long,"Solid chocolate, solid lilac and solid taupe o..."
101,101,York Chocolate,"New York, United States",Natural,Moderate,Long,"Solid chocolate, solid lilac and solid taupe o..."
102,101,York Chocolate,"New York, United States",Natural,Moderate,Long,"Solid chocolate, solid lilac and solid taupe o..."
103,99,Turkish Vankedisi (white variety of Turkish Van),"Lake Van, Turkey",Natural,Svelte,Long,Solid white


### Drop duplicates

In [142]:
df = df.drop_duplicates()

In [143]:
df.tail(10)

Unnamed: 0,ID,Breed,Location of origin,Type,Body type,Coat type and length,Coat pattern
91,92,"Thai or Traditional, Classic, or Old-style Sia...",Developed in Europe;,Natural,Moderate,Short,Colorpoint
92,93,"Thai Lilac, Thai Blue Point and Thai Lilac Point",Thailand,Color varieties of the Korat,Moderate,Short,Solid lilac and colorpoint (blue point and lil...
93,94,Tonkinese,"Canada, United States",Crossbreed between the Burmese and Siamese,Oriental,Short,"Colorpoint, mink, or solid"
94,95,Toybob,Russia,Mutation,Dwarf,Short,All
95,96,Toyger,United States,Crossbreed/hybrid between the Bengal and short...,Moderate,Short,Mackerel tabby
96,97,Turkish Angora,Turkey,Natural,Semi-cobby,Semi-long,All
97,98,Turkish Van,Developed in United Kingdom; foundation stock ...,Natural,Semi-cobby,Semi-long,Van pattern
98,99,Turkish Vankedisi (white variety of Turkish Van),"Lake Van, Turkey",Natural,Svelte,Long,Solid white
99,100,Ukrainian Levkoy,Ukraine,Crossbreed between the Donskoy and Scottish Fold,Moderate,Hairless,Solid gray
100,101,York Chocolate,"New York, United States",Natural,Moderate,Long,"Solid chocolate, solid lilac and solid taupe o..."


### Drop column 

In [144]:
df = df.drop(columns='ID')
df

Unnamed: 0,Breed,Location of origin,Type,Body type,Coat type and length,Coat pattern
0,Abyssinian,"Unspecified, but somewhere in Afro-Asia, likel...",Natural,Semi-foreign,Short,Ticked tabby
1,Aegean,Greece,Natural,Moderate,Semi-long,Multi-color
2,American Bobtail,United States,Mutation of shortened tail,Cobby,Semi-long,All
3,American Curl,United States,Mutation,Semi-foreign,Semi-long,All
4,American Ringtail,United States,Mutation,Foreign,Semi-long,All
...,...,...,...,...,...,...
96,Turkish Angora,Turkey,Natural,Semi-cobby,Semi-long,All
97,Turkish Van,Developed in United Kingdom; foundation stock ...,Natural,Semi-cobby,Semi-long,Van pattern
98,Turkish Vankedisi (white variety of Turkish Van),"Lake Van, Turkey",Natural,Svelte,Long,Solid white
99,Ukrainian Levkoy,Ukraine,Crossbreed between the Donskoy and Scottish Fold,Moderate,Hairless,Solid gray
