In [65]:
import pandas as pd
import numpy as np
from datetime import datetime

# Handling duplicate data

Pandas provides the `.duplicates()` method to facilitate finding duplicate data. This method returns a
Boolean Series , where each entry represents whether or not the row is a duplicate. A True value represents
that the specific row has appeared earlier in the DataFrame object, with all the column values identical.

Duplicate rows can be dropped from a DataFrame by using the `.drop_duplicates()` method. This method returns a copy of the DataFrame with the duplicate rows removed.

The default operation is to keep the first row of the duplicates. If you want to keep the last row of the duplicates, use the `keep='last'` parameter.

In [109]:
data = pd.DataFrame({'a': ['x'] * 3 + ['y'] * 4,
                     'b': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,a,b
0,x,1
1,x,1
2,x,2
3,y,3
4,y,3
5,y,4
6,y,4


In [110]:
# reports which rows are duplicates

data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [111]:
# drop duplicate rows retaining first row of the duplicates

data.drop_duplicates()

Unnamed: 0,a,b
0,x,1
2,x,2
3,y,3
5,y,4


If you want to check for duplicates based on a smaller set of columns, you can specify a list of column names

In [112]:
# add a column c to avoid fully duplicated rows

data['c'] = range(7)
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [114]:
data.duplicated(['a', 'b'])

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [113]:
# if we specify duplicates to be dropped only in columns a and b, they will be dropped

data.drop_duplicates(['a', 'b'])

Unnamed: 0,a,b,c
0,x,1,0
2,x,2,2
3,y,3,3
5,y,4,5


## Transforming data

Another part of tidying data involves transforming the existing data into another presentation. This may be
needed for the following reasons:

- Values are not in the correct units
- Values are qualitative and need to be converted to appropriate numeric values
- There is extraneous data that either wastes memory and processing time, or can affect results simply by being included

To address these situations, we can take one or more of the following actions:

- Map values to other values using a table lookup process with `map` method
- Explicitly replace certain values with other values (or even another type of data) with `replace` method
- Apply methods to transform the values based on an algorithm with `apply` method
- Simply remove extraneous columns and rows

### Mapping data into different values

One of the basic tasks in data transformations is the mapping of a set of values to another set. Pandas provides a
generic ability to map values using a lookup table (via a Python dictionary or a pandas `Series`) using the
`.map()` method.

This method performs the mapping by first matching the values of the outer `Series` with the index labels of
the inner `Series`. It then returns a new `Series`, with the index labels of the outer `Series` but the values
from the inner `Series`.

If pandas does not find a map between the value of the outer `Series` and an index label of the inner `Series`, it fills the value with `NaN`.

In [116]:
# mapping with the dict: keys are values need to be replaced, values of dict - new values 

df.c1.map({3: 4, 9: 10, 15: 16})

a     NaN
b     4.0
c     NaN
d    10.0
e     NaN
f    16.0
g     NaN
Name: c1, dtype: float64

In [125]:
# mapping with the Series

s = pd.Series({0: 'a', 6: 'b', 12: 'c'})
s

0     a
6     b
12    c
dtype: object

In [126]:
df.c1.map(s)

a      a
b    NaN
c      b
d    NaN
e      c
f    NaN
g    NaN
Name: c1, dtype: object

In [127]:
# it also accepts a function

df.c1.map('I am a {}'.format)

a     I am a 0.0
b     I am a 3.0
c     I am a 6.0
d     I am a 9.0
e    I am a 12.0
f    I am a 15.0
g     I am a nan
Name: c1, dtype: object

In [130]:
# to avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can be used

df.c2.map('I am a {}'.format, na_action='ignore')

a            NaN
b     I am a 4.0
c            NaN
d    I am a 10.0
e    I am a 13.0
f    I am a 16.0
g            NaN
Name: c2, dtype: object

### Replacing values

The most basic use of the `.replace()` method is to replace an individual value with another.

In [132]:
# replace a single value

df.c1.replace(6, 222)

a      0.0
b      3.0
c    222.0
d      9.0
e     12.0
f     15.0
g      NaN
Name: c1, dtype: float64

In [135]:
# replace several values, lists must be the same length

df.c1.replace([0, 6, 12, np.nan], ['I', 'am', 'feeling', 'good'])

a          I
b        3.0
c         am
d        9.0
e    feeling
f       15.0
g       good
Name: c1, dtype: object

In [138]:
# replace values in a DF

df.replace([12, 15, 10, 13, 16, 11, 14, 17, 20, 18], 256)

Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,,2.0,256.0,,NaT
b,3.0,4.0,5.0,,,2012-01-01
c,6.0,,8.0,,,NaT
d,9.0,256.0,256.0,,,2012-01-01
e,256.0,256.0,256.0,,,2012-01-01
f,256.0,256.0,256.0,256.0,,2012-01-01
g,,,,,,2012-01-01


In [140]:
# replace using entries in a dictionary

df.replace({12: 'twelve', 
            15: 'fifteen', 
            10: 'ten', 
            13: 'thirteen', 
            16: 'sixteen', 
            11: 'eleven', 
            14: 'fourteen'
           }
          )

Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,,2.0,20.0,,NaT
b,3.0,4.0,5.0,,,2012-01-01
c,6.0,,8.0,,,NaT
d,9.0,ten,eleven,,,2012-01-01
e,twelve,thirteen,fourteen,,,2012-01-01
f,fifteen,sixteen,17.0,18.0,,2012-01-01
g,,,,,,2012-01-01


In [147]:
# specify different replacement value for each column in DF

df.replace({'c1': 12, 'c2': 13}, 256)

Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,,2.0,20.0,,NaT
b,3.0,4.0,5.0,,,2012-01-01
c,6.0,,8.0,,,NaT
d,9.0,10.0,11.0,,,2012-01-01
e,256.0,256.0,14.0,,,2012-01-01
f,15.0,16.0,17.0,18.0,,2012-01-01
g,,,,,,2012-01-01


In [150]:
# replace items with index label 1, 2, 3, using fill from the most recent value prior to the specified labels

df.c1.replace([1, 2, 3], method='pad')

a     0.0
b     0.0
c     6.0
d     9.0
e    12.0
f    15.0
g     NaN
Name: c1, dtype: float64

### Applying functions to transform data

In situations where a direct mapping or substitution will not suffice, it is possible to apply a function to the data to perform an algorithm on the data. Pandas provides the ability to apply functions to individual items, entire columns, or entire rows, providing incredible flexibility in transformation.

The `.apply()` method when given a Python function, is iteratively calls the function while passing in each value from a `Series`. If applied to a `DataFrame`, pandas will pass in each column as a `Series`, or if applied along `axis=1`, it will pass in a `Series` representing each rowrepresenting each row.

In [153]:
df.c1.apply(lambda x: x ** 2)

a      0.0
b      9.0
c     36.0
d     81.0
e    144.0
f    225.0
g      NaN
Name: c1, dtype: float64

When a function is applied to a `DataFrame`, the default is to apply the method to each column. Pandas iterates through all the columns, passing each as a `Series` to your function. The result is a `Series` object with index labels matching the column names, and the result of the function applied to the column.

In [156]:
df.drop(columns='timestamp').apply(lambda x: x ** 2)

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,,4.0,400.0,
b,9.0,16.0,25.0,,
c,36.0,,64.0,,
d,81.0,100.0,121.0,,
e,144.0,169.0,196.0,,
f,225.0,256.0,289.0,324.0,
g,,,,,


In [157]:
df.drop(columns='timestamp').apply(lambda x: x.sum())

c1    45.0
c2    43.0
c3    57.0
c4    38.0
c5     0.0
dtype: float64

In [164]:
df.drop(columns='timestamp').apply(np.sum)

c1    45.0
c2    43.0
c3    57.0
c4    38.0
c5     0.0
dtype: float64

In [158]:
# application of the function can be switched to the values from each row by specifying axis=1

df.drop(columns='timestamp').apply(lambda x: x.sum(), axis=1)

a    22.0
b    12.0
c    14.0
d    30.0
e    39.0
f    66.0
g     0.0
dtype: float64

The `.applymap()` method of `DataFrame` applies the function to each and every individual value. This method applies a function that accepts and returns a scalar to every element of a DataFrame.

In [161]:
df.drop(columns='timestamp').applymap(lambda x: '%.2f' %x)

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,,2.0,20.0,
b,3.0,4.0,5.0,,
c,6.0,,8.0,,
d,9.0,10.0,11.0,,
e,12.0,13.0,14.0,,
f,15.0,16.0,17.0,18.0,
g,,,,,
