In [65]:
import pandas as pd
import numpy as np
from datetime import datetime

# Pandas tidying data

While tidying data we need to cope with following cases:

- The names of the variables are different from what you require
- There is missing data
- Values are not in the units that you require
- The period of sampling of records is not what you need
- Variables are categorical and you need quantitative values
- There is noise in the data
- Information is of an incorrect type
- Data is organized around incorrect axes
- Data is at the wrong level of normalization
- Data is duplicated

## Working with missing data

If you want to consider `inf` and `-inf` to be “NA” in computations, you `can set pandas.options.mode.use_inf_as_na` = True.

Because `NaN` is a float, a column of *integers* with even one missing values is cast to floating-point dtype.

For datetime64[ns] types, `NaT` represents missing values. 

Create a DF for test:
- One row consisting only of NaN values
- One column consisting only of NaN values
- Several rows and columns consisting of both numeric values and NaN values

In [91]:
df = pd.DataFrame(np.arange(0, 15).reshape(5, 3),
                  index=list('abcde'),
                  columns=['c1', 'c2', 'c3'])
df = df.assign(c4=np.nan)
df.loc['f'] = np.arange(15, 19)
df = df.assign(c5=np.nan)
df.loc['g'] = np.nan
df.loc['a', 'c4'] = 20
df

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,1.0,2.0,20.0,
b,3.0,4.0,5.0,,
c,6.0,7.0,8.0,,
d,9.0,10.0,11.0,,
e,12.0,13.0,14.0,,
f,15.0,16.0,17.0,18.0,
g,,,,,


In [92]:
# NaT is NaN value for timestamp datetime

df['timestamp'] = pd.Timestamp('20120101')
df.loc[['a', 'c'], ['c2', 'timestamp']] = np.nan
df

Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,,2.0,20.0,,NaT
b,3.0,4.0,5.0,,,2012-01-01
c,6.0,,8.0,,,NaT
d,9.0,10.0,11.0,,,2012-01-01
e,12.0,13.0,14.0,,,2012-01-01
f,15.0,16.0,17.0,18.0,,2012-01-01
g,,,,,,2012-01-01


### Finding and counting missing data

To make detecting missing values easier (and across different array dtypes), pandas provides the `isna()` and `notna()` functions, which are also methods on `Series` and `DataFrame` objects.

In [72]:
pd.isna(df)

Unnamed: 0,c1,c2,c3,c4,c5
a,False,False,False,False,True
b,False,False,False,True,True
c,False,False,False,True,True
d,False,False,False,True,True
e,False,False,False,True,True
f,False,False,False,False,True
g,True,True,True,True,True


In [73]:
pd.notna(df)

Unnamed: 0,c1,c2,c3,c4,c5
a,True,True,True,True,False
b,True,True,True,False,False
c,True,True,True,False,False
d,True,True,True,False,False
e,True,True,True,False,False
f,True,True,True,True,False
g,False,False,False,False,False


In [74]:
df.isna()

Unnamed: 0,c1,c2,c3,c4,c5
a,False,False,False,False,True
b,False,False,False,True,True
c,False,False,False,True,True
d,False,False,False,True,True
e,False,False,False,True,True
f,False,False,False,False,True
g,True,True,True,True,True


In [71]:
# finding NaN's values

df.isnull()

Unnamed: 0,c1,c2,c3,c4,c5
a,False,False,False,False,True
b,False,False,False,True,True
c,False,False,False,True,True
d,False,False,False,True,True
e,False,False,False,True,True
f,False,False,False,False,True
g,True,True,True,True,True


In [25]:
# count the number of NaN's in each column

df.isnull().sum()

c1    1
c2    1
c3    1
c4    5
c5    7
dtype: int64

In [27]:
# total count of NaN values

df.isnull().sum().sum()

15

In [28]:
# another way: check the number of non-missing values in each column

df.count()

c1    6
c2    6
c3    6
c4    2
c5    0
dtype: int64

In [31]:
# previous same as

df.notnull().sum()

c1    6
c2    6
c3    6
c4    2
c5    0
dtype: int64

In [33]:
# finding non-null values in selected column

df.c4[df.c4.notnull()]

a    20.0
f    18.0
Name: c4, dtype: float64

### Dropping missing data

In [34]:
# .dropna() has actually returned a copy of DataFrame without the rows. The original DataFrame is not changed

df.c4.dropna()

a    20.0
f    18.0
Name: c4, dtype: float64

In [36]:
# drops all rows from the DF object that have at least one NaN value

df.dropna()

Unnamed: 0,c1,c2,c3,c4,c5


In [37]:
# drop only rows where all values are NaNs

df.dropna(how='all')

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,1.0,2.0,20.0,
b,3.0,4.0,5.0,,
c,6.0,7.0,8.0,,
d,9.0,10.0,11.0,,
e,12.0,13.0,14.0,,
f,15.0,16.0,17.0,18.0,


In [40]:
# flip to drop columns where all values are NaNs instead of rows

df.dropna(how='all', axis=1)

Unnamed: 0,c1,c2,c3,c4
a,0.0,1.0,2.0,20.0
b,3.0,4.0,5.0,
c,6.0,7.0,8.0,
d,9.0,10.0,11.0,
e,12.0,13.0,14.0,
f,15.0,16.0,17.0,18.0
g,,,,


In [41]:
df2 = df.copy()
df2.loc['g', 'c1'] = 0
df2.loc['g', 'c3'] = 0
df2

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,1.0,2.0,20.0,
b,3.0,4.0,5.0,,
c,6.0,7.0,8.0,,
d,9.0,10.0,11.0,,
e,12.0,13.0,14.0,,
f,15.0,16.0,17.0,18.0,
g,0.0,,0.0,,


In [42]:
# drop columns with any NaN values with axis=1 parameter

df2.dropna(how='any', axis=1)

Unnamed: 0,c1,c3
a,0.0,2.0
b,3.0,5.0
c,6.0,8.0
d,9.0,11.0
e,12.0,14.0
f,15.0,17.0
g,0.0,0.0


In [43]:
# only drop columns with at least 5 NaN values with thresh=5 parameter, these are c4 and c5 columns

df.dropna(thresh=5, axis=1)

Unnamed: 0,c1,c2,c3
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,9.0,10.0,11.0
e,12.0,13.0,14.0
f,15.0,16.0,17.0
g,,,


In [44]:
# drop the data in the actual DF with inplace=True parameter

df3 = df2.copy()
df3.dropna(thresh=5, axis=1, inplace=True)
df3

Unnamed: 0,c1,c2,c3
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,9.0,10.0,11.0
e,12.0,13.0,14.0
f,15.0,16.0,17.0
g,0.0,,0.0


### Handling of NaN values in mathematical operations

The NaN values are handled differently in pandas than in NumPy. 

When a NumPy function encounters a NaN value, it returns NaN . Pandas functions typically ignore the NaN
values and continue processing the function as though the NaN values were not part of the Series object.

- `.mean()` method is totally ignore and not count a NaN values
- Summing of data treats NaN as 0
- If all values are NaN , the result is NaN
- Methods like `.cumsum()` and `.cumprod()` ignore the NaN values, but preserve them in the resulting arrays
- The product of an empty or all-NA Series or column of a DataFrame is 1.

In [45]:
# the mean of a NumPy array and a Series is different

a = np.array([1, 2, np.nan, 3])
s = pd.Series(a)
a.mean(), s.mean()

(nan, 2.0)

In [48]:
# sum handling of NaN

print(df.c4.sum())
df.c4

38.0


a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

In [49]:
# mean handling of NaN

print(df.c4.mean())

19.0


In [50]:
# cumsum treats NaN as 0, but NaN's preserved in result Series

df.c4.cumsum()

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    38.0
g     NaN
Name: c4, dtype: float64

In [51]:
# in traditional math operations NaN will be propagated through to the result

df.c4 + 1

a    21.0
b     NaN
c     NaN
d     NaN
e     NaN
f    19.0
g     NaN
Name: c4, dtype: float64

In [78]:
# the product of an empty or all-NA Series or column of a DataFrame is 1

df.c5.prod()

1.0

In [80]:
# NaN's values don't count on calculating product of Series values

df.c2.prod()

8320.0

In [93]:
df

Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,,2.0,20.0,,NaT
b,3.0,4.0,5.0,,,2012-01-01
c,6.0,,8.0,,,NaT
d,9.0,10.0,11.0,,,2012-01-01
e,12.0,13.0,14.0,,,2012-01-01
f,15.0,16.0,17.0,18.0,,2012-01-01
g,,,,,,2012-01-01


### Filling in missing data

In [52]:
# return a new DF with NaN's filled with 0 with fillna method

filled = df.fillna(0)
filled

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,1.0,2.0,20.0,0.0
b,3.0,4.0,5.0,0.0,0.0
c,6.0,7.0,8.0,0.0,0.0
d,9.0,10.0,11.0,0.0,0.0
e,12.0,13.0,14.0,0.0,0.0
f,15.0,16.0,17.0,18.0,0.0
g,0.0,0.0,0.0,0.0,0.0


In [53]:
# this causes the differences in the resulting values

df.mean()

c1     7.5
c2     8.5
c3     9.5
c4    19.0
c5     NaN
dtype: float64

In [54]:
filled.mean()

c1    6.428571
c2    7.285714
c3    8.142857
c4    5.428571
c5    0.000000
dtype: float64

Gaps in data can be filled by propagating the non- NaN values forward or backward along a Series. 

When working with time series data, this technique of filling is often referred to as the "last known value".

In [55]:
# fills NaN's forward

df.c4.fillna(method='ffill')

a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64

In [56]:
# fills NaN's backward

df.c4.fillna(method='bfill')

a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

In [59]:
# another common scenario is to fill all the NaN's in a column with the mean of the column

df.fillna(df.mean())

Unnamed: 0,c1,c2,c3,c4,c5
a,0.0,1.0,2.0,20.0,
b,3.0,4.0,5.0,19.0,
c,6.0,7.0,8.0,19.0,
d,9.0,10.0,11.0,19.0,
e,12.0,13.0,14.0,19.0,
f,15.0,16.0,17.0,18.0,
g,7.5,8.5,9.5,19.0,


In [99]:
df.fillna(df.mean()['c2':'c4'])

  df.fillna(df.mean()['c2':'c4'])


Unnamed: 0,c1,c2,c3,c4,c5,timestamp
a,0.0,10.75,2.0,20.0,,NaT
b,3.0,4.0,5.0,19.0,,2012-01-01
c,6.0,10.75,8.0,19.0,,NaT
d,9.0,10.0,11.0,19.0,,2012-01-01
e,12.0,13.0,14.0,19.0,,2012-01-01
f,15.0,16.0,17.0,18.0,,2012-01-01
g,,10.75,9.5,19.0,,2012-01-01


In [94]:
# If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword

df['c4'].fillna(method='ffill', limit=3)

a    20.0
b    20.0
c    20.0
d    20.0
e     NaN
f    18.0
g    18.0
Name: c4, dtype: float64

`ffill()` is equivalent to `fillna(method='ffill')` and `bfill()` is equivalent to `fillna(method='bfill')`

In [95]:
df.c4.ffill()

a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64

In [96]:
df.c4.bfill()

a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

### Filling using index labels

Data can be filled using the labels of a Series or keys of a Python dictionary. This allows you to specify
different fill values for different elements, based upon the value of the index label.

The labels of the dict or index of the Series must match the columns of the frame you wish to fill.

In [57]:
fill_values = pd.Series([100, 101, 102], index=['a', 'e', 'g'])
fill_values

a    100
e    101
g    102
dtype: int64

In [58]:
# only values of NaN are filled

df.c4.fillna(fill_values)

a     20.0
b      NaN
c      NaN
d      NaN
e    101.0
f     18.0
g    102.0
Name: c4, dtype: float64

### Performing interpolation of missing values

Both DataFrame and Series have an .interpolate() method that, by default, performs a linear interpolation of missing values.

This is important. Imagine if your data represents an increasing set of values, such as increasing temperature
during the day. If the sensor stops responding for a few sample periods, the missing data can be inferred through
interpolation with a high level of certainty. It is definitely better than setting the values to 0.

In [106]:
# linear interpolate the NaN values from 1 to 2

s = pd.Series([1, np.nan, np.nan, np.nan, 2], index=pd.date_range('2022-2-24', periods=5))
s.interpolate()

2022-02-24    1.00
2022-02-25    1.25
2022-02-26    1.50
2022-02-27    1.75
2022-02-28    2.00
Freq: D, dtype: float64

The interpolation method also has the ability to specify a specific method of interpolation. One of the common
methods is to use time-based interpolation.

Index aware interpolation is available via the `method` keyword.

The appropriate interpolation method will depend on the type of data you are working with. `scipy` provides different interpolation methods.

In [67]:
# linear interpolation based on number of items

ts = pd.Series([1, np.nan, 2],
               index=[datetime(2014, 1, 1),
                      datetime(2014, 2, 2),
                      datetime(2014, 4, 1)])
ts.interpolate()

2014-01-01    1.0
2014-02-02    1.5
2014-04-01    2.0
dtype: float64

But the series is missing an entry for 2014-03-01.

If we were expecting to interpolate daily values, there should be two values calculated, one for 2014-02-01 and another for 2014-03-01, resulting in one more value in the numerator of the interpolation.

The `method`='time' parameter corrects the interpolation for 2014-02-01, based upon dates. Also note that the index label and value for 2014-03-01 is not added to Series ; it is just factored into the interpolation.

In [68]:
# this accounts for the fact that we don't have an entry for 2014-03-01 

ts.interpolate(method='time')

2014-01-01    1.000000
2014-02-02    1.355556
2014-04-01    2.000000
dtype: float64

If we want to interpolate the value to be relative to the index value, when using numeric index labels, we can use `method`='values'.

In [69]:
# linear interpolation

s = pd.Series([0, np.nan, 100], index=[0, 1, 10])
s.interpolate()

0       0.0
1      50.0
10    100.0
dtype: float64

In [70]:
# interpolate based upon the values in the index (using relative positioning)

s.interpolate(method='values')

0       0.0
1      10.0
10    100.0
dtype: float64