In [31]:
import pandas as pd
import numpy as np
import random

## Filter by Row Index

The index is convenient in many ways. Say I can easily filter the target `y` and train_features `X` if I make sure the index is properly matched. (Comment out the code `y.index = X.index` to check the impact)

In [32]:
# in real-world scenario, X and y would've come from the same df so sharing the index 
X = pd.DataFrame(range(0,5,1), columns=['value'])
y = pd.Series([random.randint(0, 1) for _ in range(X.shape[0])])
X.index = range(1000, 1000+X.shape[0])
y.index = X.index # comment this out to break the code

print("X:\n", X)
# natually we can filter X like below
filtered_X= X[X['value']>2]
# due to the shared index, we can filter y just the same 😎
filtered_y = y[X['value']>2]
# check filter results
print('\nfiltered y\n', filtered_y)

X:
       value
1000      0
1001      1
1002      2
1003      3
1004      4

filtered y
 1003    1
1004    0
dtype: int64


## Filter by location: `iloc[]`
The example below identify the location of infinite values in a df, and filter X and y accordingly. (See how to achieve the same task using row index, pls see the `Location vs Row Index` section below)

In [33]:
import numpy as np
# Create a dummy DataFrame with infinite values
df = pd.DataFrame({'col1': [np.nan, 2, np.inf, 4, 8], 
                   'col2': [6, 7, 8, np.inf, np.nan], 'target': [0,1,0,1,0]})
print(df)
# like many real-world data, this df have a specified index so probably easier to use iloc to identify records
df.index = list(range(1999, # start
                      1999 + df.shape[0], # end
                      1 # step
                      ))

X = df.drop('target', axis = 1)
y = df['target']

# find the coordinates (row, col) with infinite values 
infinite_indices = np.where(np.isinf(X))
print('rows location index with infinit values:', infinite_indices[0])
print('location index of infinite values (row, col)')
print([f'({row}, {col})' for row, col in zip(infinite_indices[0], infinite_indices[1])])

# get the location list of all the records/rows
all = list(range(df.shape[0]))
# use set to exclude those with infinite values, don't forget to turn it back to a list
rows_without_infinit_values = list(set(all) - set(list(infinite_indices[0])))
# the very cool thing is we can easily filter for both X and y this way so they are still neatly matched
X2 = X.iloc[rows_without_infinit_values]
y2 = y.iloc[rows_without_infinit_values]

   col1  col2  target
0   NaN   6.0       0
1   2.0   7.0       1
2   inf   8.0       0
3   4.0   inf       1
4   8.0   NaN       0
rows location index with infinit values: [2 3]
location index of infinite values (row, col)
['(2, 0)', '(3, 1)']


Alternatively we can get row and col indices respectively from `np.where()`. The row and col are very helpful for filtering. However if we want to manipulate or treat the specific value of interests (say replace with certain value) then the combined row-col-index is more useful. 

In [34]:
# row and col index directly
row_index, col_index = np.where(df.isna())
print('null value row location index', row_index)
print('null value col location index', col_index)

null value row location index [0 4]
null value col location index [0 1]


### A note of causion
To ensure the index array works as expected, it is always prudent to do a quick test on a simple dummy dataset first. Because the indices produced are tuple of arrays representing row and column indices, and may not match the dimensions and alignment of the dataframe of interest. For example, if we use the index to replace values, we are in for a surprise! 

In [35]:
# recall this is what the original df looks like
print(df)
# null_index through `np.where` and `isna()`
null_index = np.where(df.isna())
print(df.iloc[null_index])
# similarly print out indices with infinite values
infinite_indices = np.where(np.isinf(df))
print(df.iloc[infinite_indices])
# try use the index in `iloc[]` to locate value of interests and replace/label them
# we notice the iloc methods affected both the cols and rows involved
# whereas the replace is more precise in identifying the exact location of interests
df.iloc[null_index] = 'missing'
df.replace([-np.inf, np.inf], 'infinite', inplace = True)
print(df)

      col1  col2  target
1999   NaN   6.0       0
2000   2.0   7.0       1
2001   inf   8.0       0
2002   4.0   inf       1
2003   8.0   NaN       0
      col1  col2
1999   NaN   6.0
2003   8.0   NaN
      col1  col2
2001   inf   8.0
2002   4.0   inf
          col1      col2  target
1999   missing   missing       0
2000       2.0       7.0       1
2001  infinite       8.0       0
2002       4.0  infinite       1
2003   missing   missing       0


## Location vs Row Index

As we see above, the indices returned from `np.wehre` gives us location index. If we want to get the index label, pls see below. 

In [36]:
df = pd.DataFrame({'col1': [np.nan, 2, np.inf, 4, 8], 
                   'col2': [6, 7, 8, np.inf, np.nan], 'target': [0,1,0,1,0]})  
df.index = range(1000, 1000+df.shape[0])                 
null_indices = np.where(df.isna())
null_mask = df.isna()
print('null_indices', null_indices)
print('null_mask', null_mask)

null_indices (array([0, 4], dtype=int64), array([0, 1], dtype=int64))
null_mask        col1   col2  target
1000   True  False   False
1001  False  False   False
1002  False  False   False
1003  False  False   False
1004  False   True   False


In [37]:
infinity_mask = (df==np.inf)
print(infinity_mask)

       col1   col2  target
1000  False  False   False
1001  False  False   False
1002   True  False   False
1003  False   True   False
1004  False  False   False


In [38]:
#easy to filter out all rows with infinite values in any cols
filtered_df = df[~infinity_mask.any(axis=1)]
filtered_df.index


Int64Index([1000, 1001, 1004], dtype='int64')

#### Row Index + Col Index

One great advantage of Row Index (vs location) is we can easily use it in combination with column index (which is just the column names) using `loc[]`, pls see below.

In [39]:
col1_infinity_mask = (df['col1']==np.inf)
print('col1_infinity_mask', col1_infinity_mask)
# now identify the value of col2 when col1 is of infinite value
df.loc[col1_infinity_mask, 'col2']

col1_infinity_mask 1000    False
1001    False
1002     True
1003    False
1004    False
Name: col1, dtype: bool


1002    8.0
Name: col2, dtype: float64

In [40]:
# of cours we can also assign value to col2 based on col1 mask, again using `loc`
df.loc[col1_infinity_mask, 'col2'] = 'infinity col1'
print(df)

      col1           col2  target
1000   NaN            6.0       0
1001   2.0            7.0       1
1002   inf  infinity col1       0
1003   4.0            inf       1
1004   8.0            NaN       0
