## Filter by location

The index is convenient in many ways. Say I can easily filter the target `y` and train_features `X` if I make sure the index is properly matched. 

In [None]:
# create one pandas df and one series, note that the index is assigned automatically ranging from 0 to 4
# or in many real-world cases, X and y would've come from a same df so sharing the index 
X = pd.DataFrame(range(1,6,1), columns=['value'])
y = pd.Series(range(6,11,1))

print("X:", X)
print("y:", y)
# natually we can filter X like below
filtered_x = X[X['value']>3]
print(filtered_x)
# due to the shared index, we can filter y just the same 😎
filtered_y = y[X['value']>3]
print(filtered_y)


## Filter by location: `iloc[]`
The example below identify the location of infinite values in a df, and filter X and y accordingly. 

In [None]:
import numpy as np
# Create a dummy DataFrame with infinite values
df = pd.DataFrame({'col1': [np.nan, 2, np.inf, 4, 8], 
                   'col2': [6, 7, 8, np.inf, np.nan], 'target': [0,1,0,1,0]})
print(df)
# like many real-world data, this df have a specified index so probably easier to use iloc to identify records
df.index = list(range(1999, # start
                      1999 + df.shape[0], # end
                      1 # step
                      ))

X = df.drop('target', axis = 1)
y = df['target']

# find the coordinates (row, col) with infinite values 
infinite_indices = np.where(np.isinf(X))
print('rows location index with infinit values:', infinite_indices[0])
print('location index of infinite values (row, col)')
print([f'({row}, {col})' for row, col in zip(infinite_indices[0], infinite_indices[1])])

# get the location list of all the records/rows
all = list(range(df.shape[0]))
# use set to exclude those with infinite values, don't forget to turn it back to a list
rows_without_infinit_values = list(set(all) - set(list(infinite_indices[0])))
# the very cool thing is we can easily filter for both X and y this way so they are still neatly matched
X2 = X.iloc[rows_without_infinit_values]
y2 = y.iloc[rows_without_infinit_values]

Alternatively we can get row and col indices respectively from `np.where()`. The row and col are very helpful for filtering. However if we want to manipulate or treat the specific value of interests (say replace with certain value) then the combined row-col-index is more useful. 

In [None]:
# row and col index directly
row_index, col_index = np.where(df.isna())
print('null value row location index', row_index)
print('null value col location index', col_index)

### A note of causion
To ensure the index array works as expected, it is always prudent to do a quick test on a simple dummy dataset first. Because the indices produced are tuple of arrays representing row and column indices, and may not match the dimensions and alignment of the dataframe of interest. For example, if we use the index to replace values, we are in for a surprise! 

In [None]:
# recall this is what the original df looks like
print(df)
# null_index through `np.where` and `isna()`
null_index = np.where(df.isna())
print(df.iloc[null_index])
# similarly print out indices with infinite values
infinite_indices = np.where(np.isinf(df))
print(df.iloc[infinite_indices])
# try use the index in `iloc[]` to locate value of interests and replace/label them
# we notice the iloc methods affected both the cols and rows involved
# whereas the replace is more precise in identifying the exact location of interests
df.iloc[null_index] = 'missing'
df.replace([-np.inf, np.inf], 'infinite', inplace = True)
print(df)