In this kernel I'm trying to fill some NaNs values using one intresting observation.

*just random pic idk*
![](https://sun9-31.userapi.com/c857628/v857628861/4c59e/JiPqE9xmzjs.jpg)

In [None]:
import numpy as np
import pandas as pd
import warnings

warnings.simplefilter('ignore')

train = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv')
test = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv')

I've noticed that there are cases in card columns that depends on other card columns. So using that approach we can fill some NaNs in data. Let's look at the data!

In [None]:
card_features = ['card1', 'card2', 'card3', 'card4', 'card5', 'card6']

In [None]:
train[card_features].head()

Let's count all NaNs in every card columns

In [None]:
pd.concat([train[card_features].isna().sum(), test[card_features].isna().sum()], axis=1).rename(columns={0: 'train_NaNs', 1: 'test_NaNs'})

We can see that card2 is the most NaN card feature. What is more, card3, card4 and car6 in test have 2 times more NaNs values.

Let's look ratio of missing values to the total number of rows.

In [None]:
pd.concat([train[card_features].isna().sum() / train.shape[0], test[card_features].isna().sum() / test.shape[0]], axis=1).rename(columns={0: 'train_NaNs_%', 1: 'test_NaNs_%'})

Not very high ratios though.

In [None]:
#Some usefull functions

def count_uniques(train, test, pair):
    unique_train = []
    unique_test = []

    for value in train[pair[0]].unique():
        unique_train.append(train[pair[1]][train[pair[0]] == value].value_counts().shape[0])

    for value in test[pair[0]].unique():
        unique_test.append(test[pair[1]][test[pair[0]] == value].value_counts().shape[0])

    pair_values_train = pd.Series(data=unique_train, index=train[pair[0]].unique())
    pair_values_test = pd.Series(data=unique_test, index=test[pair[0]].unique())
    
    return pair_values_train, pair_values_test

def fill_card_nans(train, test, pair_values_train, pair_values_test, pair):
    print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
    print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

    print('Filling train...')
    
    for value in pair_values_train[pair_values_train == 1].index:
        train[pair[1]][train[pair[0]] == value] = train[pair[1]][train[pair[0]] == value].value_counts().index[0]
        
    print('Filling test...')

    for value in pair_values_test[pair_values_test == 1].index:
        test[pair[1]][test[pair[0]] == value] = test[pair[1]][test[pair[0]] == value].value_counts().index[0]
        
    print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
    print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )
    
    return train, test

def nans_distribution(train, test, unique_train, unique_test, pair):
    train_nans_per_category = []
    test_nans_per_category = []

    for value in unique_train.unique():
        train_nans_per_category.append(train[train[pair[0]].isin(list(unique_train[unique_train == value].index))][pair[1]].isna().sum())

    for value in unique_test.unique():
        test_nans_per_category.append(test[test[pair[0]].isin(list(unique_test[unique_test == value].index))][pair[1]].isna().sum())

    pair_values_train = pd.Series(data=train_nans_per_category, index=unique_train.unique())
    pair_values_test = pd.Series(data=test_nans_per_category, index=unique_test.unique())
    
    return pair_values_train, pair_values_test

# Card1 and Card2

There is dependency between сard2 and card1 values.  

In the dataset we can found a lot of cases like that. Where most of the values are the same, but there are some missing values. So we can assume that in NaN rows should be that only value which occurs in that card1 category. 

In [None]:
train[train['card1'] == 13926][['card1', 'card2']]

Let's count unique values for each card1 category.

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card2'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

We can see that most of the card1 category have only one unique value. 

Now let's count amount of the missing values for amount of the unique values.

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card2'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

Hm. There are a lot of missing values for categorys where there is no values(only NaNs) and where only one value.
So we can do that:

* Fill NaNs in 1-amount category with most frequent value
* Treat 0-amount category NaNs as only one category. We can just encode it somehow.

But right now we will focus only on 1-amount category and fill NaNs with most frequent value in card1 category.

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card2'))

# Card1 and Card3

Let's do all the same but for card3 category.

In [None]:
train[train['card1'] == 13926][['card1', 'card3']]

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card3'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card3'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card3'))

So we filled almost all NaNs in card3.

# Card1 and Card4

In [None]:
train[train['card1'] == 13926][['card1', 'card4']]

Ok, here is the same dependency.

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card4'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card4'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

Here we have the same problem. And that approach can solve it too.

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card4'))

# Card1 and Card5

In [None]:
train[train['card1'] == 13926][['card1', 'card5']]

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card5'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card5'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card5'))

# Card1 and Card6

In [None]:
train[train['card1'] == 13926][['card1', 'card6']]

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card6'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card6'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card6'))

### Ok. Let's look at number on NaNs now.

In [None]:
pd.concat([train[card_features].isna().sum(), test[card_features].isna().sum()], axis=1).rename(columns={0: 'train_NaNs', 1: 'test_NaNs'})

Still there are a lot of NaNs in the card2 and card5. Let's try some other fill combinations.

In [None]:
train[card_features].head()

Let's find another dependent feature for card2.

In [None]:
print('Card3 == 150: ', train[train['card3'] == 150]['card2'].nunique())
print('Card4 == mastercard: ', train[train['card4'] == 'mastercard']['card2'].nunique())
print('Card5 == 102: ', train[train['card5'] == 102]['card2'].nunique())
print('Card6 == credit: ', train[train['card6'] == 'credit']['card2'].nunique())

We can see that there are too many unique values to implement this approch to fill remaining NaNs in card2.

Let's find another dependent feature for card5.

In [None]:
print('Card2 == 327: ', train[train['card2'] == 327]['card5'].nunique())
print('Card3 == 150: ', train[train['card3'] == 150]['card5'].nunique())
print('Card4 == mastercard: ', train[train['card4'] == 'mastercard']['card5'].nunique())
print('Card6 == credit: ', train[train['card6'] == 'credit']['card5'].nunique())

Same for card5.

Let's try some other features.

# Card1 and Addr2

In [None]:
train[train['card1'] == 13926][['card1', 'addr2']]

In [None]:
unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'addr2'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

In [None]:
train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'addr2'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

There are really a lot of missing values in addr2, especially for 1-amount category.

In [None]:
train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'addr2'))

So, we can fill to many values using this approach. But we cannot be 100% sure that missing values in 1-amount category is most frequent category.

# Let's find all the features that depends on card1

In [None]:
train[train['card1'] == 13926]['addr2'].value_counts().shape[0] == 1

In [None]:
depend_features = []

for col in train.columns:
    if train[train['card1'] == 13926][col].value_counts().shape[0] == 1:
        depend_features.append(col)

print(depend_features)

There are a lot of columns that we can suspect in dependency. And some of them we can fill like above.

## If you want to apply my kernel you can use that function:

In [None]:
def fill_pairs(train, test, pairs):
    for pair in pairs:

        unique_train = []
        unique_test = []

        print(f'Pair: {pair}')
        print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
        print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

        for value in train[pair[0]].unique():
            unique_train.append(train[pair[1]][train[pair[0]] == value].value_counts().shape[0])

        for value in test[pair[0]].unique():
            unique_test.append(test[pair[1]][test[pair[0]] == value].value_counts().shape[0])

        pair_values_train = pd.Series(data=unique_train, index=train[pair[0]].unique())
        pair_values_test = pd.Series(data=unique_test, index=test[pair[0]].unique())
        
        print('Filling train...')

        for value in pair_values_train[pair_values_train == 1].index:
            train.loc[train[pair[0]] == value, pair[1]] = train.loc[train[pair[0]] == value, pair[1]].value_counts().index[0]

        print('Filling test...')

        for value in pair_values_test[pair_values_test == 1].index:
            test.loc[test[pair[0]] == value, pair[1]] = test.loc[test[pair[0]] == value, pair[1]].value_counts().index[0]

        print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
        print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )
        
    return train, test

In [None]:
pairs = [('card1', 'card2'), ('card1', 'card3')]

train, test = fill_pairs(train, test, pairs)

If you find this kernel helpful please upvote!