# Combine the two kernels

## Clean Data

In [None]:
import os
if os.name=='nt':
    try:
        mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-8.1.0-posix-seh-rt_v6-rev0\\mingw64\\bin'
        os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
    except:
        pass
    
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import tqdm

tqdm.tqdm.pandas()

%matplotlib inline

In [2]:
df = pd.read_csv('../input/train_ver2.csv')
df2 = df.copy()

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df.to_hdf('../input/data.hdf', 'train_ver2', complib='blosc:lz4', complevel=9, format='t')

In [3]:
df.memory_usage(deep=True).sum()/2**30

15.323746562935412

Some columns have the same number of nulls, these are the same samples. Nulls are filled with 'unknown'

In [5]:
column_fillna_with_unkonwn = [c for c in dfnull.index if dfnull[c]==27734] # columns that NA is filled by 'unknown'
column_fillna_with_unkonwn.remove('fecha_alta')

In [7]:
for c in column_fillna_with_unkonwn:
    df.loc[df[c].isnull(), c] = 'unknown'

`sexo`: nulls are filled with 'unknown'

In [8]:
df.loc[df.sexo.isnull(), 'sexo'] = 'unknown'

### `ult_fec_cli_1t` 

Last date as primary customer (if he isn't at the end of the month)

Only very few samples in `ult_fec_cli_1t` is not NA, 24793 out of 13,647,309 samples have `ult_fec_cli_1t` not NA.

1. If the sample is not NA
    - `indrel_1mes==1`: the customer was primary at the beginning of the month, but left
    - But `indrel_1mes` can be values other than 1, following possibilities
        - The customer became primary at some time of the month and then left immediately in the same month
        - OR, `ult_fec_cli_1t` also records other changes of customer types
        - OR, the data is not consistent for a few samples
2. A customer can become and leave primary multiple times, `ult_fec_cli_1t` only records the change of the current primary membership
    - `df.ult_fec_cli_1t-df.fecha_dato<=-30 days`
    - `df.ult_fec_cli_1t-df.fecha_dato` can be positive, meaning the current membership changes in a future date
3. If the sample is NA
    - The customer is not primary up to the end of the month
    - The customer is still primary at the end of the month    

#### Findings in data analysis
1. `ult_fec_cli_1t` is defined a posteriori, it defines the termination of primary of each primary period
2. The definition of `indrel` does not match with data. It simply indicates if `ult_fec_cli_1t` is NA or not.
    - `ult_fec_cli_1t` is NA, `indrel==1`
    - `ult_fec_cli_1t` is not NA, `indrel==99`
3. 

In [9]:
a = df.loc[df.ult_fec_cli_1t.notnull(), ['ncodpers', 'fecha_dato', 'fecha_alta', 'ult_fec_cli_1t', 'indrel', 'indrel_1mes']]
a.fecha_dato = pd.to_datetime(a.fecha_dato)
a.fecha_alta = pd.to_datetime(a.fecha_alta)
a.ult_fec_cli_1t = pd.to_datetime(a.ult_fec_cli_1t)

#### `ult_fec_cli_1t` and `fecha_dato`

- `ult_fec_cli_1t-fecha_dato>=-27 days`
- `ult_fec_cli_1t-fecha_dato<=183 days`

`ult_fec_cli_1t` is always in the current or a future month. So it tells when this current primary membership is terminated **a posteriori**.

In [10]:
b = (a.ult_fec_cli_1t - a.fecha_dato).value_counts().sort_index()
c = b.copy()
c.index = c.index.astype('timedelta64[M]')
c = c.groupby(by=c.index).sum()

In [11]:
b.head()

-27 days    769
-26 days    725
-25 days    571
-24 days    686
-23 days    644
dtype: int64

In [12]:
b.tail()

177 days    37
180 days    24
181 days    37
182 days    26
183 days    25
dtype: int64

`fecha_dato` is no later than `ult_fec_cli_1t`, the -1 below means that they are in the same month.

In [13]:
c

-1    18236
 0     2557
 1      747
 2      829
 3      774
 4      887
 5      738
 6       25
dtype: int64

Conversely, if `ult_fec_cli_1t` is NA, change from primary to non-primary should never happen. If we compare `indrel_1mes` of this month with that of the next month for the same customer, it should never change from 1 to not 1.

### So there are three types of customers
1. `ult_fec_cli_1t` is always NA
    - Customer type should always be primary or not primary, never change between primary and not primary
    - 925932 customers
2. For customers with `ult_fec_cli_1t` always NA, 921072 have an unchanged `indrel_1mes`
    - Most of them have `indrel_1mes==1` 

|`indrel_1mes`|count     |
|-------------|----------|
|1            |920638    |
|2            |117       |
|3            |218       |
|4            |23        |
|P            |76        |

2. 4860 Customers with `ult_fec_cli_1t` always NA, and two `indrel_1mes` values
    - How about the other customers whose `indrel_1mes!=1`? Their `indrel_1mes` should change from other types to 1 and **do not change back**
        - **exceptions, `ult_fec_cli_1t` missing, but should have a number**
            - 23 customers whose `indrel_1mes` changes from 1 to 3 then to 1 again
            - 4 customers whose `indrel_1mes` changes from 1 to 3, 3 all occur at 2016-05, the last month
            - 1 customer's `indrel_1mes` changes from 3 to 1 to 3 to 1, `ncodpers==693087`
        - **others, become primary and stay**
            - 2741 customers' `indrel_1mes` changes from 3 to 1
            - 1120 customers `indrel_1mes` changes from 2 to 1
            - 730 customers `indrel_1mes` changes from 'P' to 1
            - 241 customers `indrel_1mes` changes from 4 to 1
4. `ult_fec_cli_1t` is always not NA
    - 3481 customers, all the them have been primary once and then changed types
    - 13 customers have two types of `indrel_1mes`, which change between 1 and other types (usually 3 indicating former primary or P, potential customer)
    - 3468 customers have only one  type of `indrel_1mes`, but they may have multiple samples. This indicates
        - either the data is wrong
        - or, they change to primary and then change back within the same month, so `indrel_1mes` do not change in the next month. 
        - For those customers fullfill the following condition, are they on purpose for some temporary discount/benefit?
            - with multiple samples 
            - `indrel_1mes` are the same 
            - `ult_fec_cli_1t` are different
    - we can count the unique values of `ult_fec_cli_1i` to determine how many times 
3. `ult_fec_cli_1t` changes between NA and not NA
    - 12546 customers
    - for these customers, first sort by `ncodpers` and `fecha_dato`, then fillna `ult_fec_cli_1t` with bfill
3. __Possible features for `ult_fec_cli_1t`__
    - If `ult_fec_cli_1t` is always NA, always not NA, or others
    - Number of records up to now
    - Number of unique `indrel_1mes` values
        - The customers can be classified by `indrel_1mes` into {1, 2, 3, 4, P, 21, 31, 41, P1}
            - If 1 `indrel_1mes` value, what is it
            - If 2 `indrel_1mes` value, what is the value other than 1
            - why not replace P with 0 in `indrel_1mes`?
            - the exception cases for those 28 customers that change `indrel_1mes` multiple times
        - When does the change of `indrel_1mes` happen
    - Number of unique `ult_fec_cli_1t` up to now
    - Mean encode within customers having the same `indrel_1mes` type in the same month
    - If primary is terminated/added in the month
    - Type of customer type changes in the month

In [14]:
d = df.loc[:, ['ncodpers', 'indrel_1mes', 'fecha_dato', 'ult_fec_cli_1t']]
d.fecha_dato = pd.to_datetime(d.fecha_dato)
d.ult_fec_cli_1t = pd.to_datetime(d.ult_fec_cli_1t)
d.sort_values(by=['ncodpers', 'fecha_dato'], ascending=[True, True], inplace=True)
d = d.loc[d.indrel_1mes.notnull(), :] # drop samples without indrel_1mes
d.indrel_1mes.unique()
indrel_map = {1.0: 1, '1': 1, '2': 2, 2.0: 2, '1.0': 1, '2.0': 2, 4.0: 4, '3.0': 3, 3.0: 3, '4.0': 4, 'P': 'P', '3': 3,
       '4': 4}
d.loc[:, 'indrel_1mes'] = d.indrel_1mes.map(indrel_map)

In [15]:
def count_null(x):
    return x.isnull().sum()

def count_all(x):
    return len(x)

e = d.groupby(['ncodpers'])['ult_fec_cli_1t'].agg([count_null, count_all])
e['diff'] = e.count_all-e.count_null
e.sort_values(by=['diff', 'count_all'], axis=0, ascending=[True, False], inplace=True)

Number of customers whose `ult_fec_cli_1t` is always NA is 925932

In [17]:
e.loc[(e['diff']==0) & (e['count_null']>0)].shape

(925932, 3)

Number of customers whose `ult_fec_cli_1t` is always not NA is 3481

In [18]:
e.count_null.value_counts().sort_index().iloc[0]

3481

Number of customers whose `ult_fec_cli_1t` changes between NA and not NA

In [19]:
e.loc[(e['diff']>0) & (e.count_null>0)].shape

(12546, 3)

All the cases are searched

In [20]:
12546+3481+925932-e.shape[0]

0

Save for future use

In [22]:
e.columns = ['count_ult_null', 'count_sample', 'count_ult_notnull']
e.to_hdf('../input/data.hdf', 'train_ult_null_count', complib='blosc:lz4', complevel=9, format='t')

##### Customers whose `ult_fec_cli_1t` is always NA

How their `indrel_1mes` change through different months?

In [28]:
e1 = e.loc[(e['count_ult_notnull']==0) & (e['count_ult_null']>0)].index.tolist() # list of costumer id with ult_fec_cli_1t always NA

In [29]:
d.columns

Index(['ncodpers', 'indrel_1mes', 'fecha_dato', 'ult_fec_cli_1t'], dtype='object')

In [30]:
d1 = d.loc[d.ncodpers.isin(e1), ['ncodpers', 'indrel_1mes', 'fecha_dato']]
d1.sort_values(['ncodpers', 'fecha_dato'], inplace=True)

In [31]:
d1.head(1)

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato
416965,15889,1,2015-01-28


In [32]:
d1_group = d1.groupby('ncodpers')
a = d1_group.get_group(15889)

In [46]:
def count_types(x):
    return len(x.unique())

def condense_types(x):
    return x.value_counts().index.tolist()

In [49]:
d1_count_types = d1_group['indrel_1mes'].apply(count_types) # count number of types for each customer





  6%|████▏                                                                    | 52631/925932 [00:32<02:45, 5281.95it/s]

In [50]:
d1_count_types.value_counts()

1    921072
2      4860
Name: indrel_1mes, dtype: int64

In [236]:
4860+921072-925932

0

Most of the customers whose `ult_fec_cli_1t` is always NA and `indrel_1mes` unchanged have `indrel_1mes==1`

In [51]:
d1_uniuque_type = d.loc[d.ncodpers.isin(d1_count_types.loc[d1_count_types==1].index)].groupby('ncodpers')['indrel_1mes'].apply(lambda x: x.iloc[0])

In [53]:
d1_uniuque_type.value_counts()

1    920638
3       218
2       117
P        76
4        23
Name: indrel_1mes, dtype: int64

In [245]:
d.loc[d.ncodpers.isin(d1_count_types.loc[d1_count_types==1].index), 'indrel_1mes'].value_counts()

1    13343385
3         218
2         117
P          76
4          23
Name: indrel_1mes, dtype: int64

Customers whose `ult_fec_cli_1t` is always NA and `indrel_1mes` changes

In [255]:
d1_changed = d1_count_types.loc[d1_count_types==2].index.tolist() # customers whose indrel_1mes changes
d1_changed = d1.loc[d1.ncodpers.isin(d1_changed)]

In [260]:
d1_changed_head_tail = d1_changed.groupby('ncodpers')['indrel_1mes'].apply(lambda x: (x.iloc[0], x.iloc[-1]))

In [261]:
d1_changed_head_tail.value_counts()

(3, 1)    2742
(2, 1)    1120
(P, 1)     730
(4, 1)     241
(1, 1)      23
(1, 3)       4
Name: indrel_1mes, dtype: int64

Customers whose 
- `ult_fec_cli_1t` is always NA
- `indrel_1mes` changes from 1 to 3 then to 1 again
- 23 customers

In [265]:
d11 = d1_changed_head_tail.loc[d1_changed_head_tail==(1,1)].index.tolist()
d1_changed.loc[d1_changed.ncodpers.isin(d11)].groupby('ncodpers').indrel_1mes.apply(lambda x: x.unique())

Customers whose
- `ult_fec_cli_1t` is always NA
- `indrel_1mes` changes from 1 to 3
- `indrel_1mes` changes to 3 at the last month in 2016 May
- 4 customers

In [282]:
d13 = d1_changed_head_tail.loc[d1_changed_head_tail==(1, 3)].index.tolist()
# d1_changed.loc[d1_changed.ncodpers.isin(d13)]

Customers whose 
- `ult_fec_cli_1t` is always NA
- `indrel_1mes` changes from other types to 1

In [284]:
d31 = d1_changed_head_tail.loc[d1_changed_head_tail==(3, 1)].index.tolist()
d31 = d1_changed.loc[d1_changed.ncodpers.isin(d31)].groupby('ncodpers').indrel_1mes.apply(lambda x: sum(x.diff().fillna(0)!=0))

In [290]:
d21 = d1_changed_head_tail.loc[d1_changed_head_tail==(2, 1)].index.tolist()
d21 = d1_changed.loc[d1_changed.ncodpers.isin(d21)].groupby('ncodpers').indrel_1mes.apply(lambda x: sum(x.diff().fillna(0)!=0))

In [302]:
d21.tail()

ncodpers
1540160    1
1540903    1
1541123    1
1542164    1
1542842    1
Name: indrel_1mes, dtype: int64

In [304]:
d1.loc[d1.ncodpers==1540160]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato
11929417,1540160,2,2016-04-28
13327362,1540160,1,2016-05-28


In [307]:
d1_changed_head_tail.value_counts().index.tolist()[2]

('P', 1)

In [309]:
dp1 = d1_changed_head_tail.loc[d1_changed_head_tail==('P', 1)].index.tolist()

In [310]:
def p1(x):
    x = x.map({'P': 0, 1: 1})
    return sum(x.diff().fillna(0)!=0)

In [311]:
dp1 = d1_changed.loc[d1_changed.ncodpers.isin(dp1)].groupby('ncodpers').indrel_1mes.apply(p1)

In [312]:
dp1.value_counts()

1    730
Name: indrel_1mes, dtype: int64

In [314]:
dp1.head()

ncodpers
42865    1
43521    1
46374    1
46507    1
46732    1
Name: indrel_1mes, dtype: int64

In [318]:
d1.loc[d1.ncodpers==46374]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato
4064552,46374,P,2015-07-28
5308797,46374,1,2015-08-28
5693723,46374,1,2015-09-28
6923255,46374,1,2015-10-28
7493046,46374,1,2015-11-28
8628757,46374,1,2015-12-28
9100545,46374,1,2016-01-28
10414974,46374,1,2016-02-28
11156395,46374,1,2016-03-28
12061700,46374,1,2016-04-28


In [319]:
d41 = d1_changed_head_tail.loc[d1_changed_head_tail==(4, 1)].index.tolist()
d41 = d1_changed.loc[d1_changed.ncodpers.isin(d41)].groupby('ncodpers').indrel_1mes.apply(lambda x: sum(x.diff().fillna(0)!=0))

In [320]:
d41.value_counts()

1    241
Name: indrel_1mes, dtype: int64

##### Customers whose `ult_fec_cli_1t` is always not NA

In [54]:
e2 = e.loc[e.count_ult_null==0].index.tolist() # customer ids whose ult_fec_cli_1t is always not NA

d2 = d.loc[d.ncodpers.isin(e2), ['ncodpers', 'indrel_1mes', 'fecha_dato', 'ult_fec_cli_1t']]
d2.sort_values(['ncodpers', 'fecha_dato'], inplace=True)

In [58]:
d2_group = d2.groupby('ncodpers')
a = d2_group.get_group(16817)

In [59]:
a

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
416600,16817,1,2015-01-28,2015-07-15
836621,16817,1,2015-02-28,2015-07-15
1674354,16817,1,2015-03-28,2015-07-15
2092577,16817,1,2015-04-28,2015-07-15
2933327,16817,1,2015-05-28,2015-07-15
3565387,16817,1,2015-06-28,2015-07-15
4079375,16817,1,2015-07-28,2015-07-15


In [62]:
d2_count_types = d2_group['indrel_1mes'].apply(count_types) # number of indrel_1mes types of each customer

In [63]:
d2_count_types.value_counts()

1    3468
2      13
Name: indrel_1mes, dtype: int64

13 customers 
- `ult_fec_cli_1t` is always not NA
- change between two customer types

In [None]:
for c in d2_count_types.loc[d2_count_types==2].index.tolist():
    print(d2_group.get_group(c))
    print()

In [83]:
d2.head()

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
416600,16817,1,2015-01-28,2015-07-15
836621,16817,1,2015-02-28,2015-07-15
1674354,16817,1,2015-03-28,2015-07-15
2092577,16817,1,2015-04-28,2015-07-15
2933327,16817,1,2015-05-28,2015-07-15


In [99]:
d2_one_type = d2_count_types.loc[d2_count_types==1].index.tolist() # list of ncodpers whose have only one type
d2_one_type_1 = d2.loc[d2.ncodpers.isin(d2_one_type)].groupby('ncodpers').indrel_1mes.agg({'count', lambda x: x.iloc[0]}) # customer types for customers only have one type
d2_one_type_2 = d2.loc[d2.ncodpers.isin(d2_one_type)].groupby('ncodpers').ult_fec_cli_1t.agg(lambda x: len(x.unique()))

In [100]:
d2_one_type_1.columns = ['indrel_1mes', 'count_sample']
d2_one_type_2 = pd.DataFrame(d2_one_type_2)

In [101]:
d2_one_type = d2_one_type_1.join(d2_one_type_2)

customers whose `ult_fec_cli_1t` is always not NA, and `indrel_1mes` has one value

In [102]:
d2_one_type.columns = ['indrel_1mes', 'count_sample', 'count_unique_ult']

In [103]:
d2_one_type.groupby(['indrel_1mes', 'count_sample', 'count_unique_ult']).count_sample.count()

indrel_1mes  count_sample  count_unique_ult
1            1             1                   2170
             2             1                     22
             3             1                     23
             4             1                     51
             5             1                     38
             6             1                     44
             7             1                    649
2            1             1                      9
             4             1                      1
3            1             1                    401
             2             2                     19
             3             3                      1
             5             5                      1
4            1             1                     14
P            1             1                     25
Name: count_sample, dtype: int64

In [96]:
d2_one_type.loc[(d2_one_type.indrel_1mes==1) & (d2_one_type.count_sample==7)].head()

Unnamed: 0_level_0,indrel_1mes,count_sample,count_unique_ult
ncodpers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
16817,1,7,7
18352,1,7,7
21382,1,7,7
23638,1,7,7
26036,1,7,7


In [98]:
d2.loc[d2.ncodpers==16817]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
416600,16817,1,2015-01-28,2015-07-15
836621,16817,1,2015-02-28,2015-07-15
1674354,16817,1,2015-03-28,2015-07-15
2092577,16817,1,2015-04-28,2015-07-15
2933327,16817,1,2015-05-28,2015-07-15
3565387,16817,1,2015-06-28,2015-07-15
4079375,16817,1,2015-07-28,2015-07-15


In [393]:
d2.loc[d2.ncodpers==18352]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
417413,18352,1,2015-01-28,2015-07-03
835764,18352,1,2015-02-28,2015-07-03
1673520,18352,1,2015-03-28,2015-07-03
2091741,18352,1,2015-04-28,2015-07-03
2934199,18352,1,2015-05-28,2015-07-03
3566258,18352,1,2015-06-28,2015-07-03
4078578,18352,1,2015-07-28,2015-07-03


In [359]:
d2.loc[d2.ncodpers==24170]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
5315823,24170,3,2015-08-28,2015-08-19


In [360]:
d2.loc[d2.ncodpers==26652]

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
8620173,26652,3,2015-12-28,2015-12-02


In [396]:
k3 = k2.loc[k2!=1].index.tolist() # customers who have only one customer type and the type is not 1

In [397]:
len(k3)

471

22 customers 
- have only one customer type but multiple rows
- either `indrel_1mes` wrong, or `indrel_1mes==2` include `indrel_1mes==1`
- some record are missing
- become primary and then leave multiple times

##### `ult_fec_cli_1t` is not always NA

In [104]:
e3 = e.loc[(e.count_ult_null>0) & (e.count_ult_notnull>0)].index.tolist() # customer ids whose ult_fec_cli_1t is not always NA

In [107]:
len(e3)

12546

In [108]:
d3 = d.loc[d.ncodpers.isin(e3), ['ncodpers', 'indrel_1mes', 'fecha_dato', 'ult_fec_cli_1t']]
d3.sort_values(['ncodpers', 'fecha_dato'], inplace=True)

In [109]:
d3.shape

(113965, 4)

In [111]:
d3.head(5)

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
4078994,15891,2,2015-07-28,NaT
5319232,15891,1,2015-08-28,2015-08-05
4079035,16136,1,2015-07-28,NaT
5319290,16136,1,2015-08-28,NaT
5705240,16136,1,2015-09-28,NaT


In [120]:
d3_count = d3.groupby('ncodpers').agg({'indrel_1mes': lambda x: len(x.unique()), 'ult_fec_cli_1t': lambda x: len(x.unique())})

In [121]:
d3_count.head()

Unnamed: 0_level_0,indrel_1mes,ult_fec_cli_1t
ncodpers,Unnamed: 1_level_1,Unnamed: 2_level_1
15891,2,2
16136,1,2
16137,1,2
16254,1,2
16283,2,2


In [122]:
d3_count.groupby(['indrel_1mes', 'ult_fec_cli_1t']).indrel_1mes.count()

indrel_1mes  ult_fec_cli_1t
1            2                 11664
             3                     5
2            2                   787
             3                    74
             4                     1
             5                     3
3            2                    10
             3                     1
             4                     1
Name: indrel_1mes, dtype: int64

In [126]:
d3.ult_fec_cli_1t = d3.ult_fec_cli_1t.fillna(method='bfill')

In [127]:
d3

Unnamed: 0,ncodpers,indrel_1mes,fecha_dato,ult_fec_cli_1t
4078994,15891,2,2015-07-28,2015-08-05
5319232,15891,1,2015-08-28,2015-08-05
4079035,16136,1,2015-07-28,2016-03-17
5319290,16136,1,2015-08-28,2016-03-17
5705240,16136,1,2015-09-28,2016-03-17
6909741,16136,1,2015-10-28,2016-03-17
7451203,16136,1,2015-11-28,2016-03-17
8618232,16136,1,2015-12-28,2016-03-17
9401330,16136,1,2016-01-28,2016-03-17
10425949,16136,1,2016-02-28,2016-03-17


#### `ult_fec_cli_1t` and `indrel`

`indrel`: 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month)

1. `indrel` depends on `ult_fec_cli_1t`:
    - `ult_fec_cli_1t` not NA, `indrel==99`
    - `ult_fec_cli_1t` is NA, `indrel==1` or `indrel=='unknown'`
2. What is the definition of primary customer?
    - Almost all the customers are primary?
    - How about types indicated by `indrel_1mes`, like co-owner, potential, former primary, former co-owner?
3. `indrel` only indicates if `ult_fec_cli_1t` of the same row is NA or not
    - `ult_fec_cli_1t` is NA, `indrel==1`
    - `ult_fec_cli_1t` is not NA, `indrel==99`

However, `indrel==99` for all samples with `ult_fec_cli_1t` not NA, this is not consistent with the definition of `ult_fec_cli_1t`

In [136]:
a.indrel.value_counts()

99.0    24793
Name: indrel, dtype: int64

And for samples with `ult_fec_cli_1t` is NA, `indrel` is either 1 or 'unknown'

In [137]:
df.loc[df.ult_fec_cli_1t.isnull(), 'indrel'].value_counts()

1.0        13594782
unknown       27734
Name: indrel, dtype: int64

So, `indrel` is not very informative, or we can set `indrel=='unknown'` to `indrel=1`

In [134]:
a.indrel.value_counts()

99.0    24793
Name: indrel, dtype: int64

If a customer has both `ult_fec_cli_1t` NA and not NA, what is `indrel`?

In [154]:
a = df.loc[:, ['ncodpers', 'ult_fec_cli_1t', 'indrel']].copy()
a.ult_fec_cli_1t = a.ult_fec_cli_1t.isnull()

In [155]:
a.groupby(['ult_fec_cli_1t', 'indrel']).ncodpers.count()

ult_fec_cli_1t  indrel 
False           99.0          24793
True            1.0        13594782
                unknown       27734
Name: ncodpers, dtype: int64