# **CxC III - Infinite Investments Challenge**
### By the Secret Agents - Thirandie, Deepika, Lisa, Andy

Hi Judges! This is our first times ever interacting with anything related to Data Science, so we are just happy to be here and to learn more about the subject!

We have seperated our notebook into a few sections:
1. Preliminary Thoughts and Research
2. Data Preprocessing
3. Running the Model
4. Generating Answers

Our process was to first layout our data pipeline, running through all the steps earlier and getting to running the model and getting a Cross Validation score. Then, we went through all the steps again and it on the test dataset simotaneously.

## Part 1: Preliminary Thoughts and Research

In [1]:
# From Starting Docs - Importing Relevant history.csv from downloaded location in Personal Drive

import pandas as pd;
df = pd.read_csv("./history.csv")
test = pd.read_csv("./test.csv")

  df = pd.read_csv("./history.csv")
  test = pd.read_csv("./test.csv")


Let's get some quick summary statistics on our history data set.

In [2]:
# This tells us the rows and columns there are in the dataset
df.shape

(673339, 106)

In [3]:
# As mentioned in the provided Python notebook, there is a lot more Churned customers than unchurned ones.
df.groupby(['label']).id.agg('count')

label
Churn       538444
No Churn    134895
Name: id, dtype: int64

In [4]:
# Another notable feature about this dataset is the amount of empty values there are - some columns have little to no data at all!
pd.options.display.max_rows = 128
df.isna().sum()

id                                         0
type_code                                237
is_registered                              0
country_code                          669050
currency_code                              0
is_active                                  0
class_id                                3198
debit_code                              3198
last_trade_date                       307875
contract_type                          95823
inception_date                          3198
net_of_fees                                0
cashflows_custody_fee                 673339
fee_paid_separately                        0
custody_fee_withdrawal                     0
is_fee_exempt                              0
branch                                  3198
include_client_consolidation               0
use_client_address                         0
credit_limit_type                       3198
retail_plan                           316985
is_spousal                            313694
is_arp_loc

In [5]:
# In Fact, there is no entry in this file without at least 1 empty column!
df_no_null = df.dropna()
df_no_null

Unnamed: 0,id,type_code,is_registered,country_code,currency_code,is_active,class_id,debit_code,last_trade_date,contract_type,...,is_hrdc_resp,is_plan_grandfathered,resp_specimen_plan,inserted_at,updated_at,is_olob,retail_last_maintenance_time,retail_last_maintenance_user,visible_in_reports,label


Considering all the summary statistics, we have come to the following conclusions:

1. The missing values have to be dealt with, since without it there isn't much to do at all. We plan to deal with this by using the [MICE algorithm](https://medium.com/@brijesh_soni/topic-9-mice-or-multivariate-imputation-with-chain-equation-f8fd435ca91#:~:text=MICE%20stands%20for%20Multivariate%20Imputation,produce%20a%20final%20imputed%20dataset.) for numerical and date data, and to just use the mode for categorical data. (We recognize that there are a lot of missing categorical data, so the mode might not be fully accurate - however it is reasonable enough a strategy)
2. As for the imbalanced dataset, we plan to take the advice given in the provided notebook and attempt to balance the dataset with Churn/No Churn entries before training the model. Fortunately, because Python, there does seem to be packages that handle this for us - namely [Imbalanced Learn](https://imbalanced-learn.org/stable/index.html)
3. To further reduce variance and to combat the effects of potentially errorenous preprocessing we will employ the Random Forest Classifier, as it randomly picks between columns and rows of data to use. This should hopefully improve the model's accuracy and recall.

## Part 2: Data Preprocessing

### Section 0: The Types of Columns we are dealing with
Before starting, we should replace unify all NaNs with `np.nan` for unifying reasons. We will also look at each of the columns to determine its type.

In [6]:
import numpy as np
null_mask = df.isna()
df[null_mask] = np.NaN

test_null_mask = test.isna()
test[test_null_mask] = np.NaN

df.head(3)

Unnamed: 0,id,type_code,is_registered,country_code,currency_code,is_active,class_id,debit_code,last_trade_date,contract_type,...,is_hrdc_resp,is_plan_grandfathered,resp_specimen_plan,inserted_at,updated_at,is_olob,retail_last_maintenance_time,retail_last_maintenance_user,visible_in_reports,label
0,893724,CASH SWEEP,f,,CAD,t,3.0,0,,,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 18:53:01.439561+00,f,,,f,Churn
1,1268094,RRIF,t,,CAD,t,5.0,C,2022-07-04,18.0,...,f,f,,2023-05-30 14:20:18.531115+00,2023-08-10 21:47:25.370403+00,f,2023-01-03 00:00:00,BATCH,t,Churn
2,606613,SPOUSAL RRSP,t,,CAD,t,5.0,A,2018-07-05,16.0,...,f,f,,2023-05-30 14:20:18.531115+00,2023-08-10 21:47:25.370403+00,f,2018-07-17 00:00:00,GUERINO,f,Churn


It looks the same, but we are assured that all the NaNs are now actually NaNs. As for the columns,

In [7]:
# We determine which of the columns are numerical vs categorical with the following code:

# Looking at the test data, it seems that some of the fields that seem to be numerical are actually categorical, which means manual tweaking!
numerical_cols = ['id', 'class_id', 'contract_type', 'cashflows_custody_fee', 'credit_limit_type', 
                  'dividend_confirm_code', 'options_trading_type', 'rep_commission_rate', 'rep_commission_override', 
                  'interest_dividend_conversion_type', 'guarantee_gtor_type', 'deceased_fair_market_value', 'target_grantor_grantee_flag', 
                  'esir_number', 'portfolio_cost_method', 'portfolio_report_option',
                  'interactive_portfolio_code', 'mailing_consent', 'number_of_beneficiaries', 'resp_specimen_plan']
categorical_cols = [
    col for col in df.columns
    if col not in numerical_cols
]

print("Numerical Columns: ", numerical_cols)
print("Categorical Columns: ", categorical_cols)

print("Number of numerical columns in the DataFrame:", len(numerical_cols))
print("Number of categorical columns in the DataFrame:", len(categorical_cols))

Numerical Columns:  ['id', 'class_id', 'contract_type', 'cashflows_custody_fee', 'credit_limit_type', 'dividend_confirm_code', 'options_trading_type', 'rep_commission_rate', 'rep_commission_override', 'interest_dividend_conversion_type', 'guarantee_gtor_type', 'deceased_fair_market_value', 'target_grantor_grantee_flag', 'esir_number', 'portfolio_cost_method', 'portfolio_report_option', 'interactive_portfolio_code', 'mailing_consent', 'number_of_beneficiaries', 'resp_specimen_plan']
Categorical Columns:  ['type_code', 'is_registered', 'country_code', 'currency_code', 'is_active', 'debit_code', 'last_trade_date', 'inception_date', 'net_of_fees', 'fee_paid_separately', 'custody_fee_withdrawal', 'is_fee_exempt', 'branch', 'include_client_consolidation', 'use_client_address', 'retail_plan', 'is_spousal', 'is_arp_locked', 'arp_pension_origin', 'language_code', 'sss_location', 'sss_type', 'sss_agent', 'is_midwest_clearing_account', 'use_hand_delivery', 'use_mail', 'share_name_address_to_issue

Nominal columns are those that have words or a few categories in them. Notably, some of these columns contain names or something similar and so would have over thousands of categories. However, since there are a few entries with a lot of one type of entry it still would be important to keep that data in... to combat this we can introduce an additional category for 'infrequent' if there are not enough entries of that type. This value will be manually determined.

In [8]:
categorical_cols_nominals = ['type_code', 'country_code', 'currency_code', 'debit_code', 'branch', 'retail_plan', 'language_code', 'sss_agent', 'terminal_code', 'iso_funds_code',
                           'dup_trip_quad_code',  'special_tag', 'non_plan_book_value_flag', 'portfolio_summary_option',
                           'risk_tolerance', 'investment_objective', 'last_maintenance_user', 'portfolio_name_address_option',
                            'retail_last_maintenance_user', 'arp_pension_origin', 'conjunction', 'loan_limit_override', 'special_fee_code']
print("Number of categoricalcols_nominal in DataFrame:", len(categorical_cols_nominals))

Number of categoricalcols_nominal in DataFrame: 23


Nominal columns are those that have words or a few categories in them. Notably, some of these columns contain names or something similar and so would have over thousands of categories. However, since there are a few entries with a lot of one type of entry it still would be important to keep that data in... to combat this we can introduce an additional category for 'infrequent' if there are not enough entries of that type. This value will be manually determined.

In [9]:
categorical_cols_bools = ['is_registered', 'is_active', 'tms_settlement_location',
                         'net_of_fees', 'fee_paid_separately', 'custody_fee_withdrawal',
                         'is_fee_exempt', 'include_client_consolidation', 'use_client_address',
                         'is_spousal',  'is_arp_locked', 'sss_location', 'sss_type', 'use_hand_delivery',
                          'use_mail', 'share_name_address_to_issuer', 'shareholder_instructions_received',
                          'rrsp_limit_reached',  'is_portfolio_account', 'has_no_min_commission',
                          'is_tms_eligible', 'is_agent_bbs_participant', 'is_parameters_account',
                          'is_spousal_transfer', 'spousal_age_flag', 'has_multiple_name',
                          'discretionary_trading_authorized', 'shareholder_language', 'title',
                          'function_code',  'receive_general_mailings', 'has_discrete_auth',
                          'is_non_objecting_beneficial_owner', 'is_objecting_to_disclose_info',
                          'consent_to_pay_for_mail', 'consent_to_email_delivery',
                          'has_received_instruction', 'is_broker_account',
                          'is_inventory_account', 'is_gl_account', 'is_control_account',
                          'is_extract_eligible', 'is_pledged', 'is_resp',
                          'use_original_date_for_payment_calc',  'is_family_resp',
                          'is_hrdc_resp',  'is_plan_grandfathered', 'is_olob', 'visible_in_reports', 'is_midwest_clearing_account']
print("Number of categoricalcols_bools in DataFrame:", len(categorical_cols_bools))

Number of categoricalcols_bools in DataFrame: 51


Boolean columns are those that are simple true and falses, which is nice! Of course the representation of T/F apparently differs but that is fine

In [10]:
categorical_cols_date = ['last_trade_date', 'inception_date', 'last_update_date', 'last_maintenance_time',  'non_calendar_year_end',
                         'plan_effective_date', 'plan_end_date',  'rrif_original_date',
                        'inserted_at',  'updated_at', 'retail_last_maintenance_time' ]

print("Number of categoricalcols_date in DataFrame:", len(categorical_cols_date))

Number of categoricalcols_date in DataFrame: 11


These Categorical Columns are actually Date columns, so we need to process them at a later date. From the names, it seems that turning them into numbers based off today's time would work as a transformation.

In [11]:
label_col = ['label']

Last but not least, we have our label column, which is the output.

### Section 1: Converting the Date Columns
Let's hope Python's Date Libraries are strong enough so that we don't have to put too much effort into converting them

In [12]:
for col in categorical_cols_date:
  # Convert date string into date object, then subtract today from it and get the seconds in between. If it can't be parsed right just ignore it
  # The hardest part about this was figuring out the timezone specifics but otherwise not bad
  df[col] = pd.to_datetime(df[col], errors='coerce', utc=True)
  df[col] = (df[col] - pd.Timestamp.now().tz_localize('UTC')).dt.total_seconds()

  test[col] = pd.to_datetime(test[col], errors='coerce', utc=True)
  test[col] = (test[col] - pd.Timestamp.now().tz_localize('UTC')).dt.total_seconds()

df

  df[col] = pd.to_datetime(df[col], errors='coerce', utc=True)
  test[col] = pd.to_datetime(test[col], errors='coerce', utc=True)


Unnamed: 0,id,type_code,is_registered,country_code,currency_code,is_active,class_id,debit_code,last_trade_date,contract_type,...,is_hrdc_resp,is_plan_grandfathered,resp_specimen_plan,inserted_at,updated_at,is_olob,retail_last_maintenance_time,retail_last_maintenance_user,visible_in_reports,label
0,893724,CASH SWEEP,f,,CAD,t,3.0,0,,,...,,,,-2.325738e+07,-1.719301e+07,f,,,f,Churn
1,1268094,RRIF,t,,CAD,t,5.0,C,-5.182099e+07,18.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-3.600980e+07,BATCH,t,Churn
2,606613,SPOUSAL RRSP,t,,CAD,t,5.0,A,-1.779650e+08,16.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-1.769282e+08,GUERINO,f,Churn
3,741930,CASH,f,,CAD,t,3.0,T,,12.0,...,,,,-2.325738e+07,-1.719301e+07,f,,,t,Churn
4,1137922,CASH,f,,CAD,t,3.0,T,,17.0,...,,,,-2.325738e+07,-1.718587e+07,f,,,f,Churn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,6827067,LIRA/LRSP/RLSP,t,CAN,CAD,t,5.0,C,,18.0,...,f,f,,-2.078281e+07,-1.700284e+07,f,-2.080340e+07,BATCH,t,Churn
673335,590659,REG RRSP,t,,CAD,f,5.0,A,-2.101058e+08,18.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-1.799522e+08,BATCH,f,No Churn
673336,1247774,CASH,f,,CAD,t,3.0,T,-7.627219e+07,16.0,...,,,,-2.325738e+07,-1.719301e+07,f,,,t,Churn
673337,1155640,RRIF,t,,CAD,t,5.0,C,,17.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-2.814740e+07,T80,t,Churn


### Section 2: Imputing the Non-numerical Columns
We are now ready to impute all missing values. To do this, we use the mode for nominal/boolean ones.

In [13]:
# Import the Simple Imputer for those Categorical items
from sklearn.impute import SimpleImputer

nominalToImpute = df[categorical_cols_nominals]
test_nominalToImpute = test[categorical_cols_nominals]

Imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
Imputer.set_output(transform="pandas")

nominalToImpute = Imputer.fit_transform(nominalToImpute)
test_nominalToImpute = Imputer.transform(test_nominalToImpute)
nominalToImpute

Unnamed: 0,type_code,country_code,currency_code,debit_code,branch,retail_plan,language_code,sss_agent,terminal_code,iso_funds_code,...,portfolio_summary_option,risk_tolerance,investment_objective,last_maintenance_user,portfolio_name_address_option,retail_last_maintenance_user,arp_pension_origin,conjunction,loan_limit_override,special_fee_code
0,CASH SWEEP,CAN,CAD,0,IAVM,RS,E,SEC 99999,BATCH,CAD,...,0.0,M99,G99,T80,0.0,BATCH,ON,NOT,CL,N
1,RRIF,CAN,CAD,C,IAVM,RI,E,SEC 99999,BATCH,CAD,...,0.0,H10M90,B05G85S10,BATCH,0.0,BATCH,ON,NOT,CL,N
2,SPOUSAL RRSP,CAN,CAD,A,IAVM,RS,F,SEC 99999,G023,CAD,...,0.0,H50M50S00,G50S50,GUERINO,0.0,GUERINO,ON,NOT,CL,N
3,CASH,CAN,CAD,T,IAVM,RS,E,SEC 99999,113C,CAD,...,0.0,M99,G99,T80,0.0,BATCH,ON,NOT,CL,N
4,CASH,CAN,CAD,T,IAVM,RS,E,SEC 99999,BATCH,CAD,...,0.0,M99,G99,H01,0.0,BATCH,ON,NOT,CL,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,LIRA/LRSP/RLSP,CAN,CAD,C,HOLIS,RS,E,SEC 99999,BATCH,CAD,...,0.0,M99,G99,BATCH,0.0,BATCH,FD,NOT,CL,N
673335,REG RRSP,CAN,CAD,A,IAVM,RS,F,SEC 99999,BATCH,CAD,...,0.0,M99,G99,ISLUSR,0.0,BATCH,ON,NOT,CL,N
673336,CASH,CAN,CAD,T,IAVM,RS,E,SEC 99999,BATCH,CAD,...,0.0,H25M75,G99,BATCH,0.0,BATCH,ON,NOT,CL,N
673337,RRIF,CAN,CAD,C,IAVM,RI,E,SEC 99999,107C,CAD,...,0.0,M99,G99,T80,0.0,T80,ON,NOT,CL,N


In [14]:
from sklearn.impute import SimpleImputer

boolsToImpute = df[categorical_cols_bools]
test_boolsToImpute = test[categorical_cols_bools]

boolImputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
boolImputer.set_output(transform="pandas")

boolsToImpute = boolImputer.fit_transform(boolsToImpute)
test_boolsToImpute = boolImputer.transform(test_boolsToImpute)
boolsToImpute

Unnamed: 0,is_registered,is_active,tms_settlement_location,net_of_fees,fee_paid_separately,custody_fee_withdrawal,is_fee_exempt,include_client_consolidation,use_client_address,is_spousal,...,is_extract_eligible,is_pledged,is_resp,use_original_date_for_payment_calc,is_family_resp,is_hrdc_resp,is_plan_grandfathered,is_olob,visible_in_reports,is_midwest_clearing_account
0,f,t,TOR,f,f,f,f,t,f,f,...,t,f,f,f,f,f,f,f,f,False
1,t,t,TOR,f,f,f,f,t,t,f,...,t,f,f,f,f,f,f,f,t,False
2,t,t,TOR,f,f,f,f,f,f,t,...,t,f,f,f,f,f,f,f,f,False
3,f,t,TOR,f,f,f,f,t,t,f,...,t,f,f,f,f,f,f,f,t,False
4,f,t,TOR,f,f,f,f,t,f,f,...,t,f,f,f,f,f,f,f,f,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,t,t,TOR,f,f,f,f,t,t,f,...,t,f,f,f,f,f,f,f,t,False
673335,t,f,TOR,f,f,f,f,t,f,f,...,t,f,f,f,f,f,f,f,f,False
673336,f,t,TOR,f,f,f,f,t,t,f,...,t,f,f,f,f,f,f,f,t,False
673337,t,t,TOR,f,f,f,f,t,t,f,...,t,f,f,f,f,f,f,f,t,False


In [15]:
# Merge them back in
df[categorical_cols_nominals] = nominalToImpute
df[categorical_cols_bools] = boolsToImpute

test[categorical_cols_nominals] = test_nominalToImpute
test[categorical_cols_bools] = test_boolsToImpute

df

Unnamed: 0,id,type_code,is_registered,country_code,currency_code,is_active,class_id,debit_code,last_trade_date,contract_type,...,is_hrdc_resp,is_plan_grandfathered,resp_specimen_plan,inserted_at,updated_at,is_olob,retail_last_maintenance_time,retail_last_maintenance_user,visible_in_reports,label
0,893724,CASH SWEEP,f,CAN,CAD,t,3.0,0,,,...,f,f,,-2.325738e+07,-1.719301e+07,f,,BATCH,f,Churn
1,1268094,RRIF,t,CAN,CAD,t,5.0,C,-5.182099e+07,18.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-3.600980e+07,BATCH,t,Churn
2,606613,SPOUSAL RRSP,t,CAN,CAD,t,5.0,A,-1.779650e+08,16.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-1.769282e+08,GUERINO,f,Churn
3,741930,CASH,f,CAN,CAD,t,3.0,T,,12.0,...,f,f,,-2.325738e+07,-1.719301e+07,f,,BATCH,t,Churn
4,1137922,CASH,f,CAN,CAD,t,3.0,T,,17.0,...,f,f,,-2.325738e+07,-1.718587e+07,f,,BATCH,f,Churn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,6827067,LIRA/LRSP/RLSP,t,CAN,CAD,t,5.0,C,,18.0,...,f,f,,-2.078281e+07,-1.700284e+07,f,-2.080340e+07,BATCH,t,Churn
673335,590659,REG RRSP,t,CAN,CAD,f,5.0,A,-2.101058e+08,18.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-1.799522e+08,BATCH,f,No Churn
673336,1247774,CASH,f,CAN,CAD,t,3.0,T,-7.627219e+07,16.0,...,f,f,,-2.325738e+07,-1.719301e+07,f,,BATCH,t,Churn
673337,1155640,RRIF,t,CAN,CAD,t,5.0,C,,17.0,...,f,f,,-2.325738e+07,-1.700975e+07,f,-2.814740e+07,T80,t,Churn


### Section 3: One-Hot Encoding
Before imputing numerical values, we first one-hot encode everything else. However, as mentioned earlier, as some of the fields contain literally thousands of values we implement a method to get rid of categories that are not substantial.

Of course, since we backfilled our categorical data with the mode already, we need to be careful with what the cutoff should be. Manual checking from earlier suggests that maybe, uh, **500**? Also, we have to cap the max categories to 20 - otherwise we get like 1000 or so columns, which isn't great. Just so you know, 20 was deliberately chosen because of `type_code`

In [16]:
from sklearn.preprocessing import OneHotEncoder

nominalOneHot = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='infrequent_if_exist', min_frequency=500, max_categories=20)
nominalOneHot.set_output(transform="pandas")

nomOneHot = df[categorical_cols_nominals].astype(str)
test_nomOneHot = test[categorical_cols_nominals].astype(str)

nomOneHot = nominalOneHot.fit_transform(nomOneHot)
test_nomOneHot = nominalOneHot.transform(test_nomOneHot)

nomOneHot



Unnamed: 0,type_code_CASH SWEEP,type_code_COD,type_code_LIRA/LRSP/RLSP,type_code_MISSING,type_code_MRGN,type_code_OFFBOOK,type_code_RDSP,type_code_REG RRSP,type_code_RESP,type_code_RRIF,...,arp_pension_origin_FD,arp_pension_origin_MB,arp_pension_origin_NB,arp_pension_origin_ON,arp_pension_origin_QC,arp_pension_origin_SK,arp_pension_origin_infrequent_sklearn,conjunction_infrequent_sklearn,loan_limit_override_infrequent_sklearn,special_fee_code_infrequent_sklearn
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
673335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
673336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
673337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
from sklearn.preprocessing import OneHotEncoder

booleanOneHot = OneHotEncoder(sparse_output=False, drop='if_binary', handle_unknown='infrequent_if_exist', min_frequency=500)
booleanOneHot.set_output(transform="pandas")

boolOneHot = df[categorical_cols_bools].astype(str)
test_boolOneHot = test[categorical_cols_bools].astype(str)

boolOneHot = booleanOneHot.fit_transform(boolOneHot)
test_boolOneHot = booleanOneHot.transform(test_boolOneHot)

boolOneHot



Unnamed: 0,is_registered_t,is_active_t,tms_settlement_location_DTC,tms_settlement_location_TOR,tms_settlement_location_infrequent_sklearn,net_of_fees_f,fee_paid_separately_f,custody_fee_withdrawal_f,is_fee_exempt_f,include_client_consolidation_t,...,is_extract_eligible_t,is_pledged_infrequent_sklearn,is_resp_t,use_original_date_for_payment_calc_t,is_family_resp_t,is_hrdc_resp_t,is_plan_grandfathered_f,is_olob_t,visible_in_reports_t,is_midwest_clearing_account_False
0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
2,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
4,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
673335,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
673336,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
673337,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


In [18]:
# Merge them back in
df = pd.concat([df, nomOneHot, boolOneHot], axis=1)
df = df.drop(categorical_cols_nominals, axis=1).drop(categorical_cols_bools, axis=1)

test = pd.concat([test, test_nomOneHot, test_boolOneHot], axis=1)
test = test.drop(categorical_cols_nominals, axis=1).drop(categorical_cols_bools, axis=1)

df

Unnamed: 0,id,class_id,last_trade_date,contract_type,inception_date,cashflows_custody_fee,credit_limit_type,dividend_confirm_code,options_trading_type,rep_commission_rate,...,is_extract_eligible_t,is_pledged_infrequent_sklearn,is_resp_t,use_original_date_for_payment_calc_t,is_family_resp_t,is_hrdc_resp_t,is_plan_grandfathered_f,is_olob_t,visible_in_reports_t,is_midwest_clearing_account_False
0,893724,3.0,,,-2.056994e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,1268094,5.0,-5.182099e+07,18.0,-1.670786e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
2,606613,5.0,-1.779650e+08,16.0,-4.695650e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,741930,3.0,,12.0,-9.899539e+07,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
4,1137922,3.0,,17.0,-2.081186e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,6827067,5.0,,18.0,-2.088979e+07,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
673335,590659,5.0,-2.101058e+08,18.0,-5.444738e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
673336,1247774,3.0,-7.627219e+07,16.0,-1.415906e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
673337,1155640,5.0,,17.0,-2.081186e+08,,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


### Section 4: MICE Imputing
Now for the rest of the rows. MICE time.

In [19]:
# First, we need to cut off the label, since that isn't a number and MICE cannot handle it.
df_X = df.drop(['label'], axis=1)
df_Y = df['label']

# No label in Test!
test_X = test

In [20]:
# Apparently its still a experimental feature within sklearn - but that won't stop us!
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

total_num_cols = np.concatenate((numerical_cols, categorical_cols_date))

numImpute = IterativeImputer(random_state=23950, initial_strategy='median')
numImpute.set_output(transform="pandas")

toNumImpute = df_X[total_num_cols]
toNumImpute = numImpute.fit_transform(toNumImpute)

toNumImpute



Unnamed: 0,id,class_id,contract_type,credit_limit_type,dividend_confirm_code,options_trading_type,rep_commission_rate,rep_commission_override,interest_dividend_conversion_type,guarantee_gtor_type,...,last_trade_date,inception_date,last_update_date,last_maintenance_time,plan_effective_date,plan_end_date,rrif_original_date,inserted_at,updated_at,retail_last_maintenance_time
0,893724.0,3.0,14.415885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.034469e+07,-2.056994e+08,-8.590901e+07,-6.910099e+07,-3.100958e+08,-4.821977e+09,-9.093112e+08,-2.325738e+07,-1.719301e+07,-5.706625e+09
1,1268094.0,5.0,18.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-5.182099e+07,-1.670786e+08,-4.611859e+07,-3.600979e+07,-1.670786e+08,-4.591844e+07,-1.670786e+08,-2.325738e+07,-1.700975e+07,-3.600980e+07
2,606613.0,5.0,16.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-1.779650e+08,-4.695650e+08,-3.105026e+08,-9.294739e+07,-4.695650e+08,-1.769282e+08,-4.695650e+08,-2.325738e+07,-1.700975e+07,-1.769282e+08
3,741930.0,3.0,12.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-8.337585e+07,-9.899539e+07,-9.899539e+07,-9.899539e+07,-8.497189e+07,-3.406795e+09,-8.029122e+08,-2.325738e+07,-1.719301e+07,-4.806123e+09
4,1137922.0,3.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.029380e+08,-2.081186e+08,-2.078594e+08,-2.088979e+07,-3.381639e+08,-4.578731e+09,-4.264582e+08,-2.325738e+07,-1.718587e+07,-4.509569e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,6827067.0,5.0,18.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-6.547118e+07,-2.088979e+07,-2.088979e+07,-2.080339e+07,-2.088979e+07,-5.017152e+07,-2.088979e+07,-2.078281e+07,-1.700284e+07,-2.080340e+07
673335,590659.0,5.0,18.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-2.101058e+08,-5.444738e+08,-4.973858e+08,-1.307906e+08,-5.444738e+08,-2.084642e+08,-5.444738e+08,-2.325738e+07,-1.700975e+07,-1.799522e+08
673336,1247774.0,3.0,16.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-7.627219e+07,-1.415906e+08,-1.415906e+08,-1.411586e+08,-1.588691e+08,-3.714318e+09,-6.693150e+08,-2.325738e+07,-1.719301e+07,-4.682246e+09
673337,1155640.0,5.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-8.624546e+07,-2.081186e+08,-1.631042e+08,-2.814739e+07,-2.081186e+08,-7.118907e+07,-2.081186e+08,-2.325738e+07,-1.700975e+07,-2.814740e+07


In [21]:
# Because the Imputer ignored a few columns, we need to get rid of them from test too
test_toNumImpute = test_X[total_num_cols]
test_toNumImpute = numImpute.transform(test_toNumImpute)

test_toNumImpute



Unnamed: 0,id,class_id,contract_type,credit_limit_type,dividend_confirm_code,options_trading_type,rep_commission_rate,rep_commission_override,interest_dividend_conversion_type,guarantee_gtor_type,...,last_trade_date,inception_date,last_update_date,last_maintenance_time,plan_effective_date,plan_end_date,rrif_original_date,inserted_at,updated_at,retail_last_maintenance_time
0,1155742.0,3.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.198729e+08,-2.081186e+08,-2.078594e+08,-5.095699e+07,-3.298518e+08,-4.485207e+09,-3.856433e+08,-2.325738e+07,-1.719301e+07,-4.366572e+09
1,1269359.0,5.0,18.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-1.041794e+08,-1.250018e+08,-1.250018e+08,-4.248979e+07,-1.250018e+08,-8.790960e+07,-1.250018e+08,-2.325738e+07,-1.700975e+07,-4.248980e+07
2,573181.0,3.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.220570e+07,-3.540499e+07,-3.540499e+07,-3.531859e+07,4.408194e+07,-2.805107e+09,-1.107351e+09,-2.325738e+07,-1.719301e+07,-5.201810e+09
3,967968.0,5.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-1.255202e+08,-2.081186e+08,-1.679426e+08,-8.050579e+07,-2.081186e+08,-1.243970e+08,-2.081186e+08,-2.325738e+07,-1.700975e+07,-8.050580e+07
4,595581.0,3.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-4.070978e+08,-4.737986e+08,-3.298579e+07,-2.520979e+07,-6.574632e+08,-4.476066e+09,1.153523e+08,-2.325738e+07,-1.718675e+07,-2.539596e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168330,1047726.0,3.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.809054e+08,-2.081186e+08,-2.078594e+08,-1.677698e+08,-3.021508e+08,-4.175154e+09,-2.229975e+08,-2.325738e+07,-1.719301e+07,-3.839366e+09
168331,888353.0,3.0,14.587338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.335008e+07,-2.056994e+08,-7.120984e+07,-2.218579e+07,-3.079590e+08,-4.841048e+09,-1.015163e+09,-2.325738e+07,-1.718611e+07,-5.960363e+09
168332,1090674.0,3.0,12.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-3.559203e+07,-7.653139e+07,-7.653139e+07,-2.616019e+07,-4.507189e+07,-3.305579e+09,-1.024497e+09,-2.325738e+07,-1.718697e+07,-5.235874e+09
168333,1034923.0,5.0,17.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,-2.022434e+08,-2.081186e+08,-2.078594e+08,-1.307042e+08,-2.081186e+08,-2.018114e+08,-2.081186e+08,-2.325738e+07,-1.700975e+07,-1.796930e+08


In [22]:
# Merge it in
df_X = df_X.drop(total_num_cols, axis=1)
df_X = pd.concat([df_X, toNumImpute], axis=1)

test_X = test_X.drop(total_num_cols, axis=1)
test_X = pd.concat([test_X, test_toNumImpute], axis=1)

df_X

Unnamed: 0,type_code_CASH SWEEP,type_code_COD,type_code_LIRA/LRSP/RLSP,type_code_MISSING,type_code_MRGN,type_code_OFFBOOK,type_code_RDSP,type_code_REG RRSP,type_code_RESP,type_code_RRIF,...,last_trade_date,inception_date,last_update_date,last_maintenance_time,plan_effective_date,plan_end_date,rrif_original_date,inserted_at,updated_at,retail_last_maintenance_time
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.034469e+07,-2.056994e+08,-8.590901e+07,-6.910099e+07,-3.100958e+08,-4.821977e+09,-9.093112e+08,-2.325738e+07,-1.719301e+07,-5.706625e+09
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-5.182099e+07,-1.670786e+08,-4.611859e+07,-3.600979e+07,-1.670786e+08,-4.591844e+07,-1.670786e+08,-2.325738e+07,-1.700975e+07,-3.600980e+07
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.779650e+08,-4.695650e+08,-3.105026e+08,-9.294739e+07,-4.695650e+08,-1.769282e+08,-4.695650e+08,-2.325738e+07,-1.700975e+07,-1.769282e+08
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-8.337585e+07,-9.899539e+07,-9.899539e+07,-9.899539e+07,-8.497189e+07,-3.406795e+09,-8.029122e+08,-2.325738e+07,-1.719301e+07,-4.806123e+09
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.029380e+08,-2.081186e+08,-2.078594e+08,-2.088979e+07,-3.381639e+08,-4.578731e+09,-4.264582e+08,-2.325738e+07,-1.718587e+07,-4.509569e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-6.547118e+07,-2.088979e+07,-2.088979e+07,-2.080339e+07,-2.088979e+07,-5.017152e+07,-2.088979e+07,-2.078281e+07,-1.700284e+07,-2.080340e+07
673335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-2.101058e+08,-5.444738e+08,-4.973858e+08,-1.307906e+08,-5.444738e+08,-2.084642e+08,-5.444738e+08,-2.325738e+07,-1.700975e+07,-1.799522e+08
673336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-7.627219e+07,-1.415906e+08,-1.415906e+08,-1.411586e+08,-1.588691e+08,-3.714318e+09,-6.693150e+08,-2.325738e+07,-1.719301e+07,-4.682246e+09
673337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-8.624546e+07,-2.081186e+08,-1.631042e+08,-2.814739e+07,-2.081186e+08,-7.118907e+07,-2.081186e+08,-2.325738e+07,-1.700975e+07,-2.814740e+07


That should be all of the nulls taken care of! Hopefully this didn't corrupt our data too much...

In [23]:
df = pd.concat([df_X, df_Y], axis=1)
df

Unnamed: 0,type_code_CASH SWEEP,type_code_COD,type_code_LIRA/LRSP/RLSP,type_code_MISSING,type_code_MRGN,type_code_OFFBOOK,type_code_RDSP,type_code_REG RRSP,type_code_RESP,type_code_RRIF,...,inception_date,last_update_date,last_maintenance_time,plan_effective_date,plan_end_date,rrif_original_date,inserted_at,updated_at,retail_last_maintenance_time,label
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.056994e+08,-8.590901e+07,-6.910099e+07,-3.100958e+08,-4.821977e+09,-9.093112e+08,-2.325738e+07,-1.719301e+07,-5.706625e+09,Churn
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-1.670786e+08,-4.611859e+07,-3.600979e+07,-1.670786e+08,-4.591844e+07,-1.670786e+08,-2.325738e+07,-1.700975e+07,-3.600980e+07,Churn
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-4.695650e+08,-3.105026e+08,-9.294739e+07,-4.695650e+08,-1.769282e+08,-4.695650e+08,-2.325738e+07,-1.700975e+07,-1.769282e+08,Churn
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-9.899539e+07,-9.899539e+07,-9.899539e+07,-8.497189e+07,-3.406795e+09,-8.029122e+08,-2.325738e+07,-1.719301e+07,-4.806123e+09,Churn
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.081186e+08,-2.078594e+08,-2.088979e+07,-3.381639e+08,-4.578731e+09,-4.264582e+08,-2.325738e+07,-1.718587e+07,-4.509569e+09,Churn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673334,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.088979e+07,-2.088979e+07,-2.080339e+07,-2.088979e+07,-5.017152e+07,-2.088979e+07,-2.078281e+07,-1.700284e+07,-2.080340e+07,Churn
673335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-5.444738e+08,-4.973858e+08,-1.307906e+08,-5.444738e+08,-2.084642e+08,-5.444738e+08,-2.325738e+07,-1.700975e+07,-1.799522e+08,No Churn
673336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.415906e+08,-1.415906e+08,-1.411586e+08,-1.588691e+08,-3.714318e+09,-6.693150e+08,-2.325738e+07,-1.719301e+07,-4.682246e+09,Churn
673337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-2.081186e+08,-1.631042e+08,-2.814739e+07,-2.081186e+08,-7.118907e+07,-2.081186e+08,-2.325738e+07,-1.700975e+07,-2.814740e+07,Churn


### Section 5: Encode the Y-Label as Well
Almost forgot - gotta label encode the Churn/No Churn!

In [24]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

print(label_encoder.classes_)

# I know we basically already did this but just to be sure you know
X = df.drop(["label"], axis=1)
y = df["label"]

['Churn' 'No Churn']


### Section 6: Splitting the Dataset for training
Hopefully this is good! Going to follow the notebook.

In [25]:
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Part 3: The Model
Its just the Bagged Random Forest Classifier from `imblearn` - just gonna quickly, uh, steal the provided code...

In [26]:
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import cross_val_score

classifer = BalancedRandomForestClassifier(max_features='log2', bootstrap=True, verbose=1, random_state=13, replacement=True)
cv_scores = cross_val_score(estimator=classifer, X=X_train, y=y_train, cv=5)
print(f"Average CV Score: {sum(cv_scores)/len(cv_scores)}")

  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:  2.2min
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.2s
  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:   55.4s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.3s
  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:   56.6s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.2s
  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:   56.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.5s
  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:  1.0min
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.2s


Average CV Score: 0.9682849082402761


In [27]:
from sklearn.metrics import f1_score

classifer.fit(X_train, y_train)

y_pred = classifer.predict(X_val)
f1 = f1_score(y_val, y_pred)

print(f"F1 Score: {f1}")

  warn(
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:  1.2min
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    2.2s


F1 Score: 0.9266548636058923


Woah thats pretty good! Especially since in the provided notebook it was mentioned to aim for around 75%, pretty good! Now to make our predictions.

# Part 4: Apply Model to `test.csv` 
Now to get our answers back out from the model - to do so we inverse transform the label encoder on the predicted output.

In [28]:
test_pred = classifer.predict(test_X)
test_pred

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    2.5s


array([0, 0, 0, ..., 0, 1, 0])

In [29]:
# Use the Label Encoder to Convert it back into what it should be
test_pred = label_encoder.inverse_transform(test_pred)
test_pred

array(['Churn', 'Churn', 'Churn', ..., 'Churn', 'No Churn', 'Churn'],
      dtype=object)

In [30]:
predicted_series = pd.Series(test_pred)
predicted_series

0            Churn
1            Churn
2            Churn
3         No Churn
4            Churn
            ...   
168330       Churn
168331       Churn
168332       Churn
168333    No Churn
168334       Churn
Length: 168335, dtype: object

In [31]:
# Final Output
test_output = test_X['id']
test_output = pd.concat([test_output, predicted_series.to_frame()], axis= 1)

test_output = test_output.rename(columns={0:'predicted'})
test_output['id'] = test_output['id'].astype(int)

test_output

Unnamed: 0,id,predicted
0,1155742,Churn
1,1269359,Churn
2,573181,Churn
3,967968,No Churn
4,595581,Churn
...,...,...
168330,1047726,Churn
168331,888353,Churn
168332,1090674,Churn
168333,1034923,No Churn


In [40]:
test_output.to_csv('./submission.csv', index=False)

#### Going to also create an alternate version with all columns raw

In [42]:
newTest = pd.read_csv('./test.csv')
newTest = pd.concat([newTest, predicted_series.to_frame()], axis= 1)
newTest = newTest.rename(columns={0:'predicted'})
newTest['id'] = newTest['id'].astype(int)

newTest

  newTest = pd.read_csv('./test.csv')


Unnamed: 0,id,type_code,is_registered,country_code,currency_code,is_active,class_id,debit_code,last_trade_date,contract_type,...,is_hrdc_resp,is_plan_grandfathered,resp_specimen_plan,inserted_at,updated_at,is_olob,retail_last_maintenance_time,retail_last_maintenance_user,visible_in_reports,predicted
0,1155742,CASH,f,,CAD,t,3.0,T,,17.0,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 18:53:01.439561+00,f,,,t,Churn
1,1269359,TFSA,t,,CAD,t,5.0,C,2020-11-05,18.0,...,f,f,,2023-05-30 14:20:18.531115+00,2023-08-10 21:47:25.370403+00,f,2022-10-20 00:00:00,ROMANAA,t,Churn
2,573181,CASH,f,,CAD,t,3.0,T,,17.0,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 18:53:01.439561+00,f,,,t,Churn
3,967968,TFSA,t,,CAD,t,5.0,A,2020-03-03,17.0,...,f,f,,2023-05-30 14:20:18.531115+00,2023-08-10 21:47:25.370403+00,f,2021-08-06 00:00:00,T80,f,No Churn
4,595581,CASH,f,,CAD,t,3.0,T,2011-04-01,17.0,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 20:37:22.511698+00,f,,,t,Churn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168330,1047726,CASH,f,,CAD,t,3.0,A,,17.0,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 18:53:01.439561+00,f,,,f,Churn
168331,888353,CASH SWEEP,f,,CAD,t,3.0,0,,,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 20:48:02.170633+00,f,,,f,Churn
168332,1090674,CASH,f,,CAD,t,3.0,T,,12.0,...,,,,2023-05-30 14:20:18.531115+00,2023-08-08 20:33:44.941805+00,f,,,t,Churn
168333,1034923,REG RRSP,t,,CAD,f,5.0,A,2017-09-27,17.0,...,f,f,,2023-05-30 14:20:18.531115+00,2023-08-10 21:47:25.370403+00,f,2018-06-15 00:00:00,BATCH,f,No Churn


In [None]:
newTest.to_csv('./test_predicted.csv', index=False)

# Part 5: Analysis and Marketing Strategy