# Credijusto Data Scientist Challenge 💻💰🚀

## Challenge notes:

- In this Jupyter Notebook (challenge instructions), table row counts and table displays such as `table.head()` are for demonstration only.
- Do not expect that your challenge data sources will have the same table row counts or same values as this very same Jupyter Notebook document. This challenge is automatically generated for each applicant.
    - What is **always constant**:
        - Relationship between tables.
        - Number of features and data types for all tables.
    - What **can change**:
        - Row counts:
            - for some candidates we would like to test skills for handling "big" datasets, in other cases this won't be tested.
        - Values:
            - All challenges will have the same feature columns, but values can change, for example:
                - All challenges will have a `name` feature column, one test `i` could have `Benjamin` as the first name in the list, for other test `j` it could be `Anne`.
            - The intention for this is that the model proposed by the candidate can make generalizations and is useful to make predictions on unseen data.

## Dataset description.

#### 1) Personal [data table]
- **client_id**
    - key to job table
    - key to bank table
    - key to transactional data table
- name
- address
- phone_number
- email_domain
- smoker
- is_married
- car_licence_plate
- age
- number_of_children
- years_of_education
- has_criminal_records

#### 2. Job [data table]
- **client_id**
    - key to personal table
    - key to bank table
    - key to transactional data table
- company
- phone_number
- address
- email_domain
- current_job
- car_licence_plate
- years_in_current_job
- salary

#### 3. Bank [data table]
- **client_id**
    - key to personal table
    - key to job table
    - key to transactional data table
- account_id
    - key to transactional data table
- number_of_credit_cards
- number_logs_per_day
- number_secret_keys_requested
- credit_card_number
- credit_card_expire
- credit_card_provider
- credit_score
- first_credit_card_application_date
- last_credit_card_application_date
- **defaulted_loan**
    - Variable to predit

#### 4. Transactional [data table]
- **transaction_id**
- **account_id**
    - key to bank table
- **client_id**
    - key to personal table
    - key to job table
    - key to bank data table
- duration_minutes
- amount
- type
- date

## Business question

#### Background

1. **Only the training set bank data table has the column defaulted_loan** which has two different outcomes:
    - True
        - Client defaulted (did not pay credit).
        - This is the *Positive class*
    - False
        - Client is OK (did pay credit).
        - This is the *Negative class*
2. You need to make a predictive model to **make predictions of the feature defaulted_loan on the test dataset**.
3. **The evaluation of this challenge relies only on the prediction scores on test dataset**.
    - Choose wisely the evaluation metric for this challenge.

## Technicals

Feel free to:

1. Combine all tables using the keys (all keys end up in `_id`).
2. Use any path or paths of EDA (Exploratory Data Analysis) as you want.
3. Create, modify, delete, combine any feature as you want (**do not make use of external data**).
4. Use any predictive model or stack of models as you want.
5. Use either Python (`version 3.*.*`) or R (`version 3.*.*`) to solve this data challenge.

## R Resources
- [Caret](https://topepo.github.io/caret/)
- [R for data analysis](http://r4ds.had.co.nz)

## Python Resources
- [Scikit-Learn](https://scikit-learn.org/stable/tutorial/index.html)
- [mlxtend](http://rasbt.github.io/mlxtend/)


## Deliverables

Two files:
- One CSV file.
- One Jupyter Notebook.

#### CSV File

- **The deliverable is one CSV (Comma Separated Value) file with only two columns**
    - client_id
    - defaulted_loan
- The file **must have a prediction label for each client in test dataset**, one row per client. There is an example at the bottom of this file.
- The file must be named as follow:
    - ds_challenge_[your_name_your_last_name].csv
    - Example for someone named John Smith:
        - ds_challenge_john_smith.csv
        
#### Jupyter Notebook

- Make sure that I can run this file in my computer without the need of using external data sources or programs (use only R or Python).
- **This file can be run locally and must export the results as a CSV. This CSV must be exactly the same as the one you're sending along this challenge.**
- It is OK if you want to include charts and tables in this file. This is highly valuable for the EDA process.
- The file must be named as follows:
    - ds_challenge_[your_name_your_last_name].ipynb
    - Example for someone named John Smith:
        - ds_challenge_john_smith.ipynb

## Deadlines
- You will have one week (7 days) to send your results.
- Please, send both files (CSV and ipynb) files to my email: btovar@credijusto.com

### Set working environment

In [1]:
import pandas as pd
import numpy as np

### Load dataset

In [4]:
data = {
    'train': {
        'personal': pd.read_csv('data/client_personal_train.csv'),
        'job': pd.read_csv('data/client_job_train.csv'),
        'bank_data': pd.read_csv('data/client_bank_data_train.csv'),
        'transactional_data': pd.read_csv('data/client_transactional_data_train.csv')        
    },
    'test': {
        'personal': pd.read_csv('data/client_personal_test.csv'),
        'job': pd.read_csv('data/client_job_test.csv'),
        'bank_data': pd.read_csv('data/client_bank_data_test.csv'),
        'transactional_data': pd.read_csv('data/client_transactional_data_test.csv')
    }
}

### Train Data exploration

In [11]:
[print('Dataset: ' + x + ' | Dataset dimension (rows, cols): ' + str(data['train'][x].shape)) for x in data['train'].keys()]

Dataset: personal | Dataset dimension (rows, cols): (86521, 12)
Dataset: job | Dataset dimension (rows, cols): (86521, 9)
Dataset: bank_data | Dataset dimension (rows, cols): (86521, 12)
Dataset: transactional_data | Dataset dimension (rows, cols): (6477031, 7)


[None, None, None, None]

#### 1) Personal datatable

In [12]:
data['train']['personal'].head()

Unnamed: 0,phone_number,number_of_children,has_criminal_records,client_id,address,name,age,years_of_education,is_married,car_licence_plate,email_domain,smoker
0,9797099700,2,False,HOYS2608458829640,"52901 Anna Estate Apt. 971\nHansenland, GA 79328",Ronald Ryan,29,12,True,OLW 106,hotmail.com,True
1,(268)263-7382x418,2,False,ZPWO4471671046144,"96959 Hannah Spur Suite 528\nMeganton, NV 87358",Virginia Hernandez,30,19,True,P27-77B,gmail.com,False
2,173-606-0390,2,False,LGKU2154168371152,USS Webb\nFPO AA 86921,Sheila Stewart,36,7,True,507 DNX,gmail.com,False
3,240-831-1677x2343,2,False,ESVS3759907342486,"8563 Wang Tunnel\nEast David, HI 47804",George Cox,33,15,False,BVR 153,yahoo.com,True
4,001-433-074-6320x64392,3,False,TRBA4826292353608,"747 Tom Flats Apt. 796\nSouth April, AR 60765",Lauren Lopez,35,9,False,426209,yahoo.com,False


#### 2) Job datatable

In [13]:
data['train']['job'].head()

Unnamed: 0,car_licence_plate,client_id,phone_number,email_domain,address,current_job,company,years_in_current_job,salary
0,YSE6755,LBBJ9025982900187,(805)271-7884,yahoo.com,"9108 Wright Run Apt. 397\nColinhaven, MT 77584",Chief Strategy Officer,"Burns, Lynch and Ferguson",13,9256
1,9NB2337,XOEW7300415693349,932.130.7419x61334,gmail.com,"485 David Union Apt. 987\nNew Sharonton, AK 03823",Chief Executive Officer,"Ferguson, Lloyd and Camacho",14,5704
2,DA5 8385,QSOJ2352235898099,(052)952-2214x4117,yahoo.com,"699 April Inlet Apt. 724\nWarnermouth, WV 44034","Surveyor, commercial/residential","Henderson, Bowers and Villegas",10,7055
3,5Q R1718,BTJU3236006682424,880-147-9640,yahoo.com,"23091 Maria Station\nSouth Williambury, DE 69511",Therapeutic radiographer,Howe LLC,10,5962
4,69J C93,JWDA5315944972672,483-988-2897x05522,gmail.com,"65065 Erica Dale Apt. 492\nSandovalmouth, NH 3...",Medical physicist,Coffey Group,16,7048


#### 3) Bank datatable
- Notice that this is the table that contains variable to predict: **defaulted_loan**

In [14]:
data['train']['bank_data'].head()

Unnamed: 0,number_of_credit_cards,first_credit_card_application_date,client_id,credit_card_expire,credit_card_number,credit_score,number_secret_keys_requested,number_logs_per_day,credit_card_provider,account_id,last_credit_card_application_date,defaulted_loan
0,2,2015-12-10 01:30:24,IDQN7326217819942,10/26,4917791162965063,792,1,3,JCB 16 digit,SSEL8415528042273,2017-08-29 01:53:05,False
1,3,2016-06-03 08:07:30,CSTV4627055779590,12/26,30357439741339,1056,1,3,VISA 16 digit,MUEW8640780419916,2015-08-16 13:56:25,False
2,2,2016-02-19 10:28:43,LAPI8584925015195,03/25,4743404908886490,611,1,3,VISA 16 digit,NUXE8686777893744,2015-03-09 05:00:35,False
3,1,2010-10-26 17:19:22,QGCO7450958537677,02/21,4193019096594311,704,1,3,Mastercard,HPZV6827338198621,2014-05-01 15:16:00,False
4,1,2009-08-25 16:29:57,KTRF0650626151507,08/21,3533740926977068,860,1,3,VISA 19 digit,HDUL2479521679327,2014-06-01 12:28:18,False


#### 4) Bank transactions datatable

In [15]:
data['train']['transactional_data'].head()

Unnamed: 0,account_id,amount,client_id,date,duration_minutes,transaction_id,type
0,MUUO1910562872132,345,MODL6196378885876,2017-03-06 17:42:05,13,NYNN6756993501458,Withdrawal
1,ACHE2092856499864,346,GJDO0663462186710,2018-01-28 14:26:28,10,VGUR1678924411888,Withdrawal
2,QKGF7117785518699,312,UUAS6736457187091,2017-10-16 19:49:24,10,VFCW6048243042000,Withdrawal
3,KDFS1871901481681,140,OGXP1651998264885,2017-07-06 13:02:14,10,MDZE8538477502306,Withdrawal
4,BHIK6017211474052,165,WNAS0589763978489,2017-06-29 21:29:28,13,BZPF2399381834004,Deposit


### Variable to predict stats
- **defaulted_loan**: if True, it means that the client defaulted the loan. If False, client paid the loan.
- **Our interest is to predict if a credit applicant (client_id) will default the loan.**

In [21]:
data['train']['bank_data']['defaulted_loan'].value_counts()

False    83930
True      2591
Name: defaulted_loan, dtype: int64

##### Currently, only 3% of the portfilio has defaulted the loan.

In [22]:
100 * np.round(data['train']['bank_data']['defaulted_loan'].value_counts() / data['train']['bank_data'].shape[0], 2)

False    97.0
True      3.0
Name: defaulted_loan, dtype: float64

## Test set particularities

- Bank test dataset **does not have defaulted_loan feature**
- **defaulted_loan** column is the feature we need to predict

In [25]:
data['test']['bank_data'].head()

Unnamed: 0,number_of_credit_cards,first_credit_card_application_date,client_id,credit_card_expire,credit_card_number,credit_score,number_secret_keys_requested,number_logs_per_day,credit_card_provider,account_id,last_credit_card_application_date
0,2,2014-08-17 09:15:31,QLOO6872808638149,03/21,6011107581395438,938,1,2,VISA 16 digit,CIQD7924293428249,2016-10-23 14:29:29
1,3,2017-07-13 17:42:38,LARO2418696890722,06/28,3583287285484582,974,1,3,Discover,ZHFB1331739147838,2015-12-15 13:49:15
2,3,2011-01-14 06:28:14,WLNZ9428625597800,02/21,4909285608552114,823,1,3,American Express,MCXO9264919957857,2017-10-02 03:35:27
3,1,2014-05-01 23:52:16,OTPX8452255444597,05/22,3573536236294405,954,1,3,JCB 16 digit,NQAR3462687861064,2016-04-24 23:11:23
4,3,2017-06-16 01:25:33,CPAQ6111541208111,08/21,6011045605327807,880,1,2,VISA 13 digit,QERG4184129464638,2014-11-17 21:57:30


In [31]:
pd.DataFrame(data['test']['bank_data'].columns)

Unnamed: 0,0
0,number_of_credit_cards
1,first_credit_card_application_date
2,client_id
3,credit_card_expire
4,credit_card_number
5,credit_score
6,number_secret_keys_requested
7,number_logs_per_day
8,credit_card_provider
9,account_id


In [26]:
data['train']['bank_data'].head()

Unnamed: 0,number_of_credit_cards,first_credit_card_application_date,client_id,credit_card_expire,credit_card_number,credit_score,number_secret_keys_requested,number_logs_per_day,credit_card_provider,account_id,last_credit_card_application_date,defaulted_loan
0,2,2015-12-10 01:30:24,IDQN7326217819942,10/26,4917791162965063,792,1,3,JCB 16 digit,SSEL8415528042273,2017-08-29 01:53:05,False
1,3,2016-06-03 08:07:30,CSTV4627055779590,12/26,30357439741339,1056,1,3,VISA 16 digit,MUEW8640780419916,2015-08-16 13:56:25,False
2,2,2016-02-19 10:28:43,LAPI8584925015195,03/25,4743404908886490,611,1,3,VISA 16 digit,NUXE8686777893744,2015-03-09 05:00:35,False
3,1,2010-10-26 17:19:22,QGCO7450958537677,02/21,4193019096594311,704,1,3,Mastercard,HPZV6827338198621,2014-05-01 15:16:00,False
4,1,2009-08-25 16:29:57,KTRF0650626151507,08/21,3533740926977068,860,1,3,VISA 19 digit,HDUL2479521679327,2014-06-01 12:28:18,False


In [32]:
pd.DataFrame(data['train']['bank_data'].columns)

Unnamed: 0,0
0,number_of_credit_cards
1,first_credit_card_application_date
2,client_id
3,credit_card_expire
4,credit_card_number
5,credit_score
6,number_secret_keys_requested
7,number_logs_per_day
8,credit_card_provider
9,account_id


# CSV output template

In [45]:
demo_output = pd.DataFrame(index=data['test']['bank_data'].index)
demo_output['client_id'] = data['test']['bank_data']['client_id']
# fill with random labels
demo_output['defaulted_loan'] = np.random.choice([True, False], size=data['test']['bank_data'].shape[0], replace=True, p=[0.5, 0.5])
demo_output.head()

Unnamed: 0,client_id,defaulted_loan
0,QLOO6872808638149,True
1,LARO2418696890722,False
2,WLNZ9428625597800,False
3,OTPX8452255444597,True
4,CPAQ6111541208111,False


In [None]:
# Export as CSV
demo_output.csv('growth_ds_challenge_john_smith.csv')