# Credijusto Data Scientist Challenge 💻💰🚀

## Challenge notes:

- In this Jupyter Notebook (challenge instructions), table row counts and table displays such as `table.head()` are for demonstration only.
- Do not expect that your challenge data sources will have the same table row counts or same values as this very same Jupyter Notebook document. This challenge is automatically generated for each applicant.
    - What is **always constant**:
        - Relationship between tables.
        - Number of features and data types for all tables.
    - What **can change**:
        - Row counts:
            - for some candidates we would like to test skills for handling "big" datasets, in other cases this won't be tested.
        - Values:
            - All challenges will have the same feature columns, but values can change, for example:
                - All challenges will have a `name` feature column, one test `i` could have `Benjamin` as the first name in the list, for other test `j` it could be `Anne`.
            - The intention for this is that the model proposed by the candidate can make generalizations and is useful to make predictions on unseen data.

## Dataset description.

#### 1) Personal [data table]
- **client_id**
    - key to job table
    - key to bank table
    - key to transactional data table
- name
- address
- phone_number
- email_domain
- smoker
- is_married
- car_licence_plate
- age
- number_of_children
- years_of_education
- has_criminal_records

#### 2. Job [data table]
- **client_id**
    - key to personal table
    - key to bank table
    - key to transactional data table
- company
- phone_number
- address
- email_domain
- current_job
- car_licence_plate
- years_in_current_job
- salary

#### 3. Bank [data table]
- **client_id**
    - key to personal table
    - key to job table
    - key to transactional data table
- account_id
    - key to transactional data table
- number_of_credit_cards
- number_logs_per_day
- number_secret_keys_requested
- credit_card_number
- credit_card_expire
- credit_card_provider
- credit_score
- first_credit_card_application_date
- last_credit_card_application_date
- **defaulted_loan**
    - Variable to predit

#### 4. Transactional [data table]
- **transaction_id**
- **account_id**
    - key to bank table
- **client_id**
    - key to personal table
    - key to job table
    - key to bank data table
- duration_minutes
- amount
- type
- date

## Business question

#### Background

1. **Only the training set bank data table has the column defaulted_loan** which has two different outcomes:
    - True
        - Client defaulted (did not pay credit).
        - This is the *Positive class*
    - False
        - Client is OK (did pay credit).
        - This is the *Negative class*
2. You need to make a predictive model to **make predictions of the feature defaulted_loan on the test dataset**.
3. **The evaluation of this challenge relies only on the prediction scores on test dataset**.
    - Choose wisely the evaluation metric for this challenge.

## Technicals

Feel free to:

1. Combine all tables using the keys (all keys end up in `_id`).
2. Use any path or paths of EDA (Exploratory Data Analysis) as you want.
3. Create, modify, delete, combine any feature as you want (**do not make use of external data**).
4. Use any predictive model or stack of models as you want.
5. Use either Python (`version 3.*.*`) or R (`version 3.*.*`) to solve this data challenge.

## R Resources
- [Caret](https://topepo.github.io/caret/)
- [R for data analysis](http://r4ds.had.co.nz)

## Python Resources
- [Scikit-Learn](https://scikit-learn.org/stable/tutorial/index.html)
- [mlxtend](http://rasbt.github.io/mlxtend/)


## Deliverables

Two files:
- One CSV file.
- One Jupyter Notebook.

#### CSV File

- **The deliverable is one CSV (Comma Separated Value) file with only two columns**
    - client_id
    - defaulted_loan
- The file **must have a prediction label for each client in test dataset**, one row per client. There is an example at the bottom of this file.
- The file must be named as follow:
    - ds_challenge_[your_name_your_last_name].csv
    - Example for someone named John Smith:
        - ds_challenge_john_smith.csv
        
#### Jupyter Notebook

- Make sure that I can run this file in my computer without the need of using external data sources or programs (use only R or Python).
- **This file can be run locally and must export the results as a CSV. This CSV must be exactly the same as the one you're sending along this challenge.**
- It is OK if you want to include charts and tables in this file. This is highly valuable for the EDA process.
- The file must be named as follows:
    - ds_challenge_[your_name_your_last_name].ipynb
    - Example for someone named John Smith:
        - ds_challenge_john_smith.ipynb

## Deadlines
- You will have one week (7 days) to send your results.
- Please, send both files (CSV and ipynb) files to my email: btovar@credijusto.com

### Set working environment

In [1]:
import pandas as pd
import numpy as np

### Load dataset

In [2]:
data = {
    'train': {
        'personal': pd.read_csv('data/client_personal_train.csv'),
        'job': pd.read_csv('data/client_job_train.csv'),
        'bank_data': pd.read_csv('data/client_bank_data_train.csv'),
        'transactional_data': pd.read_csv('data/client_transactional_data_train.csv')        
    },
    'test': {
        'personal': pd.read_csv('data/client_personal_test.csv'),
        'job': pd.read_csv('data/client_job_test.csv'),
        'bank_data': pd.read_csv('data/client_bank_data_test.csv'),
        'transactional_data': pd.read_csv('data/client_transactional_data_test.csv')
    }
}

### Train Data exploration

In [3]:
[print('Dataset: ' + x + ' | Dataset dimension (rows, cols): ' + str(data['train'][x].shape)) for x in data['train'].keys()]

Dataset: personal | Dataset dimension (rows, cols): (68992, 12)
Dataset: job | Dataset dimension (rows, cols): (68992, 9)
Dataset: bank_data | Dataset dimension (rows, cols): (68992, 12)
Dataset: transactional_data | Dataset dimension (rows, cols): (1517581, 7)


[None, None, None, None]

#### 1) Personal datatable

In [4]:
data['train']['personal'].head()

Unnamed: 0,is_married,age,name,number_of_children,car_licence_plate,address,has_criminal_records,smoker,years_of_education,phone_number,client_id,email_domain
0,False,31,Joel Herrera,2,8R 3A5NOQ,"0550 Tanya Ferry\nFergusonport, IA 41180",False,True,14,(444)128-8524x089,MUMR3875397452595,yahoo.com
1,False,26,Justin Burgess,3,QIE 2694,"080 Emily Springs Suite 947\nSerranostad, AZ 7...",True,True,17,419-736-1369x7810,LNFC4821269126830,hotmail.com
2,True,29,Samantha Brown,3,6KV R45,"59687 Alexander Walk\nEast David, AZ 21330",False,True,17,1209272743,PIGP5747447418648,gmail.com
3,True,34,Jason Ware,1,YL9 0751,"13474 Flores Mall Suite 952\nNorth Erinfort, N...",False,False,8,(282)819-4842,MDZY2927886938414,gmail.com
4,True,33,Ronald Hoffman,3,LPG 832,"31878 Heather Rapids Suite 933\nNorth Marie, A...",False,True,13,142-489-3506,WETM2827630477279,yahoo.com


#### 2) Job datatable

In [5]:
data['train']['job'].head()

Unnamed: 0,years_in_current_job,client_id,salary,phone_number,car_licence_plate,company,email_domain,address,current_job
0,10,LCQQ3834995242554,6626,+1-870-455-1656,AUJ 311,"Smith, Walton and Smith",hotmail.com,Unit 4250 Box 5536\nDPO AE 73809,Retail banker
1,10,BRCD7200842828050,8343,519-526-9913x6540,223R5,"Santos, Wilson and Hampton",yahoo.com,"PSC 4581, Box 0827\nAPO AE 63527",Intelligence analyst
2,15,RSAA3840744969487,6728,5360419904,NXV D21,Adkins-Mcneil,yahoo.com,"1036 Susan Roads\nEast Christophermouth, PA 99481",Multimedia specialist
3,8,KZWB7793929593940,7653,+1-245-845-9876x1778,165 6EL,Aguilar-Paul,hotmail.com,"72107 Hernandez Crossing Suite 699\nKnappstad,...","Development worker, community"
4,11,TETJ4085914615232,8437,739-916-7742,919 8NG,Jensen PLC,hotmail.com,"554 Flores Port\nKevinshire, FL 60356",Investment analyst


#### 3) Bank datatable
- Notice that this is the table that contains variable to predict: **defaulted_loan**

In [6]:
data['train']['bank_data'].head()

Unnamed: 0,number_secret_keys_requested,credit_card_provider,number_logs_per_day,first_credit_card_application_date,last_credit_card_application_date,credit_score,credit_card_number,number_of_credit_cards,account_id,credit_card_expire,client_id,defaulted_loan
0,1,VISA 16 digit,3,2011-10-06 17:58:56,2015-10-31 02:43:10,814,3596118963565100,2,RSGD4569350483260,07/27,ZFUU9069197973171,False
1,1,VISA 16 digit,3,2017-07-18 09:22:24,2017-04-06 23:21:34,835,4036708575533672,2,UNKI9301808547977,04/26,EZZZ2264498911884,False
2,1,JCB 16 digit,2,2017-08-04 06:00:26,2015-10-06 07:11:36,1040,5187829527586586,3,PREO5042440106050,09/24,HTIX3716125146816,False
3,1,Mastercard,3,2017-05-12 21:45:03,2016-01-01 23:51:21,808,4069649723930,2,YFMX9024672103664,04/28,WVDK6716021964941,False
4,1,JCB 16 digit,3,2017-06-24 23:49:50,2018-08-12 03:12:38,523,4511324297912,3,VXZA6446374802774,08/24,GPHF8397791795583,False


#### 4) Bank transactions datatable

In [7]:
data['train']['transactional_data'].head()

Unnamed: 0,transaction_id,client_id,account_id,date,duration_minutes,amount,type
0,BIGS2655386520335,SQWI6088247113041,UUJG9330648144708,2018-08-21 12:59:47,10,238,Withdrawal
1,BVVC7567878745629,EVEL6951619336672,PZWH9597088886612,2018-07-28 21:21:51,16,387,Withdrawal
2,UUSU6640167293035,MLFH5670327424978,JHZB9470931550704,2018-09-01 00:44:48,12,314,Withdrawal
3,JGXJ2801880132165,VJRK3495233458723,FWHP6221647324126,2018-05-10 07:50:26,16,229,Withdrawal
4,HDQU8860240235988,MFGZ4978234012602,AVPD5598148116569,2018-10-31 23:40:32,7,309,Withdrawal


### Variable to predict stats
- **defaulted_loan**: if True, it means that the client defaulted the loan. If False, client paid the loan.
- **Our interest is to predict if a credit applicant (client_id) will default the loan.**

In [8]:
data['train']['bank_data']['defaulted_loan'].value_counts()

False    65527
True      3465
Name: defaulted_loan, dtype: int64

##### Currently, only 3% of the portfilio has defaulted the loan.

In [9]:
100 * np.round(data['train']['bank_data']['defaulted_loan'].value_counts() / data['train']['bank_data'].shape[0], 2)

False    95.0
True      5.0
Name: defaulted_loan, dtype: float64

## Test set particularities

- Bank test dataset **does not have defaulted_loan feature**
- **defaulted_loan** column is the feature we need to predict

In [25]:
data['test']['bank_data'].head()

Unnamed: 0,number_of_credit_cards,first_credit_card_application_date,client_id,credit_card_expire,credit_card_number,credit_score,number_secret_keys_requested,number_logs_per_day,credit_card_provider,account_id,last_credit_card_application_date
0,2,2014-08-17 09:15:31,QLOO6872808638149,03/21,6011107581395438,938,1,2,VISA 16 digit,CIQD7924293428249,2016-10-23 14:29:29
1,3,2017-07-13 17:42:38,LARO2418696890722,06/28,3583287285484582,974,1,3,Discover,ZHFB1331739147838,2015-12-15 13:49:15
2,3,2011-01-14 06:28:14,WLNZ9428625597800,02/21,4909285608552114,823,1,3,American Express,MCXO9264919957857,2017-10-02 03:35:27
3,1,2014-05-01 23:52:16,OTPX8452255444597,05/22,3573536236294405,954,1,3,JCB 16 digit,NQAR3462687861064,2016-04-24 23:11:23
4,3,2017-06-16 01:25:33,CPAQ6111541208111,08/21,6011045605327807,880,1,2,VISA 13 digit,QERG4184129464638,2014-11-17 21:57:30


In [31]:
pd.DataFrame(data['test']['bank_data'].columns)

Unnamed: 0,0
0,number_of_credit_cards
1,first_credit_card_application_date
2,client_id
3,credit_card_expire
4,credit_card_number
5,credit_score
6,number_secret_keys_requested
7,number_logs_per_day
8,credit_card_provider
9,account_id


In [26]:
data['train']['bank_data'].head()

Unnamed: 0,number_of_credit_cards,first_credit_card_application_date,client_id,credit_card_expire,credit_card_number,credit_score,number_secret_keys_requested,number_logs_per_day,credit_card_provider,account_id,last_credit_card_application_date,defaulted_loan
0,2,2015-12-10 01:30:24,IDQN7326217819942,10/26,4917791162965063,792,1,3,JCB 16 digit,SSEL8415528042273,2017-08-29 01:53:05,False
1,3,2016-06-03 08:07:30,CSTV4627055779590,12/26,30357439741339,1056,1,3,VISA 16 digit,MUEW8640780419916,2015-08-16 13:56:25,False
2,2,2016-02-19 10:28:43,LAPI8584925015195,03/25,4743404908886490,611,1,3,VISA 16 digit,NUXE8686777893744,2015-03-09 05:00:35,False
3,1,2010-10-26 17:19:22,QGCO7450958537677,02/21,4193019096594311,704,1,3,Mastercard,HPZV6827338198621,2014-05-01 15:16:00,False
4,1,2009-08-25 16:29:57,KTRF0650626151507,08/21,3533740926977068,860,1,3,VISA 19 digit,HDUL2479521679327,2014-06-01 12:28:18,False


In [32]:
pd.DataFrame(data['train']['bank_data'].columns)

Unnamed: 0,0
0,number_of_credit_cards
1,first_credit_card_application_date
2,client_id
3,credit_card_expire
4,credit_card_number
5,credit_score
6,number_secret_keys_requested
7,number_logs_per_day
8,credit_card_provider
9,account_id


# CSV output template

In [10]:
demo_output = pd.DataFrame(index=data['test']['bank_data'].index)
demo_output['client_id'] = data['test']['bank_data']['client_id']
# fill with random labels
demo_output['defaulted_loan'] = np.random.choice([True, False], size=data['test']['bank_data'].shape[0], replace=True, p=[0.5, 0.5])
demo_output.head()

Unnamed: 0,client_id,defaulted_loan
0,SMMV5459810347915,True
1,AHGN6820097391146,True
2,NCEY3530535538288,False
3,VTLP4686361759706,False
4,NRAD1936477485713,True


In [11]:
# Export as CSV
demo_output.csv('growth_ds_challenge_john_smith.csv')

AttributeError: 'DataFrame' object has no attribute 'csv'