# Feature engineering

## 1. First impression
In normal conditions for exploring 2.5k / 3 ~ 0.8k lines long Credit Bureau report would take 1.5-2 days, having documentation in place. Writing code able to process this - a separate task, at least 1 day, depending on proposed feature complexity.

My expectation for an 1h deadline: to explore the document structure from top to bottom to a certain point of nesting, writing down comments. Probably will have to stop this process at loans list or payments list level.

By the way, my IDE counts 1001 typos in the document, which slows down reading: would be nice to use *snake_case* or *camelCase*. But this is provider-side issue, I guess.

**Disclaimer**: without documentation and corresponding risk stats, the following thoughts are fully my expert assumptions.

## 2. Document structure insights
3 application reports concatenated.

Top-level structure (sections):  
``1. subjectlist`` ~ technical section.  
``2. accountrating`` ~ 0 to something points scale describing credit history by segments. Probably, some inner Bureau standard. Some points may be pretty strong risk predictors.  
``3. enquirydetails`` ~ current Bureau request technical info. Just techical info, hardly contains risk predictors.  
``4. guarantorcount`` ~ loan guarantors summary. Hardly contains risk predictors, may be useful in fraud detection or in collection.  
``5. guarantordetails`` ~ loan guarantors details. Hardly contains risk predictors, may be useful in fraud detection or in collection.  
``6. telephonehistory`` ~ client's phone numbers historical info. Hardly contains risk predictors, may be useful in fraud detection or in collection.  
``7. employmenthistory`` ~ client's employment history, contains economics sectors, partially employer names, positions, dates. Probably is useful for risk prediction. Correlates with age, education and income.  
``8. enquiryhistorytop`` ~ latest part of history of how the client was requesting loans in different credit institutions. Probably is useful for risk prediction, may be a strong predictor. Much more queries than loans in history may mean that the client mostly gets rejected by credit institution - high risk marker.  
``9. creditaccountsummary`` ~ aggregated metrics of the client's credit history. The most of strongest risk predictors are here.  
``10. deliquencyinformation`` ~ deliquent loans information. Strong risk predictors could be generated from here, especially taking into account whether there are deliquencies on historical or current loans. Too little fields, in my opinion. Would be nice to have dates, amounts etc.  
``11. creditagreementsummary`` ~ top level info on the client's loans history. Strong aggregated features may be generated also from here, depending of how good are the ones from ``creditaccountsummary`` (did the Bureau already aggregate the strongest features?) and their mutual correlation.  
``12. personaldetailssummary`` ~ personal info section. Medium strength risk predictors may be found here, like age, city etc.  
``13. accountmonthlypaymenthistory`` ~ details of payment schedules of the client's loan history. Hard to propose something without docs here: looks like some encoding used for payments info.  
``14. accountmonthlypaymenthistoryheader`` ~ some additional info to ``accountmonthlypaymenthistory``. Cannot quickly tell whether this is useful in terms of risk prediction.  

## 3. Processing code draft
In the time left will try to make a draft for the ``.json`` file processing.
### 3.1. Document reading

In [1]:
import os
import pandas as pd

In [2]:
DATA_DIR = 'data'
DATA_FILE_NAME = 'credit_report_sample.json'

In [3]:
data_file_path = os.path.join(DATA_DIR, DATA_FILE_NAME)

In [4]:
df = pd.read_json(data_file_path).set_index('application_id')

In [5]:
df.shape

(3, 1)

In [6]:
df.head()

Unnamed: 0_level_0,data
application_id,Unnamed: 1_level_1
9711360,{'consumerfullcredit': {'subjectlist': {'refer...
9714953,{'consumerfullcredit': {'subjectlist': {'refer...
9714978,{'consumerfullcredit': {'subjectlist': {'refer...


### 3.2. Structure unpacking draft

In [7]:
consumer_full_credit = df['data'].apply(lambda item: item['consumerfullcredit'])

In [8]:
df.drop(labels=['data'], axis=1, inplace=True)

In [9]:
df['subjectlist'] = consumer_full_credit.apply(lambda item: item['subjectlist'])

In [10]:
df['accountrating'] = consumer_full_credit.apply(lambda item: item['accountrating'])

In [11]:
df['guarantorcount'] = consumer_full_credit.apply(lambda item: item['guarantorcount'])

In [12]:
df['guarantordetails'] = consumer_full_credit.apply(lambda item: item['guarantordetails'])

In [13]:
df.head()

Unnamed: 0_level_0,subjectlist,accountrating,guarantorcount,guarantordetails
application_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9711360,"{'reference': '128566', 'consumerid': '128566'...","{'noofotheraccountsbad': '0', 'noofotheraccoun...","{'accounts': '0', 'guarantorssecured': '0'}","{'guarantorgender': None, 'guarantorotherid': ..."
9714953,"{'reference': '58793', 'consumerid': '58793', ...","{'noofotheraccountsbad': '0', 'noofotheraccoun...","{'accounts': '0', 'guarantorssecured': '0'}","{'guarantorgender': None, 'guarantorotherid': ..."
9714978,"{'reference': '17688366', 'consumerid': '17688...","{'noofotheraccountsbad': '0', 'noofotheraccoun...","{'accounts': '0', 'guarantorssecured': '0'}","{'guarantorgender': None, 'guarantorotherid': ..."


## 4. Conclusion
Okay, here the chronometer shows 1h spent as defined for the task. Further unpacking of 1:1 features is pretty much obvious, however dealing with all the 1:M relationships and possible feature engineering is a separate problem out of the deadline.