# List of all Notebooks and Resources for this Project
(https://drive.google.com/file/d/1Z8vPNZAcivWOxeh3UKFfeARbQCMkQ_NR/view?usp=sharing)


* Cleaning:
    * Application Train/Test: https://drive.google.com/file/d/17PZtY-xD-6AbF_9B0CkhxnD25WAUhnzP/view?usp=sharing
    * Bureau Balance: https://drive.google.com/file/d/17CWrXSq0UD59yT0LgF_VXaqc-nkVcdFM/view?usp=sharing
    * Bureau: https://drive.google.com/file/d/16oY7SJ5Gup31-BsuykAEf6UxrKtMOjy-/view?usp=sharing
    * Credit Card: https://drive.google.com/file/d/17Xt2BNZ_AbtZDq3u5fUCz20u-27kmxfp/view?usp=sharing
    * Installments Payments: https://drive.google.com/file/d/17QxdLEpcFDgRFi9W28VSJgVFDU6cPLi9/view?usp=sharing
    * POS CASH: https://drive.google.com/file/d/16n5gXaxqB59kyMqAB5thSLDFwrxoqg0u/view?usp=sharing
    * Previous Application: https://drive.google.com/file/d/16Pl8cB-basjgk0aGTTi7NnlYXGJX7yje/view?usp=sharing

* EDA:
    * Application Train: https://drive.google.com/file/d/1WXiFmq0IVBg7ALqsjeDDq1DGYdf1tCjy/view?usp=sharing
    * Bureau: https://drive.google.com/file/d/16Yrojb1GLrqCbb9b1ADqxFHryxDxxbyw/view?usp=sharing
    * Previous Application: https://drive.google.com/file/d/15y6WRC9rUUmH9cvzN8pTXuQIEaVX7tTl/view?usp=sharing

* Model:
    * Feature Selection: https://drive.google.com/file/d/17AJWJKRkDIhD8Xe89T-Vg8lw6Nwqooyq/view?usp=sharing
    * Model Application Train Only: https://drive.google.com/file/d/16NjK4XB-TDXoDYRgso8mEtj8FrmZtGhy/view?usp=sharing
    * Model All Tables: https://drive.google.com/file/d/17FsG6U-pZuVAhapyLvpVZ0RxtmDRb-vb/view?usp=sharing

* Python-file with functions used: https://drive.google.com/file/d/1IsRcGuolR4Hnu6bGe44GcS_0UQPFM59h/view?usp=sharing


# Processing Plan


Plan for data wrangling:

- POS CASH, installment, and credit cards belong the the previous applications table and show the monthly payment.
- Aggregate them per previous ID (mean, latest, fraction of time on specific status) and merge with previous application.
- Bureau Balance belongs to bureau table and protocolls the monthly payments. Treat in same manner as above.
- Create new features in bureau, current, and previous applications (e.g. annuity/credit, annuity/income, etc.)
- Aggregate Bureau and previous applications with their adjacent/merged subtables on current ID
- Merge both with the applications table and create features based on all tables (e.g. (total annuity)/income)

EDA:  

- on bureau, current, and previous applications, merged with target from applications table to study correlations/dependencies.

Model:

- 2 extremes: no past credit information (model on app table only), and extensiv past credit information (model on all tables)
- RFECV with LightGBM on both subsets
- study and select model for each subset with corresponding RFE-selected features and deploy


# Objective

You and your friend came up with a brilliant startup idea - provide risk evaluation as a service for retail banks. As with most successful startup teams, both of you have your specialty. Your friend is responsible for sales and operations, while you are responsible for everything product-related, from planning to data analysis to building the solution. You have quickly identified that machine learning will be an essential part of your offering because you believe that the models can capture statistical patterns in the defaults on bank loans. You decide to start your investigation by downloading this dataset from Home Credit Group. You are not yet sure, what is the most crucial problem for your potential clients, so you had a meeting with your friend to discuss how your proof-of-concept (POC) product should look like. After a lot of arguing, you both agreed to create a number of different models so that you have a robust and diversified offering when you get your first meeting with the potential clients. You are eager to investigate the dataset and see what you can predict, so you propose that you come up with interesting features to analyze and predict - this way, you'll focus on building a solid offering, and she can work on getting meetings with the banks.
Objectives for this Part

 *   Practice translating business requirements into data science tasks.
 *   Practice performing EDA.
 *   Practice applying statistical inference procedures.
 *   Practice using machine learning to solve business problems.
 *   Practice deploying multiple machine learning models.

Requirements

 *   Download the data from [here](https://storage.googleapis.com/341-home-credit-default/home-credit-default-risk.zip) and the data description from [here](https://storage.googleapis.com/341-home-credit-default/Home%20Credit%20Default%20Risk.pdf).
 *   Create a plan for your investigation, analysis, and POC building. This should include your assumptions, overall objectives, and objectives for each step in your plan. You are not expected to have a plan for the whole project but instead have a clear understanding of what you'll try to achieve in the next step and build the plan one step at a time.
 *   Perform exploratory data analysis. This should include creating statistical summaries and charts, testing for anomalies, checking for correlations and other relations between variables, and other EDA elements.
 *   Perform statistical inference. This should include defining the target population, forming multiple statistical hypotheses and constructing confidence intervals, setting the significance levels, conducting z or t-tests for these hypotheses.
 *   Use machine learning models to predict the target variables based on your proposed plan. You should use hyperparameter tuning, model ensembling, the analysis of model selection, and other methods. The decision of where to use and not to use these techniques is up to you; however, they should be aligned with your team's objectives.
 *   Deploy these machine learning models to Google Cloud Platform. You are free to choose any deployment option you wish as long as it can be called an HTTP request.
 *   Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean.
 *   Provide suggestions about how your analysis and models can be improved.


# Remarks

- RFECV (with  LightGBM as estimator) returned sometimes very few features despite the huge amount of features for all tables combined. Experiments with features from LightGBM importance or RFECV selection yield similar results compared to the initial feature selection from RFECV with a simple decision tree (for the sake of speed).


- painful lesson at deployment: LightGBM model created in the sklearn framework cannot be loaded - loading will call LightGBM-native routines that cannot read the sklearn-native storage pattern of the modules pickle file. need to explore the onnx-module to translate

