# LendingClub Prediction

List of all notebooks and resources for this project:

* Data Cleaning: https://drive.google.com/file/d/19I0eztUcGgw7AdexyUdb3lCEU5AAO4vZ/view?usp=sharing

* EDA: https://drive.google.com/file/d/19tQcpwYM8aNWtaAtDHlwleMyqQSlmpAM/view?usp=sharing

* Loan acceptance model: https://drive.google.com/file/d/1AK3SW55QoWe8Kwx-DipmdvwDRVLEFUHZ/view?usp=sharing

* Loan status model: https://drive.google.com/file/d/18r3YmFwSFsenTtg1i_a8gL7uiA0h45-k/view?usp=sharing

* Interest rate model: https://drive.google.com/file/d/19t7v80wOd2_AaWW540qVnbiX6OhUd-UR/view?usp=sharing


* The Python-file with the functions is at
https://drive.google.com/file/d/1J_5u7W6Cqx46L1HR3_LGiSjGtefHK-x1/view?usp=sharing


* App for loan acceptance prediction https://drive.google.com/drive/folders/19XUhv1YnsRV-MOtSiwgyFLqi13UAxvqF?usp=sharing

  * deployed on Goggle Cloud: https://default-service-xrnowswa4a-oe.a.run.app/ (add 'docs' at the end of the URL to access the service overview)  
  Callable from shell with:

    ```
    curl -X 'POST' \
      'https://default-service-xrnowswa4a-oe.a.run.app/predict'       \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "Amount Requested": 20000,
      "Risk_Score": 630,
      "Employment Length": "7 years",
      "dti": 20
    }'
    ```






https://docs.google.com/presentation/d/1CGl5izJ2B6SjtdrLRXcppHgRqzPRtLLqE2Q1RXziYSs/edit?usp=sharing

# Objective


Imagine that you are a data scientist who was just hired by the LendingClub. They want to automate their lending decisions fully, and they hired you to lead this project. Your team consists of a product manager to help you understand the business domain and a software engineer who will help you integrate your solution into their product.

During the initial investigations, you've found that there was a similar initiative in the past, and luckily for you, they have left a somewhat clean dataset of LendingClub's loan data. The dataset is located in a public bucket here: (although you were wondering if having your client data in a public bucket is such a good idea). In the first meeting with your team, you all have decided to use this dataset because it will allow you to skip months of work of building a dataset from scratch.

In addition, you have decided to tackle this problem iteratively so that you can get test your hypothesis that you can automate these decisions and get actual feedback from the users as soon as possible. For that, you have proposed a three-step plan on how to approach this problem.

* The first step of your plan is to create a machine learning model to classify loans into accepted/rejected so that you can start learning if you have enough data to solve this simple problem adequately.

* The second step is to predict the grade for the loan.

* The third step is to predict the subgrade and the interest rate.

Your team likes the plan, especially because after every step, you'll have a fully-working deployed model that your company can use. Excitedly you get to work!


## Objectives for this Part

- Practice downloading datasets from external sources.
- Practice performing EDA.
- Practice applying statistical inference procedures.
- Practice using various types of machine learning models.
- Practice building ensembles of machine learning models.
- Practice using hyperparameter tuning.
- Practice using AutoML tools.
- Practice deploying machine learning models.
- Practice visualizing data with Matplotlib & Seaborn.
- Practice reading data, performing queries, and filtering data.

## Requirements

- Download the data from [here](https://storage.googleapis.com/335-lending-club/lending-club.zip).
- Perform exploratory data analysis. This should include creating statistical summaries and charts, testing for anomalies, checking for correlations and other relations between variables, and other EDA elements.
- Perform statistical inference. This should include defining the target population, forming multiple statistical hypotheses and constructing confidence intervals, setting the significance levels, conducting z or t-tests for these hypotheses.
- Apply various machine learning models to predict the target variables based on your proposed plan. You should use hyperparameter tuning, model ensembling, the analysis of model selection, and other methods. The decision where to use and not to use these techniques is up to you, however, they should be aligned with your team's objectives.
- Deploy these machine learning models to Google Cloud Platform. You are free to choose any deployment option you wish as long as it can be called an HTTP request.
- Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean.
- Provide suggestions about how your analysis and models can be improved.


# Remarks

- feature selection was based on correlation strengths - would run RFECV on all features next time instead

- grade and subgradel model failed to detect the grade classes, most likely because of too few features (s.a.), maybe also because of too small sample - corresponding notebook left out for now. Neverthe less, the interest rate shows a strong correlation with the grade, so it could be used as a proxy

- first time use of dask required - else memory crash

