# Welcome to your final stage before getting certified. Let's go!

# Supervised learning part

* Imagine that you work at the loan department in a company. At some point, the company receives too many loan applications to consider by eyes. You propose a new idea to help screen those applications - which ones should be further considered by your boss.  
* The **dataset** you have at hands are the applicants' profiles of those whose loan applications were previously accepted. 
* Of course, not all of them are good borrowers. You can inspect their loan's current status in the column named `loan_status`. If it says `Fully Paid`, it means that they have fully paid their loan (both principal and interest). On the other hand, when the value in this column says `Charged off`, it means that they haven't paid their instalments in due time for a long period of time, becoming defaulters; those who the company doesn't want.
* Below is the breif description of each column (source: https://www.kaggle.com/faressayah/lending-club-loan-defaulters-prediction).

Column name | Description
------------|------------
loan_amnt   | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
term        | The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate    | Interest Rate on the loan
installment | The monthly payment owed by the borrower if the loan originates.
grade       | LC assigned loan grade
sub_grade   | LC assigned loan subgrade
emp_title   | The job title supplied by the Borrower when applying for the loan
emp_length  | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
annual_inc | The self-reported annual income provided by the borrower during registration.
verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d    | The month which the loan was funded
loan_status | Current status of the loan
purpose    | A category provided by the borrower for the loan request
title | The loan title provided by the borrower
zip_code | The first 3 numbers of the zip code provided by the borrower in the loan application
addr_state | The state provided by the borrower in the loan application
dti | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
earliest_cr_line | The month the borrower's earliest reported credit line was opened
open_acc | The number of open credit lines in the borrower's credit file.
pub_rec | Number of derogatory public records
revol_bal | Total credit revolving balance
revol_util | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc | The total number of credit lines currently in the borrower's credit file
initial_list_status | The initial listing status of the loan. Possible values are – W, F
application_type | Indicates whether the loan is an individual application or a joint application with two co-borrowers
mort_acc | Number of mortgage accounts.
pub_rec_bankruptcies | Number of public record bankruptcies

In [3]:
# download datasets from github and unzip (google colab)
!wget https://raw.githubusercontent.com/Pataweepr/scb_TS_course/master/exam/supervised-unsupervised/lending_club_loan_two.csv.zip
!unzip lending_club_loan_two.csv.zip

In [None]:
# Read the data
df = pd.read_csv('lending_club_loan_two.csv')
df.head()

<font color='purple'>Q: Which column will you use as a target?

<font color='purple'>Q: Please explain why we can use the applicant profiles in the past to predict loan approval of the new applications. Would it be better to use the dataset of rejected and approved applications? Why?

## You are now trying to understand your data.

<font color='purple'>Q: Is this dataset imbalanced? Please describe and show the evidence here.

<font color='purple'>Q: Based on your domain knowledge, are there any columns that you don't want to further consider as features in your predictive models? Why? 

In [None]:
# Select only the related features here.

<font color='purple'>Q: Do you spot missing values or incomplete information? Which columns? How much? How do you plan to handle those missing values? Explain your plan here and implement it in the `prepare your data` section below.

<font color='purple'>Q: Investigate each feature (both numeric and categorical) how much it is related to the target. You can use any statistical measurements or visualization methods. Which features do you think may be important to predict defaulters? 

## Prepare your data

### <font color='purple'>Encode the categorical features as follows.

Convert the following column in to numbers
- term
    - 36 months -> 36, 60 months -> 60

Convert the following columns into label encoding
- grade
    - A B C D E F G
- sub_grade
    - A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4 D5 E1 E2 E3 E4 E5 F1 F2 F3 F4 F5 G1 G2 G3 G4 G5
- emp_length
    - < 1 year, 1 year, 2 year, 3 year, 4 year, 5 year, 6 year, 7 year, 8 year, 9 year, 10+ year
- verification status
	- Not Verified, Source Verified, Verified
- loan_status
    - 	Charged Off, Fully Paid
- initial_list_status
	- f, w

Convert the following columns into onehot encoding
- home_ownership
    - ANY,	MORTGAGE, NONE, OTHER, OWN, RENT
- purpose
	- car, credit_card, debt_consolidation, educational, home_improvement, house, major_purchase, medical, moving, other, renewable_energy, small_business, vacation, wedding
- application_type
    - DIRECT_PAY, INDIVIDUAL, JOINT


### <font color='purple'>Split your dataset
Split the data into a training set, a validation set, and a test set using a ratio of 80:10:10.

## Train, evaluate, and fine-tune your models

<font color='purple'>Q: What measures do you plan to use as evalution metrics? Please give your reasons.

<font color='purple'>Implement the following experiments.

* Baseline: Use all possible features, run a logistic regression model with at least one regularization term, and fine-tune the hyperparameters.

* M1: Use all possible features, run a random forest model, and fine-tune the hyperparameters.

* M2: Use all possible features, run any model, and fine-tune the hyperparameters.

* M3: Use all possible features, apply a sampling-based technique for the imbalancedness, run the best model from the above (Baseline, M1, and M2), and fine-tune the hyperparameters.

* M4: Customize the feature set by using any feature engineering techniques, and follow the best method from the above.

<font color='purple'>Q: Plot ROC curves and precision-recall curves to compare your models (Baseline, M1, M2, M3, M4).

<font color='purple'>Q: What is the performance of your best model that you are going to report your boss?

## Interpret your results from your best model

<font color='purple'>Q: What features are the most important? Do you think if it makes sense?

<font color='purple'>Q: Do you have any comments on this result to your boss who will read this report? If so, you can leave it here.

# Unsupervised learning part

K-means

Drop the following columns from the dataset
- issue_d
- earliest_cr_line

Drop the follwoing column from the dataset and set it as the true labels
- loan_status




<font color='purple'>Q: Find the best K for k-means.

<font color='purple'>Q: Plot the best clustering result.

<font color='purple'>Q: Find clusters using DBSCAN. Tune possible parameters.

<font color='purple'>Q: Plot the best clustering result.

<font color='purple'>Q: Let the true label be loan_status. Evaluate the best model of each algorithm using homogeneity, completeness, and v-measure.


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html#sklearn.metrics.homogeneity_completeness_v_measure