# SET UP

## CREATING PROJECT ENVIRONMENT

In order to keep the project dependencies isolated from both the system and other projects, a new dedicated working environment named pf_riskscoring.yml will be created in this section. Conda has been used for this purpose.

```
conda create --name pf_riskcoring python numpy pandas matplotlib seaborn scikit-learn scipy sqlalchemy xgboost jupyter

conda activate pf_riskcoring

conda install -c conda-forge pyjanitor scikit-plot yellowbrick imbalanced-learn jupyter_contrib_nbextensions cloudpickle

conda install -c districtdatalabs yellowbrick

pip install category_encoders

pip install streamlit-echarts

pip install pipreqs

conda env export > pf_riskscoring.yml
```

## IMPORTING PACKAGES

In [21]:
import os
import numpy as np
import pandas as pd
from janitor import clean_names

#To Increase autocomplete response speed
%config IPCompleter.greedy=True

## CREATING PROJECT DIRECTORY

Defining root directory where the project is to be created:

In [15]:
root = (r'C:\Users\pedro\PEDRO\DS\Portfolio').replace(os.sep,'/')

Defining project name:

In [16]:
dir_name = 'RISK_SCORING'

### Creating the project directory and structure

In [14]:
path = root + '/' + dir_name

try:
    os.mkdir(path)
    os.mkdir(path + '/01_Documents')
    os.mkdir(path + '/02_Data')
    os.mkdir(path + '/02_Data/01_Originals')
    os.mkdir(path + '/02_Data/02_Validation')
    os.mkdir(path + '/02_Data/03_Work')
    os.mkdir(path + '/02_Data/04_Caches')
    os.mkdir(path + '/03_Notebooks')
    os.mkdir(path + '/03_Notebooks/01_Functions')
    os.mkdir(path + '/03_Notebooks/02_Development')
    os.mkdir(path + '/03_Notebooks/03_System')
    os.mkdir(path + '/04_Models')
    os.mkdir(path + '/05_Results')
    os.mkdir(path + '/09_Others')
    
except OSError:
    print ("Creation of the %s directory has failed." % path)
else:
    print ("%s directory has been successfully created." % path)

C:/Users/pedro/PEDRO/DS/Portfolio/RISK_SCORING directory has been successfully created.


In [17]:
os.chdir(path)

### Creating Environment.yml file

**pf_riskscoring.yml** file can be found in '/01_Documents' folder of the project directory. 

This document contains the specific version of the packages used in the project, and can be used in the future to replicate this environment if needed.

## CREATING INITIAL DATASETS

The original dataset **Loans.csv** can be found in the folder '/02_Data/01_Originals'. This dataset has been acquired from [LendingClub](https://www.lendingclub.com/) official website. It provides information about past loan applicants.

### Data importation

Data file name and full path:

In [18]:
data_file_name = 'Loans.csv'
full_path = path + '/02_Data/01_Originals/' + data_file_name

Brief review of the file content:

In [19]:
open(full_path,'r',encoding='utf-8').readlines()[:2]

['Client Id,Employment Title,Employment Length,Annual Income,Income Verification,Scoring,DTI,Home Ownership,Nº Mortages,Nº Credit Lines,% Credit Cards Exceeding 75%,Revolving utilization,Nº Cancellations 12 Months,Nº Derogations,Nº Months Since Last Derrog,Loan Id,Description,Purpose,Loan Amount,Interest Rate,Term,Installment,Amortised Amount,Status,Recovered Amount\n',
 '137387967,Hvac technician ,3 years,54000.0,Source Verified,A,19.31,MORTGAGE,2.0,10.0,33.3,45.2,0.0,0.0,10.0,,,debt_consolidation,15000.0,7.21, 36 months,464.6,2669.06,Current,0.0\n']

Data importation:

In [24]:
data = pd.read_csv(full_path)
data

Unnamed: 0,Client Id,Employment Title,Employment Length,Annual Income,Income Verification,Scoring,DTI,Home Ownership,Nº Mortages,Nº Credit Lines,...,Loan Id,Description,Purpose,Loan Amount,Interest Rate,Term,Installment,Amortised Amount,Status,Recovered Amount
0,137387967,Hvac technician,3 years,54000.0,Source Verified,A,19.31,MORTGAGE,2.0,10.0,...,,,debt_consolidation,15000.0,7.21,36 months,464.60,2669.06,Current,0.00
1,4798121,"Target Promotions and Marketing,Inc",10+ years,65000.0,Not Verified,D,25.40,RENT,1.0,15.0,...,,,debt_consolidation,10000.0,17.77,36 months,360.38,6362.96,Charged Off,0.00
2,46641215,Banker,5 years,135000.0,Verified,A,14.68,RENT,0.0,19.0,...,,,debt_consolidation,24000.0,6.39,36 months,734.38,24000.00,Fully Paid,0.00
3,87998444,executive director,9 years,188000.0,Source Verified,B,11.69,MORTGAGE,3.0,15.0,...,,,credit_card,27000.0,8.99,60 months,560.35,12443.00,Current,0.00
4,132883631,Subsea Technician,7 years,125000.0,Source Verified,B,9.00,MORTGAGE,1.0,6.0,...,,,debt_consolidation,22000.0,10.90,36 months,719.22,22000.00,Fully Paid,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,51876926,Office Manager,10+ years,42000.0,Not Verified,C,20.85,MORTGAGE,6.0,9.0,...,,,debt_consolidation,8000.0,12.29,36 months,266.83,8000.00,Fully Paid,0.00
199996,121031962,Owner & President,6 years,111697.0,Verified,B,16.63,MORTGAGE,2.0,10.0,...,,,other,10000.0,9.44,36 months,320.05,4388.51,Current,0.00
199997,135641397,Sr. Field Engineer,10+ years,285000.0,Source Verified,D,6.02,MORTGAGE,3.0,9.0,...,,,small_business,30000.0,17.47,36 months,1076.62,5387.53,Current,0.00
199998,53664762,Attorney,8 years,168000.0,Source Verified,E,4.69,RENT,0.0,8.0,...,,,small_business,30050.0,18.25,60 months,767.17,2964.44,Charged Off,2750.88


### Extracting and reserving production script validation dataset

30% of the data has been randomly separated, with the purpose of simulating unseen data that the model will receive in the future once it is put into production and thus be able to check its production performance.

In [25]:
val = data.sample(frac = 0.3)

In [26]:
validation_file_name = 'validation.csv'
full_path = path + '/02_Data/02_Validation/' + validation_file_name

val.to_csv(full_path,index=False)

### Extracting and saving work dataset

In [27]:
work = data.loc[~data.index.isin(val.index)]

In [28]:
work_file_name = 'work.csv'
full_path = path + '/02_Data/03_Work/' + work_file_name

work.to_csv(full_path,index=False)