# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description

The notebooks focus on a borrower's **credit modeling problem**. The database was downloaded through a dataquest project and is available at link below. The data is from **Lending Club** and contains data from loans made in the period **2007 to 2011**. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. The **target variable**, or what we are wanting to predict, is whether or not, given a person's history, they will repay the loan.

You can download the data from the [Kaggle](https://www.kaggle.com/datasets/samaxtech/lending-club-20072011-data).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1fKGuR5U5ECf7On6Zo1UWzAIWZrMmZnGc"></center>

## 1.2 Install and load libraries

In [1]:
# !pip install wandb

In [2]:
import wandb
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings('ignore')

## 1.3 Preprocessing

### 1.3.1 Login wandb


In [4]:
import os
from dotenv import load_dotenv
load_dotenv()

WANDB_API_KEY=os.environ.get('WANDB_API_KEY')

In [6]:
# Login to Weights & Biases
!wandb login --relogin $WANDB_API_KEY

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/phamdinhkhanh/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


###  Artifacts

In [None]:
input_artifact="project_name/preprocessed_data.csv:latest"
artifact_name="feature_engineered_data.csv"
artifact_type="feature egineering"
artifact_description="Data after feature_engineering"

### Setup wandb project

In [None]:
# Create a new job_type
run = wandb.init(project="risk_credit", job_type="feature_engineering")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33maikhanhblog[0m ([33maikhanhblog-datascienceworld-kan[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [9]:
# Donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact(input_artifact)

# Create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [10]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [None]:
x= df.drop('Diabetes_binary', axis=1)
y = df['Diabetes_binary']

Shape before cleaning:  (42538, 52)


### Chose 15 most impact feature due to chi2

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
chi2_selector = SelectKBest(chi2, k="all")  
x_kbest = chi2_selector.fit_transform(x, y)


chi2_scores = chi2_selector.scores_

chi2_results = pd.DataFrame({
    "Feature": x.columns,
    "Chi2 Score": chi2_scores
})

chi2_results = chi2_results.sort_values(by="Chi2 Score", ascending=False)

chi2_results

In [None]:
selector = SelectKBest(score_func=chi2, k=15)
x_selected = selector.fit_transform(x, y)

# Get the selected feature names
selected_features = x.columns[selector.get_support()]
print("Selected Features:", selected_features)

0         Fully Paid
1        Charged Off
2         Fully Paid
3         Fully Paid
5         Fully Paid
            ...     
39781     Fully Paid
39782     Fully Paid
39783     Fully Paid
39784     Fully Paid
39785     Fully Paid
Name: loan_status, Length: 38770, dtype: object

In [None]:
x = df[selected_features]

loan_status
Fully Paid     33136
Charged Off     5634
Name: count, dtype: int64

### Polynomial Features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 2, interaction_only=False)
x = poly.fit_transform(x)
data = pd.DataFrame(x,columns = poly.get_feature_names_out())
data.drop(columns = '1', inplace = True)
data['Diabetes_binary'] = y
data.head()

In [None]:
# Generate a "feature_engineered file"
df.to_csv(artifact_name,index=False)

In [34]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                          type=artifact_type,
                          description=artifact_description)
artifact.add_file(artifact_name)

ArtifactManifestEntry(path='preprocessed_data.csv', digest='/nEo6o4VzA5+PJDHkPUQnQ==', size=4963598, local_path='/Users/phamdinhkhanh/Library/Application Support/wandb/artifacts/staging/tmp3j9ymrm0', skip_cache=False)

In [35]:
# Upload the artifact to Wandb
run.log_artifact(artifact)

<Artifact preprocessed_data.csv>

In [36]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()

VBox(children=(Label(value='4.054 MB of 4.740 MB uploaded\r'), FloatProgress(value=0.8551190626120343, max=1.0…