# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description

The notebooks focus on a borrower's **credit modeling problem**. The database was downloaded through a dataquest project and is available at link below. The data is from **Lending Club** and contains data from loans made in the period **2007 to 2011**. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. The **target variable**, or what we are wanting to predict, is whether or not, given a person's history, they will repay the loan.

You can download the data from the [Kaggle](https://www.kaggle.com/datasets/samaxtech/lending-club-20072011-data).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1fKGuR5U5ECf7On6Zo1UWzAIWZrMmZnGc"></center>

In [1]:
# !pip install wandb

In [2]:
# import wandb
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings('ignore')

## 1.3 Preprocessing

### 1.3.1 Login wandb


In [4]:
# import os
# from dotenv import load_dotenv
# load_dotenv()

# WANDB_API_KEY=os.environ.get('WANDB_API_KEY')

In [5]:
# Login to Weights & Biases
!wandb login --relogin $WANDB_API_KEY

zsh:1: command not found: wandb


###  Artifacts

In [6]:
input_artifact="project_name/preprocessed_data.csv:latest"
artifact_name="feature_engineered_data.csv"
artifact_type="feature egineering"
artifact_description="Data after feature_engineering"

### Setup wandb project

In [7]:
# # Create a new job_type
# run = wandb.init(project="risk_credit", job_type="feature_engineering")

In [8]:
# # Donwload the latest version of artifact raw_data.csv
# artifact = run.use_artifact(input_artifact)

# # Create a dataframe from the artifact
# df = pd.read_csv(artifact.file())

In [20]:
## Nếu ghép thì lấy đata từ wandb bằng đoạn trên
df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')

In [10]:
df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [11]:
x= df.drop('Diabetes_binary', axis=1)
y = df['Diabetes_binary']

### Chose 15 most impact feature due to chi2

In [12]:
from sklearn.feature_selection import SelectKBest, chi2
chi2_selector = SelectKBest(chi2, k="all")  
x_kbest = chi2_selector.fit_transform(x, y)


chi2_scores = chi2_selector.scores_

chi2_results = pd.DataFrame({
    "Feature": x.columns,
    "Chi2 Score": chi2_scores
})

chi2_results = chi2_results.sort_values(by="Chi2 Score", ascending=False)

chi2_results

Unnamed: 0,Feature,Chi2 Score
15,PhysHlth,133424.406534
14,MentHlth,21029.632228
3,BMI,18355.1664
16,DiffWalk,10059.506391
0,HighBP,10029.013935
13,GenHlth,9938.507776
18,Age,9276.141199
6,HeartDiseaseorAttack,7221.975378
1,HighChol,5859.710582
20,Income,4829.816361


In [13]:
selector = SelectKBest(score_func=chi2, k=15)
x_selected = selector.fit_transform(x, y)

# Get the selected feature names
selected_features = x.columns[selector.get_support()]
print("Selected Features:", selected_features)

Selected Features: Index(['HighBP', 'HighChol', 'BMI', 'Smoker', 'Stroke', 'HeartDiseaseorAttack',
       'PhysActivity', 'HvyAlcoholConsump', 'GenHlth', 'MentHlth', 'PhysHlth',
       'DiffWalk', 'Age', 'Education', 'Income'],
      dtype='object')


In [14]:
x = df[selected_features]

### Polynomial Features

In [15]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 2, interaction_only=False)
x = poly.fit_transform(x)
data = pd.DataFrame(x,columns = poly.get_feature_names_out())
data.drop(columns = '1', inplace = True)
data['Diabetes_binary'] = y
data.head()

Unnamed: 0,HighBP,HighChol,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,HvyAlcoholConsump,GenHlth,MentHlth,...,DiffWalk Age,DiffWalk Education,DiffWalk Income,Age^2,Age Education,Age Income,Education^2,Education Income,Income^2,Diabetes_binary
0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,5.0,18.0,...,9.0,4.0,3.0,81.0,36.0,27.0,16.0,12.0,9.0,0.0
1,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,3.0,0.0,...,0.0,0.0,0.0,49.0,42.0,7.0,36.0,6.0,1.0,0.0
2,1.0,1.0,28.0,0.0,0.0,0.0,0.0,0.0,5.0,30.0,...,9.0,4.0,8.0,81.0,36.0,72.0,16.0,32.0,64.0,0.0
3,1.0,0.0,27.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,...,0.0,0.0,0.0,121.0,33.0,66.0,9.0,18.0,36.0,0.0
4,1.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,2.0,3.0,...,0.0,0.0,0.0,121.0,55.0,44.0,25.0,20.0,16.0,0.0


## Up file to wandb

In [16]:
# # Generate a "feature_engineered file"
# df.to_csv(artifact_name,index=False)

In [17]:
# # Create a new artifact and configure with the necessary arguments
# artifact = wandb.Artifact(name=artifact_name,
#                           type=artifact_type,
#                           description=artifact_description)
# artifact.add_file(artifact_name)

In [18]:
# # Upload the artifact to Wandb
# run.log_artifact(artifact)

In [19]:
# # close the run
# # waiting a while after run the previous cell before execute this
# run.finish()