In [8]:
%load_ext dotenv
%dotenv ../src/.env
import sys
sys.path.append("../src")
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
from glob import glob
ft_file = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_file)

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


# Data Source

+ For this example, we will use [Give Me Some Credit from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data), a widely refered example.
 

# Feature Engineering

+ To engineer features is to transform the data in such a way that the information content is easily exposed to the model.
+ This statement can mean many things and highly depends on what exactly is "the model".
+ A flexible approach is to think of composable transformation pipelines.

## Example Transforms in sklearn

Work with categorical variables:

+ `preprocessing.Binarizer(*[, threshold, copy])`: Binarize data (set feature values to 0 or 1) according to a threshold.
+ `preprocessing.KBinsDiscretizer([n_bins, ...])`:  Bin continuous data into intervals.
+ `preprocessing.LabelBinarizer(*[, neg_label, ...])`: Binarize labels in a one-vs-all fashion.
+ `preprocessing.LabelEncoder()`: Encode target labels with value between 0 and n_classes-1.
+ `preprocessing.MultiLabelBinarizer(*[, ...])`:  Transform between iterable of iterables and a multilabel format.
+ `preprocessing.OneHotEncoder(*[, categories, ...])`: Encode categorical features as a one-hot numeric array.
+ `preprocessing.OrdinalEncoder(*[, ...])`: Encode categorical features as an integer array.

Scale and normalize:

+ `preprocessing.StandardScaler(*[, copy, ...])`: Standardize features by removing the mean and scaling to unit variance.
+ `preprocessing.MaxAbsScaler(*[, copy])`: Scale each feature by its maximum absolute value.
+ `preprocessing.MinMaxScaler([feature_range, ...])`: Transform features by scaling each feature to a given range.
+ `preprocessing.Normalizer([norm, copy])`:  Normalize samples individually to unit norm.
+ `preprocessing.RobustScaler(*[, ...])`: Scale features using statistics that are robust to outliers.


Nonlinear transforms:

+ `preprocessing.FunctionTransformer([func, ...])`: Constructs a transformer from an arbitrary callable.
+ `preprocessing.KernelCenterer()`: Center an arbitrary kernel matrix 
+ `preprocessing.PolynomialFeatures([degree, ...])`: Generate polynomial and interaction features.
+ `preprocessing.PowerTransformer([method, ...])`: Apply a power transform featurewise to make data more Gaussian-like.
+ `preprocessing.QuantileTransformer(*[, ...])`: Transform features using quantiles information.
+ `preprocessing.SplineTransformer([n_knots, ...])`: Generate univariate B-spline bases for features.
+ `preprocessing.TargetEncoder([categories, ...])`: Target Encoder for regression and classification targets.


## What are we doing?

![](./img/column_transform_1.png)

## Our data




In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

## Transform in pandas or sklearn?

+ Depending on the perspective, the answer could be neither, pandas, or sklearn:

    - Neither: most join and filtering should be done closer to the source using a database or parquet/Dask operation. 
    - Pandas, Dask, or PySpark: 
        * Renames and management tasks.
        * Use python libraries like pandas, Dask, or pySpark to add contemporaneous feature, time-series manipulation (for example, adding lags), parallel computation (using Dask or pySpark).
        * Do not use these libraries for sample-dependent features.
    - Use sklearn, pytorch:
        * Use python libraries like sklearn or pytorch to add features that are sample-dependent like scaling and normalization, one-hot encoding, tokenization, and vectorization.
        * Model-depdenent transformations: PCA, embeddings, etc.
        
+ Decisions must be guided by optimization criteria (time and resources) while avoiding data leakage.

In [10]:
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
).assign(
    high_debt_ratio = lambda x: x['debt_ratio'] > 1, 
)

Build a pipeline that: 

+ Standardizes numerical features.
+ Applies one-hot encoding to sector.
+ Tokenizes subsector.

In [14]:
df[df['debt_ratio'] > 1]

Unnamed: 0,delinquency,revolving_unsecured_line_utilization,age,num_30_59_days_late,debt_ratio,monthly_income,num_open_credit_loans,num_90_days_late,num_real_estate_loans,num_60_89_days_late,num_dependents
6,0,0.305682,57,0,5710.000000,,8,0,3,0,0.0
8,0,0.116951,27,0,46.000000,,2,0,0,0,
14,0,0.019657,76,0,477.000000,0.0,6,0,1,0,0.0
16,0,0.061086,78,0,2058.000000,,10,0,2,0,0.0
25,1,0.392248,50,0,1.595253,4676.0,14,0,3,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
149976,0,0.000627,76,0,60.000000,,5,0,0,0,0.0
149977,0,0.236450,29,0,349.000000,,3,0,0,0,0.0
149984,0,0.037548,84,0,25.000000,,5,0,0,0,0.0
149992,0,0.871976,50,0,4132.000000,,11,0,1,0,3.0


### Numerical pipeline

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe_num = Pipeline([
    ('standardizer', StandardScaler())
])

### Sector Encoding Pipeline

In [9]:
from sklearn.preprocessing import OneHotEncoder
pipe_onehot = Pipeline([
    ('one_hot', OneHotEncoder())
])

### Subsector Feature Extraction 

+ In the case of `subsector`, we have a variable with 124 entries. 
+ We would like to tokenize each entry and transform the tokenized representation into one-hot encoding.
+ To tokenize, is to transform a string into smaller components. For example, phrases into individual words.
+ To achieve this, we can use 

In [15]:
df['subsector'].unique().tolist()

['Telecom Tower REITs',
 'Semiconductors',
 'Multi-Utilities',
 'Managed Health Care',
 'Biotechnology',
 'Multi-Family Residential REITs',
 'Automobile Manufacturers',
 'Consumer Staples Merchandise Retail',
 'Electronic Equipment & Instruments',
 'Electric Utilities',
 'Health Care Services',
 'Investment Banking & Brokerage',
 'Health Care Equipment',
 'Technology Hardware, Storage & Peripherals',
 'Food Distributors',
 'Regional Banks',
 'Semiconductor Materials & Equipment',
 'Broadcasting',
 'Retail REITs',
 'Restaurants',
 'Passenger Airlines',
 'Air Freight & Logistics',
 'Oil & Gas Equipment & Services',
 'Life & Health Insurance',
 'Movies & Entertainment',
 'Building Products',
 'Electrical Components & Equipment',
 'Copper',
 'Paper & Plastic Packaging Products & Materials',
 'Hotels, Resorts & Cruise Lines',
 'Personal Care Products',
 'Asset Management & Custody Banks',
 'Industrial Conglomerates',
 'Electronic Components',
 'Financial Exchanges & Data',
 'IT Consulting &

In [6]:
pipe_

array([[-0.04612762, -0.06840328,  3.04755527, ..., -0.37409732,
        -0.20186597, -0.04845659],
       [-0.28838193,  0.12201217,  0.14000315, ..., -0.37409732,
        -0.20277346, -0.12272273],
       [-0.32875765,  0.01622581, -0.37291337, ..., -0.37293476,
        -0.1950787 , -0.02171777],
       ...,
       [ 1.84777221,  0.22866631, -0.76119768, ...,  5.82848849,
        -0.19516693, -0.01063677],
       [ 3.20438058,  4.82231501, -0.48342918, ...,  5.87895946,
        -0.19266503,  0.02073887],
       [ 3.0428777 ,  4.83586889, -0.52407759, ...,  5.9386689 ,
        -0.1917922 ,  0.01320264]])