_Lambda School Data Science — Practicing & Understanding Predictive Modeling_

# Categorical Encoding

### [category_encoders](http://contrib.scikit-learn.org/categorical-encoding/)

Install category_encoders, version >= 2.0.0
- Google Colab: `pip install category_encoders`
- Local, Anaconda: `conda install -c conda-forge category_encoders`

In [4]:
import category_encoders as ce
ce.__version__

'2.0.0'

### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- [Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)
- [Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

### Can you predict if peer-to-peer loans are charged off or fully paid?

[Lending Club says,](https://www.lendingclub.com/) _"Our mission is to transform the banking system to make credit more affordable and investing more rewarding."_ You can view their [loan statistics and visualizations](https://www.lendingclub.com/info/demand-and-credit-profile.action).

[According to Wikipedia,](https://en.wikipedia.org/wiki/Lending_Club) _Lending Club is the world's largest peer-to-peer lending platform._

>Lending Club enables borrowers to create unsecured personal loans between $1,000 and 40,000. The standard loan period is three years. Investors can search and browse the loan listings on Lending Club website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. Lending Club makes money by charging borrowers an origination fee and investors a service fee.

The data is a stratified sample of 100,000 Lending Club peer-to-peer loans with a loan status of "Charged Off" or "Fully Paid", issued from 2007 through 2018.

The set of variables included here are the intersection of what's available both when investors download historical data and when investors browse loans for manual investing.

Target: `charged_off`

Data dictionary: https://resources.lendingclub.com/LCDataDictionary.xlsx

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
url = 'https://drive.google.com/uc?export=download&id=1AafT_i1dmfaxqKiyFofVndleKozbQw3l'
df = pd.read_csv(url)
df.shape

(100000, 104)

In [2]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='charged_off')
y = df['charged_off']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, test_size=0.20, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80000, 103), (20000, 103), (80000,), (20000,))

In [3]:
y_train.value_counts(normalize=True)

0    0.80045
1    0.19955
Name: charged_off, dtype: float64

In [6]:
X_train.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,80000.0,1667101.0,384960.93752,1000000.0,1333199.0,1667750.5,1998223.0,2336172.0
member_id,0.0,,,,,,,
loan_amnt,80000.0,14385.93,8653.335024,1000.0,8000.0,12000.0,20000.0,40000.0
funded_amnt,80000.0,14378.77,8649.381479,1000.0,8000.0,12000.0,20000.0,40000.0
installment,80000.0,437.2868,260.270011,23.61,249.54,374.64,578.265,1566.8
annual_inc,80000.0,75872.64,59552.605399,0.0,45760.0,65000.0,90000.0,6000000.0
url,0.0,,,,,,,
dti,79988.0,18.36223,12.2945,0.0,11.88,17.635,24.12,999.0
delinq_2yrs,80000.0,0.3147875,0.86675,0.0,0.0,0.0,0.0,20.0
inq_last_6mths,80000.0,0.6546125,0.936913,0.0,0.0,0.0,1.0,7.0


In [7]:
X_train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
term,80000,2,36 months,60781
application_type,80000,2,Individual,78548
initial_list_status,80000,2,w,46513
disbursement_method,80000,2,Cash,79612
home_ownership,80000,6,MORTGAGE,39614
grade,80000,7,B,23583
emp_length,75387,11,10+ years,26395
purpose,80000,14,debt_consolidation,46509
sub_grade,80000,35,C1,5049
addr_state,80000,50,CA,11523


## Categorical exploration, 1 feature at a time

Change `feature`, then re-run these cells!

In [135]:
feature = 'initial_list_status'

In [136]:
X_train[feature].value_counts()

w    46513
f    33487
Name: initial_list_status, dtype: int64

In [137]:
X_train[[feature]].head()

Unnamed: 0,initial_list_status
25539,w
28968,w
34666,f
20864,w
75088,w


### One Hot Encoding

Warning: May run slow, or run out of memory, with high cardinality categoricals!

In [131]:
encoder = ce.OneHotEncoder(use_cat_names=True)
encoded = encoder.fit_transform(X_train[[feature]])
print(f'{len(encoded.columns)} columns')
encoded.head()

14 columns


Unnamed: 0,purpose_debt_consolidation,purpose_credit_card,purpose_home_improvement,purpose_major_purchase,purpose_moving,purpose_house,purpose_other,purpose_renewable_energy,purpose_wedding,purpose_vacation,purpose_small_business,purpose_car,purpose_medical,purpose_educational
25539,1,0,0,0,0,0,0,0,0,0,0,0,0,0
28968,0,1,0,0,0,0,0,0,0,0,0,0,0,0
34666,0,0,1,0,0,0,0,0,0,0,0,0,0,0
20864,1,0,0,0,0,0,0,0,0,0,0,0,0,0
75088,0,0,1,0,0,0,0,0,0,0,0,0,0,0


### Binary Encoding

In [138]:
encoder = ce.BinaryEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'{len(encoded.columns)} columns')
encoded.head()

2 columns


Unnamed: 0,initial_list_status_0,initial_list_status_1
25539,0,1
28968,0,1
34666,1,0
20864,0,1
75088,0,1


In [11]:
2**7

128

In [12]:
import math
math.sqrt(50)

7.0710678118654755

### "Ordinal" Encoding

In [139]:
encoder = ce.OrdinalEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'1 column, {encoded[feature].nunique()} unique values')
encoded.head()

1 column, 2 unique values


Unnamed: 0,initial_list_status
25539,1
28968,1
34666,2
20864,1
75088,1


### Target (Mean) Encoding

Warning: May overfit!

In [125]:
first5 = X_train[feature].head().values
train = pd.concat([X_train, y_train], axis='columns')
target_mean = train.groupby(feature)['charged_off'].mean()
target_mean[first5]

home_ownership
MORTGAGE    0.172490
MORTGAGE    0.172490
MORTGAGE    0.172490
MORTGAGE    0.172490
RENT        0.233511
Name: charged_off, dtype: float64

In [126]:
min_samples_leaf = 10
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 6 unique values, min_samples_leaf=10


Unnamed: 0,home_ownership
25539,0.17249
28968,0.17249
34666,0.17249
20864,0.17249
75088,0.233511


In [18]:
min_samples_leaf = 100
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 2 unique values, min_samples_leaf=100


Unnamed: 0,application_type
25539,0.198643
28968,0.198643
34666,0.198643
20864,0.198643
75088,0.198643


In [19]:
min_samples_leaf = 1000
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 2 unique values, min_samples_leaf=1000


Unnamed: 0,application_type
25539,0.198643
28968,0.198643
34666,0.198643
20864,0.198643
75088,0.198643


### BONUS: Data Wrangling / Feature Engineering Example

In [5]:
X = df.drop(columns='charged_off')
y = df['charged_off']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, test_size=0.20, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80000, 103), (20000, 103), (80000,), (20000,))

In [5]:
def wrangle(X):
    X = X.copy()
    
    # Drop some columns
    X = X.drop(columns='id')  # id is random
    X = X.drop(columns=['member_id', 'url', 'desc'])  # All null
    X = X.drop(columns='title')  # Duplicative of purpose
    X = X.drop(columns='grade')  # Duplicative of sub_grade
    
    # Transform sub_grade from "A1" - "G5" to 1.1 - 7.5
    def wrangle_sub_grade(x):
        first_digit = ord(x[0]) - 64
        second_digit = int(x[1])
        return first_digit + second_digit/10
    
    X['sub_grade'] = X['sub_grade'].apply(wrangle_sub_grade)

    # Convert percentages from strings to floats
    X['int_rate'] = X['int_rate'].str.strip('%').astype(float)
    X['revol_util'] = X['revol_util'].str.strip('%').astype(float)
        
    # Transform earliest_cr_line to an integer: how many days it's been open
    X['earliest_cr_line'] = pd.to_datetime(X['earliest_cr_line'], infer_datetime_format=True)
    X['earliest_cr_line'] = pd.Timestamp.today() - X['earliest_cr_line']
    X['earliest_cr_line'] = X['earliest_cr_line'].dt.days
    
    # Create features for three employee titles: teacher, manager, owner
    X['emp_title'] = X['emp_title'].str.lower()
    X['emp_title_teacher'] = X['emp_title'].str.contains('teacher', na=False)
    X['emp_title_manager'] = X['emp_title'].str.contains('manager', na=False)
    X['emp_title_owner']   = X['emp_title'].str.contains('owner', na=False)
    
    # Drop categoricals with highest cardinality
    X = X.drop(columns=['emp_title', 'zip_code'])
    
    #encode categoricals as Binary ecoders
    X['application_type'] = ce.BinaryEncoder().fit_transform(X['application_type'])
    X['disbursement_method'] = ce.BinaryEncoder().fit_transform(X['disbursement_method'])
    X['home_ownership']= ce.BinaryEncoder().fit_transform(X['home_ownership'])
    X['initial_list_status']= ce.BinaryEncoder().fit_transform(X['initial_list_status'])
    X['purpose'] = ce.BinaryEncoder().fit_transform(X['purpose'])
    X['addr_state'] = ce.BinaryEncoder().fit_transform(X['purpose'])

    
    #convert to simple int
    X['term'] = X['term'].str.strip(' months').astype(np.int8)
    
    # Transform features with many nulls to binary flags
    many_nulls = ['sec_app_mths_since_last_major_derog',
                  'sec_app_revol_util',
                  'sec_app_earliest_cr_line',
                  'sec_app_mort_acc',
                  'dti_joint',
                  'sec_app_collections_12_mths_ex_med',
                  'sec_app_chargeoff_within_12_mths',
                  'sec_app_num_rev_accts',
                  'sec_app_open_act_il',
                  'sec_app_open_acc',
                  'revol_bal_joint',
                  'annual_inc_joint',
                  'sec_app_inq_last_6mths',
                  'mths_since_last_record',
                  'mths_since_recent_bc_dlq',
                  'mths_since_last_major_derog',
                  'mths_since_recent_revol_delinq',
                  'mths_since_last_delinq',
                  'il_util',
                  'emp_length',
                  'mths_since_recent_inq',
                  'mo_sin_old_il_acct',
                  'mths_since_rcnt_il',
                  'num_tl_120dpd_2m',
                  'bc_util',
                  'percent_bc_gt_75',
                  'bc_open_to_buy',
                  'mths_since_recent_bc']

    for col in many_nulls:
        X[col] = X[col].isnull()
    
    # For features with few nulls, do mean imputation
    for col in X:
        if X[col].isnull().sum() > 0:
            X[col] = X[col].fillna(X[col].mean())
    
    # Return the wrangled dataframe
    return X

# wrangle data
X_train = wrangle(X_train)
X_test  = wrangle(X_test)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((80000, 98), (20000, 98), (80000,), (20000,))

In [6]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score


params= {
    'learning_rate' : [.03, .07],
    #'nthread':[4],
    'n_estimators': [50, 100, 150],
    'max_depth': [2, 4, 6],
    'booster': [ 'dart', 'gbtree', 'gblinear']
}

gridsearch = RandomizedSearchCV(
    XGBClassifier(seed=42), #tree_method='gpu_hist'
    param_distributions = params,
    scoring='roc_auc',
    n_iter = 10,
    cv=5,
    n_jobs = -1,
    verbose=10,
    random_state=42,
    return_train_score=True
)

gridsearch.fit(X_train, y_train)



Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   30.8s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   41.5s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.8min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
       subsample=1),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'learning_rate': [0.03, 0.07], 'n_estimators': [50, 100, 150], 'max_depth': [2, 4, 6], 'booster': ['dart', 'gbtree', 'gblinear']},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring='roc_auc', verbose=10)

In [7]:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values(by='rank_test_score').T

y_pred_proba = gridsearch.best_estimator_.predict_proba(X_test)[:,1]
test_roc = roc_auc_score(y_test, y_pred_proba)

gridsearch.best_estimator_, test_roc

(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.07, max_delta_step=0,
        max_depth=4, min_child_weight=1, missing=None, n_estimators=150,
        n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
        subsample=1), 0.715281231418327)

In [8]:
results.sort_values(by='rank_test_score').T


Unnamed: 0,9,6,5,3,8,0,1,7,4,2
mean_fit_time,21.5238,66.0383,39.5392,9.4237,8.77855,8.17585,3.57142,3.19616,4.62755,2.17603
std_fit_time,0.462933,0.920468,0.472999,0.200286,0.0544124,0.151241,0.109109,0.0477271,0.017849,0.14956
mean_score_time,0.180764,0.20856,0.160704,0.115912,0.106513,0.118986,0.106241,0.0930955,0.0961848,0.104358
std_score_time,0.00654401,0.00526414,0.00497458,0.00562023,0.00346978,0.00633256,0.00766825,0.000787299,0.0015477,0.00693722
param_n_estimators,150,150,150,50,50,100,100,100,150,50
param_max_depth,4,6,4,4,4,2,4,6,6,4
param_learning_rate,0.07,0.07,0.03,0.07,0.03,0.03,0.07,0.07,0.03,0.07
param_booster,gbtree,dart,dart,dart,dart,gbtree,gblinear,gblinear,gblinear,gblinear
params,"{'n_estimators': 150, 'max_depth': 4, 'learnin...","{'n_estimators': 150, 'max_depth': 6, 'learnin...","{'n_estimators': 150, 'max_depth': 4, 'learnin...","{'n_estimators': 50, 'max_depth': 4, 'learning...","{'n_estimators': 50, 'max_depth': 4, 'learning...","{'n_estimators': 100, 'max_depth': 2, 'learnin...","{'n_estimators': 100, 'max_depth': 4, 'learnin...","{'n_estimators': 100, 'max_depth': 6, 'learnin...","{'n_estimators': 150, 'max_depth': 6, 'learnin...","{'n_estimators': 50, 'max_depth': 4, 'learning..."
split0_test_score,0.726777,0.725816,0.723321,0.720956,0.714437,0.712012,0.689554,0.689554,0.680788,0.675465


In [21]:
# remove gblinear, focus on more esimators / deeper
params= {
    'learning_rate' : [.03, .07],
    #'nthread':[4],
    'n_estimators': [150, 200, 250],
    'max_depth': [4, 5, 6, 7],
    'booster': [ 'dart', 'gbtree']
}

gridsearch = RandomizedSearchCV(
    XGBClassifier(seed=42, tree_method='gpu_exact', n_jobs=-1), 
    param_distributions = params,
    scoring='roc_auc',
    n_iter = 10,
    cv=5,
    n_jobs = -1,
    verbose=10,
    random_state=42,
    return_train_score=True
)

gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   27.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   47.9s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  5.4min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=42, silent=True, subsample=1, tree_method='gpu_exact'),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'learning_rate': [0.03, 0.07], 'n_estimators': [150, 200, 250], 'max_depth': [4, 5, 6, 7], 'booster': ['dart', 'gbtree']},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring='roc_auc', verbose=10)

In [22]:
results.sort_values(by='rank_test_score').T


Unnamed: 0,9,6,5,3,8,0,1,7,4,2
mean_fit_time,22.0957,67.3448,40.9571,9.91946,9.11618,8.24537,3.53236,3.31123,4.87569,1.9991
std_fit_time,0.642369,0.675683,0.686521,0.257899,0.288283,0.252675,0.128728,0.0817902,0.228474,0.0796813
mean_score_time,0.182406,0.205002,0.169517,0.113825,0.106513,0.121191,0.110046,0.0955479,0.0953442,0.096084
std_score_time,0.0111649,0.00316888,0.0055635,0.00424181,0.00256035,0.00393897,0.0133416,0.00677632,0.00444275,0.0044957
param_n_estimators,150,150,150,50,50,100,100,100,150,50
param_max_depth,4,6,4,4,4,2,4,6,6,4
param_learning_rate,0.07,0.07,0.03,0.07,0.03,0.03,0.07,0.07,0.03,0.07
param_booster,gbtree,dart,dart,dart,dart,gbtree,gblinear,gblinear,gblinear,gblinear
params,"{'n_estimators': 150, 'max_depth': 4, 'learnin...","{'n_estimators': 150, 'max_depth': 6, 'learnin...","{'n_estimators': 150, 'max_depth': 4, 'learnin...","{'n_estimators': 50, 'max_depth': 4, 'learning...","{'n_estimators': 50, 'max_depth': 4, 'learning...","{'n_estimators': 100, 'max_depth': 2, 'learnin...","{'n_estimators': 100, 'max_depth': 4, 'learnin...","{'n_estimators': 100, 'max_depth': 6, 'learnin...","{'n_estimators': 150, 'max_depth': 6, 'learnin...","{'n_estimators': 50, 'max_depth': 4, 'learning..."
split0_test_score,0.727838,0.726744,0.723806,0.721425,0.714573,0.712102,0.690138,0.690138,0.681808,0.676796


In [23]:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values(by='rank_test_score').T

y_pred_proba = gridsearch.best_estimator_.predict_proba(X_test)[:,1]
test_roc = roc_auc_score(y_test, y_pred_proba)

gridsearch.best_estimator_, test_roc

(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.07, max_delta_step=0,
        max_depth=4, min_child_weight=1, missing=None, n_estimators=200,
        n_jobs=-1, nthread=None, objective='binary:logistic',
        random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=42, silent=True, subsample=1, tree_method='gpu_exact'),
 0.712043020651798)

(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1, gamma=0, learning_rate=0.07, max_delta_step=0,
        max_depth=4, min_child_weight=1, missing=None, n_estimators=200,
        n_jobs=-1, nthread=None, objective='binary:logistic',
        random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
        seed=42, silent=True, subsample=1, tree_method='gpu_exact'),
        
 0.712043020651798)

In [25]:
from sklearn.linear_model import LogisticRegression

params= {
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight': [{0: 1, 1: 1}, #(equivalent to None)
{0: 1, 1: 2},
{0: 1, 1: 10}, #(roughly equivalent to 'balanced' for this dataset)
{0: 1, 1: 100},
{0: 1, 1: 10000}]

}

gridsearch = RandomizedSearchCV(
    LogisticRegression(n_jobs=-1), 
    param_distributions = params,
    scoring='roc_auc',
    n_iter = 10,
    cv=5,
    n_jobs = -1,
    verbose=10,
    random_state=42,
    return_train_score=True
)

gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   17.8s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.0min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=-1,
          penalty='l2', random_state=None, solver='warn', tol=0.0001,
          verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'class_weight': [{0: 1, 1: 1}, {0: 1, 1: 2}, {0: 1, 1: 10}, {0: 1, 1: 100}, {0: 1, 1: 10000}]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring='roc_auc', verbose=10)

In [26]:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values(by='rank_test_score').T

y_pred_proba = gridsearch.best_estimator_.predict_proba(X_test)[:,1]
test_roc = roc_auc_score(y_test, y_pred_proba)

gridsearch.best_estimator_, test_roc

(LogisticRegression(C=1.0, class_weight={0: 1, 1: 2}, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=-1, penalty='l2', random_state=None,
           solver='newton-cg', tol=0.0001, verbose=0, warm_start=False),
 0.7025240234214909)

(LogisticRegression(C=1.0, class_weight={0: 1, 1: 2}, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=-1, penalty='l2', random_state=None,
           solver='newton-cg', tol=0.0001, verbose=0, warm_start=False),
 0.7025240234214909)