# Predicting bill_result in Challenge_8_AK_AZ_WA_prepared

### Notebook automatically generated from your model

Model LightGBM (s1), trained on 2023-04-09 07:53:54.

#### Generated on 2023-04-10 19:24:31.299858

prediction
This notebook will reproduce the steps for a BINARY_CLASSIFICATION on  Challenge_8_AK_AZ_WA_prepared.
The main objective is to predict the variable bill_result

#### Warning

The goal of this notebook is to provide an easily readable and explainable code that reproduces the main steps
of training the model. It is not complete: some of the preprocessing done by the DSS visual machine learning is not
replicated in this notebook. This notebook will not give the same results and model performance as the DSS visual machine
learning model.

Let's start with importing the required libs :

In [0]:
import sys
import dataiku
import numpy as np
import pandas as pd
import sklearn as sk
import dataiku.core.pandasutils as pdu
from dataiku.doctor.preprocessing import PCA
from collections import defaultdict, Counter

And tune pandas display options:

In [0]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data

The first step is to get our machine learning dataset:

In [0]:
# We apply the preparation that you defined. You should not modify this.
preparation_steps = []
preparation_output_schema = {'columns': [{'name': 'session', 'type': 'string'}, {'name': 'identifier', 'type': 'string'}, {'name': 'classification', 'type': 'string'}, {'name': 'jurisdiction.name', 'type': 'string'}, {'name': 'jurisdiction.classification', 'type': 'string'}, {'name': 'from_organization.name', 'type': 'string'}, {'name': 'from_organization.classification', 'type': 'string'}, {'name': 'state_party_affiliation', 'type': 'string'}, {'name': 'bill_result', 'type': 'bigint'}, {'name': 'bill_subject_1', 'type': 'string'}, {'name': 'bill_subject_2', 'type': 'string'}, {'name': 'bill_subject_3', 'type': 'string'}, {'name': 'bill_subject_4', 'type': 'string'}, {'name': 'bill_subject_5', 'type': 'string'}, {'name': 'bill_party_affiliation', 'type': 'string'}], 'userModified': False}

ml_dataset_handle = dataiku.Dataset('Challenge_8_AK_AZ_WA_prepared')
ml_dataset_handle.set_preparation_steps(preparation_steps, preparation_output_schema)
%time ml_dataset = ml_dataset_handle.get_dataframe(limit = 100000)

print ('Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1]))
# Five first records",
ml_dataset.head(5)

#### Initial data management

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll use the features and the preprocessing steps defined in Models.

Let's only keep selected features

In [0]:
ml_dataset = ml_dataset[['identifier', 'session', 'classification', 'from_organization.classification', 'bill_result', 'bill_subject_3', 'bill_subject_4', 'bill_subject_5', 'state_party_affiliation', 'bill_subject_1', 'from_organization.name', 'bill_subject_2', 'jurisdiction.name', 'bill_party_affiliation']]

Let's first coerce categorical columns into unicode, numerical features into floats.

In [0]:
# astype('unicode') does not work as expected

def coerce_to_unicode(x):
    if sys.version_info < (3, 0):
        if isinstance(x, str):
            return unicode(x,'utf-8')
        else:
            return unicode(x)
    else:
        return str(x)


categorical_features = ['identifier', 'session', 'classification', 'from_organization.classification', 'bill_subject_3', 'bill_subject_4', 'bill_subject_5', 'state_party_affiliation', 'bill_subject_1', 'from_organization.name', 'bill_subject_2', 'jurisdiction.name', 'bill_party_affiliation']
numerical_features = []
text_features = []
from dataiku.doctor.utils import datetime_to_epoch
for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in numerical_features:
    if ml_dataset[feature].dtype == np.dtype('M8[ns]') or (hasattr(ml_dataset[feature].dtype, 'base') and ml_dataset[feature].dtype.base == np.dtype('M8[ns]')):
        ml_dataset[feature] = datetime_to_epoch(ml_dataset[feature])
    else:
        ml_dataset[feature] = ml_dataset[feature].astype('double')

We are now going to handle the target variable and store it in a new variable:

In [0]:
target_map = {'0': 0, '1': 1}
ml_dataset['__target__'] = ml_dataset['bill_result'].map(str).map(target_map)
del ml_dataset['bill_result']


# Remove rows for which the target is unknown.
ml_dataset = ml_dataset[~ml_dataset['__target__'].isnull()]

ml_dataset['__target__'] = ml_dataset['__target__'].astype(np.int64)

#### Cross-validation strategy

The dataset needs to be split into 2 new sets, one that will be used for training the model (train set)
and another that will be used to test its generalization capability (test set)

This is a simple cross-validation strategy.

In [0]:
train, test = pdu.split_train_valid(ml_dataset, prop=0.8)
print ('Train data has %i rows and %i columns' % (train.shape[0], train.shape[1]))
print ('Test data has %i rows and %i columns' % (test.shape[0], test.shape[1]))

#### Features preprocessing

The first thing to do at the features level is to handle the missing values.
Let's reuse the settings defined in the model

In [0]:
drop_rows_when_missing = []
impute_when_missing = []

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    test = test[test[feature].notnull()]
    print ('Dropped missing records in %s' % feature)

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    elif feature['impute_with'] == 'CONSTANT':
        v = feature['value']
    train[feature['feature']] = train[feature['feature']].fillna(v)
    test[feature['feature']] = test[feature['feature']].fillna(v)
    print ('Imputed missing values in feature %s with value %s' % (feature['feature'], coerce_to_unicode(v)))

We can now handle the categorical features (still using the settings defined in Models):

Let's dummy-encode the following features.
A binary column is created for each of the 100 most frequent values.

In [0]:
LIMIT_DUMMIES = 100

categorical_to_dummy_encode = ['identifier', 'session', 'classification', 'from_organization.classification', 'bill_subject_3', 'bill_subject_4', 'bill_subject_5', 'state_party_affiliation', 'bill_subject_1', 'from_organization.name', 'bill_subject_2', 'jurisdiction.name', 'bill_party_affiliation']

# Only keep the top 100 values
def select_dummy_values(train, features):
    dummy_values = {}
    for feature in categorical_to_dummy_encode:
        values = [
            value
            for (value, _) in Counter(train[feature]).most_common(LIMIT_DUMMIES)
        ]
        dummy_values[feature] = values
    return dummy_values

DUMMY_VALUES = select_dummy_values(train, categorical_to_dummy_encode)

def dummy_encode_dataframe(df):
    for (feature, dummy_values) in DUMMY_VALUES.items():
        for dummy_value in dummy_values:
            dummy_name = u'%s_value_%s' % (feature, coerce_to_unicode(dummy_value))
            df[dummy_name] = (df[feature] == dummy_value).astype(float)
        del df[feature]
        print ('Dummy-encoded feature %s' % feature)

dummy_encode_dataframe(train)

dummy_encode_dataframe(test)












# Rescaling is not required

#### Modeling

Before actually creating our model, we need to split the datasets into their features and labels parts:

In [0]:
X_train = train.drop('__target__', axis=1)
X_test = test.drop('__target__', axis=1)

y_train = np.array(train['__target__'])
y_test = np.array(test['__target__'])

Now we can finally create our model!

In [0]:
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
                    boosting_type='gbdt',
                    num_leaves=242,
                    max_depth=-1,
                    learning_rate=0.29498492534883913,
                    n_estimators=19,
                    subsample_for_bin=200000,
                    min_split_gain=0.9150254046173107,
                    min_child_weight=0.04918686346291431,
                    min_child_samples=3,
                    subsample=0.75,
                    subsample_freq=2,
                    colsample_bytree=0.6107256674465726,
                    reg_alpha=0.1430428482327608,
                    reg_lambda=0.9492866548689445,
                    random_state=1337,
                    n_jobs=4
                  )

We set "class_weight" as the weighting strategy:

In [0]:
clf.class_weight = "balanced"

... And train the model

In [0]:
%time clf.fit(X_train, y_train)

Build up our result dataset

The model is now trained, we can apply it to our test set:

In [0]:
%time _predictions = clf.predict(X_test)
%time _probas = clf.predict_proba(X_test)
predictions = pd.Series(data=_predictions, index=X_test.index, name='predicted_value')
cols = [
    u'probability_of_value_%s' % label
    for (_, label) in sorted([(int(target_map[label]), label) for label in target_map])
]
probabilities = pd.DataFrame(data=_probas, index=X_test.index, columns=cols)

# Build scored dataset
results_test = X_test.join(predictions, how='left')
results_test = results_test.join(probabilities, how='left')
results_test = results_test.join(test['__target__'], how='left')
results_test = results_test.rename(columns= {'__target__': 'bill_result'})

#### Results

You can measure the model's accuracy:

In [0]:
from dataiku.doctor.utils.metrics import mroc_auc_score
y_test_ser = pd.Series(y_test)
 
print ('AUC value:', mroc_auc_score(y_test_ser, _probas))

We can also view the predictions directly.
Since scikit-learn only predicts numericals, the labels have been mapped to 0,1,2 ...
We need to 'reverse' the mapping to display the initial labels.

In [0]:
inv_map = { target_map[label] : label for label in target_map}
predictions.map(inv_map)

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !
