# Bankruptcy Prediction with LightGBM Classifier


## Introduction of LightGBM
[LightGBM](https://github.com/Microsoft/LightGBM) is an open-source, distributed, high-performance gradient boosting framework with following advantages: 
-   Composability: LightGBM models can be incorporated into existing
    SparkML Pipelines, and used for batch, streaming, and serving
    workloads.
-   Performance: LightGBM on Spark is 10-30% faster than SparkML on
    the Higgs dataset, and achieves a 15% increase in AUC.  [Parallel
    experiments](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#parallel-experiment)
    have verified that LightGBM can achieve a linear speed-up by using
    multiple machines for training in specific settings.
-   Functionality: LightGBM offers a wide array of [tunable
    parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst),
    that one can use to customize their decision tree system. LightGBM on
    Spark also supports new types of problems such as quantile regression.
-   Cross platform：LightGBM on Spark is available on Spark (Scala) and PySpark (Python).



<img src="https://mmlspark.blob.core.windows.net/graphics/Documentation/bankruptcy image.png" width="800" style="float: center;"/>

In this example, we use LightGBM to build a classification model in order to predict bankruptcy.

## Read dataset

Get a sample data of financial statements for 6819 companies, 220 represents bankrupted companies while 6599 firms are not bankrupted. 

In [19]:
dataset = spark.read.format("csv")\
  .option("header", True)\
  .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv")

StatementMeta(SampleSpark1, 1, 19, Finished, Available)



## Exploratory data

Look at the data and evaluate its suitability for use in a model.

In [20]:
display(dataset.head(5))

StatementMeta(SampleSpark1, 1, 20, Finished, Available)

SynapseWidget(Synapse.DataFrame, dadc9bb6-02bb-43b0-84e7-4ea5791f1c0d)

In [21]:
# print dataset size
print("Total number of records: " + str(dataset.count()))

StatementMeta(SampleSpark1, 1, 21, Finished, Available)

Total number of records: 6819

In [22]:
# convert features to double type
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
for colName in dataset.columns:
  dataset = dataset.withColumn(colName, col(colName).cast(DoubleType()))
print("Schema: ")
dataset.printSchema()

StatementMeta(SampleSpark1, 1, 22, Finished, Available)

Schema: 
root
 |-- Bankrupt?: double (nullable = true)
 |--  ROA(C) before interest and depreciation before interest: double (nullable = true)
 |--  ROA(A) before interest and % after tax: double (nullable = true)
 |--  ROA(B) before interest and depreciation after tax: double (nullable = true)
 |--  Operating Gross Margin: double (nullable = true)
 |--  Realized Sales Gross Margin: double (nullable = true)
 |--  Operating Profit Rate: double (nullable = true)
 |--  Pre-tax net Interest Rate: double (nullable = true)
 |--  After-tax net Interest Rate: double (nullable = true)
 |--  Non-industry income and expenditure/revenue: double (nullable = true)
 |--  Continuous interest rate (after tax): double (nullable = true)
 |--  Operating Expense Rate: double (nullable = true)
 |--  Research and development expense rate: double (nullable = true)
 |--  Cash flow rate: double (nullable = true)
 |--  Interest-bearing debt interest rate: double (nullable = true)
 |--  Tax rate (A): double (null

## Generation of testing and training data sets

Simple split, 85% for training and 15% for testing the model. Playing with this ratio may result in different models.


In [23]:
# Split the dataset into train and test

train, test = dataset.randomSplit([0.70, 0.30], seed=1)

# Add featurizer to convert features to vector

from pyspark.ml.feature import VectorAssembler
feature_cols = dataset.columns[1:]
featurizer = VectorAssembler(
    inputCols=feature_cols,
    outputCol='features'
)
train_data = featurizer.transform(train)['Bankrupt?', 'features']
test_data = featurizer.transform(test)['Bankrupt?', 'features']

StatementMeta(SampleSpark1, 1, 23, Finished, Available)



In [24]:
# check if the data is unbalanced
train_data.groupBy("Bankrupt?").count().show()

StatementMeta(SampleSpark1, 1, 24, Finished, Available)

+---------+-----+
|Bankrupt?|count|
+---------+-----+
|      0.0| 4605|
|      1.0|  154|
+---------+-----+

## Train the model
Train the Classifier model.

In [25]:
from mmlspark.lightgbm import LightGBMClassifier

model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="Bankrupt?", isUnbalance=True)
model = model.fit(train_data)

StatementMeta(SampleSpark1, 1, 25, Finished, Available)



In [26]:
from mmlspark.lightgbm import LightGBMClassificationModel
model.saveNativeModel("/lgbmcmodel")
model = LightGBMClassificationModel.loadNativeModelFromFile("/lgbmcmodel")

StatementMeta(SampleSpark1, 1, 26, Finished, Available)



In [27]:
print(model.getFeatureImportances())

StatementMeta(SampleSpark1, 1, 27, Finished, Available)

[51.0, 24.0, 41.0, 32.0, 36.0, 5.0, 9.0, 3.0, 13.0, 5.0, 13.0, 19.0, 22.0, 19.0, 10.0, 53.0, 5.0, 6.0, 14.0, 21.0, 20.0, 15.0, 11.0, 13.0, 15.0, 21.0, 13.0, 8.0, 45.0, 13.0, 18.0, 16.0, 15.0, 24.0, 28.0, 22.0, 10.0, 19.0, 10.0, 31.0, 17.0, 15.0, 4.0, 25.0, 17.0, 21.0, 46.0, 19.0, 43.0, 46.0, 22.0, 41.0, 40.0, 40.0, 21.0, 24.0, 20.0, 32.0, 20.0, 25.0, 17.0, 38.0, 34.0, 18.0, 27.0, 17.0, 28.0, 19.0, 14.0, 74.0, 13.0, 28.0, 4.0, 29.0, 3.0, 19.0, 0.0, 13.0, 22.0, 63.0, 21.0, 24.0, 20.0, 19.0, 0.0, 14.0, 34.0, 12.0, 7.0, 37.0, 12.0, 38.0, 23.0, 0.0, 11.0]

## Model Performance Evaluation

After training the model, we evaluate the performance of the model using the test set.

In [28]:
predictions = model.transform(test_data)
#predictions.limit(10).toPandas()

StatementMeta(SampleSpark1, 1, 28, Finished, Available)



In [30]:
from mmlspark.train import ComputeModelStatistics

# Compute model performance metrics
metrics = ComputeModelStatistics(evaluationMetric="classification", 
                                 labelCol="prediction", 
                                 scoredLabelsCol="Bankrupt?").transform(predictions)
metrics.toPandas()

StatementMeta(SampleSpark1, 1, 30, Finished, Available)

  evaluation_type  ...       AUC
0  Classification  ...  0.574822

[1 rows x 6 columns]

## Clean up resources
To ensure the Spark instance is shut down, end any connected sessions(notebooks). The pool shuts down when the **idle time** specified in the Apache Spark pool is reached. You can also select **stop session** from the status bar at the upper right of the notebook.

![stopsession](https://adsnotebookrelease.blob.core.windows.net/adsnotebookrelease/adsnotebook/image/stopsession.png)

## Next steps

* [Check out Synapse sample notebooks](https://github.com/Azure-Samples/Synapse/tree/main/MachineLearning) 
* [MMLSpark GitHub Repo](https://github.com/Azure/mmlspark)