Skip to content

fbargaglistoffi/machine-learning-firm-dynamics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Supervised Learning for the Prediction of Firm Dynamics

This repository contains the additional material for the Chapter on "Supervised learning for the prediction of firm dynamics" by F.J. Bargagli-Stoffi, J. Niederreiter and M. Riccaboni in the book "Data Science for Economics and Finance: Methodologies and Applications" by S. Consoli, D. Reforgiato Recupero, M. Saisana.

In the first Section of this repository we introduce a step-by-step guide for the reader that is new to machine learning to guide her/him in designing a supervised learning routine; in the second Section we provide further details on the main algorithms used for prediction tasks at different stages of the company life cycle together with simple examples on their implementation in R.

Here we show how to implement the supervised learning routine to predict firms' bankruptcy on a dataset of Italian firms' financial accounts. The dataset is a small, random sample of real firm level data used by F.J. Bargagli-Stoffi, M. Riccaboni and A. Rungi for the main analysis of the paper "Machine learning for zombie hunting. Firms' failures, financial constraints, and misallocation". For more details on the predictors, we refer the reader to the original paper.

1. A Simple Supervised Learning Routine

This simple step-by-step guide should aid the reader in designing a supervised learning (SL) routine to predict outcomes from input data.

  1. Check that information on the outcome of interest is contained for the observations that are later used to train and test the SL algorithm, i.e. that the data set is labeled.

  2. Prepare the matrix of input attributes to a machine-readable format.

  1. Choose how to split your data between training and testing set. Keep in mind that both training and testing set have to stay sufficiently large to train the algorithm or to validate its performance, respectively. Use resampling techniques in case of low data dimensions and stratified sampling whenever labels are highly unbalanced. If the data has a time dimension, make sure that the training set is formed by observations that occured before the ones in the testing set.

  2. Choose the SL algorithm that best suits your need. Possible dimensions to evaluate are prediction performance, simplicity of result interpretation and CPU runtime. Often a horserase between many algorithms is performed and the one with the highest prediction performance is chosen. There are already many algorithms already available "off the shelf" - consult this page for a comprehensive review of the main packages for machine learning in R.

  3. Train the algorithm using the training set only. In case hyper-parameters of the algorithm need to be set, choose them using crossfold validation on the training set, or better keep part of the training set only for hyperparameter tuning - but do not use the testing set until the algorithms are fully specified.

  4. Once the algorithm is trained, use it to predict the outcome on the testing set. Compare the predicted outcomes with the true outcomes.

  5. Choose the performance measure on which to evaluate the algorithm(s). Popular performance measures are Accuracy and Area Under the receiver operating Curve (AUC). Choose sensitive performance measure in case your data set is unbalanced such as Balanced Accuracy or the F-score.

  6. Once prediction performance has been assessed, the algorithm can be used to predict outcomes for observations for which the outcome is unknown. Note that valid predictions require that new observations should contain similar features and need to be independent from the outcome of old ones.

2. Supervised Learning Algorithms

2.1 Decision Trees

Description

Decision trees commonly consist of a sequence of binary decision rules (nodes) on which the tree splits into branches (edges). At each final branch (leaf node) a decision regarding the outcome is estimated. The sequence of decision rules and the location of each cut-off point is based on minimizing a measure of node purity (e.g., Gini index, or entropy for classification tasks, mean-squared-error for regression tasks). Decision trees are easy to interpret but sensitive to changes in the feature space, frequently lowering their out of sample performance (see Breiman 2017 for a detailed introduction).

Example usage in R

We focus on the function rpart in the R package Rpart. The documentation can be found here.

  • formula: a formula in the format of the formula used to train the decision tree (e.g. outcome ~ predictor1 + predictor2 + ect.);
  • data: specifies the data frame;
  • method: "class" for a classification tree, "anova" for a regression tree;
  • control: optional parameters for controlling tree growth. For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted.
# Decision Tree with rpart
install.library("rpart") # if not already installed
library(rpart)

# Grow the tree
dt <- rpart(trainoutcome ~ trainfeatures, method="class", data= train_data, control=rpart.control(minsplit=30, cp=0.001))

    printcp(dt) # display the results
    plotcp(dt) # visualize cross-validation results
    summary(dt) # detailed summary of splits

# Plot tree
plot(dt, uniform=TRUE, main="Classification Tree")
text(dt, use.n=TRUE, all=TRUE, cex=.8)

# Create attractive postscript plot of tree
post(dt, file = "c:/tree.ps",
   title = "Classification Tree")
   
# Get predicted values
dt.pred <- predict(dt, newdata=test, type='class')
   
         # generate table that compares true outcomes of the testing set with predicted outcomes of decisiontree
        dt_tab= table(true=testoutcome, pred= dt)
        # generate ROC object based on predictions in testing set
        dt_roc=roc(testoutcome ~ dt)
        #calculate AUC value of predictions in testing set
        dt_auc=pROC::auc(dt_roc)

2.2 Random Forest

Description

Instead of estimating just one DT, random forest resamples the training set observations to estimate multiple trees. For each tree at each node a sample of m predictors is chosen randomly from the feature space. To obtain the final prediction the outcomes all trees are averaged or in classification tasks the chosen by majority vote (see also the original contribution of Breiman, 2001)

Example usage in R

We focus on the function RandomForest in the R package RandomForest. The documentation can be found here.

Selection of inputs the function takes :

  • x: the feature matrix of the training set (NxP);
  • y: the outcome variable of the training set (Nx1);
  • xtest: (optional) the feature matrix of the testing set (MxP);
  • ytest: (optional) the outcome variable of the testing set (Mx1);
  • mtry: (optional) number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(P) where P is number of features and regression (P/3);
  • ntree: (optional) number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times;
  • importance: (optional) Should importance of predictors be assessed? ;
  • keep.forest: (optional) Should the forest be stored in the object (for later prediction tasks)? ;
  • seed: (optional) set an arbitrary numerical value to make RF estimation result reproducable ;

The RandomForest function returns an object which is a list containing information such as: the predicted values of the testing set in $test$predicted, importance measures in $importance and the entire forest $forest if keep.forest==TRUE.

# Random forest with the randomForest package
install.package("randomForest") # if not already installed
library("randomForest")
library("pROC") 

# Train the Random Forest
obj_rf=randomForest(trainfeatures,y=trainoutcome, xtest=testfeatures,ytest=testoutcome, mtry=8, ntree=500, importance=TRUE, keep.forest=FALSE, seed=34)
      
      #generate table that compares true outcomes of the testing set with predicted outcomes of random forest
        rf_tab= table(true=testoutcome, pred= obj_rf$test$predicted )
      #generate ROC object based on predictions in testing set
        rf_roc=roc(testoutcome~ obj_rf$test$votes[,2])
      #calculate AUC value of predictions in testing set
        rf_auc=pROC::auc(rf_roc)

2.3 Support Vector Machines

Description

Support vector machines (SVM) & Support vector machine algorithms estimate a hyperplane over the feature space to classify observations. The vectors that span the hyperplane are called support vectors. They are chosen such that the overall distance (called margin) between the data points and the hyperplane as well as the prediction accuracy is maximized (see also Steinwart 2008).

Example usage in R

We focus on the function svm in the R package e1071. The documentation can be found here

  • formula: a formula in the format of the formula used to train the decision tree (e.g. outcome ~ predictor1+predictor2+ect.);
  • data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which ‘svm’ is called from;
  • scale: a logical vector indicating the variables to be scaled.
  • type: svm can be used as a classification machine, as a regression machine, or for novelty detection;
  • kernel: the kernel used in training and predicting. You might consider changing some of the following parameters, depending on the kernel type.
# Support Vector Machine with the e1071 package
install.package("e1071") # if not already installed
library("e1071")

# Train the Support Vector Machine
obj_svm <- svm(formula, data = train)

# Predicted outcomes Support Vector Machine
svm.pred <- predict(obj_model, newdata = test)
      
      # Generate table that compares true outcomes of the testing set with predicted outcomes of random forest
        svm_tab= table(true=testoutcome, pred= svm.pred)
      # Generate ROC object based on predictions in testing set
        svm_roc=roc(testoutcome ~ svm.pred)
      # Calculate AUC value of predictions in testing set
        svm_auc=pROC::auc(svm_roc)

2.4 Artificial Neural Network

Description

(Deep) Artificial Neural Networks (ANN) & Inspired from biological networks, every neural network consists of at least three layers: an input layer containing feature information, at least one hidden layer (deep ANN are ANN with more than one hidden layer), and an output layer returning the predicted values. Each Layer consists of nodes (neurons) who are connected via edges across layers. During the learning process, edges that are more important are reinforced. Neurons may then only send a signal if the signal received is strong enough (see for example Hassoun 2016).

Example usage in R

We focus on the function nnet in the R package nnet". The documentation can be found here.

  • formula: a formula in the format of the formula used to train the decision tree (e.g. outcome ~ predictor1+predictor2+ect.);
  • data: specifies the data frame;
  • weights: weights for each example -- if missing defaults to 1;
  • size: number of units in the hidden layer. Can be zero if there are skip-layer units;
  • rang: initial random weights on [-rang, rang]. Value about 0.5 unless the inputs are large, in which case it should be chosen so that rang * max(|x|) is about 1;
  • decay: parameter for weight decay (default is 0);
  • maxit: number of iterations (default is 100).
# Neural network with the neural net package
install.packages("neuralnet") # in not already installed
library(neuralnet)

# Train the Neural Network
nnet_fit <- nnet(formula, data = train, size = 2, rang = 0.1, decay = 5e-4, maxit = 200)

test.cl <- function(true, pred) {
    true <- max.col(true)
    cres <- max.col(pred)
    table(true, cres)
}

# Predict fitted values on test
predict.nnet <- predict(nnet_fit, test)
test.cl(test, predict.nnet)

About

Repository for "Supervised learning for the prediction of firm dynamics"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages