# Executive Briefing

![](banner_exec2.jpg)

## Business Analytics

In a business situation, you can make a guess about what happened in the past, is happening now, or will happen in the future.  You can then decide what to do based on your guess.  You will consequently get a business result that comes about from your decision.  You might assume that when your guesses are based on patterns you see in some relevant data, then your guesses will be better, and so your decisions will be better, and so the business results will often be better.  Challenges arise, however, when the patterns in data are difficult to detect.

**Business analytics** combines statistics and computing to find patterns in data that are otherwise not easy to detect, so that you can make good guesses, so that you can make good decisions, so that you will often get good business results.  Use business analytics if you believe in the assumption that leveraging difficult-to-detect patterns in data often leads to better business results.

Business analytics enables you to find and quantify answers to questions like these:
* What are the patterns in the data?
* What guess can be made from these patterns?
* What are the probabilities that the guess is right or wrong, and what are the probabilities that the guess is wrong is any of several different ways?
* What decision best leverages the guess and the probabilities of how the guess could be wrong?
* What is the probability that a decision leads to a good business result, and what are the probabilities that it leads to any of several other different business results?

<br>

_Consider a business manager in charge of, say, a manufacturing process.  This business manager must decide how many workers to schedule based on a guess about what the demand for the company’s product this month will be.  Demand three months ago was 1 million units, two months ago was 2 million units, and last month was 4 million units.  Here are three ways the business manager could make a decision:_

* _**Decision (not based on data):**  Our business manager decides to keep the current number of workers, which is enough to manufacture 4 million units._
* _**Data-Driven Decision:**  One pattern easily seen in these data is that demand doubles every month.  So, our business manager guesses demand this month will be 8 million units.  From there, our business manager decides to schedule enough workers to manufacture 8 million units._
* _**Data-Driven Decision (using business analytics):**  Our business manager gathers much richer data – many months of data about lots of things in addition to just recent demand.  Patterns in the data are difficult to detect, but our business manager applies various methods that do find them.  Based on these patterns, our business manager guesses the demand this month will be 6 million units, but further knows there is a 30% probability it could be higher and a 10% probability it could lower, and that the consequences of understaffing could cost the company up to \\$1 million in lost opportunity and overstaffing could cost the company more than \\$2 million in unnecessary worker costs.  Our business manager considers these and other factors, and decides to schedule enough workers to manufacture 7 million units._ 

_In all three cases, the business result will depend on what the demand actually turns out to be, but we assume that the data-driven decision using business analytics most likely leads to the best business result._

## Business Analytics in Practice

### Decision Models, Business Models, & Business Parameters

A **decision model** estimates the probabilties of various business results deriving from various business decisions given a business model, business parameter values, and cluster models or predictive models used to inform the business decisions.

A **business model** describes how a business converts products and services to money.

**Business parameters** detail various aspects of a business's operating environment.

An **influence diagram** illustrates relationships within a decision model, useful for communicating any assumptions upon which it is based.
<br><br>
<img src="decision_model.jpg" align="left" width="440"><br clear="all">
<br>

### Data Analytic Models

**Artificial intelligence (AI)** is about machines doing things that you would normally expect only humans could do.  A machine does this by following a formula developed by humans – that’s a **rule-based system**; or developed by the machine itself from patterns it finds in data – that’s a **data analytic model**.  The process of constructing a data analytic model is called **machine learning (ML)**, and relies on a variety of **methods**.

One kind of data analytic model specifies how to partition data into clusters - that's a **cluster model**. It could be constructed (also referred to as trained) using an **unsupervised method**, so called because unsupervised methods do not make use of examples of correct partitions.

Another kind of data analytic model specifies how to predict as yet unobserved examples not explicitly reflected in data – that's a **predictive model**.  It could be constructed (also referred to as trained) using a **supervised method**, so called because supervised methods make use of examples of correct predictions. A predictive model that predicts categorical values is called a **classifier**, the categorical values it predicts are called **classes**, and a method to construct it is called a **classifier construction method**.  A predictive model that predicts numeric values is called a **regressor**, the numercial values it predicts are called **outcomes**, and a method to construct it is called a **regressor construction method**.

Another kind of data analytic model specifies connection relationships between entities - that's a **social network model**.

A method comprising a combination of several other methods working in concert is called an **ensemble method**.

Any particular method is defined by its general approach and its **hyper-parameter** values.  Any particular model, constructed by a method, is defined by its general form and its **parameter** values.



### Data-to-Decision Lifecycle

Data retrieval, data representation, exploratory data analysis (EDA), descriptive data analysis or cluster analysis, predictive data analysis …

![](methodology.jpg)


## Data Analysis

### Data Exploration

After a dataset is retrieved and perhaps in response to transforming its representation, it may be useful to use descriptive statistics and data visualization to look for patterns in the data that can produce insights and inform decisions.

A **descriptive statistic** is a number that conveniently summarizes one or more variable distributions.  Here are some popular descriptive statistics:
* population size
* sample size
* arithmetic mean (average)
* geometric mean
* median
* variance
* standard deviation
* percentile
* weighted average
* correlation coefficient (r)


A **data visualization** is a graphic representation of one or more variable distributions.  Here are some popular data visualizations:

* A **scatter plot** of data shows distributions of some numeric variables represented by points positioned along axes, and perhaps distributions of other categorical variables represented by color, size, shape, or pattern.


* A **scatter plot projection** of data is a scatterplot with 3 axes, projected onto a flat surface.


* A **line plot** of data shows distributions of some numeric variables represented by points positions along axes, and further relationships between data represented by line segments connecting the points, and perhaps distributions of other categorical variables represented by color, size, shape, or pattern.


* A **bar chart** of data shows the distribution of a categorical variable represented by heights of adjacently positioned bars, and perhaps distributions of other categorical variables represented by color or pattern. 


* A **histogram** of data shows counts of particular ranges of values in the distribution of a numeric variable represented by heights of adjacently positioned bars. 


* A **density plot** of data shows an estimated proportion of values within any range of a numeric variable represented by the area under a curve.  The **kernel density estimation (KDE)** method is one way to produce a density plot.


* An **animation** of data shows the distribution of variable, often indicating time, represented by a sequence of other data visualizations. 


### Data Representation

After a dataset is retrieved and perhaps in response to data exploration, it may be useful to transform its representation so that various methods can be applied or made more effective.  Here are some popular data representation transformation methods: 

* Transformation by **synthesizing** data adds new variables whose values are constructed based only information already captured in the original data.   


* Transformation by **imputing** data fills in missing values with synthetic values, often based on descriptive statistics. 


* Transformation by **balancing** data duplicates or removes observations to make a particular categorical variable distribution reflect equal numbers of each possible value.  


* Transformation by **aligning** data expands or contracts observations to make a particular variable distribution match another dataset's particular variable distribution, often indicating time.


* The **principal component analysis** method applied to a dataset synthesizes a set of new variables, called **principal components**, that capture in entirety the same relationships between data as do the original variables, but concentrate variance disproprtionately in just the first few of the new variables.  Transformation to principal components replaces the original variables with the principal components.


Often, principal component analysis is used in combination with forward feature selection with criterion as total amount of variance captured in the variables.

## Model Construction

### Cluster Model Construction Methods

How they work ...

Here are some popular unsupervised machine learning methods to construct cluster models:

* The **Guassian mixture model by expectation-maximization** method contructs models based on ...


* The **hierarchical agglomeration** method constructs models based on ...


* The **k-means** method constructs models based on ...


Popular ways to measure dissimilarity between observations comprising all numeric variables include **Manhattan distance**, **Euclidean distance**, **cosine distance**, and others.  Measures of dissimilarity between clusters are calculated assuming some **linkage**, which can be **single**, **complete**, **centroid**, or **average** linkage.



### Binary Classifier Construction Methods

A **classifier construction method** eximines data, called the training data, to construct a (classifier, predictive) model. 

<br>
<img src="classification_train.jpg" align="left" width="440"><br clear="all">

<br>
<img src="classification_predict.jpg" align="left" width="560"><br clear="all">

<br><br>
Here are some popular supervised methods to construct classifiers:

* The **naïve Bayes** method constructs models based on probabilities and conditional probabilities of values appearing in the data.  LaPlace smoothing is a technique incorporated into the naïve Bayes method that may improve its models’ predictions.  Hyper-parameters include LaPlace smoothing factor and others.


* The **support vector machine (SVM)** method constructs models based on an optimal separation of datapoints in the data.  Hyper-parameters include cost, kernel, and others.


* The **neural network** method (classification version) constructs models based on how values appearing in the data can be progressively transformed throughout a network structure.  Hyper-parameters include number of levels, number of nodes in each level, and others.  The **perceptron** method is a simplified version of the neural network method.  **Deep learning** refers to constructing models using the neural network method.


* The **logistic regression** method constructs models based on the optimal S-shaped hyper-curve through datapoints.


* The **decision tree** method (classification version) constructs models based on the optimal splitting of the data into progressively smaller sets of data.  Hyper-parameters include maximum depth of tree, maximum number of nodes, and others.


* The **nearest neighbor (kNN)** method (classification version) constructs models based on how similar datapoints are to each other.  Hyper-parameters include number of nearest neighbors and others.  Measures of dissimilarity can be Manhattan distance, Euclidean distance, cosine distance, or others.


### Multinomial Classifier Construction Methods

To be written ...

<br>


Here are some popular supervised methods that have multinomial forms:

* The **multinomial naïve Bayes** method ...


* The **multinomial neural network** method ...


* The **multinomial decision tree** method ...


* The **multinomial nearest neighbor** method ...

<br>

Here are two popular binary-to-multinomial form conversion methods: 

* The **one-versus-one** method ...


* The **one-versus-many** method ...

<br>


### Regressor Construction Methods

How they work ...

<br>
<img src="regression_train.jpg" align="left" width="440"><br clear="all">

<br>
<img src="regression_predict.jpg" align="left" width="360"><br clear="all">

<br><br>
Here are some popular supervised methods to construct regressors:

* The **linear regression** method constructs models based on the optimal line or hyper-plane through datapoints.


* The **support vector regression (SVR)** method constructs models based on an optimal separation of datapoints in the data.  Hyper-parameters include cost, kernel, and others.


* The **neural network** method (regression version) constructs models based on how values appearing in the data can be progressively transformed throughout a network structure.  Hyper-parameters include number of levels and number of nodes in each level.  **Deep learning** refers to constructing models using the neural network method.


* The **decision tree (DT)** method (regression version) constructs models based on the optimal splitting of the data into progressively smaller sets of data.  Hyper-parameters include maximum depth of tree, maximum number of nodes, and others.


* The **nearest neighbor (kNN)** method (regression version) constructs models based on how similar datapoints are to each other.  Hyper-parameters include number of nearest neighbors and others.  Measures of dissimilarity can be Manhattan distance, Euclidean distance, cosine distance, or others.


* The **random forest** method constructs models based on ...
 


### Ensemble Construction Methods

An **ensemble method** combines various predictive model construction methods working in concert to construct a predictive model with benefits derived from all the component methods.

Here are some popular ensemble methods to construct classifiers or regressors:

* The **bootstrap aggregating (bagging)** ensemble method constructs models based on a combination of ...


* The **boosting** ensemble method constructs models based on ... 


* The **stacking** ensemble method constructs models based on ...

## Model Evaluation

### Cluster Model Evaluation

Popular cluster model performance metrics include **dispersion ratio**, **Bayes information criterion (BIC)**, **Akaike information criterion (AIC)**, and others.  Each of these reflect the average dissimilarity between observations within a cluster and the average dissimilarity between clusters.


### Predictive Model Evaluation

Predictive model evaluation ...

<br>


#### Sampling for Predictive Model Evaluation

Here are three popular methods to evaluate predictive model performance:

**In-sample** performance evaluation of a model constructed from all the data works like this:<br>
Use the model to predict the classes/outcomes of the data, and compare those predictions to the known actual classes/outcomes based on some performance metric.  This method has the advantage that the model is constructed from all the data, but has the disadvantage that predictions are about data already known to the model.

<img src="classification_evaluate_insample.jpg" align="left" width="750"><br clear="all">
<br><br>

**Out-of-Sample** performance evaluation of a model constructed from all the data works like this:<br>
Construct a new model from a subset of all the data (referred to as **training data**) to predict the classes/outcomes of the remaining data (referred to as **validation data**), and then compare those predictions to the known actual classes/outcomes based on some performance metric.  The new model is constructed using the same method and hyper-parameter values as for the model to be evaluated, but using different data.  It is assumed that the new model has performance closely approximating that of the model to be evaluated.  This method is also known as **holdout** performance evaluation.  This method has the disadvantage that the new model is constructed from only some of the data, but has the advantage that predictions are about data not known to the model.

<img src="classification_evaluate_outofsample.jpg" align="left" width="950"><br clear="all">
<br><br>

**Cross-validation** performance evaulation of a model constructed from all the data works like this:<br>
Partition the data into subsets - called **folds**, then average the results of out-of-sample performance evaluation applied to each combination of a model constructed from data not in a fold and predictions made about data that are in that fold.  It is assumed that such models on average have performance closely approximating that of the model to be evaluated.  This method has the advantage that the models taken together are constructed from all the data, and further has the advantage that predictions made by any particular model are about data not know to that model.

<br>


#### Classifier Performance Metrics

The **confusion matrix** for some model and dataset indicates the relationships between how the model would predict classes and the known actual classes.  For a binary classifier, its confusion matrix is 2x2 comprising four counts of predictions: predicted positive class that are actually positive class, predicted positive class that are actually negative class, predicted negative class that are actually positive class, predicted negative class that are actually negative class.

Classifier performance metrics are calculated from a confusion matrix.  Here are some popular classifier performance metrics:
* accuracy (correct prediction rate)
* true positive rate (sensitivity, recall)
* true negative rate (aka specificity)
* positive predictive value (aka precision)
* negative predictive value
* f1 score


<br>

#### Regressor Performance Metrics

The **error table** for some model and dataset indicates the relationship between how the model would predict outcomes and the known actual outcomes.

Regressor performance metrics are calculated from an error table.  Here are some popular regressor performance metrics:
* root mean square error (RMSE)
* mean absolute percent error (MAPE)

## Model Tuning

### Training, Validation, & Test

### Systematic Data Representation Selection

Transformation by **feature selection** removes some variables.  **Forward feature selection** does so by evaluating each variable inidvidually according to some criteria, keeping the best variable, and then proceeding to evaluate the remaining possible pairs, triples, etc.  **Backward feature selection** does so by evaluating each set of variables less one according to some criteria, keeping the best set, and then proceeding to evaluate successive sets of variables less one.

### Systematic Hyper-Parameter Assignment


## Handling Special Data Types

### Text Data

Document-term matrix ...


### Time Series Data

Lookahead, lookback ...

**Forecast by Direct Prediction** ...

**Forecast by Recursive Prediction** ...

### Social Network Data

**Descriptive Statistics for Social Networks** ...

## Decision Making

Models, confusion matrix, population vs sample, decision model, ...

## Technology for Business Analytics

### Programming Languages

R ...

Python ...

SQL ...


### Products

Microsoft Excel ...

SAS, JMP, IBM SPSS, Tableau ...



<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised March 22, 2020
</span>
</p>