# Executive Overview

![](banner_exec.jpg)

## Data Science for Business

When making a business decision, you can guess about what happened in the past, is happening now, or will happen in the future.  You can then decide what to do based on your guess.  You will consequently get a business result that comes about from your decision.  You might assume that when your guesses are based on patterns you see in some relevant data, then your guesses will be better, and so your decisions will be better, and so the business results will be better.  Indeed, this is often the case.  Challenges arise, however, when the patterns in data are difficult to detect.

**Data science for business** combines statistics and computing to find patterns in data that would be otherwise not easy to detect, so that you can make good guesses, so that you can make good decisions, so that you will more often get good business results. 

Data science for business enables you to find and quantify answers to questions like these:
* What are the patterns in the data?
* What guess can be made from these patterns?
* What are the probabilities that the guess is right or wrong, and what are the probabilities that the guess is wrong in any of several different ways?
* What decision best leverages the guess and the probabilities of how the guess could be right or wrong?
* What is the probability that a decision leads to a good business result, and what are the probabilities that it leads to any of several other different business results?


_Consider a business manager in charge of, say, a manufacturing process.  This business manager must decide how many workers to schedule based on a guess about what the demand for the company’s product this month will be.  Demand three months ago was 1 million units, two months ago was 2 million units, and last month was 4 million units.  Here are three ways the business manager could make a decision:_

* _**Decision (not based on data):**  Our business manager decides to keep the current number of workers, which is enough to manufacture 4 million units._
* _**Data-Driven Decision:**  One pattern easily seen in these data is that demand has been doubling every month.  So, our business manager guesses demand this month will be 8 million units.  From there, our business manager decides to schedule enough workers to manufacture 8 million units._
* _**Data-Driven Decision Using Business Analytics:**  Our business manager considers much richer data, which includes information about more than just recent demand.  Patterns in such rich data are difficult to detect, but our business manager applies various methods that do find them.  Based on these patterns, our business manager guesses that the demand this month will be 6 million units, but further knows that there is a 30% probability it could be higher and a 10% probability it could lower, and that the consequences of understaffing could cost the company up to \\$1 million in lost opportunity and overstaffing could cost the company more than \\$2 million in unnecessary worker costs.  Our business manager considers these and other factors, and decides to schedule enough workers to manufacture 7 million units._ 

_In all three cases, the business result will depend on what the demand actually turns out to be, but we assume that the data-driven decision using business analytics will most likely lead to the best business result._

## Data-to-Decision Methodology

The **data-to-decision methodology** prescribes a way to make data-driven decisions by iteratively working through three stages of data analysis:

* **Data management** involves gathering, storing, and retrieving data for eventual use in decision making.<br><br>

* **Exploratory data analysis (EDA)** involves representing data in various ways and applying methods to expose patterns that may lead to non-obvious insights useful in decision making.<br><br>

* **Modeling** involves applying methods to further expose patterns in the data by estimating the underlying processes responsible for generating the data.  Such models may make predictions that reveal non-obvious insights useful in decision making.  

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">The Data-to-Decision Methodology</caption>
    <tr><td style="padding:20px; background-color:white"><img src="d2d_process.jpg"></td></tr>
</table><br clear=all>

## Decision Models

A **decision model** estimates business results that would come about from various decisions informed by a business model, business parameter values, and data analytic models.

A **business model** describes how a business converts products and services to money.

**Business parameters** detail various aspects of a business's operating environment.

A decision model can be conveniently expressed as an **influence diagram**, useful for communicating assumptions upon which the model is based.  An influence diagram comprises symbols for a decision, performance metrics associated with a data analytic model, business parameters, intermediate calculations, and a business result, with dependencies among these shown as directed links. 

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">Example of a Decision Model Represented as an Influence Diagram</caption>
    <tr><td style="padding:20px; background-color:white"><img src="decision_model.jpg" width=600></td></tr>
</table><br clear=all>

## Data Analytic Models

**Artificial intelligence (AI)** is about machines doing things that you would normally expect only humans could do.  One way a machine could do these things is by following formulae developed by the machine itself from patterns it finds in data – that’s a **data analytic model**.  The process of constructing a data analytic model is called **machine learning (ML)** or **training a model**, and employs a variety of **methods**.

One kind of data analytic model specifies how to partition data into clusters - that's a **cluster model**. It could be constructed using an **unsupervised method**, so called because an unsupervised method does not make use of examples of correct partitions.

Another kind of data analytic model specifies how to predict as yet unobserved examples not explicitly reflected in data – that's a **predictive model**.  It could be constructed using a **supervised method**, so called because a supervised method makes use of examples of correct predictions. A predictive model that predicts categorical values is called a **classifier**, the categorical values it predicts are called **classes**, and a method to construct it is called a **classifier construction method**.  A predictive model that predicts numeric values is called a **regressor**, the numeric values it predicts are called **outcomes**, and a method to construct it is called a **regressor construction method**.

Another kind of data analytic model specifies connection relationships between entities - that's a **social network model**.

A method comprising a combination of several other methods working in concert is called an **ensemble method**.

Any particular model construction method is defined by its general approach and its specific **hyper-parameter** values.  Any particular model, constructed by a method, is defined by its general form and its specific **parameter** values.

## Data Management Methods

In practice, there are a vast number of methods to address the complexities around gathering, storing, and retrieving data.  Here we are interested primarily in simple **data retrieval** to ready data for exploratory data analysis and modeling.

## Exploratory Data Analysis Methods


### Data Exploration

After a dataset is retrieved and perhaps in response to transforming its representation, it may be useful to to look for patterns in the data that can produce insights and inform decisions.  This is **data exploration**.

**Data extraction** involves the mechanics of slicing and dicing data to get at just a subset of interest, which could include a subset of observations and/or a subset of variables, arranged in various ways. 

A **descriptive statistic** is a number that conveniently summarizes some data, specifically one or more variable distributions.  Here  are some popular descriptive statistics:
size ($N$, $n$);
probability, also known as relative frequency or proportion ($P$);
arithmetic mean, also known as mean or average ($\mu$, $\bar{x}$);
geometric mean;
median;
variance ($\sigma^2$, $s^2$);
standard deviation ($\sigma$, $s$);
percentile;
weighted average;
correlation coefficient ($r$).

A **cross-tabulation** is a table that conveniently summarizes some data, organized by rows and columns that correspond to variable values, and aggregated by various functions like mean, sum, or count. 

A **data visualization** is a graphic representation of some data, specifically one or more variable distributions.  Here are some popular data visualizations:

* A **scatterplot** of data shows distributions of some numeric variables represented by points positioned along axes, and perhaps distributions of other categorical variables represented by color, size, shape, or pattern.<br><br>

* A **scatterplot projection** of data is a scatterplot with 3 axes, projected onto a flat surface.<br><br>

* A **lineplot** of data shows distributions of some numeric variables represented by points positions along axes, and further relationships between data represented by line segments connecting the points, and perhaps distributions of other categorical variables represented by color, size, shape, or pattern.<br><br>

* A **barplot** of data shows the distribution of a categorical variable represented by heights of adjacently positioned bars, and perhaps distributions of other categorical variables represented by color or pattern.<br><br> 

* A **histogram** of data shows counts of particular ranges of values in the distribution of a numeric variable represented by heights of adjacently positioned bars.<br><br>

* A **density plot** of data shows an estimated proportion of values within any range of a numeric variable represented by the area under a curve.  The **kernel density estimation (KDE)** method is one way to produce a density plot.<br><br>

* An **animation** of data shows the distribution of a variable, often indicating time, represented by a sequence of other data visualizations. 



### Data Representation

After a dataset is retrieved and perhaps in response to data exploration, it may be useful to transform its representation so that various methods can be applied or made more effective.  This is **data representation**.  Here are some popular data representation transformation methods: 

* Transformation by **synthesizing** data adds new variables whose values are constructed based only on information already captured in the original data.   


* Transformation by **imputing** data fills in missing values with synthetic values, often based on descriptive statistics. 


* Transformation by **balancing** data duplicates or removes observations to make a particular categorical variable distribution reflect equal numbers of each possible value.  


* Transformation by **aligning** data expands or contracts observations to make a particular variable distribution match another dataset's particular variable distribution, often indicating time.


* The **principal component analysis** method applied to a dataset synthesizes a set of new variables, called **principal components**, that capture in entirety the same relationships between data as do the original variables, but concentrate variance disproportionately in just the first few of the new variables.  Transformation to principal components replaces the original variables with the principal components.


### Data Representation of Special Data Types

Many data analytic methods cannot be applied directly to text, time series, or social network data because of their special representations.  Often, though, data analytic methods can be applied to such data transformed to an appropriate alternative representation.   


* **Text data** can be transformed to **document-term matrix** form, where observations correspond to documents and variables correspond to occurrences of words.  Data analytic methods can then be applied as usual.


* **Time series data** can be transformed to **cross-sectional data**, where observations correspond to points in time and include variables with information about points in time past called **lookbacks** and points in time future called **lookaheads**.  Data analytic methods can then be applied to forecast by **direct prediction** some number of time steps ahead of a **viewpoint**, or by **recursive prediction** one timestep ahead of previous predictions.


* **Social network data** is often represented in a special form like an **adjacency matrix** or **link list**, and analysis requires special descriptive statistics, like **PageRank** used in the Google ranking algorithm.  Recommender systems based on **collaborative filtering** work on **bipartite graphs**, a special version of social network data.

## Model Construction Methods

### Cluster Model Construction

A **cluster model construction method** examines data and organizes observations into several classes.  Such an organization is called a **cluster model**, and can be useful for decisions involving market segmentation and other business applications. 

Here are some popular cluster model construction methods:

* The **Gaussian mixture model by expectation-maximization** method constructs models by assigning observations partial membership in all possible classes and then iteratively concentrating membership in the most expected classes.<br><br>

* The **hierarchical agglomeration** method constructs models by agglomerating observations into classes based on their dissimilarity to other observations.<br><br>

* The **k-means** method constructs models by tentatively organizing observations into several classes and then iteratively improving the organization based on the observations dissimilarity to other observations.

Popular ways to measure dissimilarity between observations comprising all numeric variables include **Manhattan distance**, **Euclidean distance**, **cosine distance**, and others.  Measures of dissimilarity between clusters are calculated assuming some **linkage**, which can be **single**, **complete**, **centroid**, or **average** linkage.



### Predictive Model Construction: Binary Classifiers

A **binary classifier construction method** examines data to construct a predictive model capable of estimating the probabilities that new observations should be classified as members of one class or another class.  From these probabilities, taken along with some chosen cutoff threshold, you can predict whether the new observations should be classified as members of one class or another class.  Such a predictive model is called a **binary classifier**, or often just a **classifier**.

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">Construct a Classifier</caption>
    <tr><td style="padding:20px; background-color:white"><img src="classification_train.jpg" width=440></td></tr>
</table><br clear=all>

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">Use a Classifier to Make Predictions</caption>
    <tr><td style="padding:20px; background-color:white"><img src="classification_predict.jpg" width=560></td></tr>
</table><br clear=all>

Here are some popular binary classifier construction methods:

* The **naïve Bayes** method constructs models based on probabilities and conditional probabilities of values appearing in the data.  **LaPlace smoothing** is a technique incorporated into the naïve Bayes method that may improve its models’ predictions.  Hyper-parameters include LaPlace smoothing factor and others.<br><br>

* The **support vector machine (SVM)** method constructs models based on an optimal separation of observations in variable space.  Hyper-parameters include cost, kernel, and others.<br><br>

* The **neural network** method (classification version) constructs models based on how values appearing in the data can be combined as they propagate throughout a network structure.  This method is inspired by, but different from, how biological brain neurons communicate with each other.  Hyper-parameters include number of levels, number of nodes in each level, and others.  The **perceptron** method is a simplified version of the neural network method.  **Deep learning** refers to constructing models using the neural network method.<br><br>

* The **logistic regression** method constructs models based on the optimal S-shaped hyper-curve through observations in variable space.<br><br>

* The **decision tree (DT)** method (classification version) constructs models based on the optimal splitting of the data into progressively smaller sets of data.  Hyper-parameters include maximum depth of tree, maximum number of nodes, and others.<br><br>

* The **nearest neighbor (kNN)** method (classification version) constructs models based on how similar observations are to each other.  Hyper-parameters include number of nearest neighbors and others.  Measures of dissimilarity can be Manhattan distance, Euclidean distance, cosine distance, or others.


### Predictive Model Construction: Multinomial Classifiers

A **multinomial classifier construction method** is a generalization of a binary classifier construction method, which constructs a predictive model capable of estimating the probabilities that new observations should be classified as members of any of several classes.  Such a predictive model is called a **multinomial classifier**, or often just a **classifier**.

Here are some popular classifier construction methods that have both binary and multinomial forms:

* naïve Bayes
* neural network
* decision tree
* nearest neighbor

Any binary classifier can be converted to a multinomial classifier with either of two binary-to-multinomial conversion methods: 

* The **one-versus-one** method reorganizes a dataset into several other datasets, one for each class versus all other classes treated as a single other class.  Models are constructed based on all these datasets, and predictions are made by majority rule or cutoff among the models.<br><br>

* The **one-versus-many** method reorganizes a dataset into several other datasets, one for each pair of classes, excluding the other classes.  Models are then constructed based on these datasets, and prediction are made by a round robin tournament among the models.


### Predictive Model Construction: Regressors

A **regressor construction method** examines data to construct a predictive model capable of predicting which numeric values be associated with new observations.  Such a predictive model is called a **regressor**.

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">Construct a Regressor</caption>
    <tr><td style="padding:20px; background-color:white"><img src="regression_train.jpg" width=440></td></tr>
</table><br clear=all>

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">Use a Regressor to Make Predictions</caption>
    <tr><td style="padding:20px; background-color:white"><img src="regression_predict.jpg" width=380></td></tr>
</table><br clear=all>

Here are some popular regressor construction methods:

* The **linear regression** method constructs models based on the optimal line, plane, or hyper-plane fitted through observations in variable space.<br><br>

* The **support vector regression (SVR)** method constructs models based on an optimal separation of observations in variable space.  Hyper-parameters include cost, kernel, and others.<br><br>

* The **neural network** method (regression version) constructs models based on how values appearing in the data can be combined as they propagate throughout a network structure.  This method is inspired by, but different from, how biological brain neurons communicate with each other.  Hyper-parameters include number of levels and number of nodes in each level.  **Deep learning** refers to constructing models using the neural network method.<br><br>

* The **decision tree (DT)** method (regression version) constructs models based on the optimal splitting of the data into progressively smaller sets of data.  Hyper-parameters include maximum depth of tree, maximum number of nodes, and others.<br><br>

* The **nearest neighbor (kNN)** method (regression version) constructs models based on how similar observations are to each other.  Hyper-parameters include number of nearest neighbors and others.  Measures of dissimilarity can be Manhattan distance, Euclidean distance, cosine distance, or others.<br><br>
 
Some regressor construction methods involve finding optimal solutions to various formulae, using search algorithms like **gradient descent** or others.


### Predictive Model Construction: Ensembles

An **ensemble method** combines several predictive models working in concert to construct a predictive model with benefits derived from all the component methods.

Here are some popular ensemble methods to construct classifiers or regressors:

* The **bootstrap aggregating (bagging)** ensemble method constructs models based on a single component method, but on several random subsets of data.  You can think of the result as a committee of experts.<br><br>

* The **boosting** ensemble method constructs models based on a single component method, but on several random subsets of data, where each subset emphasizes observations predicted incorrectly by other models.  You can think of the result as a committee of experts, each member with increasingly specialized knowledge.<br><br>

* The **stacking** ensemble method constructs models based on several component methods, and on the predictions made by models constructed using those component methods.  You can think of the result as a committee of experts on other experts.<br><br>

* The **random forest** method is a variation on bootstrap aggregating.  It uses decision tree as the single component method, uses several random subsets of data, and also uses several randomly selected variables.

## Model Evaluation Methods

### Cluster Model Evaluation

Popular cluster model performance metrics include **dispersion ratio**, **Bayes information criterion (BIC)**, **Akaike information criterion (AIC)**, and others.  Each of these reflect the average dissimilarity between observations within a cluster and the average dissimilarity between clusters.


### Predictive Model Evaluation
<br>


#### Sampling for Predictive Model Evaluation

Here are three popular methods to evaluate predictive model performance:

**In-sample** performance evaluation of a model constructed from all the data works like this:<br>
Use the model to predict the classes/outcomes of the data, and compare those predictions to the known actual classes/outcomes based on some performance metric.  This method has the advantage that the model is constructed from all the data, but has the disadvantage that predictions are about data already known to the model.

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">In-sample Evaluation of a Classifier</caption>
    <tr><td style="padding:20px; background-color:white"><img src="classification_insample.jpg" width=750></td></tr>
</table><br clear=all>

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">In-sample Evaluation of a Regressor</caption>
    <tr><td style="padding:20px; background-color:white"><img src="regression_insample.jpg" width=750></td></tr>
</table><br clear=all>

**Out-of-Sample** performance evaluation of a model constructed from all the data works like this:<br>
Construct a new model from a subset of all the data (referred to as **training data**) to predict the classes/outcomes of the remaining data (referred to as **validation data**), and then compare those predictions to the known actual classes/outcomes based on some performance metric.  The new model is constructed using the same method and hyper-parameter values as for the model to be evaluated, but using different data.  It is assumed that the new model has performance closely approximating that of the model to be evaluated.  This method is also known as **holdout** performance evaluation.  This method has the disadvantage that the new model is constructed from only some of the data, but has the advantage that predictions are about data not known to the model.

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">Out-of-Sample Evaluation of a Classifier</caption>
    <tr><td style="padding:20px; background-color:white"><img src="classification_outofsample.jpg" width=950></td></tr>
</table><br clear=all>

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">Out-of-Sample Evaluation of a Regressor</caption>
    <tr><td style="padding:20px; background-color:white"><img src="regression_outofsample.jpg" width=950></td></tr>
</table><br clear=all>

**Cross-validation** performance evaluation of a model constructed from all the data works like this:<br>
Partition the data into subsets called **folds**, then average the results of out-of-sample performance evaluation applied to each combination of a model constructed from data not in a fold and predictions made about data that are in that fold.  It is assumed that such models on average have performance closely approximating that of the model to be evaluated.  This method has the advantage that the models taken together are constructed from all the data, and further has the advantage that predictions made by any particular model are about data not know to that model.

<table style="border:1px solid; margin-top:20px">
    <caption style="text-align:center">Cross-Validation Evaluation of a Classifier</caption>
    <tr><td style="padding:20px; background-color:white"><img src="classification_xval.jpg" width=950></td></tr>
</table><br clear=all>

<table style="border:1px solid; margin-top:20px; margin-bottom:20px">
    <caption style="text-align:center">Cross-Validation Evaluation of a Regressorr</caption>
    <tr><td style="padding:20px; background-color:white"><img src="regression_xval.jpg" width=950></td></tr>
</table><br clear=all>


#### Classifier Performance Metrics

The **confusion matrix** for some predictive model, dataset, and cutoff threshold indicates the relationships between how the model would predict classes and the known actual classes.  For a binary classifier, the confusion matrix is 2x2 comprising four counts of predictions: predicted positive class that are actually positive class, predicted positive class that are actually negative class, predicted negative class that are actually positive class, and predicted negative class that are actually negative class.

Classifier performance metrics are calculated from values in a confusion matrix.  Here are some popular classifier performance metrics:
* accuracy (correct prediction rate)
* true positive rate (sensitivity, recall)
* true negative rate (aka specificity)
* false positive rate
* false negative rate
* positive predictive value (aka precision)
* negative predictive value
* f1 score
<br><br>
* business performance, calculated per a decision model
<br><br>


#### Regressor Performance Metrics

The **error table** for some model and dataset indicates the relationship between how the model would predict outcomes and the known actual outcomes.

Regressor performance metrics are calculated from an error table.  Here are some popular regressor performance metrics:
* root mean square error (RMSE)
* mean absolute percent error (MAPE)
<br><br>
* business performance, calculated per a decision model

## Model Tuning Methods

**Model tuning** is systematically exploring the effects of method, variables, hyper-parameter settings, and cutoff settings on model performance to determine the best combination.  

Here are three popular ways to explore the effects of variables on model performance:

* **Forward feature selection** does so by evaluating the effect of using each variable individually, keeping the best variable, and then proceeding to evaluate the effects of remaining possible pairs, triples, etc.
* **Backward feature selection** does so by evaluating the effect of each set of variables less one, keeping the best set, and then proceeding to evaluate the effects of successive sets of variables less one.
* **Exhaustive feature selection** does so by evaluating the effect of each possible combination of variables.  

Often, principal component analysis is used in combination with forward feature selection to reduce the number of variables.

## Technology

R and Python are popular open-source programming languages often used for exploratory data analysis and modeling.  Both languages can be executed within special interactive development environments (IDE) or within a Jupyter notebook running on a local machine or JupyterHub server. 

SQL is a popular query language often used for data extraction and some other aspects of exploratory data analysis.

Microsoft Excel is a commercial spreadsheet product often used for data analysis and modeling.

SAS, JMP, IBM SPSS, Tableau, and other products provide convenient user interfaces to exploratory data analysis and modeling functionality, without necessarily requiring programming.

<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised July 18, 2020
</span>
</p>