# Review of Predictive Modeling

Our primary goal here is to describe the main capabilities and limitations of predictive modeling

In business applications, data analytics methods are often categorized at a high level into three distinct types: 
-    descriptive, 
-    predictive, and 
-    prescriptive


**Descriptive analytics** refers to methods for data summarization, data quality assess-
ment, and finding correlations

**Predictive analytics** focuses on estimation of the likelihood of a potential outcome by using data that are observed or known prior to the outcome
e.g)
forecast demand
propensity scoring used to tell likelihood of customer responding to promotion

**prescriptive analytics** refers to modeling of the dependency between decisions and future outcomes for optimal decision making
e.g)
when it is price optimization where the profit is modeled as a function of the price, so that one can estimate how many dollars of profit would be generated by every dollar of price discount and determine the profit-optimal discount value.


# Steps

we need to take as we begin to discuss the algorithmic approach is to translate
this business language into more formal models that describe the objective we are trying to achieve, the space of possible actions, and the constraints we should meet

naturally, naturally optimization problems like express business metrics

- revenue
- possible action of campaigns 
- assortment adjustments
    (assortment: a group of different things or of different types of the same thing; a mixture)
- require optimal actions to be found from various possible strategies


`Importantly`: there are several basic considerations that should be taken into account in any model design.

**First**, we need to define the business objective and express it as a numerical metric that can be a subject of optimization

_e.g) The design of the objective can be especially challenging if the objective represents a trade-off between the enterprise’s profit and the usefulness to the consumer_



**Second**, we should account for available data or address the data collection problem

_e.g) there is a trade-off between the cost of data acquisition and the value delivered by the acquired data_


**Third**, the model can be created at different levels of granularity

_e.g) this method assume the availability of a powerful data processing infrastructure and high-resolution data, which enables more granular modeling_

**Finally**, the economic model that estimates the business outcomes from the distribution should be defined. This is an economic problem, rather than a machine learning one

**example:**

In 1998, 

`scenario 1`:
    Suppose a retailer sells a product with margin _m_ and $q_{m}$ is the monthly amount of this product purchased by customer u expressed by simple dot product, 
    
$$G =\sum_{u}  q_{u}*m$$



 `scenario 2`:
    He wants to boost sale by factor _k_, and he is okay to take risk of cost of each promotion _c_, narutally becames optimization problem. where k and c are hyper parameters. $s=(k,c)$ is hyperparameter 

$$ \underset{s}{\operatorname{max}}  \sum_{u} k q_{u} m - c $$


`scenario 3`:
    He segments customers into different group and apply different strategy  $s_{i}=(k_{i},c_{i})$ and $s_{j}=(k_{j},c_{j})$, so now we have four hyper parameters and non-linearity

$$\underset{s_{i},s_{j}}{\operatorname{max}}  \sum_{u}(  (k_{i} q_{u} m - c_{i}) , ( k_{j} q_{u} m - c_{j})  ) $$



 

we might normally thing if more and more discount more sales we get, but it follows expoenential ( plateau ) after some times. and giving it in cheap price people might thing that low quality and it'll affect other parameter, we'll discuss later on

Now we want to know difference in controlled (with discount) and uncontrolled(without discount) factors and metrics of interest. $P(Outcome| Invest)$

so, in our simple mode 
$G =\sum_{u} G( p(y| X(s))) $

$$ \underset{s}{\operatorname{max}}G =\sum_{u} G( p(y| X(s)))  $$

As usual ML lingo,    design_matrix, $D = [X | y]$

where, x= feature matrix of (n * m) and  y= label vector of ( n * 1 )


$$ D  = [ X_{(n*m)} | y_{(n*1)} ]$$


Applying divergence in ML or constrastive in DL model, $\hat{y} = y(x)$




However in many case x,y are not explicitly present in data. we do a lot of feature engineering

In a general case $G$ is the data model and $p(y|X)$ is tells estimated distribution from the data.

`note:`
Finally, the economic model that estimates the business outcomes from the distribution should be defined. 
**This is an economic problem, rather than a machine learning one 😂**

`Agenda:`

# Supervised Learning
- Regression
- Classification

on other considerations:
- parametric
- Non parametric


Techniques:
- Linear Regression (MLE,MPA)  
- Logistic Regression/ Binary
- KNN
- Navie Bayes Classifier

- Non Linear Models
    1. feature mapping and kernel methods
    2. adaptive basis and decision trees
    
# Unsupervised Learning
    1. PCA (Decorrelation/Dimensionality Reduction )
    2. Clustering
    
- **More Specialized Models**
    1. Consumer Choice Theory
    
-         I. Multinomial Logit Model
-       II. Estimation of Multinomial Logit Model
        
    2. Survival Analysis
-         I, Survival Function       
-        II. Hazard Function       
-       III. Survival Analysis Regression 
       
    3. Auction Theory

## **Supervised Learning**

**Optimization**
- minimizaton problem with equality contrained problems G(x,y)=C

$$ \underset{x,y}{\operatorname{min}}  \sum_{i=0}^{n} F(x,y) + \lambda * [ G(x,y)-C ]  )  $$

Model fitting is an optimization problem in its own right, so we need to specify an objective function that will be optimized

#### Linear regression 
you might study via statistics or linear algebra but it's Unconstrained Optimization
    
    
$$SSE(w) = \sum_{i=1}^{n} ( y_{i} - w^T x_{i} )^2 $$


#### Logistic regression
you might study via statistics or linear algebra or logit model 

for binary setup,

$$LL(w) =  \sum_{i=1}^{n} (-y_{i} log(\hat{y_{i}})) - ((1-y_{i})log(1-\hat{y_i})) $$


for Multi class setup, 

    - instead of log-loss use arg softmax loss
    
    
`note:` Linear methods can be an apppropriate solutions for many marketing applications it should **NOT** be underestimated.

#### Nonlinear Models
Avail these methods if features explicitly have nonlinear relationship

simple non-linear introducing polynomial relationship and KNN :)


**1. Feature Mapping**
    
    transformation of the original feature space into another space, typically of higher dimensionality (kernels), is referred to as _feature mapping_. 
It is intuitively clear that addition of more dimensions, specified as nonlinear functions of one or several existing features, provides more flexibility for a regression or classification algorithm that we are trying to improve.

again, kernels that work well for consumer profiles might not be the best choice for textual product descriptions, and so on.


**2. Adaptive Basis**
   
    a family of methods that using a greedy heuristic algorithm

e.g) simple Decision Tree using Entropy/Gini Entropy

With simple example,

The **KNN** algorithm is one of the most basic supervised learning methods; however, it can work quite well in many settings.  It is widely used in **recommendation algorithms**,will discuss later also. The shortcoming of this methods is that the higher the dimensionality of data, the sparser the space becomes, and we then have to look at neighbors that are so far away from the given point.

meaning, it'll look only for local neighbours not a generalizable pattern in the data.

Now people started using NCF (Neural Collabrative Filtering) algorithms also

### **Unsupervised learning**

$$ D = (x_{i})_{i=0}^{n} ; st. x \in R^{d} $$
    
 It is widely used in marketing applications for data exploration and analysis. Clustering of customer profiles and interpretation of the obtained results, it's well know example is clustering and it's the most important techniques in marketing analytics. 

In programmatic applications, however, we are more concerned with automation than with exploration and interactive analysis. **Representation learning** is one application of unsupervised learning methods that can be useful in this context, so we are focusing here specifically on representation learning aspects, rather than unsupervised learning in general.

- PCA

    in linear algebra context it's just maximize variance problem or decompositional 
    
    $$\underset{u}{\operatorname{max}} var(u^{T} x_{i})_{i=1}^{n} ; st. |u|=1$$
    
    
PCA works only on symmetic matrix. if you want to make it generalizable SVD $X= U \sum V^{T}$, if you want to go more than that Tensor Decomposition works. some people will explain it in terms of low rank matrix also

In typical scenario, I don't see any application use Tensor decomposition as much. btw my exposured in ML is limiting.

In marketing applications, data typically corresponds to 
    
    - observed inputs,
    - properties and 
    - outputs of real marketing process
    
    
<img 
    style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%;"
    src="./pics/data observed from same real-life process.png" 
    alt="Our logo">
</img>



No one can observe market process, but we can get insights from small small actions like,
- marketing campaigns sucesstion rate,
- interaction between customers and products,
- interplay of price and demand
- etc,

There features are to be correlated if all are observed same activity in some case. But keep in mind **Correlation doesn't tell causation**


SVD dissect matrix into _Rotate_ Stretch _Rotate_

what the purpose of decomposition here?

Useful in especially marketing, application of searchs and recommendation.
Let's say we've captured interaction between customers and product. which customer bought what products


Design Matix, of shape m*n = $D_{mn}$

| **$Products \over customers$** | p1 | p2|. |. |. |. |. |$p_{j}$|. |. |. |$p_{n}$|
|--|----|---|--|--|--|--|--|--|--|--|--|--|
|c1|---|---|--|--|--|--|--|--|--|--|--|--|
|c2|---|---|--|--|--|--|--|--|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|$c_{i}$|---|---|--|--|--|--|--|$I_{ij}$|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|.|---|---|--|--|--|--|--|--|--|--|--|--|
|$c_{m}$|---|---|--|--|--|--|--|--|--|--|--|$I_{mn}$|


Really Sparse matrix, 
- each cell contains number of purchases. 
- the data is also highly correlated because many products are similar to each other 
- many customer have similar shopping habits
- our promotion reaches for certain categories
- our brand scope is limited to certain products to consumers



If we decompose,  $D_{mn} = C_{mk} P_{kn}$ otherwords, $\hat{I_{ij}}= p_{i} \dot{q_{j}^{T}}$

where,

$C_{mk}$ is the matrix holds Customer features

$P_{kn}$ is the matrix holds Products features

**Clustering**: it's hard optimization problem by nature
    
objective:
- points close to each other are high similarity
- points from far apart is low similarity


<img 
    style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 15%;"
    src="./pics/clustering.png" 
    alt="Our logo">
</img>


$S_{i}$  be  $i^{th}cluster$

each $x_{i} \in S_{i}$ s.t $(S_{i} \cap S_{j})=\phi$ in vector notation $S_{i}^{T}\dot S_{j} = 0$

mean of centroid,$C_{i} = \frac{1}{|S_{i}|} (\sum_{i=1}^{n} x_{j})$ st. $\sum{s_{i}}=1$


**Double Optimization equation**


$$ \underset{s_{1},s_{2},..s_{k}}{\operatorname{min}}  \sum_{i=1}^{k} \sum_{x_{j}\in s_{i}} ||x_{j}-c_{j} ||^{2}+ \lambda (s_{i}^{T}\dot s_{j}=0) + \lambda (\sum_{i=1} s_{j}=1 )+ \lambda  (\frac{1}{|S_{i}|} \sum_{i=1}^{n} x_{j} )$$ 




\begin{equation}
  s_{ij}=\begin{cases}
    1, & \text{if $x_{j} \in s_{i}$}.\\
    0, & \text{otherwise}.
  \end{cases}
\end{equation}


solves by integer programming, llyod's algo and expectation-maximization (EM) is well knows solution


Standard supervised and unsupervised learning methods can address most of marketing applications.

`Key Takeaway:` whatever the problem that we are trying to solve reflects the nature of the process that we are.

Some tasks cannot be easily solved by using standard ML methods. We need more some specialized data analysis technique or complex economic model that bridge the business objective

### Consumer Choice Theory

understanding & prediction of consumer choice is the Fundamental market. To answer important question related to,
- product design
- assortment planning (distriution can't be answered of the demaind not well understood)

In Mathematics lingo, it's called **Discrete choice problem**

It starts, from choosing Coffee or Tea ?

In marketing application, why consumer needs to choose $I_{j}$ while all available $I_{1},I_{2},I_{3},.....I_{i},.....I_{n}$

so decision maker chooses option $Y_{nj}$ over all ther rest. $Y_{nj}>Y_{ni} \forall i,i\neq{j}$

sometime, it maybe counter intutive choice also happen. choice model allows for investigation. may be uncertainty associated with 
- comes from like,
- income,
- other property of understanding indivuals or not input to our models.

Behavioural model should tell most importantly predict consumer choice. 
Our decision model can only be created by only know properties of individuals  & alternatives


Our Model (linear/non-linear model): $$V(X_{nj}) = \hat{Y}(X_{nj})$$ where as, True value $$Y(X_{nj})= Y(X_{nj}, h_{nj})$$ $X_{nj}$be the known proprty of model and $h_{nj}$ be the hidden/unobserved factor



So, $$Y_{ni}=\hat{Y}_{ni}+\epsilon_{ni}$$
$$Y_{ni}=V(X_{ni})+\epsilon_{ni}$$

It should tell the probability of _n_ user choosing _j_ product

$P_{nj}$ = 



### Survival Analysis
        It focus on the probability of the event not happening.
        - Hazard Function describe the risk of the event

### Auction Theory
     
       more complex theories :(

# `NOTE`

Keep mind scale of business, when things are scales up dynamic of problem (business objectives) might also changes.