# Targeting the Best Prospects for a Charity Mailing

Fundraising organizations (including those in universities) need to manage their budgets and the patience of their potential donors. In any given campaign segment, they would like to solicit from a “good” subset of the donors.


## The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces


**Decomposing the Business Problem:**

We realize that the response can vary—some people might donate $100 while others might donate $1. We need to take this into account and not only focus on the response rate.

Our ultimate focus is to maximize our donation profit—meaning the net after taking into account the costs.


**Recomposing the Solution Pieces:**

We can use **expected value** as a framework for structuring our approach to building a solution to the problem.


**Baseline comparison:** 

Not targeting anyone, and then we can compare the two to make the decision of whether to target or not. The expected benefit of not targeting is simply zero. We assume that consumers won't donate spontaneously without a solicitation.


**Expected value framework:**

Expected benefit of targeting = $p(R|x)*v_R(x) +[1-p(R|x)]*v_{NR}(x)$, where $p(R|x)$ is the probability of response given consumer, $v_R(x)$ is the profit we can get from a response consumer x, which is the the consumer’s donation minus the cost of the marketing. And $v_{NR}(x)$ is the value we get from the consumer x who doesn't response, which is the cost of the solicitation.

We always want this benefit to be greater than zero, so:

\begin{align}
&p(R|x)*v_R(x) +[1-p(R|x)]*v_{NR}(x) > 0 \\
&p(R|x)*(d_R(x)-c) +[1-p(R|x)]*v_{NR}(-c) > 0 \\
&p(R|x)*d_R(x) > c
\end{align}

which is, **the expected donation should be greater than the solicitation cost**.


**Donation estimation**: through regression model. Looking at historical data on consumers who have been targeted, we can use regression modeling to estimate how much a consumer will respond.


## Selection Bias

If we use past donation data to predict the response value, we may get even more precise prediction. However, the data may well be biased—meaning that they are not a random sample from the population of all donors. Because the data are from the individuals who did respond in the past. However, you want to apply the model to the general population to find good prospects. Fortunately, there are **data science techniques** to help modelers deal with selection bias. (Zadrozny & Elkan, 2001; Zadrozny, 2004) for an illustration of dealing with selection bias in this exact donation solicitation case study.

# Churn Example

## Business Understanding

**Decomposing the Business Problem and Recomposing the Solution Pieces:**

We would like to limit the amount of money we are losing—not simply to keep the most customers. Therefore, as in the donation problem, we want to take the value of the customer into account.


**Expected value framework:**

Unlike the donation case, we can't simply assume that the expected benefit of not targeting is 0. If we do not target and customer leave, we'll lose some profits. And if the customer stays anyway, then we actually achieve higher value because we did not expend the cost of the incentive.

- Let’s call $u_S(x)$ the **profit from customer x if she stays**, not including the incentive cost; and $u_{NS}(x)$ the **profit from customer x if she leaves**, not including the incentive cost.

- Furthmore, for simplicity, let’s assume that we incur the **incentive cost c** no matter whether the customer stays or leaves.

- We indicate the different estimated probability of starying **S** by conditioning the probability on the two possibilities (**target, T, or not target, notT**). 

**The expected benefit of targeting is:**

\begin{align}
EB_T(x) = p(S|x, T)(u_S(x)-c)+[1-p(S|x,T)](u_{NS}(x)-c]
\end{align}


**The expected benefit of not targeting is:**

\begin{align}
EB_{notT}(x) = p(S|x, notT)(u_S(x)-c)+[1-p(S|x,notT)](u_{NS}(x)-c]
\end{align}

So, now to complete our business problem formulation, we would like to target those customers for whom we would see the greatest expected benefit, which have **the largest value of targeting: $VT = EB_T(x) - EB_{notT}(x)$** from targeting them.


**If we simplify the problem by assuming that we get no value from a customer if she does not stay.**

\begin{align}
VT &= p(S|x, T)u_S(x)-p(S|x, notT)u_S(x)-c \\
&= [p(S|x, T)-(x)-p(S|x, notT)]u_S(x)-c \\
&= \Delta(p)u_S(x)-c
\end{align}

where Δ(p) is the difference in the predicted probabilities of staying, depending on whether the customer is targeted or not. We want to **target those customers who can have the greatest changes in their probability of staying because of targeting**.


**To find those customers with the largest potential gain**, we want to build two separate probability estimation models:

1. The probability that a customer will stay if targeted

2. The probability that a customer will stay anyway, even if not targeted


## Data Science Solution


**Data needed:**

To develop those two probability estimation models, we need samples of customers who have **reached contract expiration** and we can be sure to define them as 'stayed customers' and 'churned customers'.

- For the first model we need a sample of customers who were targeted with the offer. 

- For the second model, we need a sample of customers who were not targeted with the offer. 

    - Perhaps Marketing had come up with a similar, but not identical, offer in the past. If this offer had been made to customers in a similar situation (and recall the selection bias concern discussed above), it may be useful to build a model using the proxy label.
    - To oversimplify that the $p(S|x, T) = 1$, which is we assume that when given the offer, everyone would stay with certainty,

Hopefully this would be a representative sample of the (test) customer base to which the model was applied. Even if there exists selection bias, we have some techniques to correct it.


**Things we need to be sure:**

1. Nothing substantial has changed in the business environment or the whole industry that would call into question the use of historical data for churn prediction (the introduction of the iPhone only to AT&T customers would have been such an event for the other phone companies).

2. None of our customers was made some other offer that would affect the likelihood of churning.


## What can we do next:

1. Compare this model to the baseline model, which is our alternative, simple churn model.
2. Expand the formulation to include multiple offers, and judge which gives the best value for any particular customer.
3. Parameterize the offers (for example with a variable discount amount) and then work to optimize what discount will yield the best expected value. This would likely involve additional investment in data, running experiments to judge different customers’ probabilities of staying or leav‐ ing at different offer levels.

# Market-Basket Analysis / Co-occurrences Grouping and Associations Discoverage: Finding Items That Go Together


**Why**


- If we can capture the associations between consumer preferences, we can increase revenue from **cross-selling**. It also could enhance the consumer experience, and thus leverage our data asset to create additional **customer loyalty**.

- We built regional distribution centers to reduce shipping expense, but in practice we see that for many orders we end up either having to ship from the main distribution center anyway, or to make multiple deliveries for many orders. If there are particular less-popular items that co-occur often with the most popular items, these also could be stocked in the regional distribution centers, achieving a substantial reduction in our **shipping costs**.


**How to conduct cooccurrences grouping:**

1. Consider **complexity control**: there are likely to be a tremendous number of cooccurrences, many of which might simply be due to chance, rather than to a generalizable pattern. So we have to place some constraints to reduce the number of total groups.

    - Apply to some minimum percentage of the data—let’s say that we only apply to a product which appears in least **0.01% of all orders**.
    
    - Require a certain minimum degree of likelihood for the associations. **The probability that B occurs when A occurs p(B|A) should be higher above some threshold**, such as 5% (so that 5% or more of the time, a buyer of A also buys B).
    


2. **Criterion for grouping:** how to differentiate whether things occur together just by chance or because of associations.

    - With Lift, we can find associations that occur much more frequently than chance would dictate. The lift of the co-occurrence of A and B is the probability that **we actually see the two together**, compared to the probability that **we would see the two together if they were unrelated to (independent of) each other.**
    
    \begin{align}
    Lift(A,B) = \frac{p(A,B)}{p(A)p(B)}
    \end{align}

    - An alternative is leverage. Sometimes, the lift is high while the leverage is lower, which means that much of the co-occurrence is due to the fact that these product are each very popular items.
    
    \begin{align}
    Leverage(A,B) = p(B,A)-p(A)p(B)
    \end{align}
    
    
## Associations Among Facebook Likes: 

We want to find patterns of things people like and think about do certain Likes tend to co-occur more frequently than we would expect by chance?

- Find associations that give the highest lift or highest leverage, 
- While filtering out associations that cover too few cases to be interesting.

# Link Prediction

## Social Recommendation


***To predict that a link should exist between two individuals:***



**First Solution:** Build a **function** the existence or strength of a link between two individuals:

- Define a similarity measure between two individuals: weight the friends by the amount of communication, geographical proximity, or some other factor

- Find or devise a similarity function that takes these strengths into account

- Use this friend strength as one aspect of similarity while also including others, such as shared interests, shared demographics, etc.


**Second Solution:** Since we want to predict the existence (or strength) of a link, we might well decide to cast the task as a **predictive modeling problem**:

- **Instance**: We want to predict the existence of a relationship between two people. So, an instance should be a pair of people

- **Target variable**: Whether the relationship exists, or would be formed if recommended

- **Data**: Get training data where links already do or do not exist, or if we wanted to be more careful, we could invest in acquiring labels specifically for the recommendation task

- **Features**: Features of the pair of people, such as how many common friends the two individuals have, what is the similarity in their interests, and so on


# Movie Recommendation


**What would the features be for the relationship between a user and a movie?**


- One solution is to base the model on **latent** (relevant but not observed explicitly in the data) dimensions underlying the preferences. We can then **represent each movie** as a feature vector using the latent dimensions, and also to **represent each user’s preferences** as a feature vector using the latent dimensions. 

> **Examples of latent dimensions of movie preference**: 

> - Possible characterizations like serious versus escapist, comedy versus drama, orientation towards children, or gender orientation
> - Ill-defined things like depth of character development or quirkiness

- Then it is easy to find movies to recommend to any user: compute a **similarity score** between the user and all the movies; the movies that best match the users’ preferences would be those movies most similar to the user, when both are represented by the **same latent dimensions**.

- **How to decide the latent dimension:** We represent the similarity calculation between a user and a movie as a function using some number d of as-yet-unknown latent dimensions. Each dimension would be represented by **a set of weights (the coefficients) on each movie** and **a set of weights on each customer**. A high weight would mean this dimension is strongly associated with the movie or the customer. The meaning of the dimension would be purely implicit in the weights on the movies and customers. 

> For example, we might look at the movies that are weighted highly on some dimension versus low-weighted movies and decide, “the highly rated movies are all ‘quirky.”’ In this case, we could think of the dimension as the degree of quirkiness of the movie, although it is important to keep in mind that this interpretation of the dimension was imposed by us. 



<img src="./images/129.png" width=600>

- The latent dimension represented by the horizontal axis seems to separate the movies into drama-oriented films on the right and action-oriented films on the left. 

- A customer also would be placed some‐ where in the space, based on the movies she has rented or rated. The closest movies to the position of the customer would be good candidates for making recommendations.

- Note that for making recommendations, as always we need to keep thinking back to our business understanding. For example, different movies have different profit margins, so we may want to combine this knowledge with the knowledge of the most similar movies.

