In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# The purpose of case studies
We're going to practice doing some "case studies" mixing thinking about business questions and data-driven / data science solutions to them.  Depending on the form of the question, going through such a case will involve some or all of the following steps:

**Given:** A business question or problem.

**Step 1:** _Structure the problem_.  This means hypothesize some key inputs to understanding the question or problem, and/or the levers and metrics that need to be tracked to address it.  This may be a place to draw a tree, or write down a formula, to try to break up something complicated into simpler pieces.  

**Step 2:** _Plan the data_. Identify what existing (or hypothetical) data you will want to look at or collect.  Consider data availability / quality / cost / etc.

** Step 3:** _Plan the analysis_.  Is there a single model, or several related sub-models?  For each: Is it a regression or a classification problem, supervised or unsupervised?  What are some plausible features?  How will you test it?


We will talk through the following sorts of examples:
- Default risk
- Web user growth issues
- Value of a new feature
- Advertising
- Fraud

In [None]:
import mistune
from IPython.display import HTML

## An online dating platform

**The question**: Dater is an online dating platform.  They experienced a tremendous amount of user growth in their first two years but growth has recently declined.  Help them understand what's happening and how to turn it around.

**Spoilers / structure below**:

In [None]:
md_sec="""
####Dhrfgvbaf gb nfx:
1. Ner crbcyr ratntvat jvgu gur jrofvgr (jung ner gur evtug zrgevpf sbe ratntrzrag, cebsvyr ivrjf cre ivfvg? ivfvgf cre jrrx?)
1. Ubj qvq crbcyr hfr gb urne nobhg gur jrofvgr (creuncf erpbzzraq n sevraq sbe Qngre vf abj oebxra?)
1. Bs cerivbhf hfref jub wbvarq gur jrofvgr, jung qvq byq hfref qb va gurve svefg jrrx gung znqr gurz zber yvxryl gb or ybat grez npgvir zrzoref?  Jevgr n syvegngvbhf zrffntr?  Ohl tvsgf sbe zrzoref?  Ivrj ybgf bs cntrf?
1. Pna jr genpx hfre fngvfsnpgvba?  Ner hfref whfg abg fngvfsvrq jvgu gur freivpr naq jbeq vf trggvat bhg?  Ubj jbhyq lbh zrnfher fngvfsnpgvba?  Ba n qngvat jrofvgr, vg'f uneq gb xabj vs crbcyr tb ba n qngr.
1. Pna jr ernpgvir yncfrq hfref?  Ybbx ng rkvfgvat yncfrq hfref jub unir er-ratntrq naq frr jung gurl ner qbvat.
1. Fyvpr gur ybffrf (rvgure sebz yncfrq hfref be fybjre hfre npdhvgvbaf) ol qrzbtencuvpf.  Vs gur ybffrf ner pbapragengrq va n fcrpvsvp qrzbtencuvp, ner gurer pbzcrgvgbef va gung fcnpr?
"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

## A new feature on a website

We're launching a new feature on our website.  The feature is a recommendation engine that should get people to visit more of our content pages.  What is the value of this new feature?

In [None]:
md_sec="""
###Svefg fgno ng na nafjre:

Urer'f gur fvzcyrfg cbffvoyr zbqry -- jurer jr nffhzr gung gur dhrfgvba vf npghnyyl pbeerpg va vgf sbezhyngvba: 
Gung vf, gung gur znva vzcnpg bs gur erpbzzraqngvba ratvar vf vapernfvat gur ahzore bs cntrf ivfvgrq cre hfre.

####Guvatf lbh fubhyq nfx sbe:
- Ubj znal cntrf qvq n hfre glcvpnyyl ivfvg cre ivfvg gb gur fvgr, cevbe gb gur arj srngher? (10 cntrf - ubj jbhyq lbh svther guvf bhg?)
- Ubj znal cntrf qb lbh rkcrpg n hfre gb ivfvg jvgu gur arj srngher va cynpr?  (12 cntrf - ubj jbhyq lbh svther guvf bhg?)
- Vf gurer na vapernfr va serdhrapl bs ivfvgf ol gur fnzr hfre? (Jr xabj sebz cevbe rkcrevrapr gung hfref jub frr 12 cntrf pbzr onpx va bar zbagu engure guna gjb -- JNEAVAT, cbbe pbzcnenoyr.)
- Ubj qbrf gur jrofvgr zbargvmr?  (Jr'er na r-pbzzrepr cyngsbez, jr fryy fghss.)
- Jung'f gur chepunfr engr cre cntr ivrj (Vg'f pheeragyl ng 0.01% -- JNEAVAT, cbbe pbzcnenoyr.)
- Jung vf gur glcvpny onfxrg fvmr (\$100).
- Ubj znal ivfvgbef qb jr trg n zbagu?  (1 zvyyvba).

####Nafjre:
- Rkgen erirahr cre zbagu = \$100 k 0.01% k 1 zvyyvba ivfvgf k 10 ivrjf k (20% = yvsg sebz zber cntrf cre ivfvg) k (2 = yvsg sebz serdhrapl bs ergheaf) = \$40X.
- Jung vf jrnx nobhg gurfr nffhzcgvbaf?"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

In [None]:
md_sec="""
## Freivpr inyhr pbafvqrengvbaf

_Yvsrgvzr inyhr bs npgvir hfre_ = (# bs ivfvgf) * (# pbairefvbaf cre ivfvg) * (# inyhr bs pbairefvba)

_Yvsrgvzr inyhr bs freivpr_ = (# bs npgvir hfref) * (Yvsrgvzr inyhr bs npgvir hfre)

_Qrzbtencuvpf_: traqre, enpr, vapbzr, trbtencul, ntr.
(r.t., vzntvar lbhe srngher vzcnpgrq bar qrzbtencuvp zber guna bguref)

_Pbairefvba_ = chepunfr be nq pyvpx, qrcraqvat ba zbqry.

_Pbfg gb nggnva n hfre_
- Pbhcbaf
- Yrnq trarengvba (zbfgyl Tbbtyr, FRB, pbairefvba engr sbe nq jbeqf)

_Rknzcyrf_
- Fryyvat n jvqtrg bayvar / zrqvn pbagrag pzbcnal.
- Svkrq naq inevnoyr pbfgf.
- Aba-svanapvny zrgevp
- Npgvir hfref if. aba-npgvir hfref.  Xrrc crbcyr sebz yrnivat.
- Pnanonyvmvat lbhe bja srngherf
- Pbzcyrzragnel tbbq gb rkvfgvat srngherf
- Pbzcyrzragnel if pbzcrgvgvir sbe hfref.
- Hcfryyvat.

_Bgure Pbafvqrengvbaf_:

- Ivenyvgl pbzcbarag (gb freivprf, nqf).  
- Qverpg inyhr bs hfre qngn.
- Fjvgpuvat pbfg sbe hfref.
"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

## Coupon profitability

Our company ran a trial ad offering online coupons for \$20 off our widgets. The trial run had 1000 redemptions.  Should we scale up the program?


In [None]:
md_sec = """
####Guvatf lbh fubhyq nfx sbe:
1. Ubj zhpu qvq gubfr bayvar nqf pbfg: pbfg cre npgvba vf \$5.
1. Jung vf gur znetvany cebsvg ba n jvqtrg: \$10
1. Ner hfref jr trg erpheevat?  Vs fb ubj znal zber chepunfrf qb jr rkcrpg gurz gb znxr: gur nirentr hfre znxrf 5 chepunfrf ba bhe jrofvgr (JNEAVAT - vf bhe uvfgbevpny hfre cbby n tbbq tnhtr)
1. Ner jr pnanonyvmvat rkvfgvat fnyrf?  (Vs guvf vf n fznyy senpgvba bs fnyrf, guvf vf uneq gb gryy.  Vs gur pbhcba jnf zrnag gb gnetrg n arj qrzbtencuvp, lbh pbhyq frr vs gur erqrzcgvbaf jrer sebz crbcyr jub orybat gb gur qrzbtencuvp).
"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

## Credit card insurance

A credit card company is considering issuing default insurance.  In exchange for a 1% monthly fee, they will offer to pay off your balance in the event of certain specific adverse events: if you lose your job or are disabled.  The question is: what is the profitability of this?

Follow-up question: What about the profitability of a plain credit card offering?


In [None]:
md_sec = """
#### Guvatf lbh fubhyq nfx sbe:
- Jung vf gur pbfg bs phfgbzre npdhvfvgba: 50 pragf sbe n znvyre naq 1% erfcbafr (ubj jbhyq lbh rfgvzngr gur erfcbafr engr)
- Jung vf gur pbfg bs pynvzf bire pbfg bs qrsnhyg: 5% pynvzf engr, qrsnhyg vf 3%.
- Jung vf gur grez bs gur ybna: 12 zbaguf
- Jung vf gur glcvpny zbaguyl onynapr: $1000

### Cebsvgnovyvgl: 
- cre obeebjre: \$120 erirahr - (\$20 bs jevgr-bssf + \$50 sbe phfgbzre npdhvfvgvba).
- Abgr: ab pbfg bs pncvgny
- Abgr: fubhyq zragvba zbeny unmneq ceboyrz.  Jvgu zbeny unmneq, lbh fubhyq nffhzr lbhe pbfgf tb jnl hc -- nyzbfg pregnvayl rabhtu gb xvyy gur cebqhpg.
"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

## A LinkedIn ad campaign
-------------------------

Your company is considering using LinkedIn to send messages to prospective students.  A message costs \$1 to send.  We can also target messages based on user profiles and we know if they opened the message.  Who should we target the campaign to, how much should we expect to spend, and how many customers do we expect to get?



In [None]:
md_sec = """
####Nafjref
1. Fraq n fznyy pnzcnvta gb crbcyr lbh guvax ner yvxryl gb pbaireg onfrq ba nq-ubp ehyrf.  Hfr guvf nf genvavat qngn.  Ubj qb lbh guvax nobhg gur fvmr bs guvf pnzcnvta?
1. Eha n pynffvsvre nytbevguz gb vqragvsl jub'f zbfg yvxryl gb erfcbaq onfrq ba hfre cebsvyrf.
1. Nfx sbe gur yvsrgvzr inyhr bs n phfgbzre: $20.  Bayl fraq zrffntrf gb crbcyr nobir n pbairefvba cebonovyvgl guerfubyq (juvpu jvyy arire or yrff guna 5%).
1. Hfvat gur cerpvfvba genqrbss pheir, jr fubhyq or noyr gb pbzchgr gur nafjref sbe inevbhf crepragntr phgbssf.

####Guvatf lbh fubhyq abgr
1. N/O grfg zrffntrf
1. Fryrpgvba ovnf va gur qrfvta bs lbhe fznyy pnzcnvta:
    1. Lbh qba'g jnag gb fraq zrffntrf gb crbcyr jub ner abg yvxryl gb erfcbaq (orpnhfr lbh'er abg yrneavat nalguvat) ohg lbh zvtug nyfb zvff n tebhc bs sbyxf jub pbhyq erfcbaq jryy.
    1. Grzcbeny rssrpgf (erfcbafr ba qvssrerag qnlf bs gur jrrx inel).
1. Gurer'f n ybg bs crbcyr ba YvaxrqVa: gnetrg cerpvfvba, abg erpnyy vavgvnyyl gb inyvqngr gur pbaprcg.

####Rkgrafvbaf
1. Jung vs jr ner guvaxvat nobhg ohlvat Tbbtyr nq jbeqf?  Eubhtuyl fcrnxvat, guvf vf gur fnzr ceboyrz rkprcg nyy havgf ner cre havg gvzr.
1. Nq rssrpgvirarff tbrf qbja jvgu rkcbfherf fb lbhe ryyvtvoyr cbchyngvba vf qrpernfvat.
"""

# HTML(mistune.markdown(md_sec.encode('rot13')))

## Fraud detection

What's the right metric?  Do you have ground truth?

People usually care about true positives versus false positives.

### More questions

1. You are an employer and you're finding that some of your employees are becoming dissatisfied and leaving.  How would you predict who is most at risk for leaving soon?  How would you use this information?

1. Online advertising works based on an auction where the auction bid (score) is the product of your predicted conversion rate and the ad unit's bid.  Write an algorithm to determine the optimal bid.

1. We are an advertising company that delivers ads based on a keyword system.  People type searchwords into the browser and, based on pre-specified keywords the advertiser has chosen, they are eligible to show ads against certain queries.  We have a large advertiser coming onboard and our sales team needs to know how much of an advertising budget to request from them for the next quarter.  It's bad to under-deliver but we cannot ask for a larger advertising budget later in the quarter -- how large of a budget do we ask for?

1. How would you track user engagement with an article on our website?

1. How would you build a recommender system for venues in FourSquare?

1. If you have historical drug spending data, how can you predict which consumers are likely to adhere to their drug regimens?

1. If you have predictions of treatment regimen adherence, how would you determine how much of a discount you should offer an individual?

1. How would you use social media data to improve offering of personal loans?

1. How would you find duplicate entries in a contacts book?

1. You are an online retailer that shows users items from a broad catalog.  How do you measure user fatigue / novelty of items, and how would you determine the optimal number of times to show someone an item to get them to purchase it online?

1. Given certain precision and recall numbers, value of a mailer to send out, could you calculate the return?

### More general notes

## User growth problems
----------------------

**Main user segments:**
1. New users
1. Retaining existing users
1. Re-engaging lapsed users

**The general technique:** (for any of the 3 segments)  Look at historical users and what explained their success:
1. Demographic factors (gender, race, income, geography, age).  Most applications ask for gender and age.  Geography can sometimes be inferred from IP address.  Race and income are difficult.
1. Actions taken (or not taken) on the website immediately after sign-on (e.g. add friends, page views, "likes", check-ins, leave a review).
1. Social factors (in a social network), how active are their friends?

**Actions you can take:**
1. Email users (targeted messages, do certain types of users respond better to messages?).
1. Customize the feed: most websites have a feed of content.  The content should be compelling to users.
1. Pop-ups and other attention-getting instruments to drive desired behavior (e.g. add a friend).
1. Social: get friends who are established on the website to engage with the user through the site.  Even if the action is not valuable for helping to retain the friend, it can be helpful to engage the new user.

**Important takeaway:** The global metric is clear (user growth).  However, you often need to break it down into intermediate local metrics (e.g. adding a friend, clicking on content etc.) and then understand how those local metrics impact the primary metric.

**General proviso:** You probably don't need machine-learning.  Explicability is more important than precision in this kind of exercise, because the actionable insights gained have to be implemented by other teams.  A big exception is a "feed" where the recommendations can be machine-learned to optimize for a metric.

## What is the value of a customer?
-----------------------------------

Remember, the scientific metric is not always the business metric!

1. Credit customer question, Components:
    1. Cost of customer acquisition
    1. Default Rates, recovery amount
    1. Premiums earned
1. Consumer Internet User value question
    1. Ads shown * Click-through rate * average spend / Monthly Actives
    1. Given a new model, how would you project increase in revenue (answer: lift in precision @ 1, A/B Test).

### Exit Tickets
1. What are three common assumptions you should expect to run into?
1. How do you distinguish between the effects of your actions and external factors?
1. How do you calculate the value of a user?

### Spoiler generation...

In [None]:
print """
####Questions to ask:
1. Are people engaging with the website (what are the right metrics for engagement, profile views per visit? visits per week?)
1. How did people use to hear about the website (perhaps recommend a friend for Dater is now broken?)
1. Of previous users who joined the website, what did old users do in their first week that made them more likely to be long term active members?  Write a flirtatious message?  Buy gifts for members?  View lots of pages?
1. Can we track user satisfaction?  Are users just not satisfied with the service and word is getting out?  How would you measure satisfaction?  On a dating website, it's hard to know if people go on a date.
1. Can we reactive lapsed users?  Look at existing lapsed users who have re-engaged and see what they are doing.
1. Slice the losses (either from lapsed users or slower user acquitions) by demographics.  If the losses are concentrated in a specific demographic, are there competitors in that space?
""".encode('rot13')

In [None]:
print """
###First stab at an answer:

Here's the simplest possible model -- where we assume that the question is actually correct in its formulation: 
That is, that the main impact of the recommendation engine is increasing the number of pages visited per user.

####Things you should ask for:
- How many pages did a user typically visit per visit to the site, prior to the new feature? (10 pages - how would you figure this out?)
- How many pages do you expect a user to visit with the new feature in place?  (12 pages - how would you figure this out?)
- Is there an increase in frequency of visits by the same user? (We know from prior experience that users who see 12 pages come back in one month rather than two -- WARNING, poor comparable.)
- How does the website monetize?  (We're an e-commerce platform, we sell stuff.)
- What's the purchase rate per page view (It's currently at 0.01% -- WARNING, poor comparable.)
- What is the typical basket size (\$100).
- How many visitors do we get a month?  (1 million).

####Answer:
- Extra revenue per month = \$100 x 0.01% x 1 million visits x 10 views x (20% = lift from more pages per visit) x (2 = lift from frequency of returns) = \$40K.
- What is weak about these assumptions?
""".encode('rot13')

In [None]:
print """
## Service value considerations

_Lifetime value of active user_ = (# of visits) * (# conversions per visit) * (# value of conversion)

_Lifetime value of service_ = (# of active users) * (Lifetime value of active user)

_Demographics_: gender, race, income, geography, age.
(e.g., imagine your feature impacted one demographic more than others)

_Conversion_ = purchase or ad click, depending on model.

_Cost to attain a user_
- Coupons
- Lead generation (mostly Google, SEO, conversion rate for ad words)

_Examples_
- Selling a widget online / media content cmopany.
- Fixed and variable costs.
- Non-financial metric
- Active users vs. non-active users.  Keep people from leaving.
- Canabalizing your own features
- Complementary good to existing features
- Complementary vs competitive for users.
- Upselling.

_Other Considerations_:

- Virality component (to services, ads).  
- Direct value of user data.
- Switching cost for users.
""".encode('rot13')

In [None]:
print """

#### Things you should ask for:
1. How much did those online ads cost: cost per action is \$5.
1. What is the marginal profit on a widget: \$10
1. Are users we get recurring?  If so how many more purchases do we expect them to make: the average user makes 5 purchases on our website (WARNING - is our historical user pool a good gauge)
1. Are we canabalizing existing sales?  (If this is a small fraction of sales, this is hard to tell.  If the coupon was meant to target a new demographic, you could see if the redemptions were from people who belong to the demographic).

""".encode('rot13')

In [None]:
print """

#### Things you should ask for:
- What is the cost of customer acquisiton: 50 cents for a mailer and 1% response (how would you estimate the response rate)
- What is the cost of claims over cost of default: 5% claims rate, default is 3%.
- What is the term of the loan: 12 months
- What is the typical monthly balance: $1000

#### Profitability: 
per borrower: \$120 revenue - (\$20 of write-offs + \$50 for customer acquisition).
- Note: no cost of capital
- Note: should mention moral hazard problem.  With moral hazard, you should assume your costs go way up -- almost certainly enough to kill the product.

""".encode('rot13')

In [None]:
print """
####Answers
1. Send a small campaign to people you think are likely to convert based on ad-hoc rules.  Use this as training data.  How do you think about the size of this campaign?
1. Run a classifier algorithm to identify who's most likely to respond based on user profiles.
1. Ask for the lifetime value of a customer: $20.  Only send messages to people above a conversion probability threshold (which will never be less than 5%).
1. Using the precision tradeoff curve, we should be able to compute the answers for various percentage cutoffs.

####Things you should note
1. A/B test messages
1. Selection bias in the design of your small campaign:
    1. You don't want to send messages to people who are not likely to respond (because you're not learning anything) but you might also miss a group of folks who could respond well.
    1. Temporal effects (response on different days of the week vary).
1. There's a lot of people on LinkedIn: target precision, not recall initially to validate the concept.

####Extensions
1. What if we are thinking about buying Google ad words?  Rhoughly speaking, this is the same problem except all units are per unit time.
1. Ad effectiveness goes down with exposures so your elligible population is decreasing.
""".encode('rot13')

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*