# Project: Hypothesis Testing for Microtransactions
Brian is a Product Manager at FarmBurg, a company that makes a farming simulation social network game.  In the FarmBurg game, you can plow, plant, and harvest different crops.

Today, you will be acting as Brian's data analyst for an A/B Test that he has been conducting.

## Part 1: Testing for Significant Difference

Start by importing the following modules that you'll need for this project:
- `pandas` as `pd`

In [1]:
import pandas as pd

Brian tells you that he ran an A/B test with three different groups: A, B, and C.  You're kind of busy today, so you don't ask too many questions about the differences between A, B, and C.  Maybe they were shown three different versions of an ad.  Who cares?

(HINT: you will care later)

Brian gives you a CSV of results called `clicks.csv`.  It has the following columns:
- `user_id`: a unique id for each visitor to the FarmBerg site
- `ab_test_group`: either `A`, `B`, or `C` depending on which group the visitor was assigned to
- `click_day`: only filled in *if* the user clicked on a link to purchase

Load `clicks.csv` into the variable `df`.

In [2]:
df = pd.read_csv('clicks.csv')

Define a new column called `is_purchase` which is `Purchase` if `click_day` is not `None` and `No Purchase` if `click_day` is `None`.  This will tell us if each visitor clicked on the Purchase link.

In [3]:
import numpy as np
df['is_purchased'] = np.where(pd.isnull(df['click_day']), 'No Purchase', 'Purchase')

We want to count the number of users who made a purchase from each group.  Use `groupby` to count the number of `Purchase` and `No Purchase` from each `group`.  Save your answer to the variable `purchase_counts`.

**Hint**: Group by `group` and `is_purchase` and the function `count` on the column `user_id`.

In [4]:
purchase_counts = df.groupby(['group','is_purchased'])['user_id'].count()

This data is *categorical* and there are *more than 2* conditions, so we'll want to use a chi-squared test to see if there is a significant difference between the three conditions.

Start by filling in the contingency table below with the correct values:
```py
contingency = [[groupA_purchases, groupA_not_purchases],
               [groupB_purchases, groupB_not_purchases],
               [groupC_purchases, groupC_not_purchases]]
```

In [5]:
contingency = [[purchase_counts['A']['Purchase'], purchase_counts['A']['No Purchase']],
               [purchase_counts['B']['Purchase'], purchase_counts['B']['No Purchase']],
               [purchase_counts['C']['Purchase'], purchase_counts['C']['No Purchase']]]

Now import the function `chi2_contingency` from `scipy.stats` and perform the chi-squared test.

Recall that the *p-value* is the second output of `chi2_contingency`.

In [6]:
from scipy.stats import chi2_contingency

In [7]:
chi, p, dof, ex = chi2_contingency(contingency)
print "P Value: " + str(p)

P Value: 2.41262135467e-35


Great! It looks like a significantly greater portion of users from Group A made a purchase.

## Part 2: Testing for Exceeding a Goal

Your day is a little less busy than you expected, so you decide to ask Brian about his test.

**You**: Hey Brian! What was that test you were running anyway?

**Brian**: It was awesome! We are trying to get users to purchase a small FarmBurg upgrade package.  It's called a microtransaction.  We're not sure how much to charge for it, so we tested three different price points: \$0.99, \$1.99, and \$4.99.  It looks like significantly more people bought the upgrade package for \$0.99, so I guess that's what we'll charge.

**You**: Oh no! I should have asked you this before we did that chi-squared test.  I don't think that this was the right test at all.  It's true that more people wanted purchase the upgrade at \$0.99; you probably expected that.  What we really want to know is if each price point allows us to make enough money that we can exceed some target goal.  Brian, how much do you think it cost to build this feature?

**Brian**: Hmm.  I guess that we need to generate a minimum of $1000 per week in order to justify this project.

**You**: We have some work to do!

How many visitors came to the site this week?

Hint: Look at the length of `df`.

In [8]:
numVisitors = df.user_id.count()
print "Number of Visitors: " + str(numVisitors)

Number of Visitors: 4998


Let's assume that this is how many visitors we generally get each week.  Given that, calculate the percent of visitors who would need to purchase the upgrade package at each price point (\$0.99, \$1.99, \$4.99) in order to generate \$1000 per week.

In [9]:
# Calculate the number of people who would need to purchase a $0.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
targetVisitorsPercA = 1000/0.99/numVisitors
print "Visitors needed to reach $1000 in group A: " + str(targetVisitorsPercA)

Visitors needed to reach $1000 in group A: 0.202101042437


In [10]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
targetVisitorsPercB = 1000/1.99/numVisitors
print "Visitors needed to reach $1000 in group B: " + str(targetVisitorsPercB)

Visitors needed to reach $1000 in group B: 0.100542729655


In [11]:
# Calculate the number of people who would need to purchase a $1.99 upgrade in order to generate $1000.
# Then divide by the number of people who visit the site each week.
targetVisitorsPercC = 1000/4.99/numVisitors
print "Visitors needed to reach $1000 in group C: " + str(targetVisitorsPercC)

Visitors needed to reach $1000 in group C: 0.0400961988002


Note that you need a smaller percentage of purchases for higher price points.

Now, for each group, perform a binomial test using `binom_test` from `scipy.stats`.
- `x` will be the number of purchases for that group
- `n` will be the total number of visitors assigned to that group
- `p` will be the target percent of purchases for that price point (calculated above)

Recall that:
- Group `A` is the \$0.99 price point
- Group `B` is the \$1.99 price point
- Group `C` is the \$4.99 price point

In [12]:
# import the binomial test from scipy.stats here
from scipy.stats import binom_test

In [13]:
# Test group A here
totalAVisitors = purchase_counts['A']['Purchase']+purchase_counts['A']['No Purchase']
pValueA = binom_test(purchase_counts['A']['Purchase'], totalAVisitors, targetVisitorsPercA)
print "P Value Group A: " + str(pValueA)

P Value Group A: 0.21112872994


In [14]:
# Test group B here
totalBVisitors = purchase_counts['B']['Purchase']+purchase_counts['B']['No Purchase']
pValueB = binom_test(purchase_counts['B']['Purchase'], totalBVisitors, targetVisitorsPercB)
print "P Value Group B: " + str(pValueB)

P Value Group B: 0.206602092466


In [15]:
# Test group C here
totalCVisitors = purchase_counts['C']['Purchase']+purchase_counts['C']['No Purchase']
pValueC = binom_test(purchase_counts['C']['Purchase'], totalCVisitors, targetVisitorsPercC)
print "P Value Group C: " + str(pValueC)

P Value Group C: 0.0456236724772


If any of the groups passed the binomial test with $p < 0.05$, then we can be confident that enough people will buy the upgrade package at that price point to justify the feature.

Which price point should Brian go with?  Did this surprise you?

**Answer**: Brian should go with a price of $4.99, as this had a p value < 0.05. This is not terribly surprising, as Brian made the most money off of Group C in the test phase, with roughly the same number of players in each group.

Data Structures - Answers
===============

•	**Linked Lists** – A linked list is set of data in which each item points to another item (or the next item in the list). This excludes the last item, which points to a terminator. The benefit to this, is that items can be easily added or removed in the list without moving the physical memory location of the entire list. This means the add and remove operations happen in the same amount of time, regardless of the length of the list. However, searching through the list can require searching of the entire list.

•	**Stacks** – Last in, first out is the guiding principle of stacks. If you think of the data as a tower, items are added on top of the existing items in the stack. Items cannot be taken off the stack again until the more recent items are removed. The most recently added item on the stack is the one that is ready to be removed. The first item added to the stack cannot be taken out again until all the items above it are removed. Searching, inserting and removing in a stack might go through each element before finishing, so the speed of these functions is proportional to the size of the stack.

•	**Hash Maps** – A mapping of key-value pairs. Hash maps use a function, called a hash function, to map the values to their keys. This function allows searching through the data in a constant amount of time relative to the size (number of key-value pairs) in the map. The data is unordered and useful for databases, where large amounts of data need to be accessed, inserted and removed.

•	**Trees** – A tree structures data into parents and children. Each item of data, also called a node, must have one parent node (except for the root of the tree, which is parentless), and 1 or more child nodes. These data structures have many well-defined search algorithms (BFS, DFS, A*), with a range of pros and cons depending on the data itself. 

•	**Heaps** – A heap is a somewhat ordered form of a tree. It ensures that the value of each parent node is always greater than or equal to or less than or equal to, the value of the child nodes. This ordering allows searching algorithms to be implemented quicker, but if the data is not pre-sorted, there is a tradeoff in the amount of time used to sort the data.
