# Case Study: Introduction to Brian's Problem
Brian ran an A/B test with three different groups: A, B, and C. He has provided us with a CSV file of his results named clicks.csv. It has the following columns:

    user_id: a unique id for each visitor to the FarmBurg site
    group: either 'A', 'B', or 'C' depending on which group the visitor was assigned to
    is_purchase: either 'Yes' if the visitor made a purchase or 'No' if they did not.

## Task 1: Import the data and neccesary libraries 
- Inspect the data using the .head() method. 

In [2]:
import pandas as pd
import numpy as np

In [5]:
abdata = pd.read_csv('clicks.csv')

print(abdata.head())

    user_id group is_purchase
0  8e27bf9a     A          No
1  eb89e6f0     A          No
2  7119106a     A          No
3  e53781ff     A          No
4  02d48cf1     A         Yes


In [7]:
# I want to also ensure all samples are the same size.
A_len = len(abdata[abdata['group'] == 'A'])
B_len = len(abdata[abdata['group'] == 'B'])
C_len = len(abdata[abdata['group'] == 'C'])

print(A_len,B_len,C_len)

1666 1666 1666


## Task 2: Perform a Chi Square Test

We have two categorical variables: `group` and `is_purchase`. We are interested in whether visitors are more likely to make a purchase if they are in any one group compared to the others. Because we want to know if there is an association between two categorical variables, we’ll start by using a Chi-Square test to address our question.

**Question 1: After creating a contingency table, which group appears to have the highest number of purchases?**

**Question 2: Save the p-value from the chi square test to a variable named pval and print the result. Using a significance threshold of 0.05, is there a significant difference in the purchase rate for groups A, B, and C?**

In [8]:
from scipy.stats import chi2_contingency

In [9]:
Xtab = pd.crosstab(abdata.group, abdata.is_purchase)

print(Xtab)

is_purchase    No  Yes
group                 
A            1350  316
B            1483  183
C            1583   83


In [22]:
"""Answer 1: Group 'A' has the highest number of purchases at 316;\
followed by Group 'B' at 183; \
 lastly Group 'C' has 83 purchases."""

"Answer 1: Group 'A' has the highest number of purchases at 316;followed by Group 'B' at 183;  lastly Group 'C' has 83 purchases."

In [30]:
# Now to test for significance using a Chi Squared test of independence.
# The null hypothesis states there is no significant difference between the group and their\
# purchasing behaviour. The alternative hypothesis states there is a significant difference.

chi2, pval, dof, expected = chi2_contingency(Xtab)

print("Significant" if pval <= 0.05 else "Not Significant", "p-value = ",pval)

Significant p-value =  2.4126213546684264e-35


In [32]:
"""Answer 2: The p-value of 2.4126213546684264e-35 is extremely small, thus I can \
conclude that there is a significant difference between the purchase rate for each group. """

'Answer 2: The p-value of 2.4126213546684264e-35 is extremely small, thus I can conclude that there is a significant difference between the purchase rate for each group. '

## Task 3
After concluding the chi-square test, we decide to present the findings back to Brian.

Us: Hey Brian! What was that test you were running anyway?

Brian: We are trying to get users to purchase a small FarmBurg upgrade package. It’s called a microtransaction. We’re not sure how much to charge for it, so we tested three different price points: \\\$0.99 (group 'A'), \\\$1.99 (group 'B'), and \\\$4.99 (group 'C'). It looks like significantly more people bought the upgrade package for \\\$0.99, so I guess that’s what we’ll charge.

Us: Oh no! We should have asked you this before we did that Chi-Square test. That wasn’t the right test at all. It’s true that more people wanted to purchase the upgrade at \\\$0.99; you probably expected that. What we really want to know is whether each price point allows us to make enough money that we can exceed some target goal. Brian, how much do you think it cost to build this feature?

Brian: Hmm. I guess that we need to generate a minimum of \\\$1000 in revenue per week in order to justify this project.

Us: We have some work to do!

In order to justify this feature, you will need to calculate the necessary purchase rate for each price point. Start by calculating the number of visitors to the site this week.

It turns out that Brian ran his original test over the course of a week, so the number of visitors in abdata is equal to the number of visitors in a typical week. 

Calculate the number of visitors in the data and save the value in a variable named `num_visits`. Make sure to print the value.


In [34]:
num_visits = len(abdata)
print(num_visits)

4998


Now that we know how many visitors we generally get each week (`num_visits`), we need to calculate the number of visitors who would need to purchase the upgrade package at each price point (\\\$0.99, \\\$1.99, \\\$4.99) in order to generate Brian’s minimum revenue target of \\\$1,000 per week.

- calculate the number of sales that would be needed to reach \\\$1,000 dollars of revenue at each price point. Save the result and print it out.

In [36]:
target = 1000
num_sales_needed_099 = np.ceil(target/0.99)
num_sales_needed_199 = np.ceil(target/1.99)
num_sales_needed_499 = np.ceil(target/4.99)

print(num_sales_needed_099)

1011.0


- Now that we know how many sales we need at each price point, calculate the proportion of weekly visitors who would need to make a purchase in order to meet that goal. Save the result and print it out.


In [41]:
p_sales_needed_099 = num_sales_needed_099/num_visits
p_sales_needed_199 = num_sales_needed_199/num_visits
p_sales_needed_499 = num_sales_needed_499/num_visits

print(f"The proportion of weekly visitors to break even if Brian prices the product at $0.99 is {p_sales_needed_099}")

The proportion of weekly visitors to break even if Brian prices the product at $0.99 is 0.20228091236494597


### Now let’s return to Brian’s question. 

To start, we want to know if the percent of Group A (the \\\$0.99 price point) that purchased an upgrade package is significantly greater than p_sales_needed_099 (the percent of visitors who need to buy an upgrade package at \\\$0.99 in order to make our minimum revenue target of \\\$1,000).

To answer this question, we want to focus on just the visitors in group A. Then, we want to compare the number of purchases in that group to p_sales_needed_099.

Start by saving the variables required for the hypothesis test you can use to test based on the scenario.

In [43]:
from scipy.stats import binomtest

In [44]:
# To use the binomtest function I need 4 parameters: x = observed purchase rate, 
# n = sample size, p = target purchase rate, alternative = 'greater' 

samp_size_099 = Xtab.loc['A']['No'] + Xtab.loc['A']['Yes']
sales_099 = Xtab.loc['A']['Yes']

samp_size_199 = Xtab.loc['B']['No'] + Xtab.loc['B']['Yes']
sales_199 = Xtab.loc['B']['Yes']

samp_size_499 = Xtab.loc['C']['No'] + Xtab.loc['C']['Yes']
sales_499 = Xtab.loc['C']['Yes']

In [53]:
pvalueA = binomtest(sales_099,samp_size_099,p_sales_needed_099,'greater')
pvalueA

BinomTestResult(k=316, n=1666, alternative='greater', statistic=0.18967587034813926, pvalue=0.9058887362654593)

In [60]:
"""Answer: When the binomial test uses "greater" as the alternative parameter,\
the null hypothesis (H0) would assume that the observed success rate is \
less than or equal to the target success rate. A pvalue of .91 means there is not \
significant evidence to reject this hypothesis, thus the observed purchase rate is \
not significantly greater than the required sales rate which indicates it may not \
be profitable to sell the new feature at a $0.99 price point."""

'Answer: When the binomial test uses "greater" as the alternative parameter,the null hypothesis (H0) would assume that the observed success rate is less than or equal to the target success rate. A pvalue of .91 means there is not significant evidence to reject this hypothesis, thus the observed purchase rate is not significantly greater than the required sales rate which indicates it may not be profitable to sell the new feature at a $0.99 price point.'

- For Group B (\\\$1.99 price point), perform a binomial test to see if the observed purchase rate is significantly greater than `p_sales_needed_199`.

Save the results to pvalueB, and print its value.


In [62]:
pvalueB = binomtest(sales_199,samp_size_199,p_sales_needed_199,'greater')
pvalueB

BinomTestResult(k=183, n=1666, alternative='greater', statistic=0.10984393757503001, pvalue=0.11441815431122217)

- For Group C ($4.99 price point), perform a binomial test to see if the observed purchase rate is significantly greater than p_sales_needed_499.

Save the results to pvalueC, and print its value

In [64]:
pvalueC = binomtest(sales_499,samp_size_499,p_sales_needed_499,'greater')
print(pvalueC)

BinomTestResult(k=83, n=1666, alternative='greater', statistic=0.04981992797118848, pvalue=0.029642608610084057)
