# Using Hypothesis Testing to Understand Drivers of Sales

You have been contracted by an e-commerce marketplace to help them understand what factors drive sales. They've collected data on 500 purchases made recently, including how much time was spent on the page, how many reviews the product had, and the product rating. They would like to understand what factors in their data are driving sales. 

### The data

The company has collected data on 500 purchases. They know:
* The ammount of time in seconds an individual user spent on that page
* The number of product reviews
* The product rating
* Whether the user purchased the product or not.

### Your task

Use hypothesis testing to test whether the mean time spent on the page, the mean number of reviews, and the mean product rating are different in populations that purchased and didn't purchase. 

Exploratory Data Analysis:
* Plot a histogram for purchases and non-purchases for that column. 
  * *Extra:* use `plt.axvline` to indicate the means on the histograms
  * *Extra:* Do it for all the columns at once with subplots, including labeled histograms and means.
* Are your purchasers and non-purchaseres relatively evenly balanced?
  * **HINT:** `dataframe[column].value_counts()`

For each test,
* State your null and alternative hypothesis (in markdown!)
* Use masking in `pandas` to separate your populations.
* Conduct a t-test to test your null hyopthesis. 
* Interpret your p-value.
* Describe your findings about the column (in markdown!)


The goal of your work is to make a recommendation for how to identify products that will sell better. **To conclude, provide a recommendation for identifying products that will perform well and products that will perform poorly in the marketplace.**

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

Load in the CSV using `pd.read_csv`

In [None]:
# Enter the name of the csv
results_df = pd.read_csv()

Display the head to check the data.

In [None]:
# Your code here

Separate the data into two DataFrames: one for those who purchased, and one for those who didn't.

In [None]:
# Create a boolean mask on your results_df['purchase] column that's True if the customer bought and False if they didn't.
purchase_mask = None

In [None]:
# Create a boolean mask on your results_df['purchase] column that's False if the customer bought and True if they didn't.
no_purchase_mask = None

In [None]:
# Use the masks to create your new DataFrames
purchase_df = results_df[purchase_mask]
no_purchase_df = results_df[no_purchase_mask]

Plot the distributions of each numerical column. 

In [None]:
fig = plt.figure(figsize=(12,15))
for i, col in enumerate(results_df.columns[:3]):
    fig.add_subplot(3,1,1+i)
#     Add a histogram for the column in the loop from the purchase dataframe.
    
#     Add a histogram for the column in the loop from the no purchase dataframe.
    
    plt.ylabel('count')
    plt.xlabel(col)

Observations:
* the means of the time spent on each page appear the same for purchases and no-purchases.
* For number of reviews and product rating, the purchase mean seems higher than the no-purchase mean.

How many purchases? How many non-purchases?

In [None]:
# Your code here

## Hypothesis Testing

### `time_on_page_sec`

What's your null hypothesis?

**(Enter text here)**

What's your alternative hypothesis?

**(Enter text here)**

In [None]:
# Complete the code to obtain your pvalue.
pvalue = stats.ttest_ind().pvalue
pvalue

What's the interpretation of your pvalue?

**(Enter text here)**

### `num_product_reviews`

What's your null hypothesis?

**(Enter text here)**

What's your alternative hypothesis?

**(Enter text here)**

In [None]:
# Complete the code to obtain your pvalue.
pvalue = stats.ttest_ind().pvalue
pvalue

What's the interpretation of your pvalue?

**(Enter text here)**

### `product_rating`

What's your null hypothesis?

**(Enter text here)**

What's your alternative hypothesis?

**(Enter text here)**

In [None]:
# Complete the code to obtain your pvalue.
pvalue = stats.ttest_ind().pvalue
pvalue

What's the interpretation of your pvalue?

**(Enter text here)**

### Conclusion
---

**(Enter text here)**


# Part 2: A/B Testing Using the chi-square test (optional)

This part will involve some independent learning. However, resources are provided here in the notebook and linked materials. The goal is to apply what you've learned about hypothesis testing and p-values with a new test and a few new techniques.

So far we've been working with the t-test, which tests differences in means. Another common hypothesis test is the chi-square test (Greek: $\chi$), which tests whether two proportions are different. We're going to use this to do some A/B testing with the chi-square test. 

The buy button on the e-commerce marketplace's product pages is red. However, they want to see if a yellow button will do better. You've designed an experiment to test this!

Here's the setup of the A/B test:

* Control Group: Red Button
* Experimental Group: Yellow Button

We will conduct this experiment with 1000 users. Each user will be randomly assigned to a group (experimental or control). After, we will use a chi-square test to obtain a p-value.

#### Chi-square Test

A chi-square test tests the null hypothesis that our observed proportions are the same between our groups. In our case, we only have one proportion- the proportion of purchases. So, in our case, 

> $H_0$: The observed proportion of purchases is the same between control and experimental groups.

So, our alternative hypothesis is that:

> $H_1$: The observed proportion of purchases is not the same between control and experimental groups.

1. Run the `generate_test_data` function and store the results in a variable, such as `abdata`. 
2. Use `pd.crosstab` of your `group` and `purchase` columns. This returns the observed frequencies of purchases for your control and experimental groups (this is called a contingency table).
3. Use `scipy.stats.chi2_contingency()`. You may pass your crosstab as an argument to this function. It returns four items: The chi-square statistic, the p-value, the degrees of freedom, and the expected frequencies based on the observations in the table.
4. Interpret the p-value. Is there enough 'evidence' to reject our null hypothesis? 

Based on your observations and p-value, what would be the recommendation? Would you recommend sticking with the red button or moving to the yellow button?

In [None]:
def generate_test_data(n_users = 1000, seed = 42):
    '''
    Randomly generates test groups. 
    Returns a DataFrame where the first column is the group, and the second column is wheter the customer purchased.
    '''
    np.random.seed(seed)
    data = []
    for user in range(n_users):
        if np.random.random(1) > .5:
            if np.random.random(1) < .3:
                data.append(('control','yes'))
            else:
                data.append(('control','no'))
        else:
            if np.random.random(1) < .35:
                data.append(('experimental','yes'))
            else:
                data.append(('experimental','no'))
    return pd.DataFrame(data, columns = ['group','purchase'])

In [None]:
abdata = generate_test_groups()

In [None]:
abdata.head()

In [None]:
# Complete the crosstab to generate the contingency table. It should take two arguments, each should be a column.

contingency_table = pd.crosstab()
contingency_table

In [None]:
chistat, chipval, dof, exp_p = stats.chi2_contingency(contingency_table)

In [None]:
chipval

**(Interpret your p-value and make a recommendation on which button to move forward with)**