## Goal
Pricing optimization is, non surprisingly, another area where data science can provide huge
value.

The goal here is to evaluate whether a pricing test running on the site has been successful.
As always, you should focus on user segmentation and provide insights about segments
who behave differently as well as any other insights you might find.

## Challenge Description

Company XYZ sells a software for \$39. Since revenue has been flat for some time, the VP of
Product has decided to run a test increasing the price. She hopes that this would increase
revenue. In the experiment, 66% of the users have seen the old price (\$39), while a random
sample of 33% users a higher price (\$59).

The test has been running for some time and the VP of Product is interested in understanding
how it went and whether it would make sense to increase the price for all the users.

Especially he asked you the following questions:
* Should the company sell its software for \$39 or \$59?
* The VP of Product is interested in having a holistic view into user behavior, especially
focusing on actionable insights that might increase conversion rate. What are your main
findings looking at the data?
* The VP of Product feels that the test has been running for too long and he should
have been able to get statistically significant results in a shorter time. Do you agree with
her intuition? After how many days you would have stopped the test? Please, explain
why.


In [1]:
import pandas as pd
import numpy as np

In [2]:
path_to_testresults = '../test_results.csv'
path_to_user_table = '../user_table.csv'

In [3]:
results= pd.read_csv(path_to_testresults)
users= pd.read_csv(path_to_user_table)

FileNotFoundError: File b'../test_results.csv' does not exist

In [None]:
print(len(results))
results.head()

In [None]:
print(len(users))
users.head()

# EDA Summary
1) There are no repeat users. However, we cannot conclude if the same person logged on using another device
2) All users from the dataset are from the USA; from 923 different cities
3) There are more users than results, so not all users were part of this test
4) There are no missing data
5) 64% of the users saw the test page
6) 55 More people saw the increased price than those in the test group. This indicates that there was a bug or some data entry problem.
7) There are 365 misclassifed operations
8) Data was collected over a period of 3 months

In [4]:
len(set(results['user_id'])) == len(results)

NameError: name 'results' is not defined

In [5]:
country= set(users['country'])
cities = len(set(users['city']))

print("Our users are from {0}, and live in this many cities {1}".format(country,cities))

NameError: name 'users' is not defined

In [6]:
#Checking for equal amount of users
len(users) < len(results)

NameError: name 'users' is not defined

In [7]:
#Checking for missing data
sum(results.isnull().any(axis=1)) + sum(users.isnull().any(axis=1))

NameError: name 'results' is not defined

In [8]:
((len(results.test) - sum(results.test == 1)) /len(results.test)) * 100

NameError: name 'results' is not defined

In [9]:
test_group= (sum((results.test) == 1))
price_59=(sum((results.price) == 59))

print("This many people were randomly selected to be in the test group {0}, and this many people saw the appropriate price of $59: {1}.".format(test_group,price_59))
print("\nThe difference between the two is {0}".format(test_group-price_59))

NameError: name 'results' is not defined

In [10]:
len(results.loc[(results.price == 59) & (results.test == 0)])

NameError: name 'results' is not defined

In [136]:
print("There are this many misclassfied observations: {0}" .format(len(results.loc[(results.price == 59) & (results.test == 0)]) +
len(results.loc[(results.price == 39) & (results.test == 1)])))




There are this many misclassfied observations: 365


In [122]:
min(results.timestamp), max(results.timestamp)

('2015-03-02 00:04:12', '2015-05-31 23:59:45')

# Analysis of data
This section will answer the following questions:
1) Whether the company should sell its software for $39/ $59
2) Whether the test was run for too long
3) What are the actionable insights that might increase conversion rate


In [171]:
#Tracking observations that are misclassified
results['is_misclassified'] = ((results.price == 59) & (results.test == 0) 
                               |(results.price == 39) & (results.test == 1))


In [172]:
#Create a single table with all the data
data = results.set_index('user_id').join(users.set_index('user_id'), how= 'outer')

In [173]:
#Set not specified for values that are missing.
data['has_location'] = data['country'] == 'USA'

data.loc[data['country'] != 'USA', 'country'] = 'Not specified.'
data.loc[data['country'] != 'USA', 'city'] = 'Not specified'




In [180]:
#Only look at data that was not misclassified
good_data = data.loc[-data.is_misclassified]

# 1) What's the optimal software price?

First, to answer this we must understand what the conversion rates are between test groups. Once that is determined, we can compute a t-statistic to see whether there's a significant difference between the revenue of the two groups. 

## Conclusion
There is a sigificant difference in level of conversion between the two groups. The lower priced software converts a significantly higher amount of customers and brings in larger amounts of revenue. In sum, the price should be kept at $39. The client may want to try the experiment with a $49 price point.

In [203]:
#Conversion rates for each of the groups
group1 = sum(good_data.loc[good_data['test'] == 0].converted)/len(good_data.loc[good_data['test']==0])
print("{:.2%} of the people in the non-test class were converted".format(group1))

group2= sum(good_data.loc[good_data['test'] == 1].converted)/len(good_data.loc[good_data['test'] == 1])
print("{:.2%} of the people in the test class were converted".format(group2))





1.99% of the people in the non-test class were converted
1.56% of the people in the test class were converted


In [219]:
# Test whether there is a significant difference between the revenues of the two groups
import scipy.stats as sp
import numpy as np

x_control= np.array(good_data.loc[good_data.test == 0]['converted'])*np.array(good_data.loc[good_data.test == 0]['price'])
x_test= np.array(good_data.loc[good_data.test == 1]['converted'])*np.array(good_data.loc[good_data.test == 1]['price'])

sp.ttest_ind(x_test,x_not, equal_var= False)

Ttest_indResult(statistic=5.715224666463108, pvalue=1.0972577312420791e-08)

In [234]:
#Let's take a quantitative view to get a full picture of the situation
control_rev= len(good_data.loc[good_data['test'] == 0]) * 39
test_rev = len(good_data.loc[good_data['test'] == 1]) * 59

print("The total revenue from the control condition was ${}".format(control_rev))
print("The total revenue from the test condition was ${}".format(test_rev))
         

The total revenue from the control condition was $7898163
The total revenue from the test condition was $6721162


# 3) Could we have shortened the length of testing?

Below, I truncate the data in half and run a t-test. The t-test shows that the two groups are still significant, concluding that we could've shortened the length of the experiment by at least half.

In [241]:
#VP wants to know if we can observe the same results with less test time. Can we observe the same pattern with less data?

x_control_half = x_control[:int(len(x_control/2))]
x_test_half = x_test[:int(len(x_test/2))]

sp.ttest_ind(x_test_half, x_control_half, equal_var= False)

Ttest_indResult(statistic=5.715224666463108, pvalue=1.0972577312420791e-08)

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, 39,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0, 39,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 39,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 39,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0