### banner segmemt of online store
#### using CI and p-value

#### intro:
We are an online store of sports goods: clothing, shoes, accessories and sports nutrition. On the main page of the store we show users banners in order to stimulate their sales. Now one of 5 banners is randomly displayed there. Each banner advertises a specific product or the entire company. Our marketers believe that the experience with banners can vary by segment, and their effectiveness may depend on the characteristics of user behavior.
The manager of the company had an offer from partners to sell this place for a banner and advertise another service there (payment is assumed according to the CPC model).
Help the manager make a decision.

In [20]:
import numpy as np 
import pandas as pd 
import scipy.stats as st
import math

##### data explanation
* order_id - unique purchase number (NA for banner clicks and impressions)
* user_id - unique identifier of the client
* page_id - unique page number for event bundle (NA for purchases)
* product - banner / purchase product
* site_version - version of the site (mobile or desktop)
* time - time of the action
* title - type of event (show, click or purchase)
* target - target class

In [2]:
# read data
# link: https://www.kaggle.com/datasets/podsyp/how-to-do-product-analytics?resource=download
df = pd.read_csv("../data/product.csv", sep=",")
df.head()

Unnamed: 0,order_id,user_id,page_id,product,site_version,time,title,target
0,cfcd208495d565ef66e7dff9f98764da,c81e728d9d4c2f636f067f89cc14862c,6f4922f45568161a8cdf4ad2299f6d23,sneakers,desktop,2019-01-11 09:24:43,banner_click,0
1,c4ca4238a0b923820dcc509a6f75849b,eccbc87e4b5ce2fe28308fd9f2a7baf3,4e732ced3463d06de0ca9a15b6153677,sneakers,desktop,2019-01-09 09:38:51,banner_show,0
2,c81e728d9d4c2f636f067f89cc14862c,eccbc87e4b5ce2fe28308fd9f2a7baf3,5c45a86277b8bf17bff6011be5cfb1b9,sports_nutrition,desktop,2019-01-09 09:12:45,banner_show,0
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,eccbc87e4b5ce2fe28308fd9f2a7baf3,fb339ad311d50a229e497085aad219c7,company,desktop,2019-01-03 08:58:18,banner_show,0
4,a87ff679a2f3e71d9181a67b7542122c,eccbc87e4b5ce2fe28308fd9f2a7baf3,fb339ad311d50a229e497085aad219c7,company,desktop,2019-01-03 08:59:15,banner_click,0


##### Assumption: User site version affects user behaviour
* H0 = user site version affects selection
* H1 = User site version doesnt affect selection

confidence level = **0.05** for **95% confidence interval**

practical significance = **0.01**

In [3]:
def abStatistics(n1, n2, prob, alpha, x1,x2,practical_significance, std1, std2):
    SE = math.sqrt(prob * (1-prob) * (1/n1 + 1/n2))
    alpha = st.norm.ppf(1-alpha/2)
    margin_of_error = SE * alpha
    
    mean_diff = x1-x2
    lower_bound = mean_diff-margin_of_error
    upper_bound = mean_diff+margin_of_error
    
    zscore = mean_diff / math.sqrt((math.pow(std1,2)/n1) + (math.pow(std2,2)/n2))
    pvalue = st.norm.sf(abs(zscore))*2
    d = (std1-std2)/SE
    if practical_significance < lower_bound:
        print("Reject null Hypothesis")
    else:
        print ("Accept null Hypothesis")
        
    if pvalue <= alpha:
        print("Reject null Hypothesis, Result is statistical significant")
    else:
        print("Accept Alternate Hypothesis, Result not statistical significant ")

    print(f"Standard Error: {SE}, margin of error: {margin_of_error},\nCI ({lower_bound},{upper_bound})")

In [12]:
df2 = df.groupby(["site_version", "title"]).agg({"target":"sum", "user_id":"count"}).reset_index()
df2

Unnamed: 0,site_version,title,target,user_id
0,desktop,banner_click,0,115065
1,desktop,banner_show,0,2134639
2,desktop,order,133181,133181
3,mobile,banner_click,0,714119
4,mobile,banner_show,0,5258675
5,mobile,order,115541,115541


In [15]:
# Compute data for desktop orders
total_desktop = df["target"][df["site_version"] == "desktop"].count()
total_desktop_order = df["target"][(df["site_version"] == "desktop") & (df["title"] == "order")].sum()
mean_desktop_order = total_desktop_order/total_desktop
std_desktop_order = df["target"][df["site_version"] == "desktop"].std()

In [16]:
# Compute data for mobile orders
total_mobile = df["target"][df["site_version"] == "mobile"].count()
total_mobile_order = df["target"][(df["site_version"]=="mobile")&(df["title"]=="order")].sum()
mean_mobile_order = total_mobile_order/total_mobile
std_mobile_order = df["target"][df["site_version"] == "mobile"].std()

In [6]:
# Probability of Order
prob = (total_desktop_order+total_mobile_order)/ (total_mobile + total_desktop)
prob

0.029360824060761025

In [19]:
# Calculate the statistics
alpha = 0.05 #95%  confidence level
pratical_sig = 0.01
abStatistics(total_desktop, total_mobile, prob, alpha, mean_desktop_order,mean_mobile_order,pratical_sig, std_desktop_order, std_mobile_order)

Reject null Hypothesis
Reject null Hypothesis, Result is statistical significant
Standard Error: 0.00012899865925454382, margin of error: 0.00025283272619286044,
CI (0.0366603828279715,0.03716604828035722)
