#### Statistical Data Analysis

Dataset: 

- _fs_norm.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 2025-09-09

# Statistical Inference Data Analysis – xxx Activity Dataset

## __1. Libraries__.

In [1]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:

    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import os
import pandas as pd
import scipy.stats as st
from statsmodels.stats.proportion import proportions_ztest
import numpy as np


## __2. Path to Data file__.

In [2]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "clean"
df_fs = load_dataset_from_csv(data_file_path, "fs_norm.csv", sep=',', header='infer')

## __3. Statistical Data Analysis__.

### 3.1  Inferential Tests.

#### 3.1.0  Data Analysis prior to A/B test

In [3]:
df_fs

Unnamed: 0,eventname,deviceidhash,datetime,expid,date,time
0,tutorial,37374620466...,2019-08-01 ...,246,2019-08-01,00:07:28
1,mainscreena...,37374620466...,2019-08-01 ...,246,2019-08-01,00:08:00
2,mainscreena...,37374620466...,2019-08-01 ...,246,2019-08-01,00:08:55
3,offersscree...,37374620466...,2019-08-01 ...,246,2019-08-01,00:08:58
4,mainscreena...,14338408838...,2019-08-01 ...,247,2019-08-01,00:08:59
...,...,...,...,...,...,...
240882,mainscreena...,45996283640...,2019-08-07 ...,247,2019-08-07,21:12:25
240883,mainscreena...,58498066124...,2019-08-07 ...,246,2019-08-07,21:13:59
240884,mainscreena...,57469699388...,2019-08-07 ...,246,2019-08-07,21:14:43
240885,mainscreena...,57469699388...,2019-08-07 ...,246,2019-08-07,21:14:58


In [4]:
# Amount of users are in each group
df_fs_users_group = df_fs.groupby('expid')['deviceidhash'].nunique().reset_index()
df_fs_users_group = df_fs_users_group.rename(columns={'expid': 'group', 'deviceidhash': 'users'})
df_fs_users_group

Unnamed: 0,group,users
0,246,2484
1,247,2513
2,248,2537


#### 3.1.1  A/A Control groups Analysis

Hypothesis(0): Control group A1 (246) and control group A2 (247) there is no statistical significant difference.   
Hypothesis(1): Control group A1 (246) and control group A2 (247) there is statistical significant difference.

In [5]:
# Select the most popular event. For each of the control groups, determine the number of users who performed this action. Calculate the percentage.
events = df_fs['eventname'].value_counts()
events.name = 'events'
events = events.reset_index()
events

Unnamed: 0,eventname,events
0,mainscreena...,117328
1,offersscree...,46333
2,cartscreena...,42303
3,paymentscre...,33918
4,tutorial,1005


In [6]:
popular_event = df_fs['eventname'].mode()[0]
popular_event

'mainscreenappear'

In [7]:
total_users_group = df_fs.groupby('expid')['deviceidhash'].nunique().reset_index()
total_users_group.columns = ['group', 'total_users']
total_users_group

Unnamed: 0,group,total_users
0,246,2484
1,247,2513
2,248,2537


In [8]:
event_users_group = df_fs.groupby(['expid', 'eventname'])['deviceidhash'].nunique().reset_index()
event_users_group.columns = ['group', 'eventname', 'event_users']
event_users_group

Unnamed: 0,group,eventname,event_users
0,246,cartscreena...,1266
1,246,mainscreena...,2450
2,246,offersscree...,1542
3,246,paymentscre...,1200
4,246,tutorial,278
5,247,cartscreena...,1238
6,247,mainscreena...,2476
7,247,offersscree...,1520
8,247,paymentscre...,1158
9,247,tutorial,283


In [9]:
df_event_group = event_users_group.merge(total_users_group, on='group')
df_event_group['eventrate'] = (df_event_group['event_users'] / df_event_group['total_users'] * 100).round(3)
df_event_group

Unnamed: 0,group,eventname,event_users,total_users,eventrate
0,246,cartscreena...,1266,2484,50.966
1,246,mainscreena...,2450,2484,98.631
2,246,offersscree...,1542,2484,62.077
3,246,paymentscre...,1200,2484,48.309
4,246,tutorial,278,2484,11.192
5,247,cartscreena...,1238,2513,49.264
6,247,mainscreena...,2476,2513,98.528
7,247,offersscree...,1520,2513,60.485
8,247,paymentscre...,1158,2513,46.08
9,247,tutorial,283,2513,11.261


In [None]:
# α = 0.5
def control_groups_proportions_statistical_difference(df):
    
    events = df['eventname'].unique()
    
    for event in events:
        
        actions_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 246), 'event_users'],
                                  df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 247), 'event_users']])
        
        events_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 246), 'total_users'],
                                 df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 247), 'total_users']])
        
        display(HTML(f"> Event: <b>{event.upper()}</b>"))
        
        stats, p_value = proportions_ztest(actions_event, events_event, alternative="two-sided")
        display(HTML(f"> Z-statistic: {stats}"))
        display(HTML(f"> p-value: {p_value}"))
        
        if p_value <= 0.05:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that <b>conversion rate</b> between Group A1 (246) and Group A2 (247) are <b>different</b>."))
        else:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is not enough statistical evidence that <b>conversion rate</b> between Group A1 (246) and Group A2 (247) are different."))
        
        print()
 

In [11]:
control_groups_proportions_statistical_difference(df_event_group)
















In [12]:
# Confirm that the groups were divided correctly
pivot_rates = df_event_group.pivot(index='eventname', columns='group', values='eventrate')
pivot_rates

group,246,247,248
eventname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cartscreenappear,50.966,49.264,48.482
mainscreenappear,98.631,98.528,98.266
offersscreenappear,62.077,60.485,60.347
paymentscreensuccessful,48.309,46.08,46.551
tutorial,11.192,11.261,10.997


#### 3.1.2  A1/B Control-Test groups Analysis

Hypothesis(0): Control group A1 (246) and test group B (248) there is no statistical significant difference.   
Hypothesis(1): Control group A1 (246) and test group B (248) there is statistical significant difference.

In [13]:
def control_A1_test_proportions_statistical_difference(df):
    
    events = df['eventname'].unique()
    
    for event in events:
        
        actions_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 246), 'event_users'],
                                  df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 248), 'event_users']])
        
        events_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 246), 'total_users'],
                                 df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 248), 'total_users']])
        
        display(HTML(f"> Event: <b>{event.upper()}</b>"))
        
        stats, p_value = proportions_ztest(actions_event, events_event, alternative="two-sided")
        display(HTML(f"> Z-statistic: {stats}"))
        display(HTML(f"> p-value: {p_value}"))
        
        if p_value <= 0.05:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that <b>conversion rate</b> between Group A1 (246) and Group B (248) are <b>different</b>."))
        else:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is not enough statistical evidence that <b>conversion rate</b> between Group A1 (246) and Group B (248) are different."))
        
        print()

In [14]:
control_A1_test_proportions_statistical_difference(df_event_group)
















#### 3.1.3  A2/B Control-Test groups Analysis

Hypothesis(0): Control group A1 (247) and test group B (248) there is no statistical significant difference.   
Hypothesis(1): Control group A1 (247) and test group B (248) there is statistical significant difference.

In [15]:
def control_A2_test_proportions_statistical_difference(df):
    
    events = df['eventname'].unique()
    
    for event in events:
        
        actions_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 247), 'event_users'],
                                  df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 248), 'event_users']])
        
        events_event = np.array([df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 247), 'total_users'],
                                 df_event_group.loc[(df_event_group['eventname'] == event) & (df_event_group['group'] == 248), 'total_users']])
        
        display(HTML(f"> Event: <b>{event.upper()}</b>"))
        
        stats, p_value = proportions_ztest(actions_event, events_event, alternative="two-sided")
        display(HTML(f"> Z-statistic: {stats}"))
        display(HTML(f"> p-value: {p_value}"))
        
        if p_value <= 0.05:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that <b>conversion rate</b> between Group A2 (247) and Group B (248) are <b>different</b>."))
        else:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is not enough statistical evidence that <b>conversion rate</b> between Group A2 (247) and Group B (248) are different."))
        
        print()

In [16]:
control_A2_test_proportions_statistical_difference(df_event_group)
















#### 3.1.4  AU/B Control-Test groups Analysis

Hypothesis(0): Control group AU (246/247) and test group B (248) there is no statistical significant difference.   
Hypothesis(1): Control group AU (246/247) and test group B (248) there is statistical significant difference.

In [17]:
df_event_group

Unnamed: 0,group,eventname,event_users,total_users,eventrate
0,246,cartscreena...,1266,2484,50.966
1,246,mainscreena...,2450,2484,98.631
2,246,offersscree...,1542,2484,62.077
3,246,paymentscre...,1200,2484,48.309
4,246,tutorial,278,2484,11.192
5,247,cartscreena...,1238,2513,49.264
6,247,mainscreena...,2476,2513,98.528
7,247,offersscree...,1520,2513,60.485
8,247,paymentscre...,1158,2513,46.08
9,247,tutorial,283,2513,11.261


In [18]:
df_event_group_AU = df_event_group.loc[(df_event_group['group'] == 246) | (df_event_group['group'] == 247), :]
df_event_group_AU

Unnamed: 0,group,eventname,event_users,total_users,eventrate
0,246,cartscreena...,1266,2484,50.966
1,246,mainscreena...,2450,2484,98.631
2,246,offersscree...,1542,2484,62.077
3,246,paymentscre...,1200,2484,48.309
4,246,tutorial,278,2484,11.192
5,247,cartscreena...,1238,2513,49.264
6,247,mainscreena...,2476,2513,98.528
7,247,offersscree...,1520,2513,60.485
8,247,paymentscre...,1158,2513,46.08
9,247,tutorial,283,2513,11.261


In [19]:
tu = df_event_group_AU['total_users'].unique().sum()
tu

np.int64(4997)

In [20]:
df_event_group_AU = df_event_group_AU.groupby('eventname')['event_users'].sum().reset_index()
df_event_group_AU['group'] = '246_247'
df_event_group_AU['total_users'] = tu
df_event_group_AU['eventrate'] = ((df_event_group_AU['event_users'] / df_event_group_AU['total_users']) * 100).round(3)
df_event_group_AU

Unnamed: 0,eventname,event_users,group,total_users,eventrate
0,cartscreena...,2504,246_247,4997,50.11
1,mainscreena...,4926,246_247,4997,98.579
2,offersscree...,3062,246_247,4997,61.277
3,paymentscre...,2358,246_247,4997,47.188
4,tutorial,561,246_247,4997,11.227


In [23]:
df_event_group_B = df_event_group.loc[(df_event_group['group'] == 248), :]
df_event_group_B = df_event_group_B.copy()
df_event_group_B['group'] = df_event_group_B['group'].astype(str)
df_event_group_B

Unnamed: 0,group,eventname,event_users,total_users,eventrate
10,248,cartscreena...,1230,2537,48.482
11,248,mainscreena...,2493,2537,98.266
12,248,offersscree...,1531,2537,60.347
13,248,paymentscre...,1181,2537,46.551
14,248,tutorial,279,2537,10.997


In [30]:
df_event_group_AB = pd.concat([df_event_group_AU, df_event_group_B], axis=0, ignore_index=True)
df_event_group_AB

Unnamed: 0,eventname,event_users,group,total_users,eventrate
0,cartscreena...,2504,246_247,4997,50.11
1,mainscreena...,4926,246_247,4997,98.579
2,offersscree...,3062,246_247,4997,61.277
3,paymentscre...,2358,246_247,4997,47.188
4,tutorial,561,246_247,4997,11.227
5,cartscreena...,1230,248,2537,48.482
6,mainscreena...,2493,248,2537,98.266
7,offersscree...,1531,248,2537,60.347
8,paymentscre...,1181,248,2537,46.551
9,tutorial,279,248,2537,10.997


In [31]:
def control_test_proportions_statistical_difference(df):
    
    events = df['eventname'].unique()
    
    for event in events:
        
        actions_event = np.array([df_event_group_AB.loc[(df_event_group_AB['eventname'] == event) & (df_event_group_AB['group'] == '246_247'), 'event_users'],
                                  df_event_group_AB.loc[(df_event_group_AB['eventname'] == event) & (df_event_group_AB['group'] == '248'), 'event_users']])
        
        events_event = np.array([df_event_group_AB.loc[(df_event_group_AB['eventname'] == event) & (df_event_group_AB['group'] == '246_247'), 'total_users'],
                                 df_event_group_AB.loc[(df_event_group_AB['eventname'] == event) & (df_event_group_AB['group'] == '248'), 'total_users']])
        
        display(HTML(f"> Event: <b>{event.upper()}</b>"))
        
        stats, p_value = proportions_ztest(actions_event, events_event, alternative="two-sided")
        display(HTML(f"> Z-statistic: {stats}"))
        display(HTML(f"> p-value: {p_value}"))
        
        if p_value <= 0.05:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that <b>conversion rate</b> between Group A (246_247) and Group B (248) are <b>different</b>."))
        else:
            display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is not enough statistical evidence that <b>conversion rate</b> between Group A (246_247) and Group B (248) are different."))
        
        print()

In [32]:
control_test_proportions_statistical_difference(df_event_group_AB)
















#### 3.2 Results

- Significance level that have been set to test the statistical hypotheses mentioned above : α = 0.05
- Amount of statistical hypothesis tests that have been performed: 4 (A/A, A1/B, A2/B, A/B)
- What should the significance level be? Please specify if you want to change it.
  If α = 0.01, the results would be the same due to the p-value gotten from all the statistical tests (greater than α). In order to be changed
  the α value must be greater than p-value, acoording to the eventrate values no much differences are among froups, therefore p-value is allways
  greater than α.