# Introduction
- This notebook compares tool-window session durations for **manual** vs **auto** opens.
- It contains the following sections:
  - Loading the `sessions.csv` file
  - Computing descriptive statistics
  - Visualizing distributions (interactive Plotly charts)
  - Running normality / variance checks and the main hypothesis tests:
    - t-tests (Student’s and Welch’s, depending on variance)
    - Mann–Whitney U (nonparametric)
    - Kolmogorov–Smirnov (distributional comparison)
  - Computing effect sizes (Cohen’s d, Cliff’s delta)
  - Summarizing conclusions

# Imports

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
import math

# Load sessions and quick checks

In [None]:
sessions = pd.read_csv(r"data/sessions.csv")

In [None]:
print("Columns:", ", ".join(sessions.columns))

In [None]:
sessions['open_type'].value_counts(dropna=False)

# Descriptive Statistics and Data Visualization

In [None]:
# Create the two series we'll analyze
manual_dur = sessions[sessions['open_type'] == 'manual']['duration_sec']
auto_dur   = sessions[sessions['open_type'] == 'auto']['duration_sec']

# Descriptive statistics table
def describe_series(s):
    return {
        'count': len(s),
        'mean': s.mean(),
        'median': s.median(),
        'std': s.std(),
        'min': s.min(),
        '25%': s.quantile(0.25),
        '75%': s.quantile(0.75),
        'max': s.max(),
        'skew': s.skew(),
        'kurtosis': s.kurtosis()
    }

desc = pd.DataFrame({
    'manual': describe_series(manual_dur),
    'auto':   describe_series(auto_dur)
}).T

print("Descriptive Statistics Table:")
desc.round(3)

In [None]:
# Histogram overlay
fig = px.histogram(sessions, x='duration_sec', color='open_type', barmode='overlay',
                   nbins=80, range_x=[0, sessions['duration_sec'].quantile(0.99)],
                   labels={'duration_sec':'Duration (s)', 'open_type':'Open Type'},
                   title='Duration histograms (trimmed at 99th percentile)')
fig.update_layout(legend_title_text='Open Type')
fig.show()

In [None]:
# Box plot (use log scale if durations are skewed)
fig2 = px.box(sessions, x='open_type', y='duration_sec', color='open_type',
              labels={'open_type':'Open Type','duration_sec':'Duration (s)'},
              title='Boxplot of durations by open_type')
fig2.update_layout(legend_title_text='Open Type')
fig2.show()

In [None]:
# Violin plot
fig3 = px.violin(sessions, x='open_type', y='duration_sec', color='open_type', box=True, points='outliers',
                 labels={'open_type':'Open Type','duration_sec':'Duration (s)'},
                 title='Violin plot of durations by open_type')
fig3.update_layout(legend_title_text='Open Type')
fig3.show()

In [None]:
# ECDF plot
def ecdf(df, label):
    x = np.sort(df)
    y = np.arange(1, len(x)+1) / len(x)
    return x, y

mx, my = ecdf(manual_dur.values, 'manual'); ax, ay = ecdf(auto_dur.values, 'auto')
ecdf_fig = go.Figure()
ecdf_fig.add_trace(go.Scatter(x=mx, y=my, mode='lines', name='manual'))
ecdf_fig.add_trace(go.Scatter(x=ax, y=ay, mode='lines', name='auto'))
ecdf_fig.update_layout(title='Empirical CDF of durations', xaxis_title='Duration (s)', yaxis_title='ECDF', legend_title_text='Open Type')
ecdf_fig.show()

# Normality and Variance Tests

In [None]:
# Normality and variance checks (Shapiro sampled, D'Agostino for large n, Levene)
# Note: Shapiro is sensitive for very large samples - we'll sample if required
def shapiro_safe(series, sample_n=5000):
    n = len(series)
    if n > sample_n:
        sample = series.sample(sample_n, random_state=42)
        print(f"Shapiro test on random sample of {sample_n} (original n={n})")
        return stats.shapiro(sample)
    else:
        return stats.shapiro(series)

print("Shapiro (manual):", shapiro_safe(manual_dur))
print("Shapiro (auto):  ", shapiro_safe(auto_dur))

# D'Agostino's K^2 (ok for larger n)
print("\nD'Agostino's K^2 (normaltest):", stats.normaltest(manual_dur), stats.normaltest(auto_dur))

# Variance equality (Levene)
levene_res = stats.levene(manual_dur, auto_dur)
print("\nLevene test for equal variances p-value:", levene_res.pvalue)

# Hypothesis Tests

In [None]:
# Hypothesis tests
alpha = 0.05

# T-tests
tt_equal = stats.ttest_ind(manual_dur, auto_dur, equal_var=True, nan_policy='omit')
tt_welch = stats.ttest_ind(manual_dur, auto_dur, equal_var=False, nan_policy='omit')
print("Student t-test (equal var) p-value:", tt_equal.pvalue)
print("Welch t-test (unequal var) p-value: ", tt_welch.pvalue)

# Mann-Whitney U (nonparametric)
mw = stats.mannwhitneyu(manual_dur, auto_dur, alternative='two-sided')
print("Mann-Whitney U p-value:", mw.pvalue)

# Kolmogorov-Smirnov (distributional difference)
ks = stats.ks_2samp(manual_dur, auto_dur)
print("KS two-sample p-value:", ks.pvalue)

# Effect Size

In [None]:
# Effect sizes

# Cohen's d (Welch-style pooled)
def cohens_d(a, b):
    na, nb = len(a), len(b)
    ma, mb = a.mean(), b.mean()
    sa2, sb2 = a.var(ddof=1), b.var(ddof=1)
    # pooled std (for Cohen d, use sqrt((sa2+sb2)/2)) — common choice
    pooled_sd = math.sqrt((sa2 + sb2) / 2.0)
    return (ma - mb) / pooled_sd

print("Cohen's d (manual - auto):", round(cohens_d(manual_dur, auto_dur), 3))

# Cliff's delta (non-parametric effect; returns proportion difference)
def cliffs_delta(a, b):
    # efficient pairwise sign count
    a = np.array(a)
    b = np.array(b)
    gt = 0
    lt = 0
    for x in a:
        gt += np.sum(x > b)
        lt += np.sum(x < b)
    n = len(a) * len(b)
    return (gt - lt) / n

print("Cliff's delta (manual vs auto):", round(cliffs_delta(manual_dur, auto_dur), 4))

# Robustness Checks

In [None]:
# Bootstrap 95% CI for median difference
def bootstrap_median_diff(a, b, n_boot=5000, seed=0):
    rng = np.random.RandomState(seed)
    diffs = []
    a_vals = np.array(a)
    b_vals = np.array(b)
    for _ in range(n_boot):
        s1 = rng.choice(a_vals, size=len(a_vals), replace=True)
        s2 = rng.choice(b_vals, size=len(b_vals), replace=True)
        diffs.append(np.median(s1) - np.median(s2))
    return np.percentile(diffs, [2.5, 97.5])

ci = bootstrap_median_diff(manual_dur.values, auto_dur.values)
print("Bootstrap 95% CI for median(manual) - median(auto):", ci)

# Compute per-user medians
user_medians = sessions.groupby(['user_id','open_type'])['duration_sec'].median().reset_index()
manual_user = user_medians[user_medians['open_type']=='manual']['duration_sec']
auto_user   = user_medians[user_medians['open_type']=='auto']['duration_sec']

# Descriptive
print("Users with manual median count:", len(manual_user))
print("Users with auto median count:", len(auto_user))

# Nonparametric test on per-user medians
print("Mann-Whitney U on per-user medians p-value:", stats.mannwhitneyu(manual_user, auto_user, alternative='two-sided').pvalue)

log_manual = np.log1p(manual_dur)
log_auto   = np.log1p(auto_dur)

print("Welch t-test on log1p durations p-value:", stats.ttest_ind(log_manual, log_auto, equal_var=False).pvalue)

# Conclusion Summary on Statistical Analysis
- Both manual and automatic **session durations** are strongly **right-skewed** and **non-normal**.
- Statistical analysis consistently shows a **significant difference** between the two groups.
- **Automatic** tool window opens **remain active considerably longer** than **manual** ones.
- The **difference** is **robust** across multiple statistical approaches and **holds** at both **session** and **user** levels.
- Visualizations **confirm** the same pattern — **automatic** sessions show a **wider spread** and **higher median duration**.

# Detailed Conclusions on the findings for each chapter in the Statistical Analysis Notebook
## Descriptive Statistics and Visualization
- The resulting dataset sessions.csv contained **1575** valid sessions:
    - **957** opened automatically
    - **618** opened manually
- Descriptive Statistics Table
| Group  | Count | Mean (sec) | Median | Std     | Min   | 25%   | 75%    | Max      | Skew | Kurtosis |
| ------ | ----- | ---------- | ------ | ------- | ----- | ----- | ------ | -------- | ---- | -------- |
| Manual | 618   | 575.67     | 11.86  | 2613.95 | 0.015 | 2.14  | 132.55 | 29639.70 | 7.88 | 72.33    |
| Auto   | 957   | 1588.78    | 159.60 | 4414.78 | 0.15  | 31.66 | 890.96 | 34665.65 | 4.75 | 25.49    |
- From the output of the descriptive statistics table we can conclude that:
    - Both groups are highly right-skewed (many short sessions and a few very long ones).
    - The mean duration is larger for automatic opens, and medians show a large practical difference ('**manual**' median ≈ **11.86 s**, '**auto**' median ≈ **159.60 s**). The large difference between mean and median indicates that a **small number of very long sessions inflate the mean**.
- Visualizations (histogram, boxplot, violin, ECDF) **confirmed these conclusions**: automatic sessions are generally longer and more spread out.
## Normality and Variance Tests
| Test          | Group   | p-value       | Interpretation                 |
| ------------- |---------|---------------|--------------------------------|
| Shapiro–Wilk  | manual  | 9.19e-45      | Both p ≪ 0.05 → **Not normal** |
|               | auto    | 8.54e-49      |                                |
| D’Agostino K² | manual  | 2.39e-192     | Confirms **non-normality**     |
|               | auto    | 2.70e-204     |                                |
| Levene’s test |         | 7.98e-07      | **Variances unequal**          |

- Conclusions:
    - Both distributions are non-normal with unequal variances, so parametric tests (Student t-test) are not reliable.
    - Therefore, Welch’s t-test (for means) and nonparametric tests (for medians/ranks) are more appropriete.
## Hypothesis tests
| Test                         | p-value  | Interpretation                                            |
| ---------------------------- | -------- |-----------------------------------------------------------|
| Student’s t-test (equal var) | 2.9e-07  | significant, but **invalid** due to assumptions           |
| Welch’s t-test (unequal var) | 1.3e-08  | **significant** → mean durations differ                   |
| Mann–Whitney U               | 2.15e-59 | **highly significant** → median/rank distributions differ |
| Kolmogorov–Smirnov           | 3.50e-51 | strong evidence **distributions differ** overall          |
- Conclusion:
    - All tests indicate that the duration distributions for manual vs. auto opens are significantly different.
    - Given non-normality and unequal variances, the Mann–Whitney U and KS tests are the most reliable here; they confirm that **automatic sessions** tend to be **longer**.
## Effect Sizes
| Metric        | Value  | Interpretation                              |
| ------------- | ------ |---------------------------------------------|
| Cohen’s d     | -0.279 | Small difference in means                   |
| Cliff’s delta | -0.484 | Medium-to-large effect size (auto > manual) |
- Conclusion:
    - Cohen’s d (mean-based) is small because means are influenced by extreme outliers.
    - Cliff’s delta (rank-based) ≈ −0.484 indicates a substantial practical difference: in pairwise comparisons an automatic session is substantially more likely to be longer than a manual one.
## Robustness Checks
- Robustness checks (bootstrap CI for median difference, per-user median test, and log-transformed Welch test) all confirm the main result: **automatic opens remain open significantly longer than manual opens** (median difference ≈ **148 seconds**, bootstrap 95% CI ≈ [**− 181, − 124**]).