#### Statistical Data Analysis
Dataset: 

- _revenue_total

Author: Luis Sergio Pastrana Lemus  
Date: 2025-05-14

# Statistical Data Analysis – Purchasing Activity Dataset

## __1. Libraries__

In [6]:
from IPython.display import display, HTML
import os
import pandas as pd
from pathlib import Path
import scipy.stats as st
from scipy.stats import ttest_ind
import sys


# Define project root dynamically, gets the current directory from whick the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *

## __2. Path to Data file__

In [2]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "activity"
df_revenue = load_dataset_from_csv(data_file_path, "revenue_total.csv", sep=',', header='infer')


## __3. Statistical Data Analysis__

### 3.1  Inferential Tests

Hypotheses: The average revenues from Ultimate and Surf calling plan users are different, The average revenue of users in the NY-NJ area is different from that of users in other regions

#### 3.1.1  Hypotheses testing: Ultimate and Surf plans revenues are different

In [None]:
# Hypothesis: Earlier added to the cart products are reordered more frequently than those which are added later to the cart.

# 1. Hypotheses H0, H1
# H0: Average revenue of the Surf plan and the Ultimate plan are equal (==)
# H1: Average revenue of the Surf plan and the Ultimate plan are different (!=)

# Prepare data by plans
surf_revenue = df_revenue.loc[df_revenue['plan'] == 'surf', 'month_revenue_total']
ultimate_revenue = df_revenue.loc[df_revenue['plan'] == 'ultimate', 'month_revenue_total']

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [4]:
# Levene's test, to ensure that the variances of different samples are equal. 
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect

levene_stat, levene_p = st.levene(surf_revenue, ultimate_revenue)
display(HTML(f"<b>Levene's Test</b> – Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("<i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("<i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [8]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat, p_val = ttest_ind(surf_revenue, ultimate_revenue, equal_var=False)

display(HTML(f"T-statistic: <b>{t_stat:.15f}</b>"))
display(HTML(f"P-value: <b>{p_val:.15f}</b>"))

# 4. Decision and Conclusion

if p_val < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>The average revenues across the plans differ significantly.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>The average revenues across the plans differ significantly</b>."))

#### Hypothesis Test validation

In [13]:
display(HTML(f"> <b>Total revenue</b> from Megaline services: \n\n"))
print(df_revenue["month_revenue_total"].describe())

count    2293.000000
mean       64.873676
std        47.417238
min        20.000000
25%        25.340000
50%        70.000000
75%        70.000000
max       596.770000
Name: month_revenue_total, dtype: float64


In [19]:
display(HTML(f"> Revenue, plan <b>(Surf)</b>, for Megaline services: \n\n"))
print(df_revenue.loc[df_revenue["plan"] == 'surf', 'month_revenue_total'].describe())

count    1573.000000
mean       61.437495
std        56.374580
min        20.000000
25%        20.000000
50%        40.330000
75%        86.400000
max       596.770000
Name: month_revenue_total, dtype: float64


In [21]:
display(HTML(f"> Revenue, plan <b>(Ultimate)</b>, for Megaline services: \n\n"))
print(df_revenue.loc[df_revenue["plan"] == 'ultimate', 'month_revenue_total'].describe())

count    720.000000
mean      72.380778
std       11.687146
min       70.000000
25%       70.000000
50%       70.000000
75%       70.000000
max      183.960000
Name: month_revenue_total, dtype: float64


#### 3.1.2  Hypotheses testing: The average revenue of users in the NY-NJ area is different from that of users in other regions.

In [27]:
# Hypothesis: Users revenue in the NY-NJ area is different from that of users in other regions.

# 1. Hypotheses H0, H1
# H0: Average users revenue in NY-NJ area and users revenue in other regions are equal (==)
# H1: Average users revenue in NY-NJ area and users revenue in other regionsn are different (!=)

# Prepare data by plans
ny_nj_revenue = df_revenue[df_revenue['city'] == 'new_york_newark_jersey_city,_ny_nj_pa_msa']['month_revenue_total']
other_revenue = df_revenue[df_revenue['city'] != 'new_york_newark_jersey_city,_ny_nj_pa_msa']['month_revenue_total']

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [28]:
# Levene's test, to ensure that the variances of different samples are equal. 
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect

levene_stat, levene_p = st.levene(ny_nj_revenue, other_revenue)
display(HTML(f"<b>Levene's Test</b> – Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("<i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("<i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [29]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat, p_val = ttest_ind(ny_nj_revenue, other_revenue, equal_var=True)

display(HTML(f"T-statistic: <b>{t_stat:.15f}</b>"))
display(HTML(f"P-value: <b>{p_val:.15f}</b>"))

# 4. Decision and Conclusion

if p_val < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>The average revenues across locations differ significantly.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>The average revenues across locations differ significantly</b>."))

#### Hypothesis Test validation

In [32]:
display(HTML(f"> <b>NY - NJ</b> area revenue from Megaline services: \n\n"))
print(df_revenue.loc[df_revenue["city"] == 'new_york_newark_jersey_city,_ny_nj_pa_msa', 'month_revenue_total'].describe())

count    377.000000
mean      60.735040
std       44.302534
min       20.000000
25%       20.450000
50%       56.400000
75%       76.400000
max      286.400000
Name: month_revenue_total, dtype: float64


In [34]:
display(HTML(f"> <b>Other</b> area revenue from Megaline services: \n\n"))
print(df_revenue.loc[df_revenue["city"] != 'new_york_newark_jersey_city,_ny_nj_pa_msa', 'month_revenue_total'].describe())

count    1916.000000
mean       65.688011
std        47.975252
min        20.000000
25%        26.400000
50%        70.000000
75%        70.000000
max       596.770000
Name: month_revenue_total, dtype: float64


## 4. Conclusion of Statistical Data Analysis – Order and Product activity

After an exhaustive exploratory and statistical analysis of user behavior and the profitability of the Megaline plan, the following conclusions were reached:

1. User behavior: Despite the differences in the limits included, Surf and Ultimate users show similar patterns in call, text, and internet usage. Surf users tend to exceed data limits more frequently, which contributes to generating additional revenue.

2. Plan comparison: While the Ultimate plan offers more generous allowances, its added value is underutilized. The Surf plan generates more revenue from overages, especially in internet traffic.

3. Demographic patterns: The majority of high-usage (and high-income) users are concentrated in the New York and Newark areas and are over 33 years old, suggesting professional or urban usage profiles.

4. Statistical hypothesis testing: The test between the mean incomes of the Surf and Ultimate plans confirmed differences at a 95% confidence level. Therefore, there is a difference in usage behavior for Megaline services between Surf and Ultimate plan users, with the Surf plan generating the most revenue for Megaline. Additionally, it was also found that the New York and New Jersey area generates more revenue than the rest of the areas where Megaline provides services.

Recommendation: Megaline could consider:

Promoting the Surf plan with incentives for intensive internet users.

Reevaluating the Ultimate plan's price or the benefits it offers to increase its perceived value.

Focusing marketing strategies on urban centers like New York and New Jersey, where engagement and revenue are higher.