# Practical Guide: Using Statistics to Answer Business Questions

Do you have a hypothesis? Statistics can prove (or refute) whether it is real, reducing risk in decision-making.

This guide demonstrates 3 common use cases to validate business questions using Python.

### Use Cases:
1.  **Question A vs B:** "Did the new site page 'B' have a conversion rate that is significantly better than 'A'?" -> **T-test (Student's t-test)**

2.  **Multiple Categories:** "Is the average order value of customers from the 'South,' 'Southeast,' and 'Northeast' the same?" -> **ANOVA**

3.  **Relationship (Cross-Sell):** "Do customers who buy 'Category X' also tend to buy 'Category Y'?" -> **Chi-Square Test**


In [6]:
import pandas as pd
import numpy as np
from scipy import stats

# Settings
np.random.seed(42) # For reproducible results
print("✅ Setup complete. Libraries imported.")


✅ Setup complete. Libraries imported.


## Use Case 1: A/B Test (T-test)

**Business Question:** "We ran an A/B test for a week. Did the new page 'B' (the treatment) really have a better conversion rate than the old page 'A' (the control)?"

In [7]:
# 1. Create fake data for two web pages
N_A = 1000 # 1000 visitors in page A
N_B = 1000 # 1000 visitors in page B

# Conversion Rates (which we don't know in real life)
conv_A = 0.10 # 10%
conv_B = 0.12 # 12%

# Generate the data (0 = did not convert, 1 = converted)
conversions_A = np.random.binomial(1, conv_A, N_A)
conversions_B = np.random.binomial(1, conv_B, N_B)

print(f"Page A: {conversions_A.sum()} conversions from {N_A} visitors (Rate: {conversions_A.mean():.1%})")
print(f"Page B: {conversions_B.sum()} conversions from {N_B} visitors (Rate: {conversions_B.mean():.1%})")

Page A: 100 conversions from 1000 visitors (Rate: 10.0%)
Page B: 112 conversions from 1000 visitors (Rate: 11.2%)


In [8]:
# The null hypothesis (H0) is: "The mean of A is equal to that of B"
# The alternative hypothesis (H1) is: "The mean of A is different from that of B"

# We use ttest_ind (for independent samples)
# equal_var=False (Welch's T-test) is safer
statistic, pvalue = stats.ttest_ind(conversions_A, conversions_B, equal_var=False)

print(f"--- T-test result ---")
print(f"T-Statistic: {statistic:.4f}")
print(f"P-value: {pvalue:.4f}")

--- T-test result ---
T-Statistic: -0.8714
P-value: 0.3836


### Interpreting the T-test (The Decision)

* **Result:** The p-value was $0.3836$ (in my run).
* **Rule:** We use a standard significance level of 5% (or 0.05).
* **Decision:** Since the p-value ($0.3836$) is **greater** than $0.05$, we **fail to reject the null hypothesis**.

**Business Conclusion:** No, the difference is not statistically significant. The new page 'B' does not perform significantly better than 'A'. We should retain page 'A' until further improvements are made.

## Use Case 2: Multiple Categories (ANOVA)

**Business Question:** "Is the 'average order value' of customers from the 'Southeast,' 'Northeast,' and 'South' the same, or does any region spend *significantly* more or less than the others?"

(The T-test only compares 2 groups. For 3 or more, we use ANOVA.)


In [9]:
# 1. Create "fake" spending data by region
gastos_sudeste = np.random.normal(loc=150, scale=30, size=200) # Média 150
gastos_nordeste = np.random.normal(loc=140, scale=30, size=200) # Média 140
gastos_sul = np.random.normal(loc=155, scale=30, size=200) # Média 155

print(f"Average ticket Southeast: R$ {gastos_sudeste.mean():.2f}")
print(f"Average ticket Northeast: R$ {gastos_nordeste.mean():.2f}")
print(f"Average ticket South: R$ {gastos_sul.mean():.2f}")

# 2. Run the F_oneway (ANOVA) test
statistic, pvalue = stats.f_oneway(gastos_sudeste, gastos_nordeste, gastos_sul)

print(f"\n--- ANOVA Result ---")
print(f"F-Statistic: {statistic:.4f}")
print(f"P-Value: {pvalue:.4f}")

Average ticket Southeast: R$ 152.16
Average ticket Northeast: R$ 139.19
Average ticket South: R$ 154.22

--- ANOVA Result ---
F-Statistic: 15.4846
P-Value: 0.0000


### Interpreting the ANOVA (The Decision)

* **Result:** The p-value was $0$ (in my run).
* **Rule:** We use a standard significance level of 5% (or 0.05).
* **Decision:** Since the p-value ($0$) is **less** than $0.05$, we **reject the null hypothesis**.

**Business Conclusion:** Yes, there is a statistically significant difference in average order value among the regions. Further analysis (post-hoc tests) can identify which specific regions differ.


## Use Case 3: Relationship between Categories (Chi-Square Test)

**Business Question:** "Is there an *association* between the product category purchased and whether the customer used a discount coupon? (e.g., Do 'Electronics' customers use coupons more than 'Fashion' customers?)"

In [10]:
# 1. Create a "Contingency Table" (fake data)
# (This is what you would do with a pd.crosstab() on your real data)

#              Used Coupon | NNot Used Coupon
# Electronics       [100,         150]
# Fashion              [ 50,         200]

tabela_contingencia = pd.DataFrame({
    'Used_Coupon': [100, 50],
    'Not_Used_Coupon': [150, 200]
}, index=['Electronics', 'Fashion'])

print("Observed Table (Our Data):")
print(tabela_contingencia)

# 2. Run the Chi-Squared Test
chi2, pvalue, dof, expected_table = stats.chi2_contingency(tabela_contingencia)

print(f"\n--- Chi-Squared Test Result ---")
print(f"P-valor (P-value): {pvalue:.4f}")

Observed Table (Our Data):
             Used_Coupon  Not_Used_Coupon
Electronics          100              150
Fashion               50              200

--- Chi-Squared Test Result ---
P-valor (P-value): 0.0000


### Interpreting the Chi-Square Test (The Decision)

* **Null Hypothesis (H0):** "There is no association. Coupon use is *independent* of product category."
* **Result:** The p-value was $0$ (in my run).
* **Rule:** Since the p-value ($0$) is **less** than $0.05$, we **reject the null hypothesis**.

**Business Conclusion:** Yes, there is a statistically significant association. Coupon use is **not** independent of category. Looking at the data, customers in "Electronics" (40% used a coupon) seem to use coupons more frequently than customers in "Fashion" (20% used a coupon).


## Final Conclusion

Statistics is not (just) academic theory. It is a fundamental business tool to:

* **Validate** A/B tests before launching a new feature.
* **Confirm** whether a difference between groups (e.g., regions) is real or just chance.
* **Discover** hidden associations in customer behaviors.

---
Ready to use data to validate your hypotheses and improve your business outcomes?

* **See my complete statistics guide:** [https://github.com/Lucas-Ker/stats_for_data_science](https://github.com/Lucas-Ker/stats_for_data_science)

* **Invite me on Upwork:** [Your Upwork Link]()