
<span style="color:cyan; font-size:35px; font-weight:bold;">
# Questions 

</span>




<span style="color:white; font-size:25px; font-weight:bold;">
1- What is the revenue by country?<br>
2- What is the monthly revenue trend?<br>
3- What is the distribution of the amount after Winsorization?
</span>



<span style="color:pink; font-size:35px; font-weight:bold;">
1. Setup + imports
</span>


In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import plotly.express as px

DATA = Path("data/processed/analytics_table.parquet")
FIGS = Path("reports/figures")
FIGS.mkdir(parents=True, exist_ok=True)

def save_fig(fig, path: Path, *, scale: int = 2) -> None:
    """Save a Plotly figure to disk (requires `kaleido`)."""
    path.parent.mkdir(parents=True, exist_ok=True)
    fig.write_image(str(path), scale=scale)


<span style="color:pink; font-size:35px; font-weight:bold;">
2. Load processed data
</span>

In [2]:
! uv pip install pyarrow


[2mUsing Python 3.11.4 environment at: C:\Users\Arwa7\OneDrive\سطح المكتب\Bootcamp\.venv[0m
[2mAudited [1m1 package[0m [2min 6ms[0m[0m


In [3]:

ROOT = Path.cwd().parents[0]
DATA=ROOT / "data"/"processed"  / "analytics_table.parquet"
FIGS= ROOT / "reports" / "figures"







In [4]:
df = pd.read_parquet(DATA)
print("rows:", len(df), "cols:", len(df.columns))
print(df.dtypes.head(15))
missing = df.isna().sum().sort_values(ascending=False).head(10)
print(missing)

rows: 100 cols: 18
order_id               string[python]
user_id                string[python]
amount                        Float64
quantity                        Int64
created_at        datetime64[ns, UTC]
status                         object
status_clean                   object
amount__isna                     bool
quantity__isna                   bool
year                          float64
month                         float64
day                           float64
hour                          float64
dayofweek                     float64
country                        object
dtype: object
amount_winsor         12
amount                12
amount__is_outlier    12
quantity               8
dayofweek              7
hour                   7
day                    7
month                  7
created_at             7
year                   7
dtype: int64


<span style="color:pink; font-size:35px; font-weight:bold;">
3. Quick audit
</span>

<span style="color:pink; font-size:35px; font-weight:bold;">
4. Questions + results
</span>

In [5]:
! uv pip install --upgrade kaleido


[2mUsing Python 3.11.4 environment at: C:\Users\Arwa7\OneDrive\سطح المكتب\Bootcamp\.venv[0m
[2mResolved [1m12 packages[0m [2min 3.50s[0m[0m
[2mAudited [1m12 packages[0m [2min 5ms[0m[0m


In [6]:
# Question 1: Revenue by country
rev = (
    df.groupby("country", dropna=False)
      .agg(
          n=("order_id","size"),
          revenue=("amount","sum"),
          aov=("amount","mean"),
      )
      .reset_index()
      .sort_values("revenue", ascending=False)
)

fig = px.bar(rev, x="country", y="revenue", title="Revenue by country (all data)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue (sum of amount)")
save_fig(fig, FIGS / "revenue_by_country.png")
fig






In [None]:
# Question 2: Revenue trend

print(df.columns)



trend = (
    df.groupby("month", dropna=False)
    .agg(n=("order_id","size"), revenue=("amount","sum"))
    .reset_index()
    .sort_values("month")
)

fig = px.line(
  trend, x="month", y="revenue", title="Revenue over time (monthly)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Month")
fig.update_yaxes(title_text="Revenue")
save_fig(fig, FIGS / "revenue_trend_monthly.png")
fig



Index(['order_id', 'user_id', 'amount', 'quantity', 'created_at', 'status',
       'status_clean', 'amount__isna', 'quantity__isna', 'year', 'month',
       'day', 'hour', 'dayofweek', 'country', 'signup_date', 'amount_winsor',
       'amount__is_outlier'],
      dtype='object')


In [None]:
#Question 3: Amount distribution 

fig = px.histogram(
    df, x="amount_winsor", nbins=30,
    title="Order amount distribution (winsorized)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Amount (winsorized)")
fig.update_yaxes(title_text="Number of orders")
save_fig(fig, FIGS / "amount_hist_winsor.png")
fig

<span style="color:pink; font-size:35px; font-weight:bold;">
5. Bootstrap comparison
</span>

In [None]:
def bootstrap_diff_means(a: pd.Series, b: pd.Series, *, n_boot: int = 2000, seed: int = 0) -> dict:
    rng = np.random.default_rng(seed)
    a = pd.to_numeric(a, errors="coerce").dropna().to_numpy()
    b = pd.to_numeric(b, errors="coerce").dropna().to_numpy()
    assert len(a) > 0 and len(b) > 0, "Empty group after cleaning"

    diffs = []
    for _ in range(n_boot):
        sa = rng.choice(a, size=len(a), replace=True)
        sb = rng.choice(b, size=len(b), replace=True)
        diffs.append(sa.mean() - sb.mean())
    diffs = np.array(diffs)
    
    return {
        "diff_mean": float(a.mean() - b.mean()),
        "ci_low": float(np.quantile(diffs, 0.025)),
        "ci_high": float(np.quantile(diffs, 0.975)),
    }

d = df.assign(is_refund=df["status_clean"].eq("refund").astype(int))
a = d.loc[d["country"].eq("SA"), "is_refund"]
b = d.loc[d["country"].eq("AE"), "is_refund"]
print("n_SA:", len(a), "n_AE:", len(b))
print(bootstrap_diff_means(a, b, n_boot=2000, seed=0))


n_SA: 76 n_AE: 24
{'diff_mean': -0.0899122807017544, 'ci_low': -0.2807017543859649, 'ci_high': 0.07461622807017514}


<span style="color:pink; font-size:35px; font-weight:bold;">
6. Findings + caveats
</span>

"Question 1: Revenue by Country"

Two countries are shown: SA and AE.

SA has revenue above 2,500.

AE has revenue above 600.

"Question 2: Revenue Trend"

Revenue showed a consistent increase each month.

"Question 3: Amount Distribution"

The chart shows the distribution of order amounts.

Values are generally close and normally distributed, but there is one outlier.