# Recommendations & Next Steps (Act)

## Purpose

This notebook translates the analysis into **business-ready conclusions** and
**actionable recommendations**.

It answers the business question:

**What characteristics are most strongly associated with books becoming and remaining Amazon bestsellers, based on price, genre, user ratings, and review volume?**

This notebook contains minimal code and focuses on:
- Key conclusions
- Recommendations
- Risks / limitations
- Next steps for decision-makers



In [1]:
from pathlib import Path
import pandas as pd

# Resolve repo root robustly
here = Path.cwd()
repo_root = next((p for p in [here, *here.parents] if (p / "requirements.txt").exists()), None)

if repo_root is None:
    raise FileNotFoundError("Repo root not found.")

clean_path = repo_root / "data_cleaned" / "amazon_books_cleaned.csv"
df = pd.read_csv(clean_path)

df.head()


Unnamed: 0,name,author,user_rating,reviews,price,year,genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction


## Executive Summary

Based on Amazon Top 50 bestseller data (2009–2019), bestsellers tend to cluster around:
- A small number of dominant genres
- Strong customer satisfaction (high user ratings)
- High review volumes (proxy for popularity)
- Repeat author presence across multiple years (persistence signal)

The strongest opportunities for a publishing / retail strategy team are to:
1) Prioritise genre categories with sustained bestseller share
2) Use pricing bands aligned with typical bestseller price points
3) Focus on high-rating + high-review “high-performing segments”
4) Identify repeat authors and build long-term partnerships around proven demand


In [2]:
summary = {
    "years_covered": (int(df["year"].min()), int(df["year"].max())),
    "rows": int(df.shape[0]),
    "unique_books": int(df[["name", "author"]].drop_duplicates().shape[0]),
    "unique_authors": int(df["author"].nunique()),
    "genre_share_pct": (
        df["genre"]
        .value_counts(normalize=True)
        .mul(100)
        .round(1)
        .to_dict()
    ),
    "price_median": float(df["price"].median()),
    "price_mean": float(df["price"].mean()),
    "rating_median": float(df["user_rating"].median()),
    "reviews_median": float(df["reviews"].median()),
}
summary


{'years_covered': (2009, 2019),
 'rows': 550,
 'unique_books': 351,
 'unique_authors': 248,
 'genre_share_pct': {'Non Fiction': 56.4, 'Fiction': 43.6},
 'price_median': 11.0,
 'price_mean': 13.1,
 'rating_median': 4.7,
 'reviews_median': 8580.0}

## Key Findings

### 1) Genre dominates bestseller composition
- Bestseller lists are not evenly distributed by genre.
- One or two genres typically account for a large share of top titles.
- The genre share trend over time shows whether demand is stable or shifting.

**Supporting visual:** `reports/genre_share_over_time.png`  
**Supporting visual:** `reports/overall_genre_share.png`

---

### 2) Pricing clusters into a typical “bestseller band”
- Bestsellers tend to sit within a narrow price range.
- Some genres show systematically higher or lower typical prices.

**Supporting visual:** `reports/price_distribution_by_genre.png`

---

### 3) Review volume is a strong popularity signal
- Review volume varies widely across titles.
- High-review books often represent high market visibility and reach.

---

### 4) Ratings represent satisfaction but do not guarantee popularity alone
- High ratings are common among bestsellers, but review volume differentiates top-performing titles.
- The strongest segment combines high ratings and high reviews.

**Supporting visual:** `reports/rating_vs_reviews.png`

---

### 5) Repeat authors indicate persistence and lower risk
- Authors appearing across multiple years represent a proven demand base.
- This can reduce risk when planning releases, promotions, or partnerships.

**Supporting visual:** `reports/top_repeat_authors.png`


In [3]:
top_authors = (
    df.groupby("author")
      .agg(years_active=("year", "nunique"))
      .sort_values("years_active", ascending=False)
      .head(10)
)

top_authors


Unnamed: 0_level_0,years_active
author,Unnamed: 1_level_1
Gary Chapman,11
Jeff Kinney,11
American Psychological Association,10
Gallup,9
Dr. Seuss,8
Eric Carle,7
Stephen R. Covey,7
Bill O'Reilly,6
The College Board,6
Rick Riordan,6


## Recommendations

### Recommendation 1 — Focus acquisition/marketing on dominant genre categories
**Action:** Prioritise budget and promotional slots on the genres with the highest and most stable bestseller share.  
**Why:** These genres demonstrate sustained demand and are more likely to generate repeat bestseller candidates.

---

### Recommendation 2 — Use “bestseller price bands” rather than guessing pricing
**Action:** Define 2–3 recommended price bands based on historical bestseller pricing (e.g., low / mid / premium).  
**Why:** This aligns pricing strategy to market norms and reduces the risk of mispricing.

---

### Recommendation 3 — Build a “high-performing segment” scorecard for new titles
**Action:** Track early indicators for new releases using:
- rating thresholds
- review growth rate
- genre benchmarks  
**Why:** Early signals can guide promotion spend and inventory decisions.

---

### Recommendation 4 — Identify repeat authors and build partnership strategy
**Action:** Create a shortlist of authors with multi-year appearances and prioritise:
- exclusive launches
- bundled promotions
- repeat-series marketing  
**Why:** Repeat authors represent proven demand and lower uncertainty.

---

### Recommendation 5 — Optimise visibility mechanics (reviews strategy)
**Action:** Improve review capture strategy (ethically) by:
- post-purchase email prompts
- review reminders in packaging
- early-reader programmes  
**Why:** Review volume is a strong popularity proxy and supports ranking visibility.


## Risks & Limitations

- This dataset does not include true sales volume, so **reviews are a proxy**, not a direct measure of revenue.
- High ratings and reviews are **associations**, not proof of causation.
- The dataset is limited to the Top 50 books each year, so it does not represent the full Amazon catalogue.
- External factors (marketing spend, author brand, publisher reach) are not included.

These constraints should be considered when applying recommendations to real commercial decisions.


## Next Steps

If this were a real strategy project, the next steps would be:

1) **Expand the dataset**
   - Include Top 500 or category-level bestseller data
   - Add publisher, format (hardcover/paperback/ebook), and marketing signals if available

2) **Add sales and profitability measures**
   - Integrate internal sales volume, margin, ad spend, and conversion metrics

3) **Model early success indicators**
   - Build a simple predictive model to estimate “bestseller likelihood” using early ratings/reviews velocity

4) **Segment by sub-genres and formats**
   - Fiction vs Non-Fiction is broad; subcategories may reveal more targeted strategy opportunities

5) **Test strategy changes**
   - Run controlled experiments on pricing bands and review-generation tactics
