used bing.com to scrape top 10 news applied Sentiment analysis , semantic matching with news , applied seasonal and strategy rules along wuth price adjustment strategies.


#  Semantic News-Driven Book Pricing System



* Live news sentiment
* Semantic similarity (NLP embeddings)
* Seasonal strategy rules
* Book ratings

It scrapes **1,000 books** from *Books to Scrape*, matches them with **current news**, and adjusts prices accordingly.

---

##  1. Installation & Imports

### Libraries Used

* **sentence-transformers** ‚Üí Semantic similarity using embeddings
* **nltk** ‚Üí Text preprocessing & sentiment analysis (VADER)
* **requests** ‚Üí Web requests
* **beautifulsoup4** ‚Üí Web scraping
* **pandas** ‚Üí Data processing
* **torch** ‚Üí Tensor operations for similarity

```python
!pip install sentence-transformers nltk requests beautifulsoup4
```

---

##  2. NLTK Setup (Text & Sentiment Processing)

### Downloads

* WordNet ‚Üí Lemmatization
* Stopwords ‚Üí Noise removal
* VADER ‚Üí Sentiment analysis

### Key Functions

#### `clean_text(text)`

* Removes stopwords
* Keeps meaningful words
* Lemmatizes tokens

#### `sentiment_score(text)`

* Uses **VADER** compound score
* Returns:

  * `1` ‚Üí Positive
  * `0` ‚Üí Neutral
  * `-1` ‚Üí Negative

---

##  3. Fetch Top News from Bing RSS

### Source

```
https://www.bing.com/news/search?q=latest+news&format=rss
```

### What It Does

* Fetches latest news headlines
* Extracts:

  * Title
  * Link
  * Publish date
  * Sentiment score

Output is stored in a **DataFrame (`news_df`)**.

---

## 4. Load Semantic Embedding Model

### Model Used

```
all-MiniLM-L6-v2
```

### Purpose

* Converts news headlines into **vector embeddings**
* Enables **semantic similarity matching** with book descriptions

---

##  5. Scrape All 1,000 Books

### Website

```
https://books.toscrape.com
```

### Data Collected Per Book

* Title
* Price
* Rating (1‚Äì5)
* Stock (static value)
* Category
* Description

### Scraping Strategy

* Iterates page by page
* Visits individual book pages for **category & description**

---

##  6. Semantic Matching with News

### How Matching Works

1. Clean book description
2. Generate embedding
3. Compare with all news embeddings
4. Select **most similar news headline**

### Output Columns

* `matched_news`
* `news_sentiment`

This links **real-world events** to book themes.

---

##  7. Seasonal & Strategic Rules



| Season    | Keywords        | Multiplier |
| --------- | --------------- | ---------- |
| Holiday   | gift, christmas | 1.20       |
| Vacation  | travel, summer  | 1.15       |
| Education | school, study   | 1.18       |
| Politics  | war, government | 1.12       |

### Logic

* Matches keywords against:

  * Book category
  * Description
  * Related news

---

##  8. Dynamic Price Adjustment Logic

### Pricing Factors

1. **News Sentiment**

   * Positive ‚Üí +10%
   * Negative ‚Üí ‚àí5%

2. **High Rating Bonus**

   * Rating ‚â• 4 ‚Üí +15%

3. **Seasonal Multiplier**

### Final Formula

```text
Adjusted Price = Base Price √ó Sentiment √ó Rating √ó Season
```

---

## 9. Output

### CSV Generated

```
books_dynamic_pricing_news.csv
```

### Contains

* Original price
* Adjusted price
* Book metadata
* Matched news
* Sentiment score
* Seasonal reason





In [4]:
# =====================================================
# 1Ô∏è‚É£ INSTALL & IMPORTS
# =====================================================
!pip install sentence-transformers nltk requests beautifulsoup4

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import nltk

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

from sentence_transformers import SentenceTransformer, util
import torch
from urllib.parse import urljoin

# =====================================================
# 2Ô∏è‚É£ NLTK SETUP
# =====================================================
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("stopwords")
nltk.download("vader_lexicon")

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
sia = SentimentIntensityAnalyzer()

def clean_text(text, min_len=4):
    words = re.findall(r"[a-zA-Z]{%d,}" % min_len, text.lower())
    return " ".join(lemmatizer.lemmatize(w) for w in words if w not in stop_words)

def sentiment_score(text):
    score = sia.polarity_scores(text)["compound"]
    if score >= 0.05:
        return 1
    elif score <= -0.05:
        return -1
    return 0

# =====================================================
# 3Ô∏è‚É£ FETCH TOP 10 NEWS FROM BING RSS
# =====================================================
RSS_URL = "https://www.bing.com/news/search?q=latest+news&format=rss"

def fetch_top_news_rss(max_items=10):  # top 10 news
    headers = {"User-Agent": "Mozilla/5.0"}
    r = requests.get(RSS_URL, headers=headers)
    r.raise_for_status()

    soup = BeautifulSoup(r.content, "xml")
    items = soup.find_all("item")

    news_list = []
    for item in items[:max_items]:
        news_list.append({
            "title": item.title.text if item.title else "",
            "link": item.link.text if item.link else "",
            "published": item.pubDate.text if item.pubDate else "",
            "sentiment": sentiment_score(item.title.text if item.title else "")
        })
    return news_list

news = fetch_top_news_rss(10)
news_df = pd.DataFrame(news)
print("‚úÖ Top 10 news fetched and sentiment applied")
print(news_df[["title", "sentiment"]])

# =====================================================
# 4Ô∏è‚É£ LOAD EMBEDDING MODEL FOR TOP 10 NEWS
# =====================================================
model = SentenceTransformer("all-MiniLM-L6-v2")
news_embeddings = model.encode(news_df["title"].tolist(), convert_to_tensor=True)

# =====================================================
# 5Ô∏è‚É£ SCRAPE ALL 1000 BOOKS
# =====================================================
BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
BASE_SITE = "https://books.toscrape.com/catalogue/"

def rating_to_number(r):
    return {"One":1,"Two":2,"Three":3,"Four":4,"Five":5}.get(r,0)

def scrape_book_details(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    breadcrumb = soup.select(".breadcrumb li")
    category = breadcrumb[2].text.strip() if len(breadcrumb)>=3 else "Unknown"
    desc_tag = soup.select_one("#product_description")
    description = desc_tag.find_next("p").text.strip() if desc_tag else ""
    return category, description

def scrape_page(page):
    r = requests.get(BASE_URL.format(page))
    if r.status_code != 200:
        return []
    soup = BeautifulSoup(r.text, "html.parser")
    books = soup.select(".product_pod")
    data = []
    for b in books:
        title = b.h3.a["title"]
        price = float(re.sub(r"[^\d.]", "", b.select_one(".price_color").text))
        rating = rating_to_number(b.find("p", class_="star-rating")["class"][1])
        stock = 5
        link = urljoin(BASE_SITE, b.h3.a["href"])
        category, desc = scrape_book_details(link)
        data.append({
            "title": title,
            "price": price,
            "rating": rating,
            "stock": stock,
            "category": category,
            "description": desc
        })
    return data

books = []
page = 1
while True:
    page_books = scrape_page(page)
    if not page_books:
        break
    books.extend(page_books)
    print(f"‚úî Page {page} scraped | Total books: {len(books)}")
    page += 1

df = pd.DataFrame(books)

# =====================================================
# 6Ô∏è‚É£ SEMANTIC MATCHING WITH TOP 10 NEWS
# =====================================================
def match_news_and_sentiment(desc):
    if not desc:
        return None, 0
    emb = model.encode(clean_text(desc), convert_to_tensor=True)
    scores = util.cos_sim(emb, news_embeddings)
    idx = torch.argmax(scores).item()
    return news_df.iloc[idx]["title"], news_df.iloc[idx]["sentiment"]

df[["matched_news", "news_sentiment"]] = df["description"].apply(
    lambda x: pd.Series(match_news_and_sentiment(x))
)

# =====================================================
# 7Ô∏è‚É£ SEASONAL / STRATEGY RULES
# =====================================================
SEASON_RULES = {
    "Holiday gifting season": (["gift","christmas","holiday"],1.20),
    "Vacation season": (["travel","summer","journey"],1.15),
    "Back-to-school season": (["education","school","study"],1.18),
    "Political awareness season": (["politics","government","war"],1.12)
}

def detect_season(row):
    text = f"{row['category']} {row['description']} {row['matched_news']}".lower()
    for reason,(keywords,multiplier) in SEASON_RULES.items():
        if any(k in text for k in keywords):
            return multiplier, reason
    return 1.0, None

df[["seasonal_multiplier","seasonal_reason"]] = df.apply(lambda r: pd.Series(detect_season(r)), axis=1)

# =====================================================
# 8Ô∏è‚É£ ADJUST PRICE BASED ON NEWS SENTIMENT & STRATEGY
# =====================================================
def adjust_price(row):
    price = row["price"]
    # positive news => increase, negative => decrease
    if row["news_sentiment"] > 0:
        price *= 1.10
    elif row["news_sentiment"] < 0:
        price *= 0.95
    # high rating multiplier
    if row["rating"] >= 4:
        price *= 1.15
    # seasonal multiplier
    price *= row["seasonal_multiplier"]
    return round(price,2)

df["adjusted_price"] = df.apply(adjust_price, axis=1)

# =====================================================
# 9Ô∏è‚É£ SAVE CSV
# =====================================================
df["top_10_news"] = " | ".join(news_df["title"].tolist())
df.to_csv("books_dynamic_pricing_top10_news.csv", index=False)
print("\n‚úÖ CSV saved: books_dynamic_pricing_top10_news.csv")

# =====================================================
# 1Ô∏è‚É£0Ô∏è‚É£ DONE
# =====================================================
print("\nüéØ DONE ‚Äî Semantic Top 10 News + Sentiment Pricing applied to books")



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


‚úÖ Top 10 news fetched and sentiment applied
                                               title  sentiment
0  NFL Black Monday live updates: Latest rumors, ...          0
1  Transfer news LIVE: Man Utd split on Amorim re...          1
2  MLB Hot Stove tracker: Live updates on news, r...          1
3  NFL coaches, GMs fired: Black Monday live trac...         -1
4  Transfer news LIVE: Liverpool star 'open' to e...          0
5  NFL news, live updates ahead of wild card: Bro...         -1
6  Liverpool transfer news LIVE - Adam Wharton in...          1
7  Brock Purdy injury update: Latest news on 49er...         -1
8  Latest News on Cody Bellinger's MLB Free Agenc...          1
9  Will there be a $2,000 stimulus check? Latest ...          0
‚úî Page 1 scraped | Total books: 20
‚úî Page 2 scraped | Total books: 40
‚úî Page 3 scraped | Total books: 60
‚úî Page 4 scraped | Total books: 80
‚úî Page 5 scraped | Total books: 100
‚úî Page 6 scraped | Total books: 120
‚úî Page 7 scraped | Total

fetched author_popularity using
 OPEN_LIBRARY_SEARCH_URL = "https://openlibrary.org/search.json"


In [None]:
# =====================================================
# 1Ô∏è‚É£ INSTALL REQUIRED LIBRARIES
# =====================================================
!pip install requests pandas tqdm

# =====================================================
# 2Ô∏è‚É£ IMPORTS
# =====================================================
import requests
import pandas as pd
import time
from tqdm import tqdm

# =====================================================
# 3Ô∏è‚É£ LOAD EXISTING DATASET
# =====================================================
# This file is generated by your previous pipeline
df = pd.read_csv("books_dynamic_pricing_top10_news.csv")
print(f"üìö Loaded {len(df)} books")

# =====================================================
# 4Ô∏è‚É£ OPEN LIBRARY API CONFIG
# =====================================================
OPEN_LIBRARY_SEARCH_URL = "https://openlibrary.org/search.json"

def get_author_from_openlibrary(title, sleep_time=0.25):
    """
    Fetch author name and edition count from Open Library using book title.
    Returns:
        author_name (str)
        edition_count (int)
    """
    try:
        params = {
            "q": title,
            "limit": 1
        }
        response = requests.get(
            OPEN_LIBRARY_SEARCH_URL,
            params=params,
            timeout=10
        )
        response.raise_for_status()
        data = response.json()

        if data.get("docs"):
            doc = data["docs"][0]
            author = doc.get("author_name", ["Unknown Author"])[0]
            edition_count = doc.get("edition_count", 1)
            return author, edition_count

        return "Unknown Author", 1

    except Exception:
        return "Unknown Author", 1

    finally:
        time.sleep(sleep_time)  # rate-limit protection

# =====================================================
# 5Ô∏è‚É£ FETCH AUTHOR DATA FOR EACH BOOK
# =====================================================
authors = []
edition_counts = []

print("üîé Fetching author metadata from Open Library...")

for title in tqdm(df["title"]):
    author, editions = get_author_from_openlibrary(title)
    authors.append(author)
    edition_counts.append(editions)

df["author_name"] = authors
df["author_edition_count"] = edition_counts

# =====================================================
# 6Ô∏è‚É£ AUTHOR-LEVEL AGGREGATION
# =====================================================
author_stats = (
    df.groupby("author_name")
      .agg(
          author_book_count=("title", "count"),
          author_avg_rating=("rating", "mean"),
          author_avg_seasonal_demand=("seasonal_multiplier", "mean"),
          author_avg_edition_count=("author_edition_count", "mean")
      )
      .reset_index()
)

# =====================================================
# 7Ô∏è‚É£ NORMALIZATION FUNCTION
# =====================================================
def normalize(series):
    return (series - series.min()) / (series.max() - series.min() + 1e-9)

# =====================================================
# 8Ô∏è‚É£ AUTHOR POPULARITY INDEX (0‚Äì100)
# =====================================================
author_stats["author_popularity_index"] = (
      0.4 * normalize(author_stats["author_book_count"])
    + 0.3 * normalize(author_stats["author_avg_rating"])
    + 0.3 * normalize(author_stats["author_avg_edition_count"])
) * 100

author_stats["author_popularity_index"] = (
    author_stats["author_popularity_index"].round(2)
)

# =====================================================
# 9Ô∏è‚É£ ADD THREE ANALYTICAL COLUMNS
# =====================================================

# üîπ 1. Author Market Power
author_stats["author_market_power"] = (
    author_stats["author_popularity_index"] *
    author_stats["author_avg_rating"]
).round(2)

# üîπ 2. Author Visibility Score (review / discussion proxy)
author_stats["author_visibility_score"] = (
    author_stats["author_book_count"] *
    author_stats["author_avg_edition_count"]
).round(2)

# üîπ 3. Author Trend Status
def classify_author_trend(row):
    if row["author_popularity_index"] >= 70:
        return "Established Author"
    elif row["author_popularity_index"] >= 40:
        return "Emerging Author"
    else:
        return "Niche Author"

author_stats["author_trend_status"] = author_stats.apply(
    classify_author_trend, axis=1
)

# =====================================================
# üîü MERGE AUTHOR DATA BACK TO BOOK DATA
# =====================================================
df = df.merge(author_stats, on="author_name", how="left")

# =====================================================
# 1Ô∏è‚É£1Ô∏è‚É£ SAVE OUTPUT FILES
# =====================================================
df.to_csv("books_with_author_popularity.csv", index=False)
author_stats.to_csv("author_popularity_index.csv", index=False)

print("\n‚úÖ FILES SAVED SUCCESSFULLY:")
print("üìò books_with_author_popularity.csv")
print("‚úçÔ∏è author_popularity_index.csv")

# =====================================================
# 1Ô∏è‚É£2Ô∏è‚É£ DONE
# =====================================================
print("\nüéØ DONE ‚Äî Open Library author enrichment + popularity analysis completed")


after fetching author popularity we fetched book reviews using
  "https://www.googleapis.com/books/v1/volumes"

In [None]:
# =====================================================
# 0Ô∏è‚É£ INSTALL & IMPORTS
# =====================================================
!pip install pandas requests tqdm --quiet

import pandas as pd
import requests
import re
from tqdm import tqdm

# =====================================================
# 1Ô∏è‚É£ LOAD BOOK LIST
# =====================================================
BOOK_FILE = "books_with_author_popularity.csv"

bt_df = pd.read_csv(BOOK_FILE)

# Automatically detect title column
title_col = next(c for c in bt_df.columns if "title" in c.lower())

# =====================================================
# 2Ô∏è‚É£ CLEAN BOOK TITLES
# =====================================================
def clean_title(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"\(.*?\)", "", text)           # remove bracket content
    text = re.sub(r"[^a-z0-9\s]", "", text)       # remove punctuation
    return re.sub(r"\s+", " ", text).strip()      # normalize spaces

bt_df["clean_title"] = bt_df[title_col].apply(clean_title)

# =====================================================
# 3Ô∏è‚É£ FETCH BOOK TEXT FROM APIs
# =====================================================
def fetch_openlibrary_text(title):
    try:
        r = requests.get(
            "https://openlibrary.org/search.json",
            params={"title": title, "limit": 1},
            timeout=10
        )
        data = r.json()

        if data.get("numFound", 0) == 0:
            return ""

        doc = data["docs"][0]
        parts = []

        fs = doc.get("first_sentence", "")
        if isinstance(fs, dict):
            fs = fs.get("value", "")
        if fs:
            parts.append(str(fs))

        subjects = doc.get("subject", [])
        if subjects:
            parts.append(" ".join(subjects[:8]))

        return " ".join(parts).strip()

    except Exception:
        return ""


def fetch_google_text(title):
    try:
        r = requests.get(
            "https://www.googleapis.com/books/v1/volumes",
            params={"q": title, "maxResults": 1},
            timeout=10
        )
        data = r.json()

        if "items" not in data:
            return ""

        return data["items"][0]["volumeInfo"].get("description", "").strip()

    except Exception:
        return ""

# =====================================================
# 4Ô∏è‚É£ COLLECT & FILTER TEXT
# =====================================================
texts = []

for t in tqdm(bt_df["clean_title"], desc="Fetching book text"):
    text = fetch_openlibrary_text(t)

    # Fallback if OpenLibrary text is weak
    if len(text) < 80:
        text = fetch_google_text(t)

    texts.append(text)

bt_df["review_text"] = texts
bt_df["text_length"] = bt_df["review_text"].apply(len)

# Remove weak / useless descriptions
bt_df = bt_df[bt_df["text_length"] >= 80].reset_index(drop=True)

# =====================================================
# 5Ô∏è‚É£ DEMAND SIGNAL WORD SETS
# =====================================================
BUY_WORDS = {
    "buy", "bought", "purchase", "worth",
    "recommend", "recommended", "must", "gift"
}

HYPE_WORDS = {
    "amazing", "bestseller", "classic",
    "brilliant", "popular", "loved", "famous"
}

def count_word_hits(text, vocab):
    tokens = set(re.findall(r"\b[a-z]+\b", text.lower()))
    return sum(1 for w in vocab if w in tokens)

bt_df["buying_intent_score"] = bt_df["review_text"].apply(
    lambda x: count_word_hits(x, BUY_WORDS)
)

bt_df["hype_score"] = bt_df["review_text"].apply(
    lambda x: count_word_hits(x, HYPE_WORDS)
)

# =====================================================
# 6Ô∏è‚É£ DEMAND SCORE (ROBUST & REALISTIC)
# =====================================================
# Normalize text length (0‚Äì1 scale)
bt_df["length_norm"] = bt_df["text_length"] / bt_df["text_length"].max()

# Weighted demand score
bt_df["demand_score"] = (
    0.5 * bt_df["length_norm"] +
    0.3 * bt_df["buying_intent_score"] +
    0.2 * bt_df["hype_score"]
)

# High-demand threshold ‚Üí top 10%
threshold = bt_df["demand_score"].quantile(0.90)

bt_df["high_demand"] = bt_df["demand_score"] >= threshold

# =====================================================
# 7Ô∏è‚É£ SAVE OUTPUT
# =====================================================
OUTPUT_FILE = "books_with_demand.csv"
bt_df.to_csv(OUTPUT_FILE, index=False)

# =====================================================
# 8Ô∏è‚É£ SUMMARY
# =====================================================
print("‚úÖ DONE")
print("Books analyzed:", len(bt_df))
print("High-demand books:", bt_df["high_demand"].sum())
print("Output saved to:", OUTPUT_FILE)


In [2]:
!pip install pandas requests tqdm nltk sentence-transformers beautifulsoup4 vaderSentiment


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m126.0/126.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [3]:
# =====================================================
# 0Ô∏è‚É£ IMPORTS
# =====================================================
import requests
import pandas as pd
import re
import nltk
import torch
from tqdm import tqdm
from bs4 import BeautifulSoup
from urllib.parse import urljoin

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sentence_transformers import SentenceTransformer, util
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# =====================================================
# 1Ô∏è‚É£ NLTK SETUP
# =====================================================
nltk.download("stopwords")
nltk.download("wordnet")

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
sentiment_analyzer = SentimentIntensityAnalyzer()

# =====================================================
# 2Ô∏è‚É£ HELPER FUNCTIONS
# =====================================================
def clean_text(text):
    words = re.findall(r"[a-zA-Z]{4,}", str(text).lower())
    return " ".join(lemmatizer.lemmatize(w) for w in words if w not in stop_words)

def get_sentiment(text):
    return sentiment_analyzer.polarity_scores(str(text))["compound"]

def sentiment_label(score):
    if score > 0.05:
        return "Positive"
    elif score < -0.05:
        return "Negative"
    return "Neutral"

# =====================================================
# 3Ô∏è‚É£ FETCH TOP NEWS (FOR COSINE SIMILARITY)
# =====================================================
def fetch_top_news():
    rss = "https://www.bing.com/news/search?q=latest+news&format=rss"
    soup = BeautifulSoup(requests.get(rss).content, "xml")
    news = [item.title.text for item in soup.find_all("item")[:10]]
    return news

news_titles = fetch_top_news()

# =====================================================
# 4Ô∏è‚É£ LOAD EMBEDDING MODEL
# =====================================================
model = SentenceTransformer("all-MiniLM-L6-v2")
news_embeddings = model.encode(news_titles, convert_to_tensor=True)

# =====================================================
# 5Ô∏è‚É£ SCRAPE BOOKS.TOSCRAPE (1000 BOOKS)
# =====================================================
BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
BASE_SITE = "https://books.toscrape.com/catalogue/"

def rating_to_number(r):
    return {"One":1,"Two":2,"Three":3,"Four":4,"Five":5}.get(r,0)

def scrape_book_details(url):
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    desc = soup.select_one("#product_description")
    desc = desc.find_next("p").text if desc else ""
    return desc

books = []
page = 1

while True:
    r = requests.get(BASE_URL.format(page))
    if r.status_code != 200:
        break
    soup = BeautifulSoup(r.text, "html.parser")

    for b in soup.select(".product_pod"):
        title = b.h3.a["title"]
        price = float(re.sub(r"[^\d.]", "", b.select_one(".price_color").text))
        rating = rating_to_number(b.find("p", class_="star-rating")["class"][1])
        link = urljoin(BASE_SITE, b.h3.a["href"])
        desc = scrape_book_details(link)

        books.append({
            "title": title,
            "price": price,
            "rating": rating,
            "description": desc
        })

    page += 1

df = pd.DataFrame(books)

# =====================================================
# 6Ô∏è‚É£ AUTHOR DATA (OPEN LIBRARY)
# =====================================================
def fetch_author(title):
    try:
        r = requests.get("https://openlibrary.org/search.json", params={"title": title, "limit": 1})
        doc = r.json()["docs"][0]
        return doc.get("author_name", ["Unknown"])[0], doc.get("edition_count", 1)
    except:
        return "Unknown", 1

authors, editions = [], []

for t in tqdm(df["title"], desc="Fetching authors"):
    a, e = fetch_author(t)
    authors.append(a)
    editions.append(e)

df["author_name"] = authors
df["edition_count"] = editions

author_stats = df.groupby("author_name").agg(
    author_book_count=("title", "count"),
    author_avg_rating=("rating", "mean"),
    author_avg_edition_count=("edition_count", "mean")
).reset_index()

author_stats["author_popularity_index"] = (
    0.4 * (author_stats["author_book_count"] / author_stats["author_book_count"].max()) +
    0.4 * (author_stats["author_avg_rating"] / 5) +
    0.2 * (author_stats["author_avg_edition_count"] / author_stats["author_avg_edition_count"].max())
)

df = df.merge(author_stats, on="author_name", how="left")

# =====================================================
# 7Ô∏è‚É£ REVIEW PROXY + SENTIMENT
# =====================================================
df["review_text"] = df["description"]
df["review_length"] = df["review_text"].apply(len)
df["sentiment_score"] = df["review_text"].apply(get_sentiment)
df["sentiment_label"] = df["sentiment_score"].apply(sentiment_label)

# =====================================================
# 8Ô∏è‚É£ COSINE SIMILARITY WITH NEWS
# =====================================================
def match_news(desc):
    emb = model.encode(clean_text(desc), convert_to_tensor=True)
    scores = util.cos_sim(emb, news_embeddings)
    idx = torch.argmax(scores).item()
    return news_titles[idx], scores[0][idx].item()

df[["matched_news", "news_similarity_score"]] = df["description"].apply(
    lambda x: pd.Series(match_news(x))
)

# =====================================================
# 9Ô∏è‚É£ SEASONAL STRATEGY
# =====================================================
SEASON_RULES = {
    "Holiday": (["christmas", "gift"], 1.2),
    "Education": (["school", "study"], 1.15),
    "Politics": (["war", "government"], 1.1)
}

def detect_season(text):
    for name, (keys, mult) in SEASON_RULES.items():
        if any(k in text.lower() for k in keys):
            return name, mult
    return None, 1.0

df[["seasonal_reason", "seasonal_multiplier"]] = df["description"].apply(
    lambda x: pd.Series(detect_season(x))
)

# =====================================================
# üîü FINAL DEMAND SCORE
# =====================================================
df["final_demand_score"] = (
    0.25 * df["sentiment_score"] +
    0.20 * (df["review_length"] / df["review_length"].max()) +
    0.20 * df["news_similarity_score"] +
    0.20 * df["author_popularity_index"] +
    0.15 * (df["rating"] / 5)
)

threshold = df["final_demand_score"].quantile(0.9)
df["high_demand"] = df["final_demand_score"] >= threshold

# =====================================================
# 1Ô∏è‚É£1Ô∏è‚É£ FINAL PRICE
# =====================================================
df["adjusted_price"] = (
    df["price"] *
    (1 + df["final_demand_score"]) *
    df["seasonal_multiplier"]
).round(2)

# =====================================================
# 1Ô∏è‚É£2Ô∏è‚É£ SAVE FINAL OUTPUT
# =====================================================
df.to_csv("books_ai_pricing_full_pipeline.csv", index=False)

print("‚úÖ FULL PIPELINE COMPLETE")
print("üìÅ Output: books_ai_pricing_full_pipeline.csv")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Fetching authors: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [05:29<00:00,  3.03it/s]


‚úÖ FULL PIPELINE COMPLETE
üìÅ Output: books_ai_pricing_full_pipeline.csv



##  Features

- Scrapes **1000 books** from [BooksToScrape](https://books.toscrape.com) including title, price, rating, and description.
- Fetches **author data** from Open Library API (author name and edition count).
- Performs **sentiment analysis** on book descriptions.
- Computes **cosine similarity** between book descriptions and **latest news headlines**.
- Applies **seasonal strategies** to increase pricing during relevant seasons (e.g., Christmas gifts, educational books, political topics).
- Combines multiple factors to compute a **final demand score** for each book.
- Calculates **adjusted prices** dynamically based on demand, sentiment, and seasonal factors.
- Outputs a CSV file with **all enriched book data** and price recommendations.

---

## üõ†Ô∏è Libraries & Tools Used

| Library | Purpose |
|---------|---------|
| `requests` | HTTP requests for scraping and API calls |
| `pandas` | Data storage, transformation, and output |
| `BeautifulSoup` | HTML parsing for web scraping |
| `nltk` | Text preprocessing (stopwords, lemmatization) |
| `vaderSentiment` | Sentiment scoring of book descriptions |
| `sentence_transformers` | Semantic embeddings for text similarity |
| `torch` | Tensor computations and cosine similarity |
| `tqdm` | Progress bars for loops |

---

##  Pipeline Steps

### 1Ô∏è‚É£ NLTK Setup

- Download stopwords and WordNet lemmatizer.
- Initialize:
  - `stop_words` ‚Üí words to ignore for text processing.
  - `lemmatizer` ‚Üí reduces words to base forms.
  - `sentiment_analyzer` ‚Üí scores text positivity/negativity.

---

### 2Ô∏è‚É£ Text Cleaning & Sentiment

- **clean_text**: removes non-alphabetic characters, words <4 letters, stopwords, and lemmatizes text.
- **get_sentiment**: returns a compound sentiment score (-1 to 1).
- **sentiment_label**: converts numeric score into "Positive", "Neutral", or "Negative".

---

### 3Ô∏è‚É£ Fetch Latest News

- Scrapes **top 10 news titles** from Bing News RSS feed.
- News embeddings will be used to determine **book relevance to current events**.

---

### 4Ô∏è‚É£ Load Embedding Model

- Uses `SentenceTransformer("all-MiniLM-L6-v2")` to encode both:
  - **News headlines**
  - **Book descriptions**
- Enables **semantic similarity matching** between books and news.

---

### 5Ô∏è‚É£ Scrape Books Data

- Scrapes **BooksToScrape** (1000 books across multiple pages):
  - `title`, `price`, `rating`, `description`
- Converts **ratings from text to numeric**:
  - `"One" ‚Üí 1, "Two" ‚Üí 2, ..., "Five" ‚Üí 5`

---

### 6Ô∏è‚É£ Fetch Author Data

- Uses **Open Library API** to fetch author name and edition count for each book title.
- Computes **author popularity index** based on:
  1. Number of books by author
  2. Average rating of their books
  3. Average edition count
- Formula for `author_popularity_index`:
author_popularity_index = 0.4 * normalized(book count) +
0.4 * normalized(avg rating) +
0.2 * normalized(avg editions)



---

### 7Ô∏è‚É£ Review & Sentiment Scoring

- Uses **book description as proxy for reviews**.
- Computes:
  - `review_length` ‚Üí length of description
  - `sentiment_score` ‚Üí compound sentiment
  - `sentiment_label` ‚Üí Positive/Neutral/Negative

---

### 8Ô∏è‚É£ Cosine Similarity with News

- Computes similarity between **cleaned book description** and **top news embeddings**.
- Adds:
  - `matched_news` ‚Üí most similar news headline
  - `news_similarity_score` ‚Üí cosine similarity score

---

### 9Ô∏è‚É£ Seasonal Strategy

- Applies **seasonal multipliers** for price adjustment:
  - `"Holiday"` ‚Üí 1.2 (keywords: christmas, gift)
  - `"Education"` ‚Üí 1.15 (keywords: school, study)
  - `"Politics"` ‚Üí 1.1 (keywords: war, government)
- Adds columns:
  - `seasonal_reason` ‚Üí detected season
  - `seasonal_multiplier` ‚Üí multiplier value

---

### üîü Compute Final Demand Score

- Combines multiple factors for **overall book demand**:

final_demand_score = 0.25 * sentiment_score +
0.20 * (review_length / max(review_length)) +
0.20 * news_similarity_score +
0.20 * author_popularity_index +
0.15 * (rating / 5)


- Flag books in **top 10%** demand:
df["high_demand"] = df["final_demand_score"] >= df["final_demand_score"].quantile(0.9)


---

### 1Ô∏è‚É£1Ô∏è‚É£ Calculate Adjusted Price

- Adjusted price formula:




- Multiplies original price by **demand factor** and **seasonal multiplier**.

---

### 1Ô∏è‚É£2Ô∏è‚É£ Save Output

- Saves **enriched dataset** to CSV:



