# SEO Content Quality & Duplicate Detector

**Course / Assignment**: Data Science – SEO Content Quality & Duplicate Detection  
**Student**: _<Your Name Here>_  
**Program**: M.Sc. / M.Tech / MBA in Data Science / Analytics  
**Notebook Goal**: Build an end-to-end SEO content analysis pipeline that:

1. Parses raw HTML content and extracts clean text.
2. Engineers SEO-related text features (length, readability, TF–IDF, etc.).
3. Detects near-duplicate and thin content.
4. Trains a content quality classifier (Low / Medium / High).
5. Provides a real-time URL analysis demo that can be connected to a Streamlit UI.

## 0. Environment Setup & High-Level Design

In this notebook, we follow the **assignment structure** and map each section to the corresponding requirement:

1. **Data Collection & HTML Parsing (15%)**  
2. **Text Preprocessing & Feature Engineering (25%)**  
3. **Duplicate Detection & Thin Content (20%)**  
4. **Content Quality Scoring (25%)**  
5. **Real-Time Analysis Demo (15%)**  

The raw dataset is stored locally on my machine at:

```text
/Users/ajaykumark/Documents/UniArchive/MDS- Semesters/CodeWalnut/data.csv
```

This path is used only for reading; a project-local copy is saved under `./data/` to keep the repository portable.

In [None]:
from pathlib import Path
import re
import time
import json

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    confusion_matrix,
    classification_report
)

from scipy.sparse import save_npz
import pickle
import requests

# ------------------------------------------------------------------
# Project directories
# ------------------------------------------------------------------
BASE_DIR = Path().resolve()
DATA_DIR = BASE_DIR / "data"
MODELS_DIR = BASE_DIR / "models"

DATA_DIR.mkdir(exist_ok=True)
MODELS_DIR.mkdir(exist_ok=True)

# Absolute path to the raw CSV on my machine
ABS_INPUT_PATH = Path("/Users/ajaykumark/Documents/UniArchive/MDS- Semesters/CodeWalnut/data.csv")

ABS_INPUT_PATH, DATA_DIR, MODELS_DIR

### 0.1 Reusable Helper Functions

To keep the notebook clean and modular, we define all text-processing, readability, and HTML-parsing utilities in one place and reuse them throughout the pipeline.

In [None]:
def clean_text(s: str) -> str:
    """Remove weird spaces and collapse multiple spaces."""
    if not isinstance(s, str):
        return ""
    s = s.replace("\xa0", " ")
    s = re.sub(r"\s+", " ", s)
    return s.strip()


def sentence_tokenize(text: str):
    """Very simple sentence splitter based on punctuation."""
    if not isinstance(text, str):
        return []
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]


def word_tokenize(text: str):
    """Split text into alphanumeric tokens."""
    if not isinstance(text, str):
        return []
    return re.findall(r"[A-Za-z0-9']+", text.lower())


def count_syllables(word: str) -> int:
    """Rough syllable counter – sufficient for Flesch score."""
    word = word.lower()
    if len(word) <= 3:
        return 1
    vowels = "aeiouy"
    count = 0
    prev_vowel = False
    for ch in word:
        is_v = ch in vowels
        if is_v and not prev_vowel:
            count += 1
        prev_vowel = is_v
    if word.endswith("e"):
        count = max(1, count - 1)
    return max(1, count)


def flesch_reading_ease(text: str) -> float:
    """Compute Flesch Reading Ease score (higher = easier to read)."""
    words = word_tokenize(text)
    sentences = sentence_tokenize(text)
    if len(words) == 0 or len(sentences) == 0:
        return 0.0

    syllables = sum(count_syllables(w) for w in words)
    W = len(words)
    S = len(sentences)
    score = 206.835 - 1.015 * (W / S) - 84.6 * (syllables / W)
    return round(score, 2)


def extract_text_from_html(html: str):
    """
    Extract title + main body text from HTML using BeautifulSoup.

    Heuristic:
    1) Remove <script>, <style>, <noscript>.
    2) Prefer <main> or <article> blocks for the main content.
    3) Fallback: join all <p> tags.
    4) Last fallback: use the entire page text.
    """
    if not isinstance(html, str) or not html.strip():
        return "", ""

    soup = BeautifulSoup(html, "html.parser")

    for t in soup(["script", "style", "noscript"]):
        t.extract()

    title = soup.title.get_text(separator=" ", strip=True) if soup.title else ""

    candidates = []
    for tag_name in ["main", "article"]:
        tag = soup.find(tag_name)
        if tag:
            candidates.append(tag.get_text(separator=" ", strip=True))

    if not candidates:
        ps = [p.get_text(separator=" ", strip=True) for p in soup.find_all("p")]
        if len(ps) >= 2:
            candidates.append(" ".join(ps))
        else:
            candidates.append(soup.get_text(separator=" ", strip=True))

    body = max(candidates, key=len) if candidates else ""
    return clean_text(title), clean_text(body)


def label_quality(word_count: int, fre: float) -> str:
    """
    Synthetic SEO-style quality label:
    - High: long & comfortably readable
    - Low: very short or very hard to read
    - Medium: in-between
    """
    if (word_count > 1500) and (50 <= fre <= 70):
        return "High"
    if (word_count < 500) or (fre < 30):
        return "Low"
    return "Medium"

## 1. Data Collection & HTML Parsing (15%)

In this section, I:

1. Load the raw dataset from the given **absolute path** on my local machine.  
2. Store a copy inside the project (`./data/data.csv`) for reproducibility and Git version control.  
3. Parse the HTML stored in the `html_content` column to extract:
   - Page `title`
   - Cleaned `body_text` (main article content)
   - `word_count` based on tokenized text

In [None]:
# 1.1 Load raw dataset

if not ABS_INPUT_PATH.exists():
    raise FileNotFoundError(f"data.csv not found at: {ABS_INPUT_PATH}")

df_raw = pd.read_csv(ABS_INPUT_PATH)
df_raw.columns = [c.strip().lower() for c in df_raw.columns]  # normalize

print("Raw shape:", df_raw.shape)
df_raw.head()

In [None]:
# 1.2 Basic sanity checks: schema + missingness

display(df_raw.info())
(df_raw.isna().mean() * 100).to_frame("missing_%")

In [None]:
# 1.3 Save a project-local copy for reproducibility

project_raw_path = DATA_DIR / "data.csv"
df_raw.to_csv(project_raw_path, index=False)
print("Copied raw data into:", project_raw_path)

In [None]:
# 1.4 HTML parsing → title, body_text, word_count

if "url" not in df_raw.columns:
    raise ValueError("Expected a 'url' column in the dataset.")
if "html_content" not in df_raw.columns:
    raise ValueError("Expected an 'html_content' column with HTML source.")

titles, bodies, word_counts = [], [], []

for _, row in df_raw.iterrows():
    html = row.get("html_content", "")
    try:
        title, body = extract_text_from_html(html)
    except Exception:
        title, body = "", ""
    titles.append(title)
    bodies.append(body)
    word_counts.append(len(word_tokenize(body)))

extracted = pd.DataFrame({
    "url": df_raw["url"],
    "title": titles,
    "body_text": bodies,
    "word_count": word_counts
})

print("Extracted content shape:", extracted.shape)
extracted.head()

In [None]:
# 1.5 Persist extracted content

extracted_path = DATA_DIR / "extracted_content.csv"
extracted.to_csv(extracted_path, index=False)
extracted_path

## 2. Text Preprocessing & Feature Engineering (25%)

Here I engineer a set of interpretable SEO features for each URL:

- `word_count` – content length signal  
- `sentence_count` – structural complexity  
- `flesch_reading_ease` – readability score  
- `is_thin` – thin content flag (`1` if word_count < 500)  
- `quality_label` – synthetic quality tier (High / Medium / Low) based on SEO-inspired rules  
- TF–IDF embeddings – for similarity and keyword extraction

In [None]:
# 2.1 Core features: sentence_count & readability

sentence_counts = [len(sentence_tokenize(t)) for t in extracted["body_text"]]
readability_scores = [flesch_reading_ease(t) for t in extracted["body_text"]]

features = pd.DataFrame({
    "url": extracted["url"],
    "word_count": extracted["word_count"],
    "sentence_count": sentence_counts,
    "flesch_reading_ease": readability_scores
})

# Thin content flag
features["is_thin"] = (features["word_count"] < 500).astype(int)

# Synthetic quality label
features["quality_label"] = [
    label_quality(wc, fre)
    for wc, fre in zip(features["word_count"], features["flesch_reading_ease"])
]

features.head()

In [None]:
# 2.2 TF–IDF embeddings (for similarity + keywords)

texts_for_tfidf = extracted["body_text"].fillna("").astype(str).tolist()

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words="english"
)
X_tfidf = tfidf_vectorizer.fit_transform(texts_for_tfidf)

X_tfidf.shape

In [None]:
# 2.3 Example: top 5 keywords for first 3 URLs

def top_keywords_for_doc(idx, top_k=5):
    row = X_tfidf[idx].toarray().ravel()
    if row.sum() == 0:
        return []
    feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
    top_idx = row.argsort()[-top_k:][::-1]
    return feature_names[top_idx].tolist()

for i in range(min(3, X_tfidf.shape[0])):
    print("URL:", extracted.loc[i, "url"])
    print("Top keywords:", top_keywords_for_doc(i))
    print("-" * 80)

In [None]:
# 2.4 Persist features and embeddings

features_path = DATA_DIR / "features.csv"
features.to_csv(features_path, index=False)

emb_path = DATA_DIR / "tfidf_embeddings.npz"
save_npz(emb_path, X_tfidf)

features_path, emb_path

## 3. Duplicate Detection & Thin Content (20%)

This section focuses on **content reuse and cannibalisation** within the site:

1. Use TF–IDF embeddings to compute cosine similarity between every pair of pages.  
2. Treat pairs with similarity ≥ 0.80 as **near-duplicates**.  
3. Save all pairs into `data/duplicates.csv`.  
4. Summarise:
   - Total pages
   - Number of duplicate pairs
   - Thin content percentage

In [None]:
# 3.1 Cosine similarity matrix and duplicate pairs

sim_matrix = cosine_similarity(X_tfidf)
n = sim_matrix.shape[0]
threshold = 0.8

pairs = []
for i in range(n):
    for j in range(i + 1, n):
        sim = sim_matrix[i, j]
        if sim >= threshold:
            pairs.append({
                "url1": extracted.loc[i, "url"],
                "url2": extracted.loc[j, "url"],
                "similarity": round(float(sim), 4)
            })

duplicates_df = pd.DataFrame(pairs)
duplicates_df.head()

In [None]:
# 3.2 Persist duplicate pairs + summary diagnostics

dup_path = DATA_DIR / "duplicates.csv"
duplicates_df.to_csv(dup_path, index=False)

summary = {
    "total_pages": int(len(extracted)),
    "duplicate_pairs": int(len(duplicates_df)),
    "thin_pages": int(features["is_thin"].sum()),
    "thin_pct": float(features["is_thin"].mean() * 100)
}

dup_path, summary

## 4. Content Quality Scoring (25%)

We now build a **supervised classifier** that predicts the content quality tier (Low / Medium / High).

- Target variable: `quality_label` (generated using SEO-style rules).  
- Features:
  - `word_count`
  - `sentence_count`
  - `flesch_reading_ease`
- Model: `RandomForestClassifier` (strong baseline for tabular features).  
- Baseline: a very simple *rule-only* classifier based on `word_count` thresholds.

In [None]:
# 4.1 Prepare data and compute baseline performance

X_model = features[["word_count", "sentence_count", "flesch_reading_ease"]].values
y = features["quality_label"].values

# Baseline rule: only word_count
baseline_pred = np.where(
    features["word_count"] > 1000,
    "High",
    np.where(features["word_count"] < 500, "Low", "Medium")
)

baseline_acc = accuracy_score(y, baseline_pred)
baseline_acc

In [None]:
# 4.2 Train/test split + RandomForest training

X_train, X_test, y_train, y_test = train_test_split(
    X_model, y, test_size=0.3, random_state=42, stratify=y
)

clf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")
cm = confusion_matrix(y_test, y_pred, labels=["Low", "Medium", "High"])
report = classification_report(
    y_test, y_pred, labels=["Low", "Medium", "High"], zero_division=0
)

acc, f1

In [None]:
# 4.3 Detailed evaluation & feature importances

print("Baseline accuracy (word_count rule):", baseline_acc)
print("Model accuracy:", acc)
print("Model weighted F1:", f1)
print("\nConfusion matrix (Low, Medium, High):\n", cm)
print("\nClassification report:\n", report)

feature_names = ["word_count", "sentence_count", "flesch_reading_ease"]
importances = clf.feature_importances_

print("\nFeature importances:")
for name, imp in sorted(zip(feature_names, importances), key=lambda x: -x[1]):
    print(f"- {name}: {imp:.3f}")

In [None]:
# 4.4 Persist trained model

model_path = MODELS_DIR / "quality_model.pkl"
with open(model_path, "wb") as f:
    pickle.dump(clf, f)

model_path

## 5. Real-Time Analysis Demo (15%)

Finally, I provide a **real-time analysis function** that can be reused directly
in a Streamlit app or an API endpoint.

Given a URL, `analyze_url(url)` will:

1. Fetch the HTML with a polite `User-Agent` and delay.  
2. Extract clean text using the same parser as the offline pipeline.  
3. Compute:
   - `word_count`
   - `sentence_count`
   - `flesch_reading_ease`
4. Generate:
   - Rule-based quality label
   - Model-based quality label
   - Thin content flag
5. Compute similarity against the existing corpus and return top similar URLs.

In [None]:
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/119.0.0.0 Safari/537.36"
    )
}

def analyze_url(url: str, sim_threshold: float = 0.5, delay: float = 1.5):
    """
    Real-time SEO analysis for a single URL.

    Returns a dictionary containing:
    - core metrics (length, readability)
    - rule & model quality labels
    - thin content flag
    - list of similar pages from the dataset
    """
    # 1) HTTP fetch
    try:
        resp = requests.get(url, headers=HEADERS, timeout=10)
        time.sleep(delay)
        if resp.status_code != 200:
            return {"url": url, "error": f"HTTP {resp.status_code}"}
    except Exception as e:
        return {"url": url, "error": str(e)}

    # 2) HTML parsing
    title, body = extract_text_from_html(resp.text)

    # 3) Feature computation
    wc = len(word_tokenize(body))
    sc = len(sentence_tokenize(body))
    fre = flesch_reading_ease(body)

    rule_label = label_quality(wc, fre)

    try:
        X_new = np.array([[wc, sc, fre]])
        model_label = clf.predict(X_new)[0]
    except Exception:
        model_label = rule_label

    # 4) Similarity against corpus
    new_vec = tfidf_vectorizer.transform([body])
    sims = cosine_similarity(new_vec, X_tfidf).ravel()
    top_idx = sims.argsort()[::-1][:10]

    similar_pages = [
        {"url": extracted.loc[i, "url"], "similarity": round(float(sims[i]), 4)}
        for i in top_idx if sims[i] >= sim_threshold
    ]

    return {
        "url": url,
        "title": title,
        "word_count": wc,
        "sentence_count": sc,
        "readability": fre,
        "rule_quality_label": rule_label,
        "model_quality_label": model_label,
        "is_thin": wc < 500,
        "similar_pages": similar_pages
    }

In [None]:
# 5.1 Demo call (use a crawl-friendly URL; some sites may return HTTP 403)

test_url = "https://en.wikipedia.org/wiki/Data_science"
result = analyze_url(test_url)
print(json.dumps(result, indent=2))

## 6. Conclusion & Next Steps

This notebook demonstrates a complete **SEO content quality and duplicate detection pipeline**:

- Parsed raw HTML into clean title and body text.
- Engineered interpretable text features and thin-content flags.
- Detected near-duplicate URLs via TF–IDF + cosine similarity.
- Trained and evaluated a RandomForest-based quality classifier.
- Exposed a reusable `analyze_url()` function suitable for a Streamlit front-end.

**Possible extensions (for production):**
- Replace TF–IDF with transformer-based sentence embeddings (e.g., Sentence-BERT) for semantic similarity.
- Incorporate additional SEO signals (internal links, meta tags, heading structure, Core Web Vitals).
- Persist results into a database and schedule periodic re-crawling for monitoring content drift.