# SEO Content Quality & Duplicate Detector
## Complete ML Pipeline for Web Content Analysis

**Author:** LEAD_WALNUT Project  
**Date:** November 3, 2025  
**Objective:** Build a machine learning pipeline that analyzes web content for SEO quality assessment and duplicate detection.

---

## Project Overview

This notebook implements:
1. ✅ **HTML Parsing** - Extract clean text from HTML content
2. ✅ **Feature Engineering** - Calculate readability, keywords, embeddings
3. ✅ **Duplicate Detection** - Identify similar content using cosine similarity
4. ✅ **Quality Classification** - Train ML model to score content quality
5. ✅ **Real-Time Analysis** - Function to analyze any URL on-demand

---

In [55]:
# Core Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# HTML Parsing
from bs4 import BeautifulSoup
import requests
from time import sleep

# NLP & Text Processing
import textstat
from sentence_transformers import SentenceTransformer
import nltk

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
import pickle
import json
import os
from pathlib import Path

print("All libraries imported successfully.")
print(f"Working directory: {os.getcwd()}")

All libraries imported successfully.
Working directory: c:\Users\kmgs4\Documents\Christ Uni\trimester-5\LEAD_WALNUT\notebooks


In [56]:
# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    print("NLTK data downloaded successfully.")
except Exception as e:
    print(f"Warning: NLTK download issue - {e}")

NLTK data downloaded successfully.


## 2. Load Dataset

Load the CSV file containing URLs and HTML content (65 rows sampled from the original dataset).

In [57]:
# Load dataset
df = pd.read_csv('../data/data.csv')

print(f"Dataset loaded successfully.")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nSample URLs:")
for i, url in enumerate(df['url'].head(3)):
    print(f"  {i+1}. {url[:80]}...")

Dataset loaded successfully.
Shape: (65, 2)
Columns: ['url', 'html_content']

Sample URLs:
  1. https://comfax.com/reviews/free-fax/...
  2. https://www.cm-alliance.com/cybersecurity-blog...
  3. https://en.wikipedia.org/wiki/Remote_desktop_software...


In [58]:
def parse_html(html_content):
    """
    Parse HTML content and extract clean text information.
    
    Args:
        html_content (str): Raw HTML content
        
    Returns:
        dict: Dictionary containing title, body_text, and word_count
              Returns None if parsing fails
    """
    try:
        soup = BeautifulSoup(html_content, 'lxml')
        
        # Extract page title
        title = soup.find('title')
        title_text = title.get_text().strip() if title else "No title"
        
        # Remove non-content elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside', 'form']):
            tag.decompose()
        
        # Extract body text
        body = soup.find('body')
        if body:
            text = body.get_text(separator=' ', strip=True)
        else:
            text = soup.get_text(separator=' ', strip=True)
        
        # Clean text by removing extra whitespace
        text = ' '.join(text.split())
        
        # Calculate word count
        word_count = len(text.split())
        
        return {
            'title': title_text,
            'body_text': text,
            'word_count': word_count
        }
    except Exception as e:
        print(f"Warning: Parsing error - {e}")
        return None

print("parse_html() function defined.")

parse_html() function defined.


In [59]:
# Parse HTML content for all pages
print("=" * 60)
print("HTML PARSING")
print("=" * 60)

extracted_data = []
failed_count = 0

for idx, row in df.iterrows():
    parsed = parse_html(row['html_content'])
    if parsed:
        extracted_data.append({
            'url': row['url'],
            'title': parsed['title'],
            'body_text': parsed['body_text'],
            'word_count': parsed['word_count']
        })
    else:
        failed_count += 1

# Create DataFrame from extracted data
extracted_df = pd.DataFrame(extracted_data)

# Save extracted content (without HTML for smaller file size)
extracted_df.to_csv('../data/extracted_content.csv', index=False)

print(f"\nParsing Results:")
print(f"  Successfully parsed: {len(extracted_data)} pages")
print(f"  Failed to parse: {failed_count} pages")
print(f"  Saved to: data/extracted_content.csv")

# Display sample of extracted content
print(f"\nSample extracted content:")
print(extracted_df[['url', 'title', 'word_count']].head())

HTML PARSING

Parsing Results:
  Successfully parsed: 57 pages
  Failed to parse: 8 pages
  Saved to: data/extracted_content.csv

Sample extracted content:
                                                 url  \
0               https://comfax.com/reviews/free-fax/   
1     https://www.cm-alliance.com/cybersecurity-blog   
2  https://en.wikipedia.org/wiki/Remote_desktop_s...   
3  https://nytlicensing.com/latest/trends/content...   
4      https://sign.dropbox.com/products/dropbox-fax   

                                               title  word_count  
0  Free Fax Review - Top 7 Choices Reviewed and T...        5094  
1                                Cyber Security Blog        2017  
2                Remote desktop software - Wikipedia        2165  
3  16 Best Practices for Content Marketing in 202...        3284  
4  Send and Receive Fax Online by Phone and Compu...         490  

Parsing Results:
  Successfully parsed: 57 pages
  Failed to parse: 8 pages
  Saved to: data/extracted_c

In [60]:
# Feature extraction functions
def extract_sentence_count(text):
    """Extract sentence count from text using NLTK tokenizer."""
    try:
        sentences = nltk.sent_tokenize(text)
        return len(sentences)
    except:
        return len([s for s in text.split('.') if s.strip()])

def extract_readability(text):
    """Calculate Flesch Reading Ease score for text."""
    try:
        return textstat.flesch_reading_ease(text)
    except:
        return 50.0

def extract_top_keywords(texts, n_keywords=5):
    """Extract top N keywords using TF-IDF vectorization."""
    try:
        vectorizer = TfidfVectorizer(
            max_features=100, 
            stop_words='english', 
            max_df=0.85, 
            min_df=2
        )
        tfidf_matrix = vectorizer.fit_transform(texts)
        feature_names = vectorizer.get_feature_names_out()
        
        keywords_list = []
        for doc_idx in range(tfidf_matrix.shape[0]):
            tfidf_scores = tfidf_matrix[doc_idx].toarray()[0]
            top_indices = tfidf_scores.argsort()[-n_keywords:][::-1]
            top_keywords = [feature_names[i] for i in top_indices if tfidf_scores[i] > 0]
            keywords_list.append('|'.join(top_keywords) if top_keywords else 'no_keywords')
        return keywords_list
    except:
        return ['no_keywords'] * len(texts)

print("Extracting features...")

# Extract basic features
extracted_df['sentence_count'] = extracted_df['body_text'].apply(extract_sentence_count)
extracted_df['flesch_reading_ease'] = extracted_df['body_text'].apply(extract_readability)
extracted_df['top_keywords'] = extract_top_keywords(extracted_df['body_text'].tolist())

# Generate semantic embeddings
print("Generating embeddings (this may take a moment)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(extracted_df['body_text'].tolist(), show_progress_bar=True, batch_size=16)
extracted_df['embedding'] = list(embeddings)

print(f"Feature extraction complete. Shape: {embeddings.shape}")

Extracting features...
Generating embeddings (this may take a moment)...
Generating embeddings (this may take a moment)...


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Feature extraction complete. Shape: (57, 384)


In [61]:
# Create features dataframe and save to CSV
features_df = extracted_df[['url', 'word_count', 'sentence_count', 'flesch_reading_ease', 'top_keywords', 'embedding']].copy()

# Convert embeddings to string for CSV storage
features_df['embedding_vector'] = features_df['embedding'].apply(lambda x: str(x.tolist()))
features_to_save = features_df.drop('embedding', axis=1)

# Save features (without embedding column to reduce file size)
features_to_save.to_csv('../data/features.csv', index=False)

print(f"Features saved to: data/features.csv")
print(f"\nFeature statistics:")
print(features_df[['word_count', 'sentence_count', 'flesch_reading_ease']].describe())

Features saved to: data/features.csv

Feature statistics:
         word_count  sentence_count  flesch_reading_ease
count     57.000000       57.000000            57.000000
mean    3084.859649      156.403509            38.769881
std     3611.009096      192.121600            21.489318
min       58.000000        5.000000            -3.257091
25%      863.000000       31.000000            24.939930
50%     2165.000000       90.000000            37.880668
75%     4091.000000      181.000000            52.003378
max    22520.000000      970.000000           102.680379


In [62]:
# Compute cosine similarity matrix
print("Computing similarity matrix...")
similarity_matrix = cosine_similarity(np.array(features_df['embedding'].tolist()))

# Find duplicate pairs above threshold
SIMILARITY_THRESHOLD = 0.80
duplicates = []
urls = features_df['url'].tolist()

for i in range(len(urls)):
    for j in range(i+1, len(urls)):
        if similarity_matrix[i][j] > SIMILARITY_THRESHOLD:
            duplicates.append({
                'url1': urls[i],
                'url2': urls[j],
                'similarity': round(similarity_matrix[i][j], 3)
            })

# Detect thin content (pages with less than 500 words)
features_df['is_thin'] = features_df['word_count'] < 500
thin_count = features_df['is_thin'].sum()

# Save duplicate pairs to CSV
if duplicates:
    pd.DataFrame(duplicates).to_csv('../data/duplicates.csv', index=False)
else:
    pd.DataFrame(columns=['url1', 'url2', 'similarity']).to_csv('../data/duplicates.csv', index=False)

print(f"\nDuplicate Detection Summary:")
print(f"  Duplicate pairs found: {len(duplicates)}")
print(f"  Thin content pages: {thin_count} ({thin_count/len(features_df)*100:.1f}%)")
print(f"  Results saved to: data/duplicates.csv")

Computing similarity matrix...

Duplicate Detection Summary:
  Duplicate pairs found: 2
  Thin content pages: 10 (17.5%)
  Results saved to: data/duplicates.csv


## 13. Save Trained Model

Persist the trained model to disk for future use.

In [63]:
def baseline_predict(word_count):
    """
    Simple rule-based baseline classifier using only word count.
    
    Args:
        word_count (int): Number of words in content
        
    Returns:
        str: Predicted quality label
    """
    if word_count > 1000:
        return 'High'
    elif word_count < 500:
        return 'Low'
    else:
        return 'Medium'

# Evaluate baseline model
baseline_preds = X_test['word_count'].apply(baseline_predict)
baseline_accuracy = accuracy_score(y_test, baseline_preds)

# Compare baseline vs Random Forest
print("Model Comparison:")
print(f"  Baseline Accuracy: {baseline_accuracy:.3f}")
print(f"  Random Forest Accuracy: {rf_accuracy:.3f}")
print(f"  Improvement: {((rf_accuracy - baseline_accuracy) / baseline_accuracy * 100):.1f}%")

Model Comparison:
  Baseline Accuracy: 0.444
  Random Forest Accuracy: 0.833
  Improvement: 87.5%


## 12. Baseline Comparison

Compare Random Forest performance against simple rule-based baseline.

In [66]:
# Prepare features and target for model training
X = features_df[['word_count', 'sentence_count', 'flesch_reading_ease']]
y = features_df['quality_label']

# Split data with stratification to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42, 
    stratify=y
)

# Train Random Forest classifier
rf_model = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    max_depth=10
)
rf_model.fit(X_train, y_train)

# Evaluate model performance
y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred)

print("Random Forest Model Performance:")
print(f"Accuracy: {rf_accuracy:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display feature importance
feat_imp = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feat_imp)

Random Forest Model Performance:
Accuracy: 0.833

Classification Report:
              precision    recall  f1-score   support

        High       1.00      0.50      0.67         2
         Low       0.80      1.00      0.89         8
      Medium       0.86      0.75      0.80         8

    accuracy                           0.83        18
   macro avg       0.89      0.75      0.79        18
weighted avg       0.85      0.83      0.82        18


Feature Importance:
               feature  importance
2  flesch_reading_ease    0.497291
0           word_count    0.283375
1       sentence_count    0.219334


## 11. Train Quality Classification Model

Train Random Forest classifier to predict content quality.

In [65]:
def create_quality_labels(row):
    """
    Create quality labels based on word count and readability metrics.
    
    Labeling criteria:
    - High: word_count > 1500 AND readability between 50-70
    - Low: word_count < 500 OR readability < 30
    - Medium: all other cases
    
    Args:
        row: DataFrame row with word_count and flesch_reading_ease columns
        
    Returns:
        str: Quality label ('High', 'Medium', or 'Low')
    """
    if row['word_count'] > 1500 and 50 <= row['flesch_reading_ease'] <= 70:
        return 'High'
    elif row['word_count'] < 500 or row['flesch_reading_ease'] < 30:
        return 'Low'
    else:
        return 'Medium'

# Apply labeling function to create quality labels
features_df['quality_label'] = features_df.apply(create_quality_labels, axis=1)

print("Quality labels created.")
print("\nLabel distribution:")
print(features_df['quality_label'].value_counts())

Quality labels created.

Label distribution:
quality_label
Medium    25
Low       25
High       7
Name: count, dtype: int64


## 10. Create Quality Labels

Generate quality labels based on content metrics for ML training.

In [67]:
# Save trained model
os.makedirs('../models', exist_ok=True)
with open('../models/quality_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)
    
print("Model saved to: models/quality_model.pkl")

Model saved to: models/quality_model.pkl


## 15. Demo - Test Real-Time Analysis

Test the analyze_url() function with a sample URL from our dataset.

In [68]:
def analyze_url(url, existing_embeddings=None, existing_urls=None, model=None, similarity_threshold=0.75):
    """
    Analyze a URL for content quality and find similar pages.
    
    This function scrapes a given URL, extracts features, predicts content quality,
    and identifies similar pages based on semantic similarity.
    
    Args:
        url (str): URL to analyze
        existing_embeddings (np.array, optional): Embeddings for similarity comparison
        existing_urls (list, optional): List of existing URLs
        model (object, optional): Trained classification model
        similarity_threshold (float): Threshold for similarity detection (default: 0.75)
        
    Returns:
        dict: Analysis results including quality label, metrics, and similar pages
              Returns error dict if analysis fails
    """
    try:
        # Scrape URL with appropriate headers
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse HTML content
        parsed = parse_html(response.text)
        if not parsed:
            return {"error": "Failed to parse HTML content"}
        
        # Extract features
        sentence_count = extract_sentence_count(parsed['body_text'])
        readability = extract_readability(parsed['body_text'])
        word_count = parsed['word_count']
        
        # Generate embedding for similarity comparison
        embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        embedding = embedding_model.encode([parsed['body_text']])[0]
        
        # Load model if not provided
        if model is None:
            with open('../models/quality_model.pkl', 'rb') as f:
                model = pickle.load(f)
        
        # Predict quality
        features = pd.DataFrame({
            'word_count': [word_count],
            'sentence_count': [sentence_count],
            'flesch_reading_ease': [readability]
        })
        
        quality_label = model.predict(features)[0]
        quality_proba = model.predict_proba(features)[0]
        
        # Find similar content if embeddings provided
        similar_pages = []
        if existing_embeddings is not None and existing_urls is not None:
            similarities = cosine_similarity([embedding], existing_embeddings)[0]
            for idx, sim in enumerate(similarities):
                if sim > similarity_threshold and existing_urls[idx] != url:
                    similar_pages.append({
                        'url': existing_urls[idx],
                        'similarity': round(float(sim), 3)
                    })
            similar_pages = sorted(similar_pages, key=lambda x: x['similarity'], reverse=True)[:5]
        
        # Return analysis results
        return {
            'url': url,
            'title': parsed['title'],
            'word_count': word_count,
            'sentence_count': sentence_count,
            'readability': round(readability, 2),
            'quality_label': quality_label,
            'quality_confidence': {
                c: round(float(p), 3) 
                for c, p in zip(model.classes_, quality_proba)
            },
            'is_thin': word_count < 500,
            'similar_to': similar_pages
        }
        
    except Exception as e:
        return {"error": f"Analysis failed: {str(e)}"}

print("analyze_url() function defined.")

analyze_url() function defined.


## 14. Real-Time URL Analysis Function

Function to analyze any URL for content quality and find similar pages.

In [70]:
print("=" * 60)
print("PIPELINE EXECUTION COMPLETE")
print("=" * 60)

print("\nDeliverables created:")
print("  1. data/extracted_content.csv - Parsed HTML content")
print("  2. data/features.csv - Extracted features")
print("  3. data/duplicates.csv - Duplicate pairs")
print("  4. models/quality_model.pkl - Trained model")
print("  5. analyze_url() function - Real-time analysis")

print(f"\nFinal Statistics:")
print(f"  Pages processed: {len(features_df)}")
print(f"  Duplicate pairs: {len(duplicates)}")
print(f"  Thin content: {thin_count} ({thin_count/len(features_df)*100:.1f}%)")
print(f"  Model accuracy: {rf_accuracy:.3f}")
print(f"  Baseline accuracy: {baseline_accuracy:.3f}")
print(f"  Improvement: {((rf_accuracy - baseline_accuracy) / baseline_accuracy * 100):.1f}%")

print("\nNext steps:")
print("  1. Review generated CSV files in data/ folder")
print("  2. Test analyze_url() with different URLs")
print("  3. Push to GitHub repository")

PIPELINE EXECUTION COMPLETE

Deliverables created:
  1. data/extracted_content.csv - Parsed HTML content
  2. data/features.csv - Extracted features
  3. data/duplicates.csv - Duplicate pairs
  4. models/quality_model.pkl - Trained model
  5. analyze_url() function - Real-time analysis

Final Statistics:
  Pages processed: 57
  Duplicate pairs: 2
  Thin content: 10 (17.5%)
  Model accuracy: 0.833
  Baseline accuracy: 0.444
  Improvement: 87.5%

Next steps:
  1. Review generated CSV files in data/ folder
  2. Test analyze_url() with different URLs
  3. Push to GitHub repository


In [69]:
# Prepare data for real-time analysis demo
existing_embeddings = np.array(features_df['embedding'].tolist())
existing_urls = features_df['url'].tolist()

# Test analyze_url() function with sample URL
test_url = features_df['url'].iloc[0]

print(f"Analyzing sample URL: {test_url}\n")

result = analyze_url(
    test_url, 
    existing_embeddings=existing_embeddings,
    existing_urls=existing_urls,
    model=rf_model,
    similarity_threshold=0.75
)

print("Analysis Results:")
print("=" * 60)
print(json.dumps(result, indent=2))

Analyzing sample URL: https://comfax.com/reviews/free-fax/

Analysis Results:
{
  "url": "https://comfax.com/reviews/free-fax/",
  "title": "Free Fax Review - Top 7 Choices Reviewed and Tested, 2025 Updated",
  "word_count": 5094,
  "sentence_count": 236,
  "readability": 60.13,
  "quality_label": "High",
  "quality_confidence": {
    "High": 0.78,
    "Low": 0.02,
    "Medium": 0.2
  },
  "is_thin": false,
  "similar_to": []
}
Analysis Results:
{
  "url": "https://comfax.com/reviews/free-fax/",
  "title": "Free Fax Review - Top 7 Choices Reviewed and Tested, 2025 Updated",
  "word_count": 5094,
  "sentence_count": 236,
  "readability": 60.13,
  "quality_label": "High",
  "quality_confidence": {
    "High": 0.78,
    "Low": 0.02,
    "Medium": 0.2
  },
  "is_thin": false,
  "similar_to": []
}
