<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/SEO_Visibility_Maximizer_Dynamic_Rendering_for_JavaScript_SEO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name: SEO Visibility Maximizer: Dynamic Rendering for JavaScript SEO**

---
# **Purpose of the Project:**

This project is designed to solve a significant problem that many modern websites face: **search engines often struggle to understand and rank websites built with JavaScript frameworks** (like React, Angular, or Vue.js). The purpose of this project is to make websites built with JavaScript more accessible and understandable to search engines, ensuring that they rank higher and attract more visitors.

---

### What Problem Does This Project Address?

1. **Search Engines Can't Always Understand JavaScript**:
   - Websites built with JavaScript are dynamic, meaning they load content on-the-fly rather than showing a preloaded page.
   - Search engine bots like Google can’t always process this type of content effectively, leading to poor rankings and reduced visibility.

2. **SEO Visibility Issues**:
   - If search engines can’t fully understand a webpage, they may fail to index it properly. This means your website might not show up in search results at all, or it might rank much lower than it should.

3. **Missed Opportunities for Website Owners**:
   - Websites with poor SEO visibility lose out on organic traffic, which is crucial for business growth, online presence, and lead generation.

---

### What Does This Project Do?

This project implements **Dynamic Rendering**, a modern technique to ensure search engines see the website the way real users do. It does this by:

1. **Detecting Who Is Visiting**:
   - The system identifies whether the visitor is a **search engine bot** or a **real user**.

2. **Serving Optimized Content**:
   - For search engines: The project generates a fully-loaded, SEO-friendly version of the website that is easy for bots to understand and index.
   - For real users: Visitors get the dynamic, interactive JavaScript experience as intended.

3. **Ensuring SEO Best Practices**:
   - Automatically optimizes critical SEO elements like:
     - **Title tags**: The clickable headline in search results.
     - **Meta descriptions**: The short summary under the headline in search results.
     - **Canonical URLs**: Prevents duplicate content issues by showing the preferred page version.
     - **Structured Data**: Provides search engines with extra context about the page content.

---

### Why Is This Important?

- **Improved Rankings**: The project ensures that websites rank higher by making them fully understandable to search engines.
- **More Organic Traffic**: With better rankings, websites attract more visitors from search engines.
- **Business Growth**: Increased visibility means more leads, more customers, and more success.
- **Cost Efficiency**: Fixing SEO visibility issues with dynamic rendering is more cost-effective than rebuilding a website.

---

### Key Features of This Project:

1. **Processing URLs**:
   - The project identifies each page on the website and ensures it is rendered correctly.

2. **Rendering for Search Engines**:
   - It generates a static HTML version of JavaScript-heavy pages, ensuring they are SEO-friendly.

3. **Validation Reports**:
   - The system checks each page to ensure it meets SEO standards and reports any errors or missing data.

4. **Deployment**:
   - Static HTML files are generated and deployed for search engines to crawl and index.

5. **Error Handling**:
   - Identifies common issues like missing structured data or incorrect canonical URLs and recommends fixes.

---

### How Can It Benefit Website Owners?

1. **Make Your Website Visible**:
   - Ensures all critical pages are indexed and ranked on search engines.

2. **Boost Click-Through Rates (CTR)**:
   - Optimized titles and meta descriptions attract more clicks in search results.

3. **Fix Common SEO Errors**:
   - Automatically detects and resolves issues like missing metadata or invalid page structures.

4. **Save Time and Resources**:
   - Dynamic rendering eliminates the need for manual fixes and long SEO audits.

5. **Reach a Broader Audience**:
   - Higher visibility brings more traffic, expanding your reach to potential customers.

---

### What Steps Should a Website Owner Take After Using This Project?

1. **Review the SEO Reports**:
   - Check which pages are optimized and if there are any errors to fix.

2. **Fix Missing Data**:
   - Ensure titles, descriptions, and structured data are added to all critical pages.

3. **Monitor Performance**:
   - Use tools like Google Search Console to track rankings and traffic after deployment.

4. **Update Regularly**:
   - Keep adding fresh, optimized content to maintain and improve rankings.

5. **Optimize Page Speed**:
   - Make sure the website loads quickly for both users and search engines.

---

### How Does This Help Non-Tech Users?

- **For Business Owners**:
  - They can focus on growing their business while the project ensures their website is visible and optimized.

- **For Developers**:
  - Simplifies the process of making JavaScript-heavy websites SEO-friendly.

- **For Marketers**:
  - Provides actionable insights into what is missing or needs improvement for better SEO performance.

---

## **Final Purpose:**

This project ensures that websites built with modern JavaScript technologies are easy for search engines to understand and rank. By fixing visibility issues and optimizing the website for search engines, it helps businesses attract more visitors, improve their online presence, and grow successfully. It’s a bridge between advanced web technologies and effective SEO.



# **Understanding Dynamic Rendering for JavaScript SEO**

---

Dynamic Rendering is a solution used to ensure that websites built with JavaScript frameworks are optimized for search engines and fully understood by their bots. Let's break this down step-by-step in plain English, so even someone with no technical background can understand.

---

### **What is Dynamic Rendering for JavaScript SEO?**

1. **Dynamic Rendering**:
   - A technique where a website serves two different versions of its content:
     - **For search engine bots**: A pre-rendered, static HTML version.
     - **For regular users**: A fully interactive JavaScript-powered version.

2. **Why is it Necessary?**:
   - Search engines, like Google, often struggle to process JavaScript-heavy websites. If the bot cannot understand the content, the page may not rank well.
   - Dynamic Rendering solves this issue by ensuring bots get a simplified, static version of the page that they can easily crawl and index.

3. **How Does it Work?**:
   - A website detects if a visitor is a **search engine bot** or a **human user**.
   - If it’s a bot, the website sends a pre-rendered static HTML version.
   - If it’s a user, the website sends the fully functional JavaScript version.

---

### **Use Cases of Dynamic Rendering for JavaScript SEO**

1. **Improving SEO for JavaScript Websites**:
   - Websites built using frameworks like React, Angular, or Vue.js often rely on JavaScript to load content dynamically. Dynamic rendering ensures search engines can still access all the content.

2. **E-Commerce Websites**:
   - Online stores often have JavaScript-heavy pages with product filters, sorting options, etc. Dynamic rendering ensures product details are accessible to search engines.

3. **News Websites**:
   - News websites frequently update content and rely on JavaScript for interactive features. Dynamic rendering ensures the latest articles are indexed quickly.

4. **Single Page Applications (SPAs)**:
   - SPAs load content dynamically without changing the URL, making it harder for bots to crawl. Dynamic rendering solves this by providing a static version.

5. **Content-Driven Websites**:
   - Blogs, educational portals, and resource websites that heavily use JavaScript for animations or dynamic interactions benefit from improved indexing.

---

### **Real-Life Implementation of Dynamic Rendering**

1. **Google Search Engine**:
   - Googlebot sometimes fails to index JavaScript-heavy websites. Websites using Dynamic Rendering ensure Google sees an optimized version, improving their ranking.

2. **Retail Giants**:
   - Retailers like Amazon or eBay use dynamic rendering to ensure their millions of product pages are crawled and indexed accurately.

3. **Travel Booking Websites**:
   - Travel portals with dynamic pricing and flight/hotel searches rely on JavaScript. Dynamic rendering ensures these pages rank well despite their complex nature.

---

### **What Kind of Data Does Dynamic Rendering Need?**

1. **Input Data**:
   - **URLs of Web Pages**:
     - Dynamic rendering needs to know the URLs of all pages that need to be optimized for search engines.
   - **Website Content**:
     - Text, images, meta information (like title and description), and structured data (e.g., JSON-LD or schema markup).
   - **JavaScript**:
     - The scripts responsible for loading dynamic content on the website.



---

### **What Output Does Dynamic Rendering Provide?**

1. **Rendered HTML Output**:
   - A fully static, pre-rendered HTML version of your website optimized for search engines.

2. **SEO Analysis Report**:
   - A report highlighting:
     - Missing titles, meta descriptions, or structured data.
     - Errors in rendering specific pages.
     - Validation status (whether each page is ready for indexing).

3. **Deployment Files**:
   - Static HTML files for all pre-rendered pages, ready for deployment to your server or CDN (Content Delivery Network).

4. **Performance Metrics**:
   - Insights into how quickly the rendered pages load, ensuring they meet SEO speed standards.

---

### **How Does Dynamic Rendering Ensure Semantic Optimization?**

1. **Title and Meta Tags**:
   - Dynamic rendering ensures every page has optimized titles and meta descriptions, improving click-through rates.

2. **Canonical Tags**:
   - It prevents duplicate content issues by specifying the preferred version of a page.

3. **Structured Data**:
   - Adds JSON-LD or schema.org markup to provide additional context about the page to search engines, enabling features like rich snippets.

4. **Header Tags (H1, H2)**:
   - Ensures proper use of semantic HTML headers for better readability by search engines.

---

### **Steps to Implement Dynamic Rendering for a Website**

1. **Identify the Pages to Optimize**:
   - Gather a list of all the URLs on your website that need to be indexed by search engines.

2. **Pre-Render Pages**:
   - Use a pre-rendering service or software to generate static HTML versions of your JavaScript-heavy pages.

3. **Detect Bots vs. Users**:
   - Set up a system to differentiate between search engine bots and human visitors (based on the User-Agent header).

4. **Serve the Correct Version**:
   - Send pre-rendered pages to bots and dynamic JavaScript pages to regular users.

5. **Monitor Performance**:
   - Use tools like Google Search Console to track which pages are indexed and identify any errors.

---

### **Why is Dynamic Rendering Useful?**

1. **Boosts SEO Rankings**:
   - Search engines get a clean, optimized version of the website, improving indexing and rankings.

2. **Increases Website Visibility**:
   - Ensures all pages, even those heavily reliant on JavaScript, are visible in search results.

3. **Enhances User Experience**:
   - Human visitors still enjoy the interactive, JavaScript-powered experience.

4. **Saves Resources**:
   - No need to rebuild the website from scratch to make it SEO-friendly.

---

### **Conclusion**

Dynamic Rendering for JavaScript SEO bridges the gap between modern web technologies and search engine optimization. By serving search engines with pre-rendered, static HTML content while retaining the dynamic experience for users, it ensures websites rank better and attract more organic traffic. For website owners, this technique is a game-changer in achieving high visibility without compromising on advanced JavaScript functionality.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
# Importing pandas library, which is used to handle and manipulate structured data like CSV files
# Pandas is crucial for data analysis tasks in Python
import pandas as pd

# Step 1: Define the file path
# The file path is the location where the CSV file is stored.
# Here, we specify the path to the input CSV file.
input_csv_path = '/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/internal_all (1).csv'

# Step 2: Attempt to load the dataset and handle any errors gracefully
# This block ensures that we try to load the data and also handle potential issues like a missing file.
try:
    # Read the CSV file into a DataFrame
    # Pandas' read_csv() method reads a CSV (Comma-Separated Values) file and converts it into a tabular format.
    # This tabular data is stored as a DataFrame, which is similar to a spreadsheet or SQL table.
    data = pd.read_csv(input_csv_path)

    # Step 3: Preview the first 40 rows of the dataset
    # The head() method is used to view the first few rows of a DataFrame.
    # Here, we specify 40 rows to get a quick snapshot of the data for inspection or debugging.
    preview = data.head(40)

    # Displaying the preview in the console
    # Printing the data helps us ensure that the file was loaded correctly and that the data looks as expected.
    print("Preview of the first 40 rows:")
    print(preview)

# Step 4: Handle the case where the file is not found
# This exception will be raised if the specified file path does not exist.
except FileNotFoundError:
    # Provide a clear error message so the user knows what went wrong.
    print(f"Error: The file at '{input_csv_path}' was not found.")

# Step 5: Handle any other unexpected errors
# If any other error occurs, it will be caught here, and the error message will be displayed.
except Exception as e:
    # Print a message explaining that something went wrong and show the specific error message for debugging.
    print(f"An error occurred: {e}")


Preview of the first 40 rows:
                                              Address  \
0                                https://thatware.co/   
1                       https://thatware.co/basecamp/   
2   https://thatware.co/wp-content/uploads/2024/07...   
3         https://thatware.co/ai-based-seo-blueprint/   
4       https://thatware.co/google-page-title-update/   
5       https://thatware.co/app-development-services/   
6                   https://thatware.co/advanced-seo/   
7              https://thatware.co/seo-company-noida/   
8                      https://thatware.co/contact-us   
9     https://thatware.co/google-indexifembedded-tag/   
10        https://thatware.co/outsource-seo-services/   
11  https://thatware.co/branding-press-release-ser...   
12        https://thatware.co/link-building-services/   
13    https://thatware.co/digital-marketing-services/   
14  https://thatware.co/wp-content/cache/min/1/wp-...   
15  https://thatware.co/ai-powered-semantic-search...   
1

### **Understanding the Importance and Purpose of the Code Below**

This code is designed to **generate a sitemap.xml file** from a CSV dataset, which is particularly useful in the context of **Dynamic Rendering for JavaScript SEO**. Let me explain this step-by-step in a way that is simple to understand, even for someone without a technical background.

---

### **What is a Sitemap and Why is it Important?**

1. **Definition of a Sitemap**:
   - A sitemap is an XML file that lists all the important pages of a website.
   - It helps search engines like Google and Bing find and index your website’s pages efficiently.

2. **Purpose in SEO**:
   - Ensures all pages (including dynamically rendered ones) are discoverable by search engines.
   - Boosts the visibility of web pages by guiding search engine bots to them.

3. **Why is it Necessary for Dynamic Rendering**:
   - Websites using JavaScript frameworks often have pages that are not easily discoverable by search engines.
   - The sitemap bridges this gap by listing these pages explicitly, ensuring they are indexed properly.

---

### **Purpose of the Code**

The purpose of the code is to:
1. **Transform Screaming Frog SEO Spider Data**:
   - Use the CSV dataset generated by Screaming Frog (a popular SEO tool) to extract URLs and metadata.
2. **Generate an XML Sitemap**:
   - Create a sitemap.xml file that search engines can use to efficiently crawl and index the website.
3. **Improve SEO for Dynamic Websites**:
   - Ensure that even JavaScript-heavy pages are indexed correctly by search engines.

---

### **Key Components of the Code**

#### **Part 1: Generating the Sitemap**
**File Name Suggestion: `generate_sitemap_from_csv.py`**

1. **Loading the Dataset**:
   - Reads the CSV file (`internal_all.csv`) using the `pandas` library.
   - The CSV contains columns like `URL Encoded Address` (the page URL) and `Crawl Timestamp` (last modification date).
   - This data is essential for creating an accurate sitemap.

2. **Building the Sitemap XML Structure**:
   - The `<urlset>` element is the container for all URLs in the sitemap.
   - For each row in the CSV:
     - A `<url>` element is added, representing a single webpage.
     - Inside each `<url>`:
       - `<loc>` specifies the URL.
       - `<lastmod>` indicates when the page was last modified.
       - `<changefreq>` suggests how often the page is updated.
       - `<priority>` indicates the importance of the page (default is 0.5).

3. **Saving the Sitemap**:
   - Converts the constructed XML structure into a string using `tostring`.
   - Formats the XML for readability using `minidom`.
   - Saves the sitemap as a file named `sitemap.xml`.

---

#### **Part 2: Previewing the Sitemap**
**File Name Suggestion: `preview_sitemap.py`**

1. **Parsing the Generated Sitemap**:
   - The generated `sitemap.xml` file is read and parsed using the `ElementTree` library.

2. **Previewing the Sitemap Entries**:
   - Extracts the first few entries (default: 10) for preview.
   - Displays the fields:
     - URL
     - Last Modified Date
     - Change Frequency
     - Priority
   - This helps verify that the sitemap is correct before deploying it.

---

### **Why is the Screaming Frog Dataset Important?**

1. **Source of URLs**:
   - Screaming Frog scans your website and identifies all the URLs (pages) that should be indexed by search engines.
   - These URLs form the backbone of the sitemap.

2. **Metadata Extraction**:
   - Columns like `Crawl Timestamp` provide additional details that enhance the sitemap (e.g., last modification date).

3. **Ensures Coverage**:
   - Using this dataset ensures no important page is missed in the sitemap, especially for JavaScript-heavy or dynamically rendered pages.

---

### **How Does This Help Dynamic Rendering for JavaScript SEO?**

1. **Improves Discoverability**:
   - Ensures search engines are aware of all pages, even those that rely on JavaScript for content loading.

2. **Optimizes Crawling**:
   - Provides metadata (like last modification dates) to help search engines prioritize which pages to crawl.

3. **Fixes JavaScript Visibility Issues**:
   - Many JavaScript-heavy websites suffer from poor indexing. The sitemap ensures these pages are included in search engine results.

4. **Supports Dynamic Rendering**:
   - While dynamic rendering ensures bots get a static version of JavaScript pages, the sitemap acts as a secondary layer to make sure every page is indexed.

---

### **Output Generated by the Code**

1. **Sitemap File (`sitemap.xml`)**:
   - Contains all URLs from the CSV.
   - Each URL entry includes:
     - Location (`<loc>`): The page URL.
     - Last Modified (`<lastmod>`): The date the page was last updated.
     - Change Frequency (`<changefreq>`): A recommendation for how often the page changes (default: weekly).
     - Priority (`<priority>`): The relative importance of the page (default: 0.5).

2. **Preview of the Sitemap**:
   - Displays the first few rows of the sitemap for verification.

---

### **Steps for Using This Code**

1. **Input Preparation**:
   - Use Screaming Frog SEO Spider to generate a CSV file of your website’s URLs and metadata.
   - Ensure the file has columns like `URL Encoded Address` and `Crawl Timestamp`.

2. **Run the Code**:
   - Execute the first script (`generate_sitemap_from_csv.py`) to create the sitemap.
   - Run the second script (`preview_sitemap.py`) to verify the sitemap’s contents.

3. **Deploy the Sitemap**:
   - Upload the generated `sitemap.xml` file to the root directory of your website.

4. **Submit to Search Engines**:
   - Submit the sitemap to Google Search Console and Bing Webmaster Tools.

---

### **Conclusion**

This code is an integral part of the **Dynamic Rendering for JavaScript SEO** project. It ensures that all pages on a website, including JavaScript-heavy ones, are discoverable and indexable by search engines. By automating the generation of a sitemap using a Screaming Frog dataset, it simplifies the SEO process and significantly enhances website visibility. This tool is essential for website owners aiming to improve their SEO performance.

In [None]:
import pandas as pd  # For working with tabular data in CSV files
from xml.etree.ElementTree import Element, SubElement, tostring  # For creating XML structures
import xml.dom.minidom as minidom  # For formatting XML for readability

# Define file paths
# Input CSV file containing URLs and metadata
input_csv_path = '/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/internal_all (1).csv'
# Output file where the generated sitemap XML will be saved
output_sitemap_path = '/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/sitemap.xml'

# Step 1: Load the dataset from the provided CSV file
try:
    # Read the CSV file into a pandas DataFrame for easy data manipulation
    data = pd.read_csv(input_csv_path)

    # Step 2: Create the root XML structure for the sitemap
    # The <urlset> tag is required for all sitemaps; it defines the list of URLs.
    urlset = Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")

    # Step 3: Loop through each row in the dataset to add URLs to the sitemap
    for _, row in data.iterrows():
        # Skip rows where the URL is missing
        if pd.notna(row['URL Encoded Address']):
            # Create a <url> tag for each valid URL
            url = SubElement(urlset, "url")

            # Add the <loc> tag, which specifies the URL's location
            loc = SubElement(url, "loc")
            loc.text = row['URL Encoded Address']  # Add the URL value from the CSV

            # Optional: Add the <lastmod> tag to indicate the last modification date
            if pd.notna(row['Crawl Timestamp']):
                lastmod = SubElement(url, "lastmod")
                lastmod.text = row['Crawl Timestamp']  # Add timestamp from the CSV

            # Optional: Add the <changefreq> tag, defaulting to "weekly"
            # This tells search engines how frequently the page is expected to change
            changefreq = SubElement(url, "changefreq")
            changefreq.text = "weekly"

            # Optional: Add the <priority> tag, defaulting to 0.5
            # Priority is used by search engines to understand which pages are more important
            priority = SubElement(url, "priority")
            priority.text = "0.5"

    # Step 4: Convert the XML structure to a formatted string
    # Generate the raw XML string
    xml_str = tostring(urlset, encoding="utf-8", method="xml")
    # Use minidom to pretty-print the XML for better readability
    pretty_xml = minidom.parseString(xml_str).toprettyxml(indent="  ")

    # Step 5: Save the formatted XML to the specified file
    with open(output_sitemap_path, "w") as f:
        f.write(pretty_xml)

    print(f"Sitemap.xml saved successfully at: {output_sitemap_path}")
except Exception as e:
    # Handle errors, such as file not found or invalid CSV format
    print(f"Error: {e}")

# Second Part: Preview the generated sitemap
import xml.etree.ElementTree as ET  # For parsing XML files

# Path to the generated sitemap file
sitemap_file_path = '/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/sitemap.xml'

def preview_sitemap(file_path, preview_rows=10):
    """
    Reads and previews the sitemap.xml file to verify its contents.

    Args:
        file_path (str): Path to the sitemap.xml file.
        preview_rows (int): Number of entries to preview from the sitemap.
    """
    try:
        # Parse the sitemap XML file
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Define the namespace used in the sitemap
        namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

        # Collect data from the sitemap for preview
        rows = []
        for url_tag in root.findall('ns:url', namespace):
            # Extract each field: loc, lastmod, changefreq, priority
            loc = url_tag.find('ns:loc', namespace)
            lastmod = url_tag.find('ns:lastmod', namespace)
            changefreq = url_tag.find('ns:changefreq', namespace)
            priority = url_tag.find('ns:priority', namespace)

            # Append structured data to the rows list
            rows.append({
                'URL': loc.text if loc is not None else 'N/A',
                'Last Modified': lastmod.text if lastmod is not None else 'N/A',
                'Change Frequency': changefreq.text if changefreq is not None else 'N/A',
                'Priority': priority.text if priority is not None else 'N/A'
            })

            # Stop after collecting the requested number of rows
            if len(rows) >= preview_rows:
                break

        # Print a preview of the sitemap content
        print(f"Previewing first {preview_rows} rows of the sitemap:")
        print("=" * 80)
        for i, row in enumerate(rows, start=1):
            print(f"Row {i}:")
            print(f"  URL: {row['URL']}")
            print(f"  Last Modified: {row['Last Modified']}")
            print(f"  Change Frequency: {row['Change Frequency']}")
            print(f"  Priority: {row['Priority']}")
            print("-" * 80)

    except Exception as e:
        # Handle errors such as invalid XML or file not found
        print(f"Error reading the sitemap: {e}")

# Preview the sitemap (default preview: 10 rows)
preview_sitemap(sitemap_file_path, preview_rows=20)


Sitemap.xml saved successfully at: /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/sitemap.xml
Previewing first 20 rows of the sitemap:
Row 1:
  URL: https://thatware.co/
  Last Modified: 30-12-2024 17:21
  Change Frequency: weekly
  Priority: 0.5
--------------------------------------------------------------------------------
Row 2:
  URL: https://thatware.co/basecamp/
  Last Modified: 30-12-2024 17:21
  Change Frequency: weekly
  Priority: 0.5
--------------------------------------------------------------------------------
Row 3:
  URL: https://thatware.co/wp-content/uploads/2024/07/Welby-Consulting-Top-Digital-Marketing-Company-GoodFirms-2019-150x150-1.webp
  Last Modified: 30-12-2024 17:21
  Change Frequency: weekly
  Priority: 0.5
--------------------------------------------------------------------------------
Row 4:
  URL: https://thatware.co/ai-based-seo-blueprint/
  Last Modified: 30-12-2024 17:21
  Change Frequency: weekly
  Priority: 0.5
--------------

---
# **1st Part: Sitemap Parsing and Dynamic Content Rendering**
**Purpose:**
This part of the code is designed to load a sitemap (an XML file containing website URLs), validate each URL, and render the content of JavaScript-heavy pages into static HTML using a headless browser (Selenium).

**Key Features:**
1. **Parse Sitemap:** Extracts URLs from the sitemap and validates them.
2. **Validate URLs:** Ensures only HTML pages are processed by filtering out unsupported files (e.g., images, CSS).
3. **Render Dynamic Pages:** Uses Selenium WebDriver to load and render dynamic content into static HTML files.
4. **Save Rendered HTML:** Saves the rendered HTML files with unique names (based on a hash of the URL) for later use.

**Why It’s Important:**
Many websites use JavaScript to load content dynamically, which search engines may not index properly. This part ensures the content is fully rendered and saved as static HTML files, making it easier for search engines to crawl.

---


In [None]:
import os
import hashlib
import requests
import pandas as pd
from xml.etree import ElementTree as ET
from time import sleep
import random

# Selenium setup for rendering JavaScript-heavy webpages
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
except ImportError:
    # Install Selenium if it is missing
    print("Selenium is not installed. Installing it now...")
    os.system("pip install selenium")
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options

# File paths for necessary input/output files
SITEMAP_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/sitemap.xml"  # Location of the sitemap XML
OUTPUT_HTML_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendered_html"  # Directory to save rendered HTML
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendering_log.csv"  # File to save logs

# Constants for controlling request behavior
VALID_CONTENT_TYPES = ["text/html"]  # Only accept URLs with HTML content
EXCLUDED_EXTENSIONS = [".css", ".js", ".png", ".jpg", ".webp", ".pdf"]  # Ignore non-HTML file types
RETRY_COUNT = 3  # Number of retries for failed requests
RETRY_DELAY = 5  # Base delay (in seconds) between retries
TIMEOUT = 30  # Timeout for each request in seconds

# List of different User-Agent strings to mimic real browsers
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]

def setup_webdriver():
    """
    Sets up Selenium WebDriver to render dynamic content (like JavaScript-heavy webpages).
    Why: Some websites dynamically generate content using JavaScript, which requires a browser environment to load.
    """
    print("Setting up Selenium WebDriver...")
    os.system("apt-get update")  # Update system packages
    os.system("apt-get install -y chromium-chromedriver")  # Install Chrome and ChromeDriver
    os.environ["PATH"] += ":/usr/bin/"  # Ensure ChromeDriver is accessible in the system PATH

    options = Options()
    options.add_argument("--headless")  # Run browser without a visible UI
    options.add_argument("--no-sandbox")  # Required for some environments like Colab
    options.add_argument("--disable-dev-shm-usage")  # Prevent memory issues in virtual environments
    return webdriver.Chrome(options=options)  # Return the configured WebDriver instance

def is_webpage(url):
    """
    Validates if a given URL is a meaningful webpage.
    Why:
    - Ensures only relevant URLs (like HTML pages) are processed.
    - Avoids wasting time on non-HTML resources like images or scripts.
    """
    for attempt in range(RETRY_COUNT):  # Retry failed requests up to RETRY_COUNT times
        try:
            headers = {"User-Agent": random.choice(USER_AGENTS)}  # Use a random User-Agent to mimic browser behavior

            # Exclude URLs with undesired file extensions
            if any(url.lower().endswith(ext) for ext in EXCLUDED_EXTENSIONS):
                return False

            # Make a HEAD request to check content type
            response = requests.head(url, headers=headers, timeout=TIMEOUT)
            content_type = response.headers.get("Content-Type", "")
            if not any(valid_type in content_type for valid_type in VALID_CONTENT_TYPES):
                return False

            # Make a GET request to ensure the page contains meaningful content
            response = requests.get(url, headers=headers, timeout=TIMEOUT)
            if "<title>" not in response.text or "<meta" not in response.text:
                return False

            return True  # If all checks pass, the URL is valid
        except requests.RequestException as e:
            print(f"Attempt {attempt + 1}/{RETRY_COUNT}: Error validating URL {url}: {e}")
            sleep(RETRY_DELAY * (attempt + 1))  # Increase delay between retries (exponential backoff)
    return False  # Return False if all attempts fail

def load_urls_from_sitemap(sitemap_path):
    """
    Parses and validates URLs from a sitemap XML file.
    Why: The sitemap provides a list of URLs to be processed, but only valid URLs should be rendered.
    """
    valid_urls = []
    try:
        print("Parsing sitemap...")
        tree = ET.parse(sitemap_path)  # Load and parse the XML file
        root = tree.getroot()  # Get the root element of the XML

        for url in root.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
            loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text  # Extract the <loc> tag
            if is_webpage(loc):  # Validate the URL
                valid_urls.append(loc)  # Add valid URLs to the list
    except ET.ParseError as e:
        print(f"Error parsing sitemap: {e}")  # Handle invalid XML structure
    except Exception as e:
        print(f"Unexpected error: {e}")  # Handle other unexpected errors
    return valid_urls

def render_and_save_html(url, driver):
    """
    Renders a webpage and saves its HTML content.
    Why: Many modern websites rely on JavaScript to display content. Selenium renders these pages fully before saving.
    """
    try:
        print(f"Rendering: {url}")  # Log the URL being processed
        driver.get(url)  # Load the webpage in the browser
        sleep(3)  # Wait for JavaScript to load

        rendered_html = driver.page_source  # Extract the fully rendered HTML

        # Generate a unique filename based on the URL
        file_hash = hashlib.md5(url.encode()).hexdigest()
        file_path = os.path.join(OUTPUT_HTML_DIR, f"{file_hash}.html")
        os.makedirs(OUTPUT_HTML_DIR, exist_ok=True)  # Create output directory if it doesn't exist

        # Save the HTML content to the file
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(rendered_html)

        return {"url": url, "html_file": file_path, "status": "Success"}
    except Exception as e:
        print(f"Error rendering {url}: {e}")  # Log errors
        return {"url": url, "html_file": None, "status": f"Error: {e}"}

def process_urls_and_preview(urls):
    """
    Processes a list of URLs, renders each, and logs the results.
    Why:
    - Saves the rendered HTML for further use.
    - Provides a log of successes and failures for review.
    """
    driver = setup_webdriver()  # Initialize Selenium WebDriver
    results = []

    for url in urls:
        result = render_and_save_html(url, driver)  # Render and save each URL
        results.append(result)

    driver.quit()  # Close the WebDriver when done

    # Save the results to a CSV file
    log_df = pd.DataFrame(results)
    log_df.to_csv(LOG_FILE_PATH, index=False)
    print(f"Rendering log saved at: {LOG_FILE_PATH}")

    # Display a preview of the rendering log
    print("\nPreviewing first 10 rows of the rendering log:")
    print(log_df.head(10))

if __name__ == "__main__":
    # Main execution starts here
    urls = load_urls_from_sitemap(SITEMAP_PATH)  # Load and validate URLs from the sitemap
    if urls:
        process_urls_and_preview(urls)  # Render and log the validated URLs
    else:
        print("No valid URLs found in the sitemap.")


Parsing sitemap...
Attempt 1/3: Error validating URL https://thatware.co/local-business-seo-services/: HTTPSConnectionPool(host='thatware.co', port=443): Max retries exceeded with url: /local-business-seo-services/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7bd31c298a90>, 'Connection to thatware.co timed out. (connect timeout=30)'))
Attempt 2/3: Error validating URL https://thatware.co/local-business-seo-services/: HTTPSConnectionPool(host='thatware.co', port=443): Max retries exceeded with url: /local-business-seo-services/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7bd31c8809d0>, 'Connection to thatware.co timed out. (connect timeout=30)'))
Attempt 3/3: Error validating URL https://thatware.co/local-business-seo-services/: HTTPSConnectionPool(host='thatware.co', port=443): Max retries exceeded with url: /local-business-seo-services/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7b

---
# **2nd Part: Metadata Analysis of Rendered HTML**
**Purpose:**
This code analyzes the static HTML files generated in the first part to validate and assess their metadata for SEO compliance.

**Key Features:**
1. **Title Analysis:** Checks if the `<title>` tag exists, its length, and whether it meets SEO guidelines (10–60 characters).
2. **Meta Description Validation:** Ensures the `<meta name="description">` tag exists and has a length of 50–160 characters.
3. **Canonical Link Check:** Confirms the presence of `<link rel="canonical">` tags to avoid duplicate content issues.
4. **Summary and Reporting:**
   - Saves a CSV report with detailed results for each file.
   - Generates a JSON summary with overall statistics, such as how many files are missing metadata.

**Why It’s Important:**
Metadata plays a crucial role in SEO. This part identifies gaps and issues, providing actionable insights to improve page rankings.

---


In [None]:
import os  # For file system operations like reading files and managing directories
import pandas as pd  # For organizing data into structured CSV reports
import json  # For saving summary data in a structured JSON format
from bs4 import BeautifulSoup  # For parsing and analyzing HTML content

# Define paths for input HTML directory and output reports
html_dir = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendered_html"
csv_output_path = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_report.csv"
json_output_path = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_summary.json"

def analyze_html_files():
    """
    Analyze and validate metadata of rendered HTML files.
    Generates:
    - A detailed CSV report with metadata analysis for each HTML file.
    - A JSON summary with overall statistics and actionable suggestions.
    """

    # Initialize data structures to store results and summary statistics
    analysis_results = []  # Will hold detailed analysis for each file
    summary = {
        "total_files": 0,  # Count of total HTML files processed
        "missing_titles": 0,  # Files missing <title> tag
        "invalid_titles": 0,  # Files with invalid <title> length
        "missing_meta_descriptions": 0,  # Files missing meta descriptions
        "invalid_meta_descriptions": 0,  # Files with invalid meta description length
        "missing_canonical_links": 0,  # Files missing canonical links
        "files_with_errors": 0  # Files that caused an error during processing
    }

    # Loop through all files in the specified HTML directory
    for file_name in os.listdir(html_dir):
        # Process only files with .html extension
        if file_name.endswith(".html"):
            file_path = os.path.join(html_dir, file_name)  # Full path to the file

            try:
                # Step 1: Read and parse the HTML file using BeautifulSoup
                # Why? To extract and analyze key metadata for SEO compliance.
                with open(file_path, "r", encoding="utf-8") as file:
                    html_content = file.read()
                soup = BeautifulSoup(html_content, "html.parser")

                # Step 2: Extract and validate the <title> tag
                # Why? Titles are crucial for SEO and should meet length guidelines (10-60 characters).
                title = soup.title.string.strip() if soup.title else "Missing"
                title_length = len(title) if title != "Missing" else 0
                if title == "Missing":
                    title_valid = "Invalid: Title is missing"
                    title_suggestion = "Add a title with 10-60 characters to improve SEO."
                elif 10 <= title_length <= 60:
                    title_valid = "Valid"
                    title_suggestion = "No changes needed."
                elif title_length < 10:
                    title_valid = f"Invalid: Title too short ({title_length} characters)"
                    title_suggestion = "Increase the title length to at least 10 characters."
                else:
                    title_valid = f"Invalid: Title too long ({title_length} characters)"
                    title_suggestion = "Shorten the title to 60 characters or less."

                # Step 3: Extract and validate the meta description
                # Why? Meta descriptions summarize the page content and influence click-through rates.
                meta_description = (
                    soup.find("meta", attrs={"name": "description"})["content"].strip()
                    if soup.find("meta", attrs={"name": "description"}) else "Missing"
                )
                meta_desc_length = len(meta_description) if meta_description != "Missing" else 0
                if meta_description == "Missing":
                    meta_desc_valid = "Invalid: Meta description is missing"
                    meta_desc_suggestion = "Add a meta description with 50-160 characters."
                elif 50 <= meta_desc_length <= 160:
                    meta_desc_valid = "Valid"
                    meta_desc_suggestion = "No changes needed."
                elif meta_desc_length < 50:
                    meta_desc_valid = f"Invalid: Description too short ({meta_desc_length} characters)"
                    meta_desc_suggestion = "Increase the description length to at least 50 characters."
                else:
                    meta_desc_valid = f"Invalid: Description too long ({meta_desc_length} characters)"
                    meta_desc_suggestion = "Shorten the description to 160 characters or less."

                # Step 4: Extract the canonical link
                # Why? Canonical links prevent duplicate content issues and improve SEO.
                canonical_link = (
                    soup.find("link", attrs={"rel": "canonical"})["href"].strip()
                    if soup.find("link", attrs={"rel": "canonical"}) else "Missing"
                )
                canonical_suggestion = "Ensure a canonical link is added to prevent duplicate content issues." if canonical_link == "Missing" else "No changes needed."

                # Step 5: Check presence of key structural elements
                # Why? Ensuring the basic structure like <html>, <head>, and <body> is essential for a valid webpage.
                has_html = soup.html is not None
                has_head = soup.head is not None
                has_body = soup.body is not None

                # Update summary statistics
                summary["total_files"] += 1
                if title == "Missing":
                    summary["missing_titles"] += 1
                elif "Invalid" in title_valid:
                    summary["invalid_titles"] += 1
                if meta_description == "Missing":
                    summary["missing_meta_descriptions"] += 1
                elif "Invalid" in meta_desc_valid:
                    summary["invalid_meta_descriptions"] += 1
                if canonical_link == "Missing":
                    summary["missing_canonical_links"] += 1

                # Append results for this file to the analysis list
                analysis_results.append({
                    "File Name": file_name,
                    "Title": title,
                    "Title Validity": title_valid,
                    "Title Suggestion": title_suggestion,
                    "Meta Description": meta_description,
                    "Meta Description Validity": meta_desc_valid,
                    "Meta Description Suggestion": meta_desc_suggestion,
                    "Canonical Link": canonical_link,
                    "Canonical Suggestion": canonical_suggestion,
                    "Has <html>": has_html,
                    "Has <head>": has_head,
                    "Has <body>": has_body
                })

            except Exception as e:
                # Handle errors gracefully and log them in the summary
                summary["files_with_errors"] += 1
                analysis_results.append({
                    "File Name": file_name,
                    "Title": "Error",
                    "Meta Description": "Error",
                    "Canonical Link": "Error",
                    "Has <html>": False,
                    "Has <head>": False,
                    "Has <body>": False,
                    "Error": str(e)
                })
                print(f"Error processing file {file_name}: {e}")

    # Step 6: Save detailed analysis results to a CSV file
    # Why? To provide a structured and shareable report of the analysis.
    analysis_df = pd.DataFrame(analysis_results)
    analysis_df.to_csv(csv_output_path, index=False, encoding="utf-8")
    print(f"Detailed analysis saved to CSV: {csv_output_path}")

    # Step 7: Save summary statistics to a JSON file
    # Why? To provide a quick overview of the results and identify improvement areas.
    with open(json_output_path, "w", encoding="utf-8") as json_file:
        json.dump(summary, json_file, indent=4)
    print(f"Summary saved to JSON: {json_output_path}")

    # Display a preview of the first 40 rows of the analysis for quick review
    print("\n--- Analysis Preview (First 40 Rows) ---")
    print(analysis_df.head(40))

# Entry point for executing the script
if __name__ == "__main__":
    analyze_html_files()


Detailed analysis saved to CSV: /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_report.csv
Summary saved to JSON: /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_summary.json

--- Analysis Preview (First 40 Rows) ---
                    File Name  \
0    2445840446231693084.html   
1    3445600768952517889.html   
2   -6798089790693216076.html   
3   -5621794161926510500.html   
4    8496365001144611913.html   
5    3692490903250420304.html   
6    6808648066961521283.html   
7   -1610247714284567949.html   
8    3907671781176194564.html   
9    6925278034660345745.html   
10  -7582597486279110324.html   
11   4917508925371105299.html   
12  -8764603961112146386.html   
13   2608248413038825434.html   
14   4026303393469988296.html   
15  -6888828361509409421.html   
16   3449603455730655479.html   
17   7905355167294934904.html   
18   1602924848936222192.html   
19   6890641865127205189.html   
20  -8

---

### **Understanding the Output**

This output is a result of analyzing metadata from rendered HTML files and generating a report. The report provides a detailed analysis of SEO-related components like `<title>`, `<meta>`, and canonical tags for each processed HTML file. Here's what each part of the output means:

---

#### **1. File Path for Saved Reports**
- **CSV Report Path:**
  - Location: `/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_report.csv`
  - What it contains: A detailed spreadsheet report with information about each HTML file's metadata (title, meta description, canonical links).
  - Use Case: This CSV can be opened in tools like Excel or Google Sheets for review and further analysis.

- **JSON Summary Path:**
  - Location: `/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_summary.json`
  - What it contains: A summary of overall statistics, like total files processed, missing titles, or missing canonical links.
  - Use Case: The JSON is useful for quickly understanding the health of your metadata without going through individual files.

---

#### **2. Preview of First 40 Rows**
This preview shows a summary of the first 40 files analyzed. Let me explain the key columns one by one:

1. **File Name:**
   - What it is: The name of the static HTML file (e.g., `2445840446231693084.html`).
   - Use Case: Helps you identify which file corresponds to which webpage in the sitemap.

2. **Title:**
   - What it is: The content of the `<title>` tag in the HTML file. For example, "THATWARE® - Revolutionizing SEO with Hyper-Intelligence".
   - Use Case: Titles are important for search engines to understand the purpose of the page and attract users.

3. **Title Validity:**
   - What it is: Whether the `<title>` meets SEO guidelines (10–60 characters).
     - "Valid": Title meets length requirements.
     - "Invalid": Title is too short or too long.
   - Use Case: Helps you spot titles that need improvement for better SEO rankings.

4. **Title Suggestion:**
   - What it is: Recommendations for fixing invalid titles.
     - Example: "Shorten the title to 60 characters or less."
   - Use Case: Provides clear action steps to improve titles for SEO.

5. **Meta Description:**
   - What it is: The content of the `<meta name="description">` tag, which summarizes the page.
     - Example: "THATWARE® is the world's first SEO agency to specialize in hyper-intelligence."
   - Use Case: Helps search engines and users understand the page content.

6. **Meta Description Validity:**
   - What it is: Whether the `<meta>` description meets SEO guidelines (50–160 characters).
     - "Valid": Description meets requirements.
     - "Invalid": Missing or not within the ideal length.
   - Use Case: Shows if your descriptions are ready for search engine display.

7. **Meta Description Suggestion:**
   - What it is: Recommendations for improving invalid meta descriptions.
     - Example: "Add a meta description with 50-160 characters."
   - Use Case: Provides specific fixes to enhance click-through rates.

8. **Canonical Link:**
   - What it is: The content of the `<link rel="canonical">` tag, which indicates the preferred URL for search engines.
     - Example: `https://thatware.co/seo-company-delhi/`
     - "Missing" if not found.
   - Use Case: Prevents duplicate content issues by pointing to the main version of the page.

9. **Canonical Suggestion:**
   - What it is: Recommendations for missing canonical links.
     - Example: "Ensure a canonical link is added to prevent duplicate content issues."
   - Use Case: Guides you in preventing SEO penalties due to duplicate content.

10. **Has `<html>`, `<head>`, `<body>`:**
    - What it is: Confirms the presence of essential structural elements in the HTML file.
      - "True": The element exists.
      - "False": The element is missing.
    - Use Case: Ensures the basic structure of the webpage is intact.

---

### **Key Observations from the Preview**
1. **Valid Titles:** Most titles are valid, but a few are too long (e.g., "Invalid: Title too long (64 characters)"). The suggestions guide you to fix them.
2. **Meta Description:** One file (`4449457235279430625.html`) is missing a meta description. The suggestion is to add one.
3. **Canonical Links:** Most files have valid canonical links, except one (`4449457235279430625.html`) where it is missing.
4. **HTML Structure:** All files have essential elements (`<html>`, `<head>`, `<body>`), ensuring the pages are structurally sound.

---

### **What This Output Tells Us**
1. **SEO Health:** This analysis provides a snapshot of your HTML files' SEO readiness.
   - Strong Points: Most titles and meta descriptions are valid, and all files have proper structure.
   - Areas to Improve: Fix a few titles that are too long and add missing meta descriptions and canonical links.

2. **Next Steps:**
   - Implement the suggestions from the report (e.g., shorten long titles, add meta descriptions).
   - Deploy the improved HTML files to improve SEO performance.

---

### **Use Cases of This Report**
1. **SEO Improvement:** Helps you identify and fix issues that could harm your website’s visibility on search engines.
2. **Client Reporting:** Share this analysis with your client to demonstrate the quality and readiness of their webpages.
3. **Future Optimizations:** Use this as a baseline to track improvements over time.


---
# **3rd Part: Optimizing and Validating Static HTML**
**Purpose:**
This code optimizes the rendered HTML files to ensure they meet SEO standards and are ready for deployment.

**Key Features:**
1. **Add Missing Metadata:**
   - Adds a `<meta name="description">` tag if it’s missing.
   - Adds a `<link rel="canonical">` tag to prevent duplicate content issues.
2. **SEO-Friendly Title Adjustment:**
   - Adjusts the length of `<title>` tags to fit SEO best practices.
3. **Enhance Accessibility:** Adds `alt` attributes to images that lack them.
4. **Remove Redundant Elements:**
   - Deletes unnecessary `<script>` and `<noscript>` tags.
   - Removes lazy-loading attributes and preloader elements.
5. **Validation:** Checks the optimized HTML for compliance with SEO guidelines and logs any remaining issues.

**Why It’s Important:**
This part ensures that the static HTML files are optimized for search engine crawlers and free from common SEO pitfalls.

---


In [None]:
import os  # Provides functions to interact with the operating system (e.g., creating directories, checking file existence)
import json  # Handles JSON data for logging purposes
import time  # Used for pausing the script (e.g., waiting for pages to load dynamically)
from bs4 import BeautifulSoup  # Library for parsing and manipulating HTML and XML
from selenium import webdriver  # Automates browser actions for rendering JavaScript-heavy pages
from selenium.webdriver.chrome.options import Options  # Configures options for the Chrome browser
from selenium.webdriver.common.by import By  # Provides methods to locate HTML elements
from selenium.webdriver.support.ui import WebDriverWait  # Waits for certain conditions to be met before continuing
from selenium.webdriver.support import expected_conditions as EC  # Pre-defined conditions for waiting
import pandas as pd  # Handles structured data in tabular form (e.g., CSV file input/output)

# File paths for input and output
CSV_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendering_log.csv"
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"
# 1. Function to ensure a directory exists
def ensure_directory_exists(directory_path):
    """
    Ensures that the specified directory exists. If not, creates it.
    Why: Files cannot be saved if the directory doesn't exist.
    This avoids runtime errors due to missing directories.
    """
    if not os.path.exists(directory_path):  # Check if directory exists
        os.makedirs(directory_path)  # Create directory if it doesn't exist

# 2. Function to load URLs from a CSV file
def load_urls(file_path):
    """
    Loads a list of URLs from a CSV file.
    Why: URLs are stored in a structured format (CSV) for scalability and reusability.
    """
    try:
        if not os.path.exists(file_path):  # Check if the file exists
            print(f"Error: File not found -> {file_path}")
            return []  # Return an empty list if the file is missing

        df = pd.read_csv(file_path)  # Read the CSV file into a DataFrame
        return df["url"].tolist()  # Convert the 'url' column into a list
    except Exception as e:
        print(f"Error loading URLs: {e}")  # Print any errors that occur
        return []  # Return an empty list if an error occurs

# 3. Function to save HTML content to a file
def save_html(content, file_path):
    """
    Saves the provided HTML content into a file at the specified path.
    Why: The rendered HTML files are saved for SEO purposes and later analysis.
    """
    ensure_directory_exists(os.path.dirname(file_path))  # Ensure the directory exists
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(content)  # Write the HTML content to the file

# 4. Function to save logs for debugging and tracking
def save_logs(logs, log_file_path):
    """
    Saves the rendering logs as a JSON file.
    Why: Logs help track rendering status and debug any issues during execution.
    """
    with open(log_file_path, "w", encoding="utf-8") as file:
        json.dump(logs, file, indent=4)  # Save logs in a readable JSON format

# 5. Function to optimize HTML for SEO
def optimize_html(soup, url):
    """
    Modifies the HTML to make it more SEO-friendly by adding or updating key elements.
    Why: Search engines rely on structured and complete HTML for indexing pages effectively.
    """
    # Add a meta description if it's missing
    if not soup.find("meta", {"name": "description"}):
        meta_tag = soup.new_tag("meta", attrs={"name": "description", "content": "SEO-optimized content."})
        soup.head.append(meta_tag)  # Append the meta tag to the head section

    # Add a canonical link if it's missing
    if not soup.find("link", {"rel": "canonical"}):
        canonical_tag = soup.new_tag("link", attrs={"rel": "canonical", "href": url})
        soup.head.append(canonical_tag)  # Append the canonical tag to the head section

    # Adjust the title tag to ensure it's SEO-optimized
    title_tag = soup.find("title")
    if title_tag:
        title_text = title_tag.text.strip()  # Get the current title text
        if len(title_text) < 30:  # Ensure the title is long enough
            title_tag.string = title_text + " - SEO Optimized"
        elif len(title_text) > 60:  # Truncate overly long titles
            title_tag.string = title_text[:57] + "..."

    # Add alt attributes to images that lack them
    for img_tag in soup.find_all("img"):
        if not img_tag.get("alt"):  # Check if the alt attribute is missing
            img_tag["alt"] = "image description missing"  # Add a placeholder alt attribute

    # Add href placeholders for anchor tags missing href attributes
    for a_tag in soup.find_all("a"):
        if not a_tag.get("href"):  # Check if the href attribute is missing
            a_tag["href"] = "#"  # Add a placeholder href attribute

    # Remove unnecessary script and noscript tags
    for script in soup.find_all(["script", "noscript"]):
        script.decompose()  # Completely remove these tags

    # Remove lazy-loading attributes (e.g., "loading='lazy'")
    for tag in soup.find_all(attrs=lambda attr: attr and "lazy" in attr.lower()):
        del tag.attrs["loading"]  # Delete the lazy-loading attribute

    # Remove preloader elements by their classes or IDs
    preloader_classes = ["dots", "preloader", "preload-content", "animation"]
    for preloader in soup.find_all(
        lambda tag: any(cls in tag.get("class", []) for cls in preloader_classes) or
                    tag.get("id") in preloader_classes
    ):
        preloader.decompose()  # Completely remove these elements

    return soup  # Return the modified BeautifulSoup object


# 6. Validate the optimized HTML
def validate_html(soup):
    """
    Validates the optimized HTML for compliance with SEO standards.
    Why:
    - To ensure the generated HTML meets key SEO requirements.
    - To identify and log any issues that could affect SEO performance.
    """
    issues = []  # Initialize a list to store identified issues

    # Check if any <script> tags remain in the HTML
    # Script tags are often unnecessary in the final optimized HTML as they are for dynamic behavior
    if soup.find("script"):
        issues.append("Residual <script> tags found.")

    # Check if any elements have lazy-loading attributes
    # Lazy-loading attributes can prevent search engines from fully loading content
    if soup.find(attrs=lambda attr: attr and "lazy" in attr.lower()):
        issues.append("Residual lazy-loading attributes found.")

    # Check for a meta description tag
    # Meta descriptions provide a summary of the page and are important for search engine indexing
    if not soup.find("meta", {"name": "description"}):
        issues.append("Missing <meta name='description'> tag.")

    # Check for a canonical link tag
    # Canonical tags prevent duplicate content issues by indicating the preferred version of a URL
    if not soup.find("link", {"rel": "canonical"}):
        issues.append("Missing <link rel='canonical'> tag.")

    # Check the <title> tag for proper length
    # Titles should be between 30 and 60 characters for optimal SEO
    title_tag = soup.find("title")
    if not title_tag or not (30 <= len(title_tag.text.strip()) <= 60):
        issues.append("Invalid or missing <title> tag.")

    # Return a tuple: validation status (True/False) and the list of issues
    return not issues, issues


import time  # For measuring time taken

def render_page_and_optimize(url, wait_selector=None):
    """
    Renders, optimizes, validates a webpage, and ensures accurate logging and file saving.
    """
    logs = {"url": url, "status": "Error", "issues": [], "time_taken": 0}  # Initialize logs
    file_name = None  # Placeholder for the static HTML file name

    try:
        # STEP 1: Configure Selenium WebDriver for automated rendering
        options = Options()
        options.add_argument("--headless")  # Run in headless mode (no GUI)
        options.add_argument("--disable-gpu")  # Disable GPU for better performance in headless mode
        options.add_argument("--no-sandbox")  # Required for running Chrome in some environments
        options.add_argument(
            "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
        )  # Use a realistic User-Agent to avoid being blocked

        driver = webdriver.Chrome(options=options)  # Initialize WebDriver

        # STEP 2: Navigate to the URL
        print(f"Processing URL: {url}")
        print(f"Rendering: {url}")
        start_time = time.time()  # Start the timer
        driver.get(url)  # Open the URL in the browser

        # STEP 3: Wait for the page to load dynamically
        try:
            WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
            if wait_selector:
                WebDriverWait(driver, 30).until(
                    EC.presence_of_element_located((By.XPATH, wait_selector))
                )  # Use the configurable selector if provided
        except Exception:
            logs["issues"].append(f"Dynamic element not found: {wait_selector}")
            print(f"Warning: Dynamic element not found for {url}. Proceeding with available content.")

        # STEP 4: Extract and parse the rendered HTML
        rendered_html = driver.page_source
        soup = BeautifulSoup(rendered_html, "html.parser")

        # STEP 5: Optimize the HTML for SEO
        optimized_soup = optimize_html(soup, url)

        # STEP 6: Validate the optimized HTML
        is_valid, issues = validate_html(optimized_soup)

        # STEP 7: Generate a unique file name based on the URL
        file_name = f"{hash(url)}.html"
        file_path = os.path.join(STATIC_HTML_OUTPUT_DIR, file_name)

        # STEP 8: Save the optimized HTML as a static file
        save_html(str(optimized_soup), file_path)

        # Confirm the file was saved successfully
        if os.path.exists(file_path):
            logs["file_name"] = file_name
        else:
            logs["issues"].append("Failed to save the HTML file.")

        # STEP 9: Update log status based on validation
        logs["status"] = "Valid" if is_valid else "Invalid"
        logs["issues"].extend(issues)

        # STEP 10: Show a preview of the rendered HTML
        print("\n===== Optimized HTML Preview =====")
        print(optimized_soup.prettify()[:5000])  # Show the first 5000 characters
        print("===================================")
        print(f"Validation Status: {'Valid' if is_valid else 'Invalid'}")

        # STEP 11: Print any validation issues found
        if issues:
            print("Issues Found:")
            for issue in issues:
                print(f"- {issue}")

    except Exception as e:
        # STEP 12: Handle any exceptions that occur during processing
        logs["issues"].append(f"Error: {str(e)}")  # Log the error message
        print(f"Error processing {url}: {str(e)}")  # Print the error for debugging

    finally:
        # STEP 13: Ensure the browser is closed and measure time
        end_time = time.time()  # End the timer
        logs["time_taken"] = round(end_time - start_time, 2)  # Calculate time taken
        if 'driver' in locals():  # Check if the driver exists
            try:
                logs["browser_logs"] = driver.get_log("browser")  # Capture browser console logs
            except Exception:
                logs["browser_logs"] = "Unable to capture browser logs."
            driver.quit()  # Quit the WebDriver

    return logs  # Return the log details for this URL



def process_pages():
    """
    Processes all URLs sequentially, ensuring accurate logging, rendering, and file saving.
    """
    ensure_directory_exists(STATIC_HTML_OUTPUT_DIR)  # Ensure output directory exists

    urls = load_urls(CSV_FILE_PATH)  # Load URLs from CSV
    if not urls:
        print("No URLs to process.")
        return

    all_logs = []  # Store logs for all URLs
    for url in urls:
        print(f"\nProcessing URL: {url}")  # Inform the user about the URL being processed
        log = render_page_and_optimize(url)  # Render and process the URL
        all_logs.append(log)  # Append logs to the list

        # Save logs incrementally after every URL for better debugging
        save_logs(all_logs, LOG_FILE_PATH)
        print(f"Log updated for URL: {url}")

    print(f"\nFinal logs saved to {LOG_FILE_PATH}")


# Execute the script
if __name__ == "__main__":
    process_pages()




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  <meta content="2024-05-06T12:10:41+00:00"/>
  <meta content="https://thatware.co/wp-content/uploads/2021/05/webdesign.jpg"/>
  <meta content="2500"/>
  <meta content="1309"/>
  <meta content="image/jpeg"/>
  <meta content="admin" name="author"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="Website Design Services - Hire Professional Web Designers" name="twitter:title"/>
  <meta content="Thatware is recognized as the leading website design services provider. Get your free sample today!" name="twitter:description"/>
  <meta content="Written by" name="twitter:label1"/>
  <meta content="admin" name="twitter:data1"/>
  <meta content="Est. reading time" name="twitter:label2"/>
  <meta content="1 minute" name="twitter:data2"/>
  <!-- / Yoast SEO Premium plugin. -->
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link href="https://thatware.co/feed/" rel="alternate" 

---
# **What Does This Output Represent?**

The output shows:
1. **The URLs being processed**: Each URL corresponds to a webpage that is being analyzed for SEO improvements.
2. **Rendering HTML content**: The script uses tools like Selenium to load and analyze the HTML structure of these pages.
3. **Optimized HTML Preview**: The HTML is checked and modified for better SEO.
4. **Validation Status**: Each webpage is either marked as "Valid" or flagged with issues based on specific SEO checks.
5. **Log Updates**: Logs are updated after each URL is processed, storing information about the page's SEO quality.

Now let’s break down the output in detail.

---

### **Step-by-Step Explanation of the Output**

#### **1. Processing the URL**
Example:
```
Processing URL: https://thatware.co/law-firm-seo-services/
Rendering: https://thatware.co/law-firm-seo-services/
```

- **What is happening?**
  - The script starts processing a specific webpage. For example, `https://thatware.co/law-firm-seo-services/` is one of the pages.
  - It then renders the webpage using a browser-like tool. Rendering means the script loads the page as if a real user were visiting it in a browser.

- **Why is this step needed?**
  - Many websites generate content dynamically using JavaScript, which regular tools cannot analyze. Rendering ensures the script captures the final webpage as seen by users.

- **Use Case:**
  - This ensures the SEO analysis is based on the actual content users and search engines see, not just the raw code.

---

#### **2. Optimized HTML Preview**
Example:
```
===== Optimized HTML Preview =====
<html class="no-js" lang="en-US">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="upgrade-insecure-requests" http-equiv="Content-Security-Policy"/>
  ...
===================================
```

- **What is happening?**
  - The rendered HTML content of the webpage is displayed. This includes important elements like:
    - `<title>`: The title of the page shown in search results.
    - `<meta name="description">`: A summary of the page for search engines.
    - `<link rel="canonical">`: The preferred URL to avoid duplicate content issues.
    - Other metadata: Author information, image previews, and social media tags.

- **Why is this step needed?**
  - This helps identify if the webpage has all the necessary SEO elements and whether they meet best practices.

- **Use Case:**
  - You can see what improvements are being suggested, such as fixing a missing `<meta>` tag or improving the `<title>` length.

---

#### **3. Title Tag Analysis**
Example:
```
<title>
   Law Firm SEO Services | Best Legal SEO Services Agency
</title>
```

- **What is this?**
  - The title tag is a key SEO element that appears in search engine results.
  - In this example, the title is: "Law Firm SEO Services | Best Legal SEO Services Agency."

- **Why is it important?**
  - Titles should be concise, relevant, and between 10–60 characters. They influence whether users click on the link.

- **Use Case:**
  - If the title is too long, short, or irrelevant, the script suggests changes to improve SEO.

---

#### **4. Meta Description**
Example:
```
<meta content="Explore our Law Firm SEO Services and take the lead in the digital courtroom of search engines! Outshine competitors in the legal field." name="description"/>
```

- **What is this?**
  - The meta description provides a summary of the page content.
  - In this example, it’s: "Explore our Law Firm SEO Services and take the lead in the digital courtroom of search engines!"

- **Why is it important?**
  - It helps search engines and users understand the page. It should be between 50–160 characters for optimal performance.

- **Use Case:**
  - If the description is missing or too long, the script flags it and suggests improvements.

---

#### **5. Canonical Link**
Example:
```
<link href="https://thatware.co/law-firm-seo-services/" rel="canonical"/>
```

- **What is this?**
  - The canonical tag tells search engines the preferred version of the URL.
  - For example, `https://thatware.co/law-firm-seo-services/` is the main URL for this page.

- **Why is it important?**
  - It prevents search engines from indexing duplicate pages and ensures the correct page ranks higher.

- **Use Case:**
  - Missing canonical links can cause duplicate content issues, harming SEO.

---

#### **6. Validation Status**
Example:
```
Validation Status: Valid
```

- **What is this?**
  - After analyzing the HTML, the script checks if the page meets SEO standards.
  - "Valid" means the page has no major issues.

- **Why is this step important?**
  - It confirms that the page is ready for deployment without any critical SEO problems.

- **Use Case:**
  - Valid pages can be deployed directly. Invalid pages need fixes before publishing.

---

#### **7. Log Updates**
Example:
```
Log updated for URL: https://thatware.co/law-firm-seo-services/
```

- **What is this?**
  - After processing each page, the script updates a log file with the results.

- **Why is it important?**
  - It keeps track of which pages were analyzed, their validation status, and any issues.

- **Use Case:**
  - This log can be shared with developers or clients for transparency and further action.

---

### **Summary of Key Steps**
1. **URL Processing:** Loading and analyzing each webpage.
2. **HTML Rendering:** Displaying the actual content and metadata.
3. **Title and Meta Analysis:** Ensuring these elements are optimized for search engines.
4. **Canonical Tag Check:** Preventing duplicate content issues.
5. **Validation:** Marking pages as "Valid" or flagging them for issues.
6. **Logging:** Documenting the results for tracking and reporting.

---

### **How to Use This Output?**
- **Fix Issues:** Review flagged issues like missing titles, long descriptions, or duplicate content and apply the suggested fixes.
- **Deploy Valid Pages:** Move validated pages to the live environment for improved SEO performance.
- **Monitor Progress:** Use the logs to track which pages were processed and their current SEO status.

---


---
# **4th Part: Comprehensive SEO Report Generation**
**Purpose:**
Analyzes the optimized static HTML files to generate a detailed SEO report, highlighting issues and providing actionable recommendations.

**Key Features:**
1. **Analyze Common Issues:** Identifies missing or invalid elements, such as:
   - `<meta name="description">` tags.
   - `<title>` tags that are too short or too long.
   - `<h1>` tags and structured data (JSON-LD).
2. **Generate Reports:**
   - A detailed log for each file, highlighting specific issues.
   - A summary report with overall statistics (e.g., total URLs, valid/invalid URLs, most common issues).
   - Actionable recommendations to fix the identified issues.

**Why It’s Important:**
The report helps track progress, measure SEO improvements, and prioritize fixes for common issues.

---


In [None]:
import os  # For file operations
import json  # For reading and writing JSON files
from bs4 import BeautifulSoup  # For parsing and analyzing static HTML files

# File paths for input and output
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
FINAL_REPORT_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/seo_analysis_report.json"

def analyze_static_html(file_path):
    """
    Analyze a static HTML file for SEO improvements.
    Parameters:
        file_path (str): Path to the static HTML file.
    Returns:
        list: Insights and issues found in the HTML.
    """
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            soup = BeautifulSoup(file.read(), "html.parser")

        issues = []

        # Check for meta description
        if not soup.find("meta", {"name": "description"}):
            issues.append("Missing <meta name='description'> tag.")

        # Check for title tag length
        title_tag = soup.find("title")
        if not title_tag or len(title_tag.text.strip()) < 30 or len(title_tag.text.strip()) > 60:
            issues.append("Invalid or missing <title> tag.")

        # Check for missing alt attributes in images
        for img in soup.find_all("img"):
            if not img.get("alt"):
                issues.append("Missing alt attribute in <img> tag.")

        # Check for missing canonical link
        if not soup.find("link", {"rel": "canonical"}):
            issues.append("Missing <link rel='canonical'> tag.")

        # Check for missing H1 tag
        if not soup.find("h1"):
            issues.append("Missing <h1> tag.")

        # Check for structured data
        if not soup.find("script", {"type": "application/ld+json"}):
            issues.append("Missing structured data (JSON-LD).")

        return issues
    except Exception as e:
        return [f"Error analyzing HTML: {str(e)}"]

def generate_seo_report(log_file_path, static_html_dir, output_report_path):
    """
    Generate a comprehensive SEO report based on rendering logs and static HTML files.
    Parameters:
        log_file_path (str): Path to the rendering log JSON file.
        static_html_dir (str): Directory containing static HTML files.
        output_report_path (str): Path to save the final report.
    """
    try:
        # Validate log file existence
        if not os.path.exists(log_file_path):
            print(f"Error: Log file not found -> {log_file_path}")
            return

        # Load rendering logs
        with open(log_file_path, "r", encoding="utf-8") as file:
            logs = json.load(file)

        # Initialize the SEO report structure
        report = {
            "total_urls": 0,
            "valid_urls": 0,
            "invalid_urls": 0,
            "common_issues": {},
            "detailed_logs": [],
        }

        issue_counts = {}  # Track frequency of issues
        for log in logs:
            url = log["url"]
            report["total_urls"] += 1
            if log["status"] == "Valid":
                report["valid_urls"] += 1
            else:
                report["invalid_urls"] += 1

            # Analyze the corresponding static HTML file
            static_file_path = os.path.join(static_html_dir, log.get("file_name", ""))
            if os.path.isfile(static_file_path):  # Ensure it's a valid file
                html_issues = analyze_static_html(static_file_path)
                log["html_issues"] = html_issues

                # Count issues
                for issue in html_issues:
                    if issue in issue_counts:
                        issue_counts[issue] += 1
                    else:
                        issue_counts[issue] = 1
            else:
                log["html_issues"] = ["Static HTML file missing or is a directory."]

            report["detailed_logs"].append(log)

        # Add common issues to the report
        report["common_issues"] = issue_counts

        # Generate actionable recommendations
        recommendations = []
        if "Missing <meta name='description'> tag." in issue_counts:
            recommendations.append("Add <meta name='description'> tags for better SEO.")
        if "Invalid or missing <title> tag." in issue_counts:
            recommendations.append("Ensure <title> tags are between 30-60 characters.")
        if "Missing alt attribute in <img> tag." in issue_counts:
            recommendations.append("Add alt attributes to all <img> tags.")
        if "Missing <link rel='canonical'> tag." in issue_counts:
            recommendations.append("Include <link rel='canonical'> tags to avoid duplicate content.")
        if "Missing <h1> tag." in issue_counts:
            recommendations.append("Include a primary <h1> tag for better SEO.")
        if "Missing structured data (JSON-LD)." in issue_counts:
            recommendations.append("Add structured data for richer search engine results.")

        report["recommendations"] = recommendations

        # Save the report to a JSON file
        with open(output_report_path, "w", encoding="utf-8") as file:
            json.dump(report, file, indent=4)

        # Display a summary preview
        print("\n===== SEO Report Summary =====")
        print(f"Total URLs Processed: {report['total_urls']}")
        print(f"Valid URLs: {report['valid_urls']}")
        print(f"Invalid URLs: {report['invalid_urls']}")
        print("\nCommon Issues:")
        for issue, count in report["common_issues"].items():
            print(f"- {issue}: {count} occurrences")
        print("\nRecommendations:")
        for rec in report["recommendations"]:
            print(f"- {rec}")
        print("===================================")
        print(f"SEO report saved to {output_report_path}")

    except Exception as e:
        print(f"Error generating SEO report: {e}")

# Execute the SEO analysis
if __name__ == "__main__":
    generate_seo_report(LOG_FILE_PATH, STATIC_HTML_OUTPUT_DIR, FINAL_REPORT_PATH)



===== SEO Report Summary =====
Total URLs Processed: 189
Valid URLs: 97
Invalid URLs: 92

Common Issues:
- Missing structured data (JSON-LD).: 97 occurrences

Recommendations:
- Add structured data for richer search engine results.
SEO report saved to /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/seo_analysis_report.json


# **Understanding the Output**
---
# **SEO Report Summary**

#### **1. "Total URLs Processed: 189"**
- **What this means:**
  - The system analyzed **189 URLs** provided in the input, which likely came from a sitemap or a list of web pages.
  - These URLs represent pages that were checked to see if they meet SEO standards.

- **Use Case:**
  - This tells you the total number of pages under consideration for improving their search engine visibility.

---

#### **2. "Valid URLs: 97"**
- **What this means:**
  - Out of 189 URLs, **97 pages were marked as 'Valid'**, meaning these pages passed the SEO checks and are considered optimized.
  - These pages have the required metadata, structure, and other SEO components.

- **Use Case:**
  - You can deploy these pages confidently without additional SEO changes. They are already optimized for search engines.

---

#### **3. "Invalid URLs: 92"**
- **What this means:**
  - **92 URLs failed the SEO validation checks.**
  - These pages might have issues like:
    - Missing important metadata (e.g., titles, meta descriptions, structured data).
    - Problems rendering the content (e.g., server issues, slow loading, or dynamic content not loading properly).
    - Incomplete or improper HTML structure.

- **Real-World Cause:**
  - As you mentioned, some of these URLs failed because the server didn’t respond or load the content during the rendering process in **Part 3** of the code.
  - This can happen due to:
    - **Server issues:** Temporary downtime or unresponsiveness.
    - **Dynamic content:** Pages rely on JavaScript to load content, but the script couldn’t fully render them.

- **Use Case:**
  - These pages require attention to fix the specific issues preventing them from being valid.
  - If the issue is server-related, you can reprocess these URLs when the server is functioning properly.

---

#### **4. "Common Issues: Missing structured data (JSON-LD): 97 occurrences"**
- **What this means:**
  - Out of all the URLs, **97 pages** are missing **structured data** in JSON-LD format.
  - **Structured data** is a special format of metadata that helps search engines understand the content of a page better.

- **Why this matters:**
  - Structured data can enable features like:
    - **Rich snippets** (e.g., star ratings, product prices) in search results.
    - **Better visibility** on Google.
    - Improved indexing and ranking by search engines.

- **Use Case:**
  - You should prioritize adding structured data to these pages to boost their SEO performance.

---

### **Recommendations:**

#### **1. Add Structured Data for Richer Search Engine Results**
- **What this recommendation means:**
  - You should add **JSON-LD structured data** to your web pages. JSON-LD is a format for describing your page content (like products, reviews, or articles) in a way that search engines can understand.

- **Why it is important:**
  - **Improves visibility:** Structured data helps search engines like Google display rich information (like FAQs, ratings, and reviews) directly on the search results page.
  - **Boosts SEO ranking:** Pages with structured data often rank higher because search engines can understand them better.
  - **Increases click-through rates:** Rich snippets attract more clicks because they provide additional information at a glance.

---

### **Explaining JSON-LD and Why It’s Important**

#### **What is JSON-LD?**
- JSON-LD stands for **JavaScript Object Notation for Linked Data**.
- It’s a simple, lightweight way to describe the structure of your content in a format search engines understand.

#### **How JSON-LD Works:**
- It adds context to your webpage by describing key elements, such as:
  - **Title:** What the page is about.
  - **Description:** A summary of the content.
  - **Author:** Who created the page.
  - **Image:** Visual content on the page.
  - **Type of content:** Whether the page is an article, product, recipe, etc.

#### **Example of JSON-LD for a Web Page:**
```json
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "WebPage",
  "name": "Law Firm SEO Services",
  "description": "Explore our Law Firm SEO Services and take the lead in the digital courtroom of search engines! Outshine competitors in the legal field.",
  "url": "https://thatware.co/law-firm-seo-services/"
}
</script>
```

#### **Benefits of Adding JSON-LD:**
1. **Enables Rich Snippets:**
   - Displays additional information (like FAQs or product ratings) directly in search results.
2. **Improves Understanding:**
   - Search engines understand your content better and index it more accurately.
3. **Boosts Search Rankings:**
   - Pages with structured data are often favored in search engine rankings.
4. **Supports Voice Search:**
   - Structured data helps voice assistants like Alexa or Google Assistant retrieve accurate answers from your website.

---

### **What Should You Do Next?**

1. **Fix Invalid URLs:**
   - Re-run the rendering process for the invalid URLs (especially the ones affected by server issues) when the server is stable.
   - Analyze why these pages failed and fix any technical or SEO-related issues.

2. **Add Structured Data:**
   - For the 97 pages missing structured data, implement JSON-LD according to the type of content (e.g., articles, products, FAQs).
   - Use tools like Google’s [Structured Data Markup Helper](https://www.google.com/webmasters/markup-helper/) to generate JSON-LD.

3. **Reprocess and Validate:**
   - After making the fixes, reprocess the URLs to check if the structured data and other improvements are recognized.

4. **Share Recommendations with Developers:**
   - Provide the JSON report and this explanation to your developers or SEO team for implementation.

---

### **Why Is This Output Useful?**

1. **Guides Prioritization:**
   - Shows which pages are ready to deploy and which need fixing.
   - Highlights the most common issues to focus on.

2. **Improves SEO Results:**
   - Following the recommendations (like adding JSON-LD) ensures better performance in search engines.

3. **Tracks Progress:**
   - The saved JSON report helps track which issues were resolved and which remain outstanding.

---


---
# **5th Part: Preparing Static Files for Deployment**
**Purpose:**
Filters and prepares validated static HTML files for deployment, ensuring only optimized files are deployed to the server.

**Key Features:**
1. **Filter Valid Files:** Only deploys files marked as "Valid" in the rendering log.
2. **Deployment Directory Setup:** Copies validated files to a separate deployment directory for easy management.
3. **Log Deployment Summary:** Provides a summary of the number of files deployed and skipped.

**Why It’s Important:**
This part ensures that only high-quality, SEO-optimized files are served to search engine bots, enhancing crawlability and indexability.

---



In [None]:
import os
import shutil
import json

# Paths for deployment
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
DEPLOYMENT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/deployment"
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"

def prepare_for_deployment(source_dir, deployment_dir, log_file_path):
    """
    Prepare validated static HTML files for deployment.
    - Filters files based on valid URLs from the rendering log.
    - Removes old or unrelated files from the source directory.
    - Copies validated `.html` files to the deployment directory.
    """
    # Check if source directory exists
    if not os.path.exists(source_dir):
        print(f"Error: Source directory not found -> {source_dir}")
        return

    # Check if rendering log exists
    if not os.path.exists(log_file_path):
        print(f"Error: Log file not found -> {log_file_path}")
        return

    # Load rendering logs
    with open(log_file_path, "r", encoding="utf-8") as file:
        logs = json.load(file)

    # Extract valid file names from logs
    valid_files = {log.get("file_name") for log in logs if log.get("status") == "Valid"}

    # Ensure deployment directory exists
    if not os.path.exists(deployment_dir):
        os.makedirs(deployment_dir)

    total_copied = 0
    total_skipped = 0

    # Copy only valid files to the deployment directory
    for file_name in os.listdir(source_dir):
        if file_name.endswith(".html"):
            source_path = os.path.join(source_dir, file_name)
            destination_path = os.path.join(deployment_dir, file_name)

            if file_name in valid_files:
                try:
                    shutil.copy(source_path, destination_path)
                    print(f"Deployed: {file_name}")
                    total_copied += 1
                except Exception as e:
                    print(f"Error deploying {file_name}: {e}")
            else:
                total_skipped += 1

    # Print deployment summary
    print("\n===== Deployment Summary =====")
    print(f"Static HTML files prepared for deployment in: {deployment_dir}")
    print(f"Total Files Deployed: {total_copied}")
    print(f"Total Files Skipped: {total_skipped}")

if __name__ == "__main__":
    prepare_for_deployment(STATIC_HTML_OUTPUT_DIR, DEPLOYMENT_DIR, LOG_FILE_PATH)


Deployed: -3386425097208935956.html
Deployed: -3943956137625275236.html
Deployed: -4209779895804168791.html
Deployed: -3590266196621279101.html
Deployed: 2708823915312850801.html
Deployed: -9136584530038659594.html
Deployed: 7965246990520139191.html
Deployed: -1222851262758936397.html
Deployed: -6442093507368630876.html
Deployed: -8437106891577471860.html
Deployed: -6883596211709326858.html
Deployed: 4497751070710947186.html
Deployed: 5346949977711324150.html
Deployed: -3182646748915117410.html
Deployed: -3014337555139088171.html
Deployed: -6891309348135913500.html
Deployed: 1012808543794521966.html
Deployed: -4021465429612074244.html
Deployed: -347125982437071754.html
Deployed: 1949078521591455488.html
Deployed: -261387407035723478.html
Deployed: 37304185910348639.html
Deployed: 7392761429958303643.html
Deployed: -2113772602602928879.html
Deployed: -8410259733810903180.html
Deployed: 1363021383528680333.html
Deployed: -6385119576274227100.html
Deployed: 133375392147165847.html
Deploye

# **Understanding the Deployment Summary Output**


---

### **Deployment Output**

#### **Deployed Files**
- **What this means:**
  - The list of deployed files (e.g., `-3386425097208935956.html`) represents the HTML pages that were successfully prepared for deployment.
  - These files are now available in the deployment directory at:
    ```
    /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/deployment
    ```
  - Each file corresponds to a unique webpage (e.g., a blog post, service page, or product page) from your website.

- **Why these files were deployed:**
  - These files passed the SEO checks and validation processes in earlier steps.
  - They were deemed **ready** to be deployed because they:
    - Contain necessary metadata like `<title>`, `<meta description>`, and `<canonical link>`.
    - Rendered successfully without any errors.

- **Use Case:**
  - These deployed files are ready to be served to search engines for indexing.
  - They are optimized and can improve your website’s visibility on Google and other search engines.

---

#### **"Static HTML files prepared for deployment in: /content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/deployment"**
- **What this means:**
  - This is the directory where all the successfully validated and optimized HTML files were saved after rendering.
  - Think of it as the final storage location for your web pages that will be uploaded to a server.

- **Use Case:**
  - You can take these files and upload them to your live website or a Content Delivery Network (CDN).
  - They are now ready to be accessed by visitors or crawled by search engines.

---

# **Understanding the Importance of "Successfully Deployed Pages"**

Let me break this down in simple terms so that we can understand the significance of these **97 successfully deployed pages**.

---

### **What Are These Deployed Pages?**
- These are web pages from your website that:
  - Passed all **validation checks** during processing.
  - Were optimized for **search engine friendliness**.
  - Are now **ready for deployment** (publishing or serving live on your website or a server).
  - Contain critical elements like a properly formatted `<title>`, `<meta description>`, and `<canonical link>`.

---

### **Why Are These Pages Important?**
These 97 pages are **the heart of your project**. Here’s why:

#### 1. **SEO-Optimized Pages Help You Rank on Google**
   - These pages are tailored to meet search engine requirements (like Google’s SEO standards).
   - They include critical metadata (like a well-written `<title>` and `<meta description>`), which helps search engines understand:
     - What your page is about.
     - Whether it’s relevant to a user’s query.
   - Better understanding by search engines = **higher chances of ranking on Google**.

#### 2. **Improves User Experience (UX)**
   - These pages are structured to load quickly and display relevant content.
   - They are optimized to ensure the user sees the information they need (like product descriptions, blog articles, or service details) without confusion.
   - **Happier users mean they’re more likely to stay on your site longer.**

#### 3. **Foundation for Website Growth**
   - These pages act as a base for your website’s online presence.
   - You can expand and build new pages using the same optimized structure.

#### 4. **Prepared for Search Engine Indexing**
   - These files are ready to be crawled and indexed by search engines.
   - Once indexed, they’ll appear in search results for relevant keywords.
   - This leads to **organic traffic**—people finding your website for free by searching.

---

### **Where and How to Use These Pages?**

#### **1. Host Them on Your Website**
   - These pages are meant to be uploaded to your live server or Content Management System (CMS).
   - Examples:
     - If your website is hosted on platforms like **WordPress**, you can integrate these pages into the site.
     - If you use a custom server, upload them to the corresponding directories.

#### **2. Deploy Them on a CDN (Content Delivery Network)**
   - A CDN helps load pages faster for users across the world.
   - Example: Use platforms like **Cloudflare** or **AWS CloudFront** to serve these pages to users globally, improving their experience.

#### **3. Submit Them to Search Engines**
   - Use tools like **Google Search Console** to submit these pages’ URLs.
   - Once submitted:
     - Google crawls and indexes the pages.
     - Indexed pages appear in search results, driving organic traffic to your website.

#### **4. Use Them in Marketing Campaigns**
   - Share the deployed pages as part of your advertising or social media campaigns.
   - Example:
     - A service page (e.g., `law-firm-seo-services.html`) can be used in email campaigns targeting law firms looking for SEO solutions.
     - Product pages can be shared on platforms like LinkedIn or Twitter to attract attention.

---

### **The Benefits of These Successfully Deployed Pages**

1. **Boosts Online Visibility:**
   - These pages are like billboards on Google’s highway—they’ll attract more visitors to your site.

2. **Increases Traffic to Your Website:**
   - When these pages rank higher in search results, more people click on them, leading to increased website traffic.

3. **Enhances Credibility:**
   - Optimized pages make your website look professional and trustworthy.
   - Visitors are more likely to convert into customers.

4. **Saves Time and Effort:**
   - Since these pages are validated and ready to go, you don’t have to spend additional time fixing issues.

5. **Drives Conversions and Revenue:**
   - High-quality, well-structured pages can convert visitors into paying customers by providing relevant information quickly.

---

### **How These Pages Relate to the Entire Project**

- **Purpose of Writing All That Code:**
  - The primary goal of project was to take raw web pages (some incomplete or poorly formatted) and turn them into **high-quality, SEO-ready pages**.
  - Every step of the process—from reading URLs in a sitemap to rendering, validating, and deploying—was aimed at achieving this outcome.

- **Final Outcome:**
  - These 97 deployed pages are the result of all the hard work.
  - They represent the **optimized and error-free content** that search engines and users both love.

---

### **What to Do Next With These Pages?**

#### Step 1: **Upload to Your Server**
   - Use an FTP client or your hosting provider’s control panel to upload these HTML files.
   - Ensure they are accessible via your website’s URLs.

#### Step 2: **Test Live Pages**
   - After uploading, open the URLs in a browser to ensure everything works correctly.
   - Use tools like **Google’s Mobile-Friendly Test** to confirm the pages display properly on all devices.

#### Step 3: **Submit to Google**
   - Log in to **Google Search Console**.
   - Add your website’s sitemap containing these deployed pages.
   - This will notify Google to crawl and index your site.

#### Step 4: **Monitor Performance**
   - Use analytics tools like **Google Analytics** to track:
     - Traffic coming to these pages.
     - User behavior (time spent on page, bounce rates, etc.).

#### Step 5: **Iterate and Improve**
   - Use insights from the deployed pages to:
     - Fix skipped files (189 skipped pages).
     - Create new pages following the same optimization standards.

---

### **Final Thoughts**

The 97 successfully deployed pages are **the backbone of your website's SEO efforts**. They are now ready to help you:
- Attract more visitors.
- Improve your search rankings.
- Drive business growth.

Treat these pages as a **milestone**—a tangible outcome of project’s success. Celebrate this achievement while working on improving the skipped files to ensure even more success in the future!

---

#### **"Total Files Skipped: 189"**
- **What this means:**
  - **189 files were skipped** because they didn’t pass the validation checks.
  - Reasons files might be skipped:
    1. **Rendering issues:**
       - Some files couldn’t be rendered properly, often due to server issues or dynamic content failing to load.
    2. **Missing metadata:**
       - These pages lacked key SEO elements like `<meta description>`, `<title>`, or `<canonical link>`.
    3. **Errors in structure:**
       - The HTML structure was incomplete or contained errors.
    4. **Server or connection problems:**
       - The page didn’t load because the server was temporarily down or slow.

- **Why this happened:**
  - Many of these files were skipped because the **server didn’t respond properly** or the content couldn’t be loaded dynamically during the rendering phase (3rd part of the process).
  - When the server doesn’t provide a response, the rendering tool skips those files.

- **Use Case:**
  - You need to review these skipped files, identify the specific issues, and fix them. For server-related issues, reprocess these URLs when the server is stable.

---

### **What To Do Next?**

#### 1. **Deploy the Valid Files (97 Files)**
   - These files are ready and can be uploaded to your web server.
   - They will improve your website’s SEO because they are optimized.

#### 2. **Review and Fix Skipped Files (189 Files)**
   - Go through the logs or errors associated with these files to find out why they were skipped.
   - Common reasons to check:
     - Server connection issues: Ensure the server is responding.
     - Missing metadata: Add `<title>`, `<meta description>`, and other SEO elements.
     - Rendering issues: If JavaScript-heavy pages failed to load, debug them.

#### 3. **Reprocess the Skipped URLs**
   - After fixing the skipped files, rerun the rendering and validation process to generate optimized HTML for those pages.

#### 4. **Implement Recommendations for Skipped Files**
   - Focus on adding **structured data (JSON-LD)** to all pages, including the skipped ones.
   - Use Google’s **Structured Data Testing Tool** or other validators to check your JSON-LD implementation.

---

### **JSON-LD and Recommendations**

#### **What is JSON-LD?**
- JSON-LD stands for **JavaScript Object Notation for Linked Data**.
- It’s a way of describing the content of your webpage in a format search engines can easily understand.

#### **Why Add JSON-LD?**
- **Richer Search Results:**
  - Pages with JSON-LD appear more appealing in Google search results, often showing rich snippets (e.g., star ratings, product details, FAQs).
- **Better Search Engine Understanding:**
  - JSON-LD helps search engines understand the page’s purpose and structure better.
- **Higher Click-Through Rates:**
  - Rich snippets lead to more clicks because they provide additional information upfront.
- **Improved Rankings:**
  - Structured data can indirectly improve your rankings by increasing click-through rates and engagement.

---

### **Example Use Case of JSON-LD**

For example, if a page provides information about a product, you can use JSON-LD to describe it:

```json
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "SEO Optimization Guide",
  "description": "Learn how to optimize your website for search engines.",
  "brand": {
    "@type": "Brand",
    "name": "ThatWare"
  },
  "offers": {
    "@type": "Offer",
    "price": "19.99",
    "priceCurrency": "USD",
    "availability": "https://schema.org/InStock"
  }
}
</script>
```

---



1. **What Was Achieved:**
   - 97 pages were successfully optimized and are ready to improve your website’s performance on Google.
   - These pages are "SEO-ready" and can be published live.

2. **What Needs Attention:**
   - 189 pages were skipped because of errors, missing data, or server issues.
   - Fix these pages to ensure they meet the same quality as the deployed ones.

3. **Why JSON-LD Matters:**
   - Adding structured data (JSON-LD) will help Google better understand your pages and display them attractively in search results.
   - This can result in higher rankings, more clicks, and better engagement.

---





# **Dynamic Rendering for JavaScript SEO Model**

In [None]:
import os
import hashlib
import requests
import pandas as pd
from xml.etree import ElementTree as ET
from time import sleep
import random

# Selenium setup for rendering JavaScript-heavy webpages
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
except ImportError:
    # Install Selenium if it is missing
    print("Selenium is not installed. Installing it now...")
    os.system("pip install selenium")
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options

# File paths for necessary input/output files
SITEMAP_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/sitemap.xml"  # Location of the sitemap XML
OUTPUT_HTML_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendered_html"  # Directory to save rendered HTML
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendering_log.csv"  # File to save logs

# Constants for controlling request behavior
VALID_CONTENT_TYPES = ["text/html"]  # Only accept URLs with HTML content
EXCLUDED_EXTENSIONS = [".css", ".js", ".png", ".jpg", ".webp", ".pdf"]  # Ignore non-HTML file types
RETRY_COUNT = 3  # Number of retries for failed requests
RETRY_DELAY = 5  # Base delay (in seconds) between retries
TIMEOUT = 30  # Timeout for each request in seconds

# List of different User-Agent strings to mimic real browsers
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]

def setup_webdriver():
    """
    Sets up Selenium WebDriver to render dynamic content (like JavaScript-heavy webpages).
    Why: Some websites dynamically generate content using JavaScript, which requires a browser environment to load.
    """
    print("Setting up Selenium WebDriver...")
    os.system("apt-get update")  # Update system packages
    os.system("apt-get install -y chromium-chromedriver")  # Install Chrome and ChromeDriver
    os.environ["PATH"] += ":/usr/bin/"  # Ensure ChromeDriver is accessible in the system PATH

    options = Options()
    options.add_argument("--headless")  # Run browser without a visible UI
    options.add_argument("--no-sandbox")  # Required for some environments like Colab
    options.add_argument("--disable-dev-shm-usage")  # Prevent memory issues in virtual environments
    return webdriver.Chrome(options=options)  # Return the configured WebDriver instance

def is_webpage(url):
    """
    Validates if a given URL is a meaningful webpage.
    Why:
    - Ensures only relevant URLs (like HTML pages) are processed.
    - Avoids wasting time on non-HTML resources like images or scripts.
    """
    for attempt in range(RETRY_COUNT):  # Retry failed requests up to RETRY_COUNT times
        try:
            headers = {"User-Agent": random.choice(USER_AGENTS)}  # Use a random User-Agent to mimic browser behavior

            # Exclude URLs with undesired file extensions
            if any(url.lower().endswith(ext) for ext in EXCLUDED_EXTENSIONS):
                return False

            # Make a HEAD request to check content type
            response = requests.head(url, headers=headers, timeout=TIMEOUT)
            content_type = response.headers.get("Content-Type", "")
            if not any(valid_type in content_type for valid_type in VALID_CONTENT_TYPES):
                return False

            # Make a GET request to ensure the page contains meaningful content
            response = requests.get(url, headers=headers, timeout=TIMEOUT)
            if "<title>" not in response.text or "<meta" not in response.text:
                return False

            return True  # If all checks pass, the URL is valid
        except requests.RequestException as e:
            print(f"Attempt {attempt + 1}/{RETRY_COUNT}: Error validating URL {url}: {e}")
            sleep(RETRY_DELAY * (attempt + 1))  # Increase delay between retries (exponential backoff)
    return False  # Return False if all attempts fail

def load_urls_from_sitemap(sitemap_path):
    """
    Parses and validates URLs from a sitemap XML file.
    Why: The sitemap provides a list of URLs to be processed, but only valid URLs should be rendered.
    """
    valid_urls = []
    try:
        print("Parsing sitemap...")
        tree = ET.parse(sitemap_path)  # Load and parse the XML file
        root = tree.getroot()  # Get the root element of the XML

        for url in root.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
            loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text  # Extract the <loc> tag
            if is_webpage(loc):  # Validate the URL
                valid_urls.append(loc)  # Add valid URLs to the list
    except ET.ParseError as e:
        print(f"Error parsing sitemap: {e}")  # Handle invalid XML structure
    except Exception as e:
        print(f"Unexpected error: {e}")  # Handle other unexpected errors
    return valid_urls

def render_and_save_html(url, driver):
    """
    Renders a webpage and saves its HTML content.
    Why: Many modern websites rely on JavaScript to display content. Selenium renders these pages fully before saving.
    """
    try:
        print(f"Rendering: {url}")  # Log the URL being processed
        driver.get(url)  # Load the webpage in the browser
        sleep(3)  # Wait for JavaScript to load

        rendered_html = driver.page_source  # Extract the fully rendered HTML

        # Generate a unique filename based on the URL
        file_hash = hashlib.md5(url.encode()).hexdigest()
        file_path = os.path.join(OUTPUT_HTML_DIR, f"{file_hash}.html")
        os.makedirs(OUTPUT_HTML_DIR, exist_ok=True)  # Create output directory if it doesn't exist

        # Save the HTML content to the file
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(rendered_html)

        return {"url": url, "html_file": file_path, "status": "Success"}
    except Exception as e:
        print(f"Error rendering {url}: {e}")  # Log errors
        return {"url": url, "html_file": None, "status": f"Error: {e}"}

def process_urls_and_preview(urls):
    """
    Processes a list of URLs, renders each, and logs the results.
    Why:
    - Saves the rendered HTML for further use.
    - Provides a log of successes and failures for review.
    """
    driver = setup_webdriver()  # Initialize Selenium WebDriver
    results = []

    for url in urls:
        result = render_and_save_html(url, driver)  # Render and save each URL
        results.append(result)

    driver.quit()  # Close the WebDriver when done

    # Save the results to a CSV file
    log_df = pd.DataFrame(results)
    log_df.to_csv(LOG_FILE_PATH, index=False)
    print(f"Rendering log saved at: {LOG_FILE_PATH}")

    # Display a preview of the rendering log
    print("\nPreviewing first 10 rows of the rendering log:")
    print(log_df.head(10))

if __name__ == "__main__":
    # Main execution starts here
    urls = load_urls_from_sitemap(SITEMAP_PATH)  # Load and validate URLs from the sitemap
    if urls:
        process_urls_and_preview(urls)  # Render and log the validated URLs
    else:
        print("No valid URLs found in the sitemap.")


import os  # For file system operations like reading files and managing directories
import pandas as pd  # For organizing data into structured CSV reports
import json  # For saving summary data in a structured JSON format
from bs4 import BeautifulSoup  # For parsing and analyzing HTML content

# Define paths for input HTML directory and output reports
html_dir = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendered_html"
csv_output_path = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_report.csv"
json_output_path = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/metadata_analysis_summary.json"

def analyze_html_files():
    """
    Analyze and validate metadata of rendered HTML files.
    Generates:
    - A detailed CSV report with metadata analysis for each HTML file.
    - A JSON summary with overall statistics and actionable suggestions.
    """

    # Initialize data structures to store results and summary statistics
    analysis_results = []  # Will hold detailed analysis for each file
    summary = {
        "total_files": 0,  # Count of total HTML files processed
        "missing_titles": 0,  # Files missing <title> tag
        "invalid_titles": 0,  # Files with invalid <title> length
        "missing_meta_descriptions": 0,  # Files missing meta descriptions
        "invalid_meta_descriptions": 0,  # Files with invalid meta description length
        "missing_canonical_links": 0,  # Files missing canonical links
        "files_with_errors": 0  # Files that caused an error during processing
    }

    # Loop through all files in the specified HTML directory
    for file_name in os.listdir(html_dir):
        # Process only files with .html extension
        if file_name.endswith(".html"):
            file_path = os.path.join(html_dir, file_name)  # Full path to the file

            try:
                # Step 1: Read and parse the HTML file using BeautifulSoup
                # Why? To extract and analyze key metadata for SEO compliance.
                with open(file_path, "r", encoding="utf-8") as file:
                    html_content = file.read()
                soup = BeautifulSoup(html_content, "html.parser")

                # Step 2: Extract and validate the <title> tag
                # Why? Titles are crucial for SEO and should meet length guidelines (10-60 characters).
                title = soup.title.string.strip() if soup.title else "Missing"
                title_length = len(title) if title != "Missing" else 0
                if title == "Missing":
                    title_valid = "Invalid: Title is missing"
                    title_suggestion = "Add a title with 10-60 characters to improve SEO."
                elif 10 <= title_length <= 60:
                    title_valid = "Valid"
                    title_suggestion = "No changes needed."
                elif title_length < 10:
                    title_valid = f"Invalid: Title too short ({title_length} characters)"
                    title_suggestion = "Increase the title length to at least 10 characters."
                else:
                    title_valid = f"Invalid: Title too long ({title_length} characters)"
                    title_suggestion = "Shorten the title to 60 characters or less."

                # Step 3: Extract and validate the meta description
                # Why? Meta descriptions summarize the page content and influence click-through rates.
                meta_description = (
                    soup.find("meta", attrs={"name": "description"})["content"].strip()
                    if soup.find("meta", attrs={"name": "description"}) else "Missing"
                )
                meta_desc_length = len(meta_description) if meta_description != "Missing" else 0
                if meta_description == "Missing":
                    meta_desc_valid = "Invalid: Meta description is missing"
                    meta_desc_suggestion = "Add a meta description with 50-160 characters."
                elif 50 <= meta_desc_length <= 160:
                    meta_desc_valid = "Valid"
                    meta_desc_suggestion = "No changes needed."
                elif meta_desc_length < 50:
                    meta_desc_valid = f"Invalid: Description too short ({meta_desc_length} characters)"
                    meta_desc_suggestion = "Increase the description length to at least 50 characters."
                else:
                    meta_desc_valid = f"Invalid: Description too long ({meta_desc_length} characters)"
                    meta_desc_suggestion = "Shorten the description to 160 characters or less."

                # Step 4: Extract the canonical link
                # Why? Canonical links prevent duplicate content issues and improve SEO.
                canonical_link = (
                    soup.find("link", attrs={"rel": "canonical"})["href"].strip()
                    if soup.find("link", attrs={"rel": "canonical"}) else "Missing"
                )
                canonical_suggestion = "Ensure a canonical link is added to prevent duplicate content issues." if canonical_link == "Missing" else "No changes needed."

                # Step 5: Check presence of key structural elements
                # Why? Ensuring the basic structure like <html>, <head>, and <body> is essential for a valid webpage.
                has_html = soup.html is not None
                has_head = soup.head is not None
                has_body = soup.body is not None

                # Update summary statistics
                summary["total_files"] += 1
                if title == "Missing":
                    summary["missing_titles"] += 1
                elif "Invalid" in title_valid:
                    summary["invalid_titles"] += 1
                if meta_description == "Missing":
                    summary["missing_meta_descriptions"] += 1
                elif "Invalid" in meta_desc_valid:
                    summary["invalid_meta_descriptions"] += 1
                if canonical_link == "Missing":
                    summary["missing_canonical_links"] += 1

                # Append results for this file to the analysis list
                analysis_results.append({
                    "File Name": file_name,
                    "Title": title,
                    "Title Validity": title_valid,
                    "Title Suggestion": title_suggestion,
                    "Meta Description": meta_description,
                    "Meta Description Validity": meta_desc_valid,
                    "Meta Description Suggestion": meta_desc_suggestion,
                    "Canonical Link": canonical_link,
                    "Canonical Suggestion": canonical_suggestion,
                    "Has <html>": has_html,
                    "Has <head>": has_head,
                    "Has <body>": has_body
                })

            except Exception as e:
                # Handle errors gracefully and log them in the summary
                summary["files_with_errors"] += 1
                analysis_results.append({
                    "File Name": file_name,
                    "Title": "Error",
                    "Meta Description": "Error",
                    "Canonical Link": "Error",
                    "Has <html>": False,
                    "Has <head>": False,
                    "Has <body>": False,
                    "Error": str(e)
                })
                print(f"Error processing file {file_name}: {e}")

    # Step 6: Save detailed analysis results to a CSV file
    # Why? To provide a structured and shareable report of the analysis.
    analysis_df = pd.DataFrame(analysis_results)
    analysis_df.to_csv(csv_output_path, index=False, encoding="utf-8")
    print(f"Detailed analysis saved to CSV: {csv_output_path}")

    # Step 7: Save summary statistics to a JSON file
    # Why? To provide a quick overview of the results and identify improvement areas.
    with open(json_output_path, "w", encoding="utf-8") as json_file:
        json.dump(summary, json_file, indent=4)
    print(f"Summary saved to JSON: {json_output_path}")

    # Display a preview of the first 40 rows of the analysis for quick review
    print("\n--- Analysis Preview (First 40 Rows) ---")
    print(analysis_df.head(40))

# Entry point for executing the script
if __name__ == "__main__":
    analyze_html_files()



import os  # Provides functions to interact with the operating system (e.g., creating directories, checking file existence)
import json  # Handles JSON data for logging purposes
import time  # Used for pausing the script (e.g., waiting for pages to load dynamically)
from bs4 import BeautifulSoup  # Library for parsing and manipulating HTML and XML
from selenium import webdriver  # Automates browser actions for rendering JavaScript-heavy pages
from selenium.webdriver.chrome.options import Options  # Configures options for the Chrome browser
from selenium.webdriver.common.by import By  # Provides methods to locate HTML elements
from selenium.webdriver.support.ui import WebDriverWait  # Waits for certain conditions to be met before continuing
from selenium.webdriver.support import expected_conditions as EC  # Pre-defined conditions for waiting
import pandas as pd  # Handles structured data in tabular form (e.g., CSV file input/output)

# File paths for input and output
CSV_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/rendering_log.csv"
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"
# 1. Function to ensure a directory exists
def ensure_directory_exists(directory_path):
    """
    Ensures that the specified directory exists. If not, creates it.
    Why: Files cannot be saved if the directory doesn't exist.
    This avoids runtime errors due to missing directories.
    """
    if not os.path.exists(directory_path):  # Check if directory exists
        os.makedirs(directory_path)  # Create directory if it doesn't exist

# 2. Function to load URLs from a CSV file
def load_urls(file_path):
    """
    Loads a list of URLs from a CSV file.
    Why: URLs are stored in a structured format (CSV) for scalability and reusability.
    """
    try:
        if not os.path.exists(file_path):  # Check if the file exists
            print(f"Error: File not found -> {file_path}")
            return []  # Return an empty list if the file is missing

        df = pd.read_csv(file_path)  # Read the CSV file into a DataFrame
        return df["url"].tolist()  # Convert the 'url' column into a list
    except Exception as e:
        print(f"Error loading URLs: {e}")  # Print any errors that occur
        return []  # Return an empty list if an error occurs

# 3. Function to save HTML content to a file
def save_html(content, file_path):
    """
    Saves the provided HTML content into a file at the specified path.
    Why: The rendered HTML files are saved for SEO purposes and later analysis.
    """
    ensure_directory_exists(os.path.dirname(file_path))  # Ensure the directory exists
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(content)  # Write the HTML content to the file

# 4. Function to save logs for debugging and tracking
def save_logs(logs, log_file_path):
    """
    Saves the rendering logs as a JSON file.
    Why: Logs help track rendering status and debug any issues during execution.
    """
    with open(log_file_path, "w", encoding="utf-8") as file:
        json.dump(logs, file, indent=4)  # Save logs in a readable JSON format

# 5. Function to optimize HTML for SEO
def optimize_html(soup, url):
    """
    Modifies the HTML to make it more SEO-friendly by adding or updating key elements.
    Why: Search engines rely on structured and complete HTML for indexing pages effectively.
    """
    # Add a meta description if it's missing
    if not soup.find("meta", {"name": "description"}):
        meta_tag = soup.new_tag("meta", attrs={"name": "description", "content": "SEO-optimized content."})
        soup.head.append(meta_tag)  # Append the meta tag to the head section

    # Add a canonical link if it's missing
    if not soup.find("link", {"rel": "canonical"}):
        canonical_tag = soup.new_tag("link", attrs={"rel": "canonical", "href": url})
        soup.head.append(canonical_tag)  # Append the canonical tag to the head section

    # Adjust the title tag to ensure it's SEO-optimized
    title_tag = soup.find("title")
    if title_tag:
        title_text = title_tag.text.strip()  # Get the current title text
        if len(title_text) < 30:  # Ensure the title is long enough
            title_tag.string = title_text + " - SEO Optimized"
        elif len(title_text) > 60:  # Truncate overly long titles
            title_tag.string = title_text[:57] + "..."

    # Add alt attributes to images that lack them
    for img_tag in soup.find_all("img"):
        if not img_tag.get("alt"):  # Check if the alt attribute is missing
            img_tag["alt"] = "image description missing"  # Add a placeholder alt attribute

    # Add href placeholders for anchor tags missing href attributes
    for a_tag in soup.find_all("a"):
        if not a_tag.get("href"):  # Check if the href attribute is missing
            a_tag["href"] = "#"  # Add a placeholder href attribute

    # Remove unnecessary script and noscript tags
    for script in soup.find_all(["script", "noscript"]):
        script.decompose()  # Completely remove these tags

    # Remove lazy-loading attributes (e.g., "loading='lazy'")
    for tag in soup.find_all(attrs=lambda attr: attr and "lazy" in attr.lower()):
        del tag.attrs["loading"]  # Delete the lazy-loading attribute

    # Remove preloader elements by their classes or IDs
    preloader_classes = ["dots", "preloader", "preload-content", "animation"]
    for preloader in soup.find_all(
        lambda tag: any(cls in tag.get("class", []) for cls in preloader_classes) or
                    tag.get("id") in preloader_classes
    ):
        preloader.decompose()  # Completely remove these elements

    return soup  # Return the modified BeautifulSoup object


# 6. Validate the optimized HTML
def validate_html(soup):
    """
    Validates the optimized HTML for compliance with SEO standards.
    Why:
    - To ensure the generated HTML meets key SEO requirements.
    - To identify and log any issues that could affect SEO performance.
    """
    issues = []  # Initialize a list to store identified issues

    # Check if any <script> tags remain in the HTML
    # Script tags are often unnecessary in the final optimized HTML as they are for dynamic behavior
    if soup.find("script"):
        issues.append("Residual <script> tags found.")

    # Check if any elements have lazy-loading attributes
    # Lazy-loading attributes can prevent search engines from fully loading content
    if soup.find(attrs=lambda attr: attr and "lazy" in attr.lower()):
        issues.append("Residual lazy-loading attributes found.")

    # Check for a meta description tag
    # Meta descriptions provide a summary of the page and are important for search engine indexing
    if not soup.find("meta", {"name": "description"}):
        issues.append("Missing <meta name='description'> tag.")

    # Check for a canonical link tag
    # Canonical tags prevent duplicate content issues by indicating the preferred version of a URL
    if not soup.find("link", {"rel": "canonical"}):
        issues.append("Missing <link rel='canonical'> tag.")

    # Check the <title> tag for proper length
    # Titles should be between 30 and 60 characters for optimal SEO
    title_tag = soup.find("title")
    if not title_tag or not (30 <= len(title_tag.text.strip()) <= 60):
        issues.append("Invalid or missing <title> tag.")

    # Return a tuple: validation status (True/False) and the list of issues
    return not issues, issues


import time  # For measuring time taken

def render_page_and_optimize(url, wait_selector=None):
    """
    Renders, optimizes, validates a webpage, and ensures accurate logging and file saving.
    """
    logs = {"url": url, "status": "Error", "issues": [], "time_taken": 0}  # Initialize logs
    file_name = None  # Placeholder for the static HTML file name

    try:
        # STEP 1: Configure Selenium WebDriver for automated rendering
        options = Options()
        options.add_argument("--headless")  # Run in headless mode (no GUI)
        options.add_argument("--disable-gpu")  # Disable GPU for better performance in headless mode
        options.add_argument("--no-sandbox")  # Required for running Chrome in some environments
        options.add_argument(
            "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
        )  # Use a realistic User-Agent to avoid being blocked

        driver = webdriver.Chrome(options=options)  # Initialize WebDriver

        # STEP 2: Navigate to the URL
        print(f"Processing URL: {url}")
        print(f"Rendering: {url}")
        start_time = time.time()  # Start the timer
        driver.get(url)  # Open the URL in the browser

        # STEP 3: Wait for the page to load dynamically
        try:
            WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
            if wait_selector:
                WebDriverWait(driver, 30).until(
                    EC.presence_of_element_located((By.XPATH, wait_selector))
                )  # Use the configurable selector if provided
        except Exception:
            logs["issues"].append(f"Dynamic element not found: {wait_selector}")
            print(f"Warning: Dynamic element not found for {url}. Proceeding with available content.")

        # STEP 4: Extract and parse the rendered HTML
        rendered_html = driver.page_source
        soup = BeautifulSoup(rendered_html, "html.parser")

        # STEP 5: Optimize the HTML for SEO
        optimized_soup = optimize_html(soup, url)

        # STEP 6: Validate the optimized HTML
        is_valid, issues = validate_html(optimized_soup)

        # STEP 7: Generate a unique file name based on the URL
        file_name = f"{hash(url)}.html"
        file_path = os.path.join(STATIC_HTML_OUTPUT_DIR, file_name)

        # STEP 8: Save the optimized HTML as a static file
        save_html(str(optimized_soup), file_path)

        # Confirm the file was saved successfully
        if os.path.exists(file_path):
            logs["file_name"] = file_name
        else:
            logs["issues"].append("Failed to save the HTML file.")

        # STEP 9: Update log status based on validation
        logs["status"] = "Valid" if is_valid else "Invalid"
        logs["issues"].extend(issues)

        # STEP 10: Show a preview of the rendered HTML
        print("\n===== Optimized HTML Preview =====")
        print(optimized_soup.prettify()[:5000])  # Show the first 5000 characters
        print("===================================")
        print(f"Validation Status: {'Valid' if is_valid else 'Invalid'}")

        # STEP 11: Print any validation issues found
        if issues:
            print("Issues Found:")
            for issue in issues:
                print(f"- {issue}")

    except Exception as e:
        # STEP 12: Handle any exceptions that occur during processing
        logs["issues"].append(f"Error: {str(e)}")  # Log the error message
        print(f"Error processing {url}: {str(e)}")  # Print the error for debugging

    finally:
        # STEP 13: Ensure the browser is closed and measure time
        end_time = time.time()  # End the timer
        logs["time_taken"] = round(end_time - start_time, 2)  # Calculate time taken
        if 'driver' in locals():  # Check if the driver exists
            try:
                logs["browser_logs"] = driver.get_log("browser")  # Capture browser console logs
            except Exception:
                logs["browser_logs"] = "Unable to capture browser logs."
            driver.quit()  # Quit the WebDriver

    return logs  # Return the log details for this URL



def process_pages():
    """
    Processes all URLs sequentially, ensuring accurate logging, rendering, and file saving.
    """
    ensure_directory_exists(STATIC_HTML_OUTPUT_DIR)  # Ensure output directory exists

    urls = load_urls(CSV_FILE_PATH)  # Load URLs from CSV
    if not urls:
        print("No URLs to process.")
        return

    all_logs = []  # Store logs for all URLs
    for url in urls:
        print(f"\nProcessing URL: {url}")  # Inform the user about the URL being processed
        log = render_page_and_optimize(url)  # Render and process the URL
        all_logs.append(log)  # Append logs to the list

        # Save logs incrementally after every URL for better debugging
        save_logs(all_logs, LOG_FILE_PATH)
        print(f"Log updated for URL: {url}")

    print(f"\nFinal logs saved to {LOG_FILE_PATH}")


# Execute the script
if __name__ == "__main__":
    process_pages()


import os  # For file operations
import json  # For reading and writing JSON files
from bs4 import BeautifulSoup  # For parsing and analyzing static HTML files

# File paths for input and output
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
FINAL_REPORT_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/seo_analysis_report.json"

def analyze_static_html(file_path):
    """
    Analyze a static HTML file for SEO improvements.
    Parameters:
        file_path (str): Path to the static HTML file.
    Returns:
        list: Insights and issues found in the HTML.
    """
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            soup = BeautifulSoup(file.read(), "html.parser")

        issues = []

        # Check for meta description
        if not soup.find("meta", {"name": "description"}):
            issues.append("Missing <meta name='description'> tag.")

        # Check for title tag length
        title_tag = soup.find("title")
        if not title_tag or len(title_tag.text.strip()) < 30 or len(title_tag.text.strip()) > 60:
            issues.append("Invalid or missing <title> tag.")

        # Check for missing alt attributes in images
        for img in soup.find_all("img"):
            if not img.get("alt"):
                issues.append("Missing alt attribute in <img> tag.")

        # Check for missing canonical link
        if not soup.find("link", {"rel": "canonical"}):
            issues.append("Missing <link rel='canonical'> tag.")

        # Check for missing H1 tag
        if not soup.find("h1"):
            issues.append("Missing <h1> tag.")

        # Check for structured data
        if not soup.find("script", {"type": "application/ld+json"}):
            issues.append("Missing structured data (JSON-LD).")

        return issues
    except Exception as e:
        return [f"Error analyzing HTML: {str(e)}"]

def generate_seo_report(log_file_path, static_html_dir, output_report_path):
    """
    Generate a comprehensive SEO report based on rendering logs and static HTML files.
    Parameters:
        log_file_path (str): Path to the rendering log JSON file.
        static_html_dir (str): Directory containing static HTML files.
        output_report_path (str): Path to save the final report.
    """
    try:
        # Validate log file existence
        if not os.path.exists(log_file_path):
            print(f"Error: Log file not found -> {log_file_path}")
            return

        # Load rendering logs
        with open(log_file_path, "r", encoding="utf-8") as file:
            logs = json.load(file)

        # Initialize the SEO report structure
        report = {
            "total_urls": 0,
            "valid_urls": 0,
            "invalid_urls": 0,
            "common_issues": {},
            "detailed_logs": [],
        }

        issue_counts = {}  # Track frequency of issues
        for log in logs:
            url = log["url"]
            report["total_urls"] += 1
            if log["status"] == "Valid":
                report["valid_urls"] += 1
            else:
                report["invalid_urls"] += 1

            # Analyze the corresponding static HTML file
            static_file_path = os.path.join(static_html_dir, log.get("file_name", ""))
            if os.path.isfile(static_file_path):  # Ensure it's a valid file
                html_issues = analyze_static_html(static_file_path)
                log["html_issues"] = html_issues

                # Count issues
                for issue in html_issues:
                    if issue in issue_counts:
                        issue_counts[issue] += 1
                    else:
                        issue_counts[issue] = 1
            else:
                log["html_issues"] = ["Static HTML file missing or is a directory."]

            report["detailed_logs"].append(log)

        # Add common issues to the report
        report["common_issues"] = issue_counts

        # Generate actionable recommendations
        recommendations = []
        if "Missing <meta name='description'> tag." in issue_counts:
            recommendations.append("Add <meta name='description'> tags for better SEO.")
        if "Invalid or missing <title> tag." in issue_counts:
            recommendations.append("Ensure <title> tags are between 30-60 characters.")
        if "Missing alt attribute in <img> tag." in issue_counts:
            recommendations.append("Add alt attributes to all <img> tags.")
        if "Missing <link rel='canonical'> tag." in issue_counts:
            recommendations.append("Include <link rel='canonical'> tags to avoid duplicate content.")
        if "Missing <h1> tag." in issue_counts:
            recommendations.append("Include a primary <h1> tag for better SEO.")
        if "Missing structured data (JSON-LD)." in issue_counts:
            recommendations.append("Add structured data for richer search engine results.")

        report["recommendations"] = recommendations

        # Save the report to a JSON file
        with open(output_report_path, "w", encoding="utf-8") as file:
            json.dump(report, file, indent=4)

        # Display a summary preview
        print("\n===== SEO Report Summary =====")
        print(f"Total URLs Processed: {report['total_urls']}")
        print(f"Valid URLs: {report['valid_urls']}")
        print(f"Invalid URLs: {report['invalid_urls']}")
        print("\nCommon Issues:")
        for issue, count in report["common_issues"].items():
            print(f"- {issue}: {count} occurrences")
        print("\nRecommendations:")
        for rec in report["recommendations"]:
            print(f"- {rec}")
        print("===================================")
        print(f"SEO report saved to {output_report_path}")

    except Exception as e:
        print(f"Error generating SEO report: {e}")

# Execute the SEO analysis
if __name__ == "__main__":
    generate_seo_report(LOG_FILE_PATH, STATIC_HTML_OUTPUT_DIR, FINAL_REPORT_PATH)



import os
import shutil
import json

# Paths for deployment
STATIC_HTML_OUTPUT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/static_html"
DEPLOYMENT_DIR = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/deployment"
LOG_FILE_PATH = "/content/drive/MyDrive/Dataset For Dynamic Rendering for JavaScript SEO/render_log.json"

def prepare_for_deployment(source_dir, deployment_dir, log_file_path):
    """
    Prepare validated static HTML files for deployment.
    - Filters files based on valid URLs from the rendering log.
    - Removes old or unrelated files from the source directory.
    - Copies validated `.html` files to the deployment directory.
    """
    # Check if source directory exists
    if not os.path.exists(source_dir):
        print(f"Error: Source directory not found -> {source_dir}")
        return

    # Check if rendering log exists
    if not os.path.exists(log_file_path):
        print(f"Error: Log file not found -> {log_file_path}")
        return

    # Load rendering logs
    with open(log_file_path, "r", encoding="utf-8") as file:
        logs = json.load(file)

    # Extract valid file names from logs
    valid_files = {log.get("file_name") for log in logs if log.get("status") == "Valid"}

    # Ensure deployment directory exists
    if not os.path.exists(deployment_dir):
        os.makedirs(deployment_dir)

    total_copied = 0
    total_skipped = 0

    # Copy only valid files to the deployment directory
    for file_name in os.listdir(source_dir):
        if file_name.endswith(".html"):
            source_path = os.path.join(source_dir, file_name)
            destination_path = os.path.join(deployment_dir, file_name)

            if file_name in valid_files:
                try:
                    shutil.copy(source_path, destination_path)
                    print(f"Deployed: {file_name}")
                    total_copied += 1
                except Exception as e:
                    print(f"Error deploying {file_name}: {e}")
            else:
                total_skipped += 1

    # Print deployment summary
    print("\n===== Deployment Summary =====")
    print(f"Static HTML files prepared for deployment in: {deployment_dir}")
    print(f"Total Files Deployed: {total_copied}")
    print(f"Total Files Skipped: {total_skipped}")

if __name__ == "__main__":
    prepare_for_deployment(STATIC_HTML_OUTPUT_DIR, DEPLOYMENT_DIR, LOG_FILE_PATH)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  <meta content="AI Based SEO Blueprint - AI SEO Mechanism - Thatware" name="twitter:title"/>
  <meta content="Here we provide you with an exclusive live sample on how AI based SEO blueprint works." name="twitter:description"/>
  <meta content="Est. reading time" name="twitter:label1"/>
  <meta content="1 minute" name="twitter:data1"/>
  <!-- / Yoast SEO Premium plugin. -->
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link href="https://thatware.co/feed/" rel="alternate" title="Thatware » Feed" type="application/rss+xml"/>
  <link href="https://thatware.co/comments/feed/" rel="alternate" title="Thatware » Comments Feed" type="application/rss+xml"/>
  <link as="style" data-rocket-async="style" href="https://thatware.co/wp-includes/css/dist/block-library/style.min.css?ver=6.7.1" media="all" onerror="this.removeAttribute('data-rocket-async')" onload="this.onload=null;this.rel='stylesheet'" re

---
# **What is Dynamic Rendering for JavaScript SEO?**

1. **Problem Statement**:
   - Many modern websites use **JavaScript frameworks** (e.g., Angular, React, Vue.js).
   - Search engines, like Google, sometimes **struggle to fully process JavaScript-heavy websites**. This can lead to:
     - Pages not being indexed correctly.
     - Content not appearing in search results.

2. **Dynamic Rendering Solution**:
   - Dynamic rendering creates **two different versions** of your website:
     - A **search-engine-friendly version** (fully loaded HTML) optimized for bots.
     - The **JavaScript-heavy version** intended for real users.

3. **Benefit**:
   - Search engines see a complete, ready-to-index version of your page, improving SEO visibility.

---

### **What is This Output?**

The output is a **report generated during the dynamic rendering process**. It shows the status of each page as it is processed and optimized for search engines. This ensures that:
- Pages are being properly rendered.
- SEO-critical elements (like titles, descriptions, and canonical tags) are present and correct.

---

### **Detailed Breakdown of the Output**

#### **1. Processing URL**
   - **Example**: `Processing URL: https://thatware.co/google-page-title-update/`
   - **What it means**:
     - This shows the page being analyzed for SEO optimization.
     - It confirms that the system is reviewing this page to ensure it works for search engines.
   - **Why it’s important**:
     - Every critical page on your website must be reviewed to ensure search engines can process them without issues.
   - **Action for Website Owners**:
     - Ensure all important pages (like product, service, and blog pages) are being processed.

---

#### **2. Rendering URL**
   - **Example**: `Rendering: https://thatware.co/google-page-title-update/`
   - **What it means**:
     - This confirms that the dynamic rendering system has loaded the page and generated a **search-engine-friendly version**.
   - **Why it’s important**:
     - Rendering ensures that search engines see the fully loaded content of the page, not just incomplete JavaScript.
   - **Action for Website Owners**:
     - Verify that the rendered version contains all the necessary information (titles, descriptions, and content).

---

#### **3. Optimized HTML Preview**
This section contains the actual **HTML code** that search engines will see after the page is dynamically rendered. Let’s break down the key parts:

##### **a. `<title>` Tag**
   - **Example**: `<title>Google Title Tag Update: Ways To Ignore CTR Downfall - ThatWare</title>`
   - **What it means**:
     - The `<title>` tag is the title of your page, which appears on search engine results pages (SERPs).
   - **Why it’s important**:
     - A well-crafted title improves your **click-through rate (CTR)**.
   - **Action for Website Owners**:
     - Make sure the title is concise, includes keywords, and describes the page content effectively.

---

##### **b. `<meta>` Tags**
   - **Example**:
     ```html
     <meta content="Google title tag update - Google start giving priority to the H1 or H2 as a webpage's title which causes CTR drop. Read on the guide to fix the drop." name="description"/>
     ```
   - **What it means**:
     - Meta descriptions provide a brief summary of the page’s content. They show up under the title in search results.
   - **Why it’s important**:
     - A compelling meta description encourages users to click on your page.
   - **Action for Website Owners**:
     - Write unique, engaging meta descriptions for every page.

---

##### **c. `<link>` Tags**
   - **Example**:
     ```html
     <link href="https://thatware.co/google-page-title-update/" rel="canonical"/>
     ```
   - **What it means**:
     - Canonical tags indicate the **preferred version** of a page. This prevents duplicate content issues.
   - **Why it’s important**:
     - Duplicate content can hurt SEO rankings. Canonical tags help search engines understand which page to prioritize.
   - **Action for Website Owners**:
     - Ensure every page has a correct canonical tag pointing to the main URL.

---

#### **4. Validation Status**
   - **Example**: `Validation Status: Valid`
   - **What it means**:
     - This confirms that the page has been successfully optimized for search engines.
   - **Why it’s important**:
     - A valid status ensures there are no technical SEO issues.
   - **Action for Website Owners**:
     - If the status is invalid, investigate the errors (e.g., missing titles, broken links) and fix them.

---

### **How is This Output Beneficial?**

1. **Improves Search Engine Visibility**:
   - Search engines can correctly interpret and index your pages, leading to better rankings.

2. **Fixes JavaScript Rendering Issues**:
   - Ensures that JavaScript-heavy websites don’t lose visibility due to rendering problems.

3. **Boosts CTR**:
   - Optimized titles and meta descriptions make your page more attractive on search results.

4. **Avoids Duplicate Content Issues**:
   - Canonical tags prevent SEO penalties for duplicate pages.

---

### **Steps for Website Owners After Receiving This Output**

#### **1. Review the Rendered HTML**
   - Check the `<title>` and `<meta>` tags:
     - Are they relevant to the page content?
     - Do they include keywords and entice users to click?

#### **2. Fix Missing Data**
   - Ensure:
     - Every page has a title, description, and canonical tag.
     - Headings (H1, H2) are clear and descriptive.

#### **3. Optimize Page Speed**
   - Use tools like **Google PageSpeed Insights** to ensure the rendered version loads quickly.

#### **4. Validate URLs**
   - Verify that all critical pages are being rendered and indexed correctly.

#### **5. Monitor for Errors**
   - Look for issues like:
     - `Error processing: 'loading'`
     - Fix these by checking server response times or JavaScript dependencies.

---

### **Conclusion**

This output is a roadmap for ensuring your website is SEO-friendly and optimized for search engines. By using dynamic rendering:
- Search engines can better understand your website.
- You can avoid technical issues with JavaScript.
- You improve your visibility, rankings, and traffic.

---


### **Understanding the Output of Dynamic Rendering for JavaScript SEO**
This output provides a detailed report on how pages are processed, validated, and optimized using the **Dynamic Rendering for JavaScript SEO Model**.

---

### **What is the Purpose of Dynamic Rendering?**

Dynamic Rendering is a process used to help **search engines** understand and rank websites that rely heavily on **JavaScript**. Sometimes, search engines like Google can’t read or process JavaScript correctly, which can hurt a website’s SEO. Dynamic Rendering solves this by:
1. **Generating static HTML versions of pages for search engines.**
2. **Ensuring all important SEO elements (titles, descriptions, structured data, etc.) are included.**

This is **critical for improving a website’s visibility in search engine results.**

---

### **What Does This Output Mean?**
The output is a detailed log of:
- How many pages were analyzed.
- Which pages were successfully processed and optimized.
- Common issues found and steps recommended for improvement.
- Deployment details of optimized HTML files.

Let’s break it down part by part.

---

### **1. Logs of URLs Processed**
- **Example**:  
  `Total URLs Processed: 152`  
  `Valid URLs: 75`  
  `Invalid URLs: 77`  

- **Meaning**:  
  - **152** pages on your website were analyzed by the system.  
  - **75** of those pages were successfully processed and optimized.  
  - **77** pages had errors or issues that need attention.

- **Action Steps**:  
  - Focus on fixing the **invalid URLs**. These may not be visible or indexed properly by search engines.  
  - Check the error logs for more details about what went wrong.  

---

### **2. Common Issues**
- **Example**:  
  `- Missing structured data (JSON-LD): 75 occurrences`

- **Meaning**:  
  - **Structured data** is additional information in your website’s code that helps search engines understand your content better.  
  - Missing structured data means your website might not show rich results (e.g., product ratings, FAQs, images) in search results.

- **Why It’s Important**:  
  - Without structured data, your website may not perform as well in search rankings. For example, your competitors might have eye-catching product listings, while your pages appear plain.

- **Action Steps**:  
  1. **Add structured data** to the pages.  
     Use tools like [Google’s Structured Data Markup Helper](https://www.google.com/webmasters/markup-helper/) or ask a developer to add JSON-LD.  
  2. Validate structured data using the [Rich Results Test Tool](https://search.google.com/test/rich-results).

---

### **3. Validation Status**
- **Example**:  
  `Validation Status: Valid`  

- **Meaning**:  
  - A **valid status** means the page is successfully optimized for search engines.  
  - If the status is **invalid**, it means something is broken (e.g., missing titles, broken links, or incomplete HTML).

- **Action Steps**:  
  - Review **invalid pages** for missing SEO elements (titles, descriptions, structured data).  
  - Fix errors and re-run the rendering process to ensure all pages are validated.

---

### **4. Deployment Details**
- **Example**:
  ```
  Deployed: 7965246990520139191.html
  Total Files Deployed: 75
  Total Files Skipped: 211
  ```

- **Meaning**:  
  - **75 files** were successfully created and deployed as static HTML versions for search engines.  
  - **211 files** were skipped because they might not need rendering or had errors.

- **Why It’s Important**:  
  - Deployment ensures that optimized HTML files are ready for search engines.  
  - Skipped files may mean missed opportunities for SEO improvement.

- **Action Steps**:  
  1. Verify the deployed files:
     - Check if all important pages (home, products, blogs, etc.) were deployed.  
  2. Investigate skipped files:
     - Identify why they were skipped (e.g., errors, unnecessary files).  

---

### **5. Recommendations**
- **Example**:
  ```
  Recommendations:
  - Add structured data for richer search engine results.
  ```

- **Meaning**:  
  - The system identified specific improvements to make your website more SEO-friendly.  

- **Action Steps**:  
  - **Add structured data** to ensure your website shows rich results like FAQs, star ratings, or events in search engines.  
  - Prioritize adding structured data to the most important pages first (e.g., homepage, product pages).

---

### **Why is This Output Beneficial?**
1. **Improves Search Engine Ranking**:
   - By optimizing all pages and resolving errors, your website becomes more search-engine-friendly.  

2. **Fixes JavaScript Rendering Issues**:
   - Ensures that search engines can fully understand JavaScript-heavy websites.  

3. **Enhances Search Appearance**:
   - Adding structured data makes your pages more attractive in search results.  

4. **Identifies and Resolves Errors**:
   - Invalid pages and missing elements are flagged, giving you a clear roadmap to fix them.  

5. **Prepares Static Files for SEO**:
   - Optimized HTML files are ready to be indexed by search engines, reducing the risk of poor rankings.

---

### **Final Steps for Website Owners**
Here’s a clear, step-by-step guide:

#### **1. Review All Processed URLs**:
- Check the list of **valid and invalid URLs**.
- Focus on fixing invalid pages first.

#### **2. Add Missing Structured Data**:
- Prioritize pages flagged for missing structured data.
- Use JSON-LD to add rich snippets for:
  - Products
  - Articles
  - FAQs
  - Reviews

#### **3. Fix Validation Errors**:
- Review errors like:
  - Missing `<title>` or `<meta>` tags.
  - Incorrect canonical URLs.
- Update your pages to fix these issues.

#### **4. Optimize Skipped Pages**:
- Investigate skipped files to understand why they were not processed.
- Re-run the rendering tool after fixing issues.

#### **5. Monitor Performance**:
- Use tools like **Google Search Console** to check indexing and performance.
- Validate structured data with Google’s **Rich Results Test Tool**.

---

### **Conclusion**
This output is a roadmap for improving your website’s SEO. It highlights:
- Which pages need optimization.
- What errors need fixing.
- How to enhance your website’s visibility in search results.

By following these steps, you’ll ensure your website is **search-engine-friendly**, ranks higher, and attracts more traffic.
