<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/KBT_SEO_Analyzer_Building_Trust_Through_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name: KBT-SEO Analyzer: Building Trust Through Data**

---
# **What Is This Project About?**

The **KBT-SEO Analyzer: Building Trust Through Data** is a tool designed to help website owners, digital marketers, and SEO professionals analyze the quality of their website content.

Its goal is to **evaluate and improve trustworthiness, readability, and SEO performance** using advanced technology. This project focuses on applying **Knowledge-Based Trust (KBT)** principles, which means determining how credible and trustworthy a webpage's content is based on measurable factors like grammar, sentiment, and citations.

---

### **Purpose of This Project**
The primary purpose of this project is to help **website owners** and **content creators** ensure their content is:

1. **Trustworthy**:
   - It checks if the content has references, is well-written, and avoids misleading or low-quality language.

2. **Readable**:
   - The tool ensures sentences are clear and easy to understand.
   - It flags long and complex sentences, making content more engaging for readers.

3. **SEO-Optimized**:
   - It evaluates keyword density to ensure the content is neither under-optimized nor overstuffed with keywords.
   - The tool helps balance keywords in a way that improves search engine rankings without penalization.

4. **Actionable**:
   - After analyzing the content, it provides clear suggestions and improvements, helping the user take specific actions to enhance the quality of their webpage.

---

### **Why Was This Project Created?**

1. **Problem It Solves**:
   - Website content often fails to rank on search engines because it lacks credibility or contains poorly written language.
   - Many webpages overuse keywords (keyword stuffing), which reduces their quality and can lead to penalties from search engines.
   - A lack of proper citations or references makes content appear less trustworthy to readers and search engines.

2. **How It Helps**:
   - The KBT-SEO Analyzer identifies these problems in your content and provides actionable insights to fix them.
   - It enhances the trust factor of your content, which is critical for building a loyal audience and ranking higher on search engines.

---

### **Who Is This Project For?**

1. **Website Owners**:
   - Ensures that their website content builds trust and ranks higher in search engine results.

2. **SEO Professionals**:
   - Helps them optimize their clients’ content for both readability and search engine trust.

3. **Content Creators**:
   - Offers insights into how to improve their writing for clarity, grammar, and sentiment.

4. **Digital Marketers**:
   - Provides detailed feedback on how content can be aligned with Knowledge-Based Trust principles to engage audiences effectively.

---

### **What Does This Project Analyze?**
Here are the key features of the **KBT-SEO Analyzer**:

1. **Sentiment Analysis**:
   - Analyzes the tone of the content (positive, negative, or neutral).
   - Helps improve the tone to make it more engaging for readers.

2. **Keyword Density Analysis**:
   - Checks how often specific keywords appear in the content.
   - Flags keyword stuffing, ensuring that the content is SEO-friendly.

3. **Grammar and Sentence Structure**:
   - Identifies sentences that are too long or use passive voice, making content harder to understand.
   - Recommends rewriting for better readability.

4. **Citation Count**:
   - Counts references and citations in the content.
   - Flags content that lacks proper citations, helping improve trustworthiness.

5. **Suggestions for Improvement**:
   - Provides actionable suggestions, such as simplifying sentences, improving tone, or reducing overuse of specific keywords.

---

### **How Is It Beneficial?**
The **KBT-SEO Analyzer** ensures your content is ready for both readers and search engines. Here's how it benefits you:

1. **Improves Search Rankings**:
   - By optimizing your content for keywords, grammar, and readability, it becomes more likely to rank higher on Google and other search engines.

2. **Builds Trust**:
   - Content with proper citations, positive tone, and clear language establishes trust with your audience.

3. **Enhances User Experience**:
   - Simplifies complex sentences and reduces errors, making content enjoyable to read.

4. **Saves Time**:
   - Instead of manually checking your content, the tool quickly analyzes it and provides suggestions.

---

### **What Should You Do After Getting This Output?**

1. **Review the Issues**:
   - Look at the flagged issues in the **Issues** section. For example, check for sentences that are too long or for keywords that are overused.

2. **Follow the Suggestions**:
   - Use the actionable suggestions to rewrite and improve your content.

3. **Optimize Keyword Usage**:
   - Ensure keywords are used naturally and avoid overusing them.

4. **Check Tone and Sentiment**:
   - If the sentiment is flagged as "neutral" or "negative," rewrite sections to make them more engaging.

5. **Add Citations**:
   - If the citation count is low, include proper references to build credibility.

6. **Repeat the Process**:
   - After making changes, re-run the tool to ensure all issues are resolved.

---

### **Summary**
The **KBT-SEO Analyzer: Building Trust Through Data** is a powerful tool designed to help website owners and content creators optimize their content. By analyzing trustworthiness, readability, and SEO effectiveness, it ensures that your content engages readers and ranks higher on search engines.


---
# **What is Knowledge-Based Trust (KBT) in SEO?**
Knowledge-Based Trust (KBT) is an algorithm developed by Google to measure how trustworthy and accurate the information on a website is. This trust score is based on:
- How well the facts presented on the website match publicly available, verified knowledge sources.
- Whether the information is free from misleading, false, or incomplete claims.

This is important because Google uses KBT to prioritize websites that provide reliable and accurate information when ranking them in search results.

---

### Use Cases of KBT in SEO:
Here are the use cases of KBT in the context of a **website**:
1. **Improving Search Rankings**: Websites that present accurate, fact-based information are more likely to rank higher on Google.
2. **Building User Trust**: Users trust websites with reliable information, leading to higher engagement and lower bounce rates.
3. **Avoiding Penalties**: Misinformation or inaccuracies can lead to lower rankings or penalties by Google.
4. **Boosting Brand Credibility**: A website that aligns with Google's trust algorithms strengthens its brand image as a reliable source of information.

---

### Real-Life Implementations of KBT in SEO for Websites:
1. **News Websites**: They use KBT to ensure the facts they present are verified and match reputable sources. For example, Google News prioritizes trustworthy content.
2. **E-Commerce Websites**: Product descriptions, reviews, and specifications must be accurate to ensure trustworthiness.
3. **Educational Websites**: They cross-check their facts with known knowledge bases (e.g., Wikipedia, research papers) to ensure credibility.
4. **Healthcare Websites**: Medical websites ensure their content is fact-checked against reliable medical sources like PubMed or WHO guidelines.

---

### Data Required by a KBT Model:
A KBT model requires input data to analyze and determine the trustworthiness of the content. The data can be provided in two main formats:
1. **Website URLs**:  
   - The URLs of the web pages are fed into the model.  
   - The model crawls these pages to extract the text content for analysis.  
   - This approach is ideal for analyzing live content on a website.

2. **Structured Data in CSV Format**:  
   - This is used when the text content (e.g., page titles, descriptions, and main content) is already exported into a CSV file.  
   - Each row in the CSV file represents a web page, with columns for title, content, metadata, etc.

---

### How Does the KBT Model Process Data?
1. **Text Preprocessing**:  
   - The text is extracted from URLs or CSV files.  
   - The text is cleaned by removing HTML tags, special characters, and redundant formatting.
   
2. **Fact Checking**:  
   - The extracted content is cross-checked against trusted knowledge sources (e.g., Google Knowledge Graph, Wikipedia, medical journals).  
   - The model looks for factual inconsistencies or unverifiable claims.

3. **Trust Score Calculation**:  
   - Based on the accuracy and alignment of the content with known facts, the model assigns a trust score to each page.

---

### Expected Output of a KBT Model:
Here’s what the KBT model provides as output:
1. **Trust Scores for Web Pages**:  
   - A numerical score (e.g., 0-100) indicating the trustworthiness of each page.
2. **Highlighted Issues**:  
   - Specific sections of content that may be misleading or unverified.  
   - Suggestions for improving factual accuracy.
3. **Recommendations**:  
   - Tips for aligning content with reliable sources to improve trustworthiness.  
   - Identifying missing citations or verifications.
4. **Insights on Metadata**:  
   - Suggestions for optimizing meta titles, descriptions, and schema markup to align with KBT principles.

---

### How is This Useful for Optimizing Website Content?
1. **Content Review**:  
   - Website owners can identify inaccurate or weak content and update it with verified information.
2. **Citation Management**:  
   - Add proper citations and references to back up claims on the website.
3. **Improved Rankings**:  
   - Higher trust scores translate into better SEO rankings as Google prioritizes accurate content.
4. **User Retention**:  
   - Users are more likely to stay on and trust websites with accurate information, leading to better engagement metrics.

---

### Non-Tech Guide to Implementing KBT in SEO:
1. **Data Preparation**:  
   - Either provide the URLs of your website or export your content into a structured CSV file.
   
2. **Running the KBT Model**:  
   - Use a tool or script (often in Python) to analyze the data.
   - The model will preprocess the content, cross-check with verified knowledge bases, and calculate trust scores.

3. **Interpreting the Output**:  
   - Look at the trust scores and recommendations provided by the model.  
   - Update your website content based on the insights to align with KBT principles.

---



---
# **Part 1: Web Scraping Code**
**Purpose**: To fetch, clean, and save raw webpage content from specified URLs for further processing.

#### **Steps and Functionality**:
1. **`fetch_content(url)`**:  
   - Fetches raw HTML content from a webpage using the given URL.
   - **Why**: This function retrieves the initial webpage data which is essential for analysis.

2. **`extract_text_from_html(html_content)`**:  
   - Cleans the raw HTML content to extract readable text while removing unwanted elements like scripts and styles.
   - **Why**: Ensures only relevant information is passed to the next steps.

3. **`scrape_webpages(urls)`**:  
   - Iterates through a list of URLs, fetches the content, and cleans it using the above functions.
   - **Why**: Gathers all webpage data into a structured format for later use.

4. **`save_to_csv(data, filename)`**:  
   - Saves the scraped data (URL and cleaned text) into a CSV file.
   - **Why**: Provides a structured file format for further processing.

5. **`preview_data(data)`**:  
   - Displays a preview of the scraped data to verify its accuracy.
   - **Why**: Ensures that the scraped content is accurate before moving to the next step.

---


In [None]:
# Importing required libraries
import requests  # To send HTTP requests and fetch webpage content
from bs4 import BeautifulSoup  # To parse HTML and extract readable text
import csv  # To save the scraped data in a CSV file
import pandas as pd  # To display a preview of the data

# List of URLs to scrape (provided by the user)
urls = [
    'https://thatware.co/software-development-services/',
    'https://thatware.co/business-intelligence-services/',
    'https://thatware.co/competitor-keyword-analysis/'
]

# Step 1: Function to fetch webpage content from a given URL
def fetch_content(url):
    """
    Fetch HTML content from a webpage.
    - Purpose: This function sends an HTTP request to the given URL to fetch the raw HTML content of the page.
    - Why: Without this step, we wouldn't have any data to process or analyze.
    """
    try:
        response = requests.get(url, timeout=10)  # Sending a GET request with a 10-second timeout
        response.raise_for_status()  # Ensures the request was successful; raises an error otherwise
        return response.text  # Return the raw HTML content of the page
    except requests.exceptions.RequestException as e:
        # Print an error message if the request fails
        print(f"Error fetching URL {url}: {e}")
        return None  # Return None so the process can continue even if one URL fails

# Step 2: Function to extract meaningful text from HTML content
def extract_text_from_html(html_content):
    """
    Extract visible text from HTML content.
    - Purpose: This function removes unnecessary elements like scripts, styles, and hidden text from the HTML.
    - Why: It ensures that only the main content of the webpage is extracted for analysis.
    """
    soup = BeautifulSoup(html_content, 'html.parser')  # Parse the HTML content using BeautifulSoup
    # Extract visible text by removing unnecessary elements and combining all visible text
    return soup.get_text(separator=' ', strip=True)

# Step 3: Function to scrape all URLs and store their content
def scrape_webpages(urls):
    """
    Scrape webpage content for multiple URLs.
    - Purpose: This function loops through each URL, fetches its content, and cleans it.
    - Why: It structures the process of collecting and organizing webpage data for easy analysis later.
    """
    webpage_data = []  # List to store scraped data for each URL
    for url in urls:
        print(f"Scraping URL: {url}")  # Notify the user of the current URL being processed
        html_content = fetch_content(url)  # Step 1: Fetch HTML content
        if html_content:  # Ensure we have valid content before proceeding
            text_content = extract_text_from_html(html_content)  # Step 2: Extract clean text
            # Store the URL and cleaned content in a dictionary
            webpage_data.append({'URL': url, 'Content': text_content})
    return webpage_data  # Return the scraped data as a list of dictionaries

# Step 4: Save the scraped data to a CSV file
def save_to_csv(data, filename='webpage_content.csv'):
    """
    Save scraped data to a CSV file.
    - Purpose: This function saves the collected webpage data into a structured format (CSV).
    - Why: The CSV format is easy to open, share, and analyze using tools like Excel or Python.
    """
    try:
        # Open the CSV file in write mode with UTF-8 encoding to handle special characters
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            # Create a CSV writer object and define the columns
            writer = csv.DictWriter(file, fieldnames=['URL', 'Content'])
            writer.writeheader()  # Write the column headers to the file
            writer.writerows(data)  # Write each row of data
        print(f"Data successfully saved to {filename}.")  # Confirm success to the user
    except Exception as e:
        # Handle any file-saving issues
        print(f"Error saving data to CSV: {e}")

# Step 5: Display a preview of the scraped data
def preview_data(data, num_rows=5):
    """
    Display a preview of the scraped data in tabular format.
    - Purpose: Show a quick preview of the data to ensure it's correctly scraped before moving forward.
    - Why: This helps validate the data and catch issues early on.
    """
    try:
        df = pd.DataFrame(data)  # Convert the scraped data into a Pandas DataFrame for tabular representation
        print("\nPreview of Scraped Data:\n")
        print(df.head(num_rows))  # Display the first few rows of the data
    except Exception as e:
        # Handle issues with data preview
        print(f"Error displaying preview: {e}")

# Main process: This is where all the functions come together
if __name__ == "__main__":
    # Step 3: Scrape the URLs and get the data
    scraped_data = scrape_webpages(urls)

    # Step 4: Save the scraped data into a CSV file
    save_to_csv(scraped_data)

    # Step 5: Display a preview of the scraped data
    preview_data(scraped_data)


Scraping URL: https://thatware.co/software-development-services/
Scraping URL: https://thatware.co/business-intelligence-services/
Scraping URL: https://thatware.co/competitor-keyword-analysis/
Data successfully saved to webpage_content.csv.

Preview of Scraped Data:

                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                             Content  
0  Custom Software Development Services - Softwar...  
1  Business Intelligence Services - Competitive A...  
2  SEO Competitor Keyword Analysis - Competitor R...  


---
# **Explanation of the Output**

This output represents the **scraped data from the webpage**. It is a structured representation of information that was collected from certain URLs (webpage links). Let's break it down step by step:

#### **Columns in the Data**
1. **URL**  
   - The `URL` column contains the web address of the pages from which the content was scraped.
   - These URLs are like digital addresses that point to specific webpages on the internet.
   - Example: `https://thatware.co/software-development-services/`  
     This URL is for a page about custom software development services.

2. **Content**  
   - The `Content` column contains the textual content found on each webpage.
   - This is the information that was visible on the webpage, such as titles, descriptions, and any other text.
   - Example:
     - "Custom Software Development Services - Software tailored to your needs."
     - This content is what users see when they visit the corresponding URL.

#### **Rows in the Data**
Each row in the output corresponds to one webpage:
- Row 1 (Index `0`):
  - URL: A webpage about software development services.
  - Content: Text from that page, which includes descriptions or promotional content about those services.
- Row 2 (Index `1`):
  - URL: A webpage about business intelligence services.
  - Content: Text from that page discussing competitive analysis and strategies.
- Row 3 (Index `2`):
  - URL: A webpage about competitor keyword analysis.
  - Content: Text describing SEO services related to competitor research.

---

### **Purpose of the Data**
This data serves as the **input for further analysis** in your Knowledge-Based Trust (KBT) SEO Model. Here's what it does:
1. **Extracts Information:**
   - Gathers textual content from specific URLs to understand what the page is about.

2. **Prepares for Analysis:**
   - This content will later be analyzed for issues like tone, grammar, and keyword density to improve the quality of the webpage.

3. **Client-Friendly View:**
   - This output shows the client what information has been collected from their webpages for review or further processing.

---

### **Why This Data Matters**
1. **SEO Insights:**
   - Helps analyze how webpages are written and whether they align with SEO best practices.
2. **Quality Assurance:**
   - Ensures that the content on the webpages is engaging, grammatically correct, and optimized for keywords.
3. **Data Transparency:**
   - Shows exactly what data was extracted from the webpages, ensuring there are no surprises for the client.



---

### **Conclusion**
This output represents the starting point of the KBT model. It captures the **webpage URLs** and their **content** to provide transparency and set the foundation for analysis. It ensures the client knows exactly what is being analyzed and why.



---
# **Part 2: Data Enhancement and Preprocessing**
**Purpose**: To clean, preprocess, and enrich the webpage content with NLP (Natural Language Processing) features.

#### **Steps and Functionality**:
1. **`initialize_nltk_resources()`**:  
   - Downloads necessary NLTK resources such as stopwords and tokenizers.
   - **Why**: Prepares the environment for advanced text preprocessing tasks.

2. **`initialize_spacy_model()`**:  
   - Loads SpaCy’s pre-trained language model for grammar and sentence analysis.
   - **Why**: Enables advanced NLP tasks like sentence segmentation and keyword extraction.

3. **`preprocess_text(text, nlp)`**:  
   - Cleans the text by removing special characters, converting to lowercase, and filtering out stopwords.
   - **Why**: Prepares the text for accurate NLP analysis.

4. **`extract_keywords(content, nlp)`**:  
   - Extracts keywords dynamically using SpaCy’s part-of-speech tagging.
   - **Why**: Identifies the most relevant terms in the text.

5. **`calculate_sentiment(content)`**:  
   - Analyzes the sentiment polarity (positive, neutral, or negative) of the content.
   - **Why**: Determines the emotional tone of the text, which is crucial for trust analysis.

6. **`sentence_metadata(content, nlp)`**:  
   - Provides metadata for each sentence, including its length and whether it uses passive voice.
   - **Why**: Identifies structural issues in the text.

7. **`count_citations(content)`**:  
   - Counts references and citations dynamically based on specific keywords like "source" or "report."
   - **Why**: Measures the credibility of the content.

8. **`process_data(input_file, output_file)`**:  
   - Applies all preprocessing steps to the webpage content and saves the enhanced data to a CSV file.
   - **Why**: Enhances the raw data with NLP insights for final analysis.

---


In [None]:
# Importing necessary libraries for text processing
import re  # To clean and normalize text by removing special characters
import pandas as pd  # To handle tabular data in a structured format
import nltk  # Natural Language Toolkit for language-related tasks
from nltk.corpus import stopwords  # For removing common stopwords (e.g., 'the', 'is', etc.)
from textblob import TextBlob  # For sentiment analysis
import spacy  # For advanced natural language processing
from collections import Counter  # For counting keyword frequencies

# Step 1: Download and ensure all necessary NLP resources are available
def initialize_nltk_resources():
    """
    Ensures all required NLTK resources are available for the program.
    - Purpose: Downloads tokenization resources, stopwords, and WordNet.
    - Why: These resources are essential for text cleaning and processing tasks.
    """
    try:
        nltk.download('punkt', force=True)  # For breaking sentences into words
        nltk.download('stopwords', force=True)  # To filter out common stopwords
        nltk.download('wordnet', force=True)  # For word synonym and lexical analysis
        nltk.download('omw-1.4', force=True)  # Additional support for synonyms
        print("All necessary NLTK resources downloaded successfully.")
    except Exception as e:
        print(f"Error downloading NLTK resources: {e}")
        raise e

# Step 2: Load SpaCy's language model
def initialize_spacy_model():
    """
    Loads SpaCy's English language model for advanced NLP tasks.
    - Purpose: Provides features like tokenization, named entity recognition, and more.
    - Why: SpaCy handles advanced linguistic features that enhance text processing.
    """
    try:
        return spacy.load("en_core_web_sm")  # Load the SpaCy English model
    except Exception as e:
        print(f"Error loading SpaCy model: {e}")
        raise e

# Step 3: Preprocess text by cleaning and removing noise
def preprocess_text(text, nlp):
    """
    Cleans and preprocesses text data by:
    - Removing special characters.
    - Converting text to lowercase.
    - Removing stopwords using NLTK and SpaCy.
    - Purpose: Ensures the text is clean and ready for analysis.
    """
    try:
        text = re.sub(r'[^\w\s]', '', text)  # Remove special characters like punctuation
        text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces
        text = text.lower()  # Convert to lowercase for consistency
        doc = nlp(text)  # Use SpaCy to tokenize text
        stop_words = set(stopwords.words('english'))  # Load stopwords
        filtered_tokens = [token.text for token in doc if token.text not in stop_words]
        return ' '.join(filtered_tokens)  # Return cleaned text
    except Exception as e:
        print(f"Error during text preprocessing: {e}")
        raise e

# Step 4: Extract keywords from the text dynamically
def extract_keywords(content, nlp):
    """
    Identifies keywords dynamically based on parts of speech (NOUN, PROPN).
    - Purpose: Highlight important concepts from the text.
    """
    try:
        doc = nlp(content)  # Analyze the text using SpaCy
        keywords = [token.text.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"]]
        return dict(Counter(keywords))  # Count keyword frequencies and return as a dictionary
    except Exception as e:
        print(f"Error extracting keywords: {e}")
        raise e

# Step 5: Perform sentiment analysis on the text
def calculate_sentiment(content):
    """
    Calculates the sentiment polarity of the text.
    - Purpose: Determines whether the text is positive, negative, or neutral.
    - Output: Polarity score between -1 (negative) and 1 (positive).
    """
    try:
        blob = TextBlob(content)  # Use TextBlob for sentiment analysis
        return blob.sentiment.polarity  # Return the polarity score
    except Exception as e:
        print(f"Error calculating sentiment: {e}")
        raise e

# Step 6: Generate sentence-level metadata
def sentence_metadata(content, nlp):
    """
    Provides metadata about each sentence, such as:
    - Sentence length.
    - Whether the sentence is in passive voice.
    """
    try:
        doc = nlp(content)  # Tokenize and analyze the text
        metadata = []  # Store metadata for each sentence
        for sent in doc.sents:
            # Check if the sentence uses passive voice
            is_passive = any([token.tag_ == "VBN" and token.dep_ == "auxpass" for token in sent])
            metadata.append({
                "sentence": sent.text,
                "length": len(sent.text.split()),  # Word count in the sentence
                "is_passive": is_passive
            })
        return metadata
    except Exception as e:
        print(f"Error generating sentence metadata: {e}")
        raise e

# Step 7: Count citations or references in the text
def count_citations(content):
    """
    Counts the number of references or citations in the text.
    - Purpose: Identify if the content provides sufficient references.
    """
    citation_keywords = ["source", "reference", "citation", "study", "report"]
    return sum(content.lower().count(keyword) for keyword in citation_keywords)

# Step 8: Load the CSV input file
def load_input_data(filename):
    """
    Loads the input CSV file containing webpage content.
    - Purpose: Prepares the data for processing.
    """
    try:
        data = pd.read_csv(filename)  # Load data into a Pandas DataFrame
        print(f"\nLoaded data from '{filename}'. Preview:")
        print(data.head())  # Display the first few rows
        return data
    except FileNotFoundError as e:
        print(f"Error: File '{filename}' not found.")
        raise e

# Step 9: Process the data for analysis
def process_data(input_file, output_file):
    """
    Enhances data with advanced NLP features like:
    - Cleaned content.
    - Keyword counts.
    - Sentiment scores.
    - Sentence metadata.
    - Citation counts.
    """
    initialize_nltk_resources()  # Step 1: Initialize NLP resources
    nlp = initialize_spacy_model()  # Step 2: Load SpaCy model
    data = load_input_data(input_file)  # Step 8: Load input data

    try:
        data.rename(columns={"Content": "Original_Content"}, inplace=True)  # Ensure consistent column names
        # Apply preprocessing and analysis functions
        data["Cleaned_Content"] = data["Original_Content"].apply(lambda x: preprocess_text(x, nlp))
        data["Keyword_Counts"] = data["Original_Content"].apply(lambda x: extract_keywords(x, nlp))
        data["Sentiment_Score"] = data["Original_Content"].apply(calculate_sentiment)
        data["Citations_Count"] = data["Original_Content"].apply(count_citations)
        data["Sentence_Metadata"] = data["Original_Content"].apply(lambda x: sentence_metadata(x, nlp))
        # Add flags based on sentiment and citation thresholds
        data["Sentiment_Flag"] = data["Sentiment_Score"].apply(
            lambda x: "Negative" if x < 0 else "Positive" if x > 0 else "Neutral"
        )
        data["Citation_Flag"] = data["Citations_Count"].apply(
            lambda x: "Low Citations" if x < 5 else "Sufficient Citations"
        )
        # Save the enhanced data
        data.to_csv(output_file, index=False)
        print(f"\nEnhanced data saved to '{output_file}'.")
        print("\nPreview of Enhanced Data:")
        print(data.head())  # Display the processed data
    except Exception as e:
        print(f"Error processing data: {e}")
        raise e

# Step 10: Run the workflow
if __name__ == "__main__":
    input_file = "webpage_content.csv"  # Input CSV file
    output_file = "enhanced_webpage_content.csv"  # Output CSV file
    process_data(input_file, output_file)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


All necessary NLTK resources downloaded successfully.

Loaded data from 'webpage_content.csv'. Preview:
                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                             Content  
0  Custom Software Development Services - Softwar...  
1  Business Intelligence Services - Competitive A...  
2  SEO Competitor Keyword Analysis - Competitor R...  

Enhanced data saved to 'enhanced_webpage_content.csv'.

Preview of Enhanced Data:
                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                    Original_Content  \
0  Custom Software Development Services - Softwar...   
1  Business Intelligence Servic

---
#  **Explanation of the Output**
This output represents **processed and enhanced data** from various webpages. The raw webpage content has been cleaned and analyzed to provide insights into its quality, tone, readability, and keyword usage. This enhanced data is saved in a CSV file called `'enhanced_webpage_content.csv'`.

It contains **several columns**, each representing a specific aspect of the analyzed content.

---

### **2. Column-by-Column Explanation**

#### **Column: `URL`**
- **What it is**:
  - This column lists the address of the webpage from which the data was collected.
- **Why it’s important**:
  - It tells us the source of the content, so we can trace each piece of information back to its webpage.
- **Example**:
  - `https://thatware.co/software-development-services/`
  - This URL links to a page about software development services.

---

#### **Column: `Original_Content`**
- **What it is**:
  - The exact text or content extracted from the webpage. This is how the content appears on the website.
- **Why it’s important**:
  - It provides the unaltered raw text for reference before any cleaning or processing.
- **Example**:
  - `"Custom Software Development Services - Software tailored to your needs."`
  - This is the raw content from the website’s page.

---

#### **Column: `Cleaned_Content`**
- **What it is**:
  - A cleaned and processed version of the `Original_Content`. This version removes unnecessary characters, punctuation, stop words, and formatting to make it easier to analyze.
- **Why it’s important**:
  - Cleaning the content ensures accurate analysis, especially for tasks like keyword density or sentiment analysis.
- **Example**:
  - `"custom software development services software tailored needs"`
  - This processed content is ready for deeper analysis.

---

#### **Column: `Keyword_Counts`**
- **What it is**:
  - A breakdown of how often specific words (keywords) appear in the cleaned content.
- **Why it’s important**:
  - It identifies the focus of the content and flags potential overuse of specific words (keyword stuffing), which can harm SEO rankings.
- **Example**:
  - `{ 'custom': 32, 'software': 105, 'development': 45 }`
  - This tells us the word `software` appears 105 times, which could be excessive.

---

#### **Column: `Sentiment_Score`**
- **What it is**:
  - A numerical score that represents the emotional tone of the content. It is calculated using advanced algorithms.
- **Why it’s important**:
  - Content with a neutral or negative sentiment may not engage users effectively, while a positive tone is more appealing.
- **Example**:
  - `0.147029`
  - A low score like this suggests a neutral or slightly positive tone.

---

#### **Column: `Citations_Count`**
- **What it is**:
  - The total number of references, links, or citations found in the content.
- **Why it’s important**:
  - Citations establish the credibility and trustworthiness of the content. Pages with more citations are often considered more authoritative.
- **Example**:
  - `12`
  - This indicates there are 12 citations or references in the content.

---

#### **Column: `Sentence_Metadata`**
- **What it is**:
  - A detailed analysis of each sentence in the content, including:
    - Sentence length
    - Whether it’s written in passive voice
    - Other grammatical details
- **Why it’s important**:
  - Helps improve readability by identifying overly complex or passive sentences.
- **Example**:
  ```
  [{'sentence': 'Custom Software Development Services - Software tailored to your needs.',
    'length': 8,
    'is_passive': False}]
  ```
  - This metadata tells us the first sentence is 8 words long and not written in passive voice.

---

#### **Column: `Sentiment_Flag`**
- **What it is**:
  - A simple label indicating whether the sentiment of the content is positive, neutral, or negative.
- **Why it’s important**:
  - Provides a quick overview of the emotional tone of the content.
- **Example**:
  - `"Positive"`
  - This means the content has a generally positive tone.

---

#### **Column: `Citation_Flag`**
- **What it is**:
  - A label indicating whether the content includes a sufficient number of citations.
- **Why it’s important**:
  - It ensures the content meets credibility standards, especially for professional or informational webpages.
- **Example**:
  - `"Sufficient Citations"`
  - This means the content includes enough citations to be considered credible.

---

### **3. Why Is This Data Useful?**

This enhanced data helps in several ways:

#### **A. Improving Content Quality**
- Identifies overly complex sentences, passive voice, and excessive keywords, allowing you to rewrite the content for better readability and user engagement.

#### **B. SEO Optimization**
- Highlights keyword usage patterns to ensure the content is optimized for search engines without being penalized for keyword stuffing.

#### **C. Sentiment Analysis**
- Ensures the tone of the content aligns with the target audience’s expectations. Positive sentiment is crucial for engaging users.

#### **D. Credibility Check**
- Assesses whether the content includes sufficient citations to establish trustworthiness.

#### **E. Actionable Suggestions**
- The data provides actionable feedback (e.g., reduce certain keywords, simplify sentences), making it easier to improve the content.

---



---
# **Part 3: Knowledge-Based Trust (KBT) Analysis**
**Purpose**: To analyze the enhanced data for trustworthiness, tone, keyword usage, and readability.

#### **Steps and Functionality**:
1. **`load_enhanced_data(filename)`**:  
   - Loads the enhanced CSV data into a structured format (DataFrame).
   - **Why**: Prepares the input data for analysis.

2. **`initialize_models()`**:  
   - Initializes the tokenizer, tone analysis model, and SpaCy grammar model.
   - **Why**: Provides the tools needed for detailed analysis of each webpage.

3. **`chunk_text_dynamically(text, tokenizer, max_tokens=512)`**:  
   - Splits large text into manageable chunks based on token limits.
   - **Why**: Ensures the text fits within the limitations of the NLP models.

4. **`analyze_chunk(chunk, tone_model, spacy_model)`**:  
   - Analyzes each chunk for:
     - Tone issues.
     - Grammar problems.
     - Keyword density.
   - **Why**: Provides actionable insights on the content's quality.

5. **`dynamic_analysis(row, tokenizer, tone_model, spacy_model)`**:  
   - Aggregates the analysis results from all chunks of a webpage.
   - **Why**: Consolidates findings into a single report for easier interpretation.

6. **`save_results(data, csv_file, json_file)`**:  
   - Saves the analysis results into both CSV and JSON formats.
   - **Why**: Makes the output accessible for different use cases.

7. **`process_trust_scores(input_file, csv_output_file, json_output_file)`**:  
   - Executes the full KBT analysis pipeline, combining all previous steps.
   - **Why**: Produces a final report highlighting key insights like issues, suggestions, and severity scores.

---


In [None]:
pip install textstat




In [None]:
# Importing necessary libraries
import pandas as pd  # For handling tabular data and reading/writing CSV files
from transformers import AutoTokenizer, pipeline  # For advanced NLP tasks such as tone analysis
from collections import Counter  # To count and manage issues and keyword densities
import spacy  # For advanced natural language processing tasks like grammar analysis
import json  # For saving the results in JSON format

# Step 1: Load Enhanced Data
def load_enhanced_data(filename):
    """
    Loads the input CSV file containing cleaned webpage content.
    Purpose:
        - Converts the CSV into a structured format (DataFrame) for analysis.
    Why:
        - Structured input ensures efficient processing during analysis.
    """
    try:
        # Reading the input CSV file into a Pandas DataFrame
        data = pd.read_csv(filename)
        print(f"Data loaded successfully from '{filename}'.")
        print("Preview of the data:\n", data.head())  # Display the first few rows to verify content
        return data
    except FileNotFoundError:
        raise Exception(f"Error: File '{filename}' not found.")  # Raise an error if the file is missing

# Step 2: Initialize NLP Models
def initialize_models():
    """
    Initializes the required NLP models:
    - Tokenizer: Splits large text into manageable parts.
    - Tone Model: Determines the emotional tone of the content.
    - SpaCy Model: Analyzes grammar and sentence structure.
    Purpose:
        - Provides tools for chunking, analyzing tone, and identifying grammatical structures.
    """
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # Initialize the tokenizer
    tone_model = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion")  # Load tone model
    spacy_model = spacy.load("en_core_web_sm")  # Load SpaCy's English language model
    return tokenizer, tone_model, spacy_model

# Step 3: Chunk Text Dynamically with Token Limit Handling
def chunk_text_dynamically(text, tokenizer, max_tokens=512):
    """
    Splits long text into smaller chunks that fit within the model's token limit.
    Purpose:
        - Prevents errors when processing text that exceeds the model's token limit.
    """
    tokens = tokenizer.tokenize(text)  # Tokenize the text into smaller units
    chunks = []  # Store the resulting chunks
    for i in range(0, len(tokens), max_tokens - 50):  # Split into overlapping chunks to preserve context
        chunk_tokens = tokens[i:i + max_tokens - 50]
        chunks.append(tokenizer.convert_tokens_to_string(chunk_tokens))  # Convert tokens back into text
    return chunks

# Step 4: Analyze Each Chunk
def analyze_chunk(chunk, tone_model, spacy_model):
    """
    Analyzes a single chunk of text for:
    - Tone issues.
    - Grammatical structure and sentence complexity.
    - Keyword density and potential keyword stuffing.
    Purpose:
        - Provides detailed insights for each chunk of content.
    """
    issues = Counter()  # Store detected issues
    suggestions = set()  # Store actionable suggestions
    severity_score = 0  # Overall score indicating the severity of issues in this chunk

    # Tone Analysis
    try:
        truncated_chunk = chunk[:512]  # Ensure the chunk doesn't exceed the model's input limit
        tone = tone_model(truncated_chunk)[0]['label'].lower()  # Perform tone analysis
        if tone in ["negative", "neutral"]:  # Identify undesirable tones
            issues["Negative/neutral tone detected."] += 1
            suggestions.add("Improve the tone to make it more engaging.")
            severity_score += 2  # Assign higher weight to tone issues
    except Exception as e:
        print(f"Tone analysis failed: {e}")  # Handle tone analysis errors gracefully

    # Grammar and Sentence Analysis
    doc = spacy_model(chunk)  # Analyze text using SpaCy
    processed_sentences = set()  # Avoid duplicate sentence processing
    for sent in doc.sents:
        if sent.text not in processed_sentences:
            processed_sentences.add(sent.text)
            # Check for passive voice usage
            passive_voice_count = sum(1 for token in sent if token.dep_ == "auxpass")
            if passive_voice_count > 0:
                issues["Passive voice detected."] += passive_voice_count
                suggestions.add("Rewrite sentences in active voice.")
                severity_score += passive_voice_count  # Weight based on the count of passive constructions
            # Check for sentence complexity
            if len(sent.text.split()) > 20:
                issues["Contains long/complex sentences."] += 1
                suggestions.add("Simplify long sentences for better readability.")
                severity_score += 1

    # Keyword Density Analysis
    words = chunk.split()  # Split the chunk into individual words
    keyword_counts = Counter(words)  # Count occurrences of each word
    flagged_keywords = {}
    for keyword, count in keyword_counts.items():
        density = (count / len(words)) * 100  # Calculate keyword density as a percentage
        if density > 3:  # Flag keywords with high density
            issues[f"Keyword stuffing detected: {keyword}"] += 1
            flagged_keywords[keyword] = density
            suggestions.add(f"Reduce usage of keyword '{keyword}'.")
            severity_score += 1

    return issues, list(suggestions), flagged_keywords, severity_score

# Step 5: Aggregate Results
def dynamic_analysis(row, tokenizer, tone_model, spacy_model):
    """
    Aggregates analysis results across all chunks of a webpage's content.
    Purpose:
        - Combines insights to provide a holistic view of the content's quality.
    """
    aggregated_issues = Counter()  # Consolidate issues
    aggregated_suggestions = set()  # Consolidate suggestions
    keyword_densities = {}  # Store keyword densities
    severity_scores = []  # Track severity scores for each chunk

    text_chunks = chunk_text_dynamically(row["Cleaned_Content"], tokenizer)
    for chunk in text_chunks:
        chunk_issues, chunk_suggestions, flagged_keywords, chunk_severity_score = analyze_chunk(chunk, tone_model, spacy_model)
        aggregated_issues.update(chunk_issues)
        aggregated_suggestions.update(chunk_suggestions)
        keyword_densities.update(flagged_keywords)
        severity_scores.append(chunk_severity_score)

    # Print a summary for the aggregated results
    summary = {
        "Total Issues": sum(aggregated_issues.values()),
        "Severity Score": sum(severity_scores),
        "Top Issues": aggregated_issues.most_common(3),
        "Top Suggestions": list(aggregated_suggestions)[:3]
    }
    print("Summary for Content Analysis:", summary)

    return aggregated_issues, list(aggregated_suggestions), keyword_densities, sum(severity_scores)

# Step 6: Save Results
def save_results(data, csv_file, json_file):
    """
    Saves the analysis results into CSV and JSON formats.
    Purpose:
        - Enables the client to access results in user-friendly formats.
    """
    data.to_csv(csv_file, index=False)  # Save to CSV
    print(f"Results saved to CSV file: {csv_file}")

    json_output = data.to_dict(orient="records")  # Convert data to a JSON-compatible structure
    with open(json_file, "w") as json_file_obj:
        json.dump(json_output, json_file_obj, indent=4)  # Save as a JSON file
    print(f"Results saved to JSON file: {json_file_obj.name}")

# Step 7: Process and Save Analysis Results
def process_trust_scores(input_file, csv_output_file, json_output_file):
    """
    Processes input content, performs analysis, and saves the results.
    Purpose:
        - Executes the complete analysis pipeline.
    """
    data = load_enhanced_data(input_file)  # Step 1: Load data
    tokenizer, tone_model, spacy_model = initialize_models()  # Step 2: Initialize models

    # Apply analysis to each row of content
    data["Issues"], data["Suggestions"], data["Keyword_Densities"], data["Severity_Score"] = zip(
        *data.apply(lambda row: dynamic_analysis(row, tokenizer, tone_model, spacy_model), axis=1)
    )

    print("\nPreview of Analysis Results:")
    print(data[["Issues", "Suggestions", "Keyword_Densities", "Severity_Score"]].head())  # Show a preview

    save_results(data, csv_output_file, json_output_file)  # Save results

# Step 8: Execute Workflow
if __name__ == "__main__":
    input_file = "enhanced_webpage_content.csv"  # Input data
    csv_output_file = "kbt_refined_output.csv"  # CSV output file
    json_output_file = "kbt_refined_output.json"  # JSON output file
    process_trust_scores(input_file, csv_output_file, json_output_file)


Data loaded successfully from 'enhanced_webpage_content.csv'.
Preview of the data:
                                                  URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                    Original_Content  \
0  Custom Software Development Services - Softwar...   
1  Business Intelligence Services - Competitive A...   
2  SEO Competitor Keyword Analysis - Competitor R...   

                                     Cleaned_Content  \
0  custom software development services software ...   
1  business intelligence services competitive ana...   
2  seo competitor keyword analysis competitor res...   

                                      Keyword_Counts  Sentiment_Score  \
0  {'custom': 32, 'software': 105, 'development':...         0.147029   
1  {'business': 35, 'intelligence': 20, 'services...         0.155916   
2  {'seo': 108, 'compe

Token indices sequence length is longer than the specified maximum sequence length for this model (3473 > 512). Running this sequence through the model will result in indexing errors


Summary for Content Analysis: {'Total Issues': 29, 'Severity Score': 29, 'Top Issues': [('Contains long/complex sentences.', 11), ('Keyword stuffing detected: software', 4), ('Keyword stuffing detected: development', 3)], 'Top Suggestions': ["Reduce usage of keyword 'services'.", 'Simplify long sentences for better readability.', "Reduce usage of keyword 'saas'."]}
Summary for Content Analysis: {'Total Issues': 21, 'Severity Score': 21, 'Top Issues': [('Contains long/complex sentences.', 9), ('Keyword stuffing detected: services', 3), ('Keyword stuffing detected: analysis', 2)], 'Top Suggestions': ["Reduce usage of keyword 'services'.", 'Simplify long sentences for better readability.', "Reduce usage of keyword 'upto'."]}
Summary for Content Analysis: {'Total Issues': 23, 'Severity Score': 23, 'Top Issues': [('Contains long/complex sentences.', 8), ('Keyword stuffing detected: seo', 3), ('Keyword stuffing detected: services', 3)], 'Top Suggestions': ["Reduce usage of keyword 'services'

---
# **Explanation of the Output**

This output is the result of analyzing webpage content for **quality and SEO performance**. It provides:
1. A **summary of issues** identified in the content.
2. A list of **actionable suggestions** to improve the content.
3. Insights into **keyword usage patterns**.
4. A **severity score** to indicate how urgent the issues are.

This analysis helps improve the readability, relevance, and SEO ranking of the content on your webpages.

---

### **Output Breakdown**

#### **1. Summary for Content Analysis**
Each row in this section summarizes the issues, suggestions, and keyword analysis for a specific webpage.

##### **Example Row:**
```
{'Total Issues': 29,
 'Severity Score': 29,
 'Top Issues': [('Contains long/complex sentences.', 11),
                ('Keyword stuffing detected: software', 4),
                ('Keyword stuffing detected: development', 3)],
 'Top Suggestions': ["Reduce usage of keyword 'services'.",
                      'Simplify long sentences for better readability.',
                      "Reduce usage of keyword 'saas'."]
}
```

##### **Explanation:**
1. **Total Issues**:
   - This shows the total number of problems identified in the content.
   - Example: `29` issues found.

2. **Severity Score**:
   - This indicates how serious the problems are. A higher score means more significant issues.
   - Example: A severity score of `29` means the content has critical issues that need attention.

3. **Top Issues**:
   - This lists the most frequent problems.
   - Example:
     - `Contains long/complex sentences (11 times)`: The content has 11 sentences that are too long or difficult to read.
     - `Keyword stuffing detected: software (4 times)`: The word "software" appears too frequently, which may harm SEO rankings.
     - `Keyword stuffing detected: development (3 times)`: The word "development" is also overused.

4. **Top Suggestions**:
   - These are actionable steps to fix the issues.
   - Example:
     - `"Reduce usage of keyword 'services'.": This suggests cutting down on the overuse of the word "services."
     - `"Simplify long sentences for better readability.": Break complex sentences into shorter ones.
     - `"Reduce usage of keyword 'saas'.": Lower the frequency of the word "saas" to avoid keyword stuffing.

---

#### **2. Preview of Analysis Results**
This section shows detailed results for each analyzed webpage.

##### **Example Columns:**

1. **Issues**:
   - Contains detailed counts of problems in the content.
   - Example:
     ```
     {'Contains long/complex sentences.': 11,
      'Keyword stuffing detected: software': 4,
      'Keyword stuffing detected: development': 3}
     ```
     - Long sentences are a common issue (`11` instances).
     - Overuse of the keywords "software" and "development."

2. **Suggestions**:
   - Lists recommendations to improve the content.
   - Example:
     ```
     [Reduce usage of keyword 'services'.,
      Simplify long sentences for better readability.,
      Reduce usage of keyword 'saas'.]
     ```
     - These suggestions align with the identified issues.

3. **Keyword_Densities**:
   - Shows how frequently each keyword appears as a percentage of the total content.
   - Example:
     ```
     {'custom': 3.13,
      'software': 3.98,
      'development': 3.50}
     ```
     - The keyword "software" appears in `3.98%` of the content, which may be too high.

4. **Severity_Score**:
   - Indicates how serious the issues are for each webpage.
   - Example:
     - `29` is a high severity score, suggesting urgent fixes are required.

---

### **How to Present This to a Client**

#### **Key Points to Emphasize:**
1. **Purpose**:
   - "This analysis highlights the strengths and weaknesses of your webpage content. It helps you understand where improvements are needed for better readability, SEO, and user engagement."

2. **Explanation**:
   - **Total Issues**:
     - "The number of problems found in the content, such as long sentences or overused keywords."
   - **Severity Score**:
     - "A measure of how urgent the fixes are. A higher score means the content needs more attention."
   - **Top Issues**:
     - "The most frequent problems, like keyword stuffing or overly complex sentences."
   - **Top Suggestions**:
     - "Specific recommendations to improve the content."

3. **Value**:
   - "By addressing these issues, your content will be more engaging, SEO-friendly, and easier to read."

---

### **Final Summary**
This output provides a comprehensive evaluation of webpage content. It identifies issues, prioritizes them by severity, and offers actionable suggestions. Using this data, you can make informed decisions to optimize your content for better user engagement and search engine rankings.

In [None]:
# Importing required libraries
import requests  # To send HTTP requests and fetch webpage content
from bs4 import BeautifulSoup  # To parse HTML and extract readable text
import csv  # To save the scraped data in a CSV file
import pandas as pd  # To display a preview of the data

# List of URLs to scrape (provided by the user)
urls = [
    'https://thatware.co/software-development-services/',
    'https://thatware.co/business-intelligence-services/',
    'https://thatware.co/competitor-keyword-analysis/'
]

# Step 1: Function to fetch webpage content from a given URL
def fetch_content(url):
    """
    Fetch HTML content from a webpage.
    - Purpose: This function sends an HTTP request to the given URL to fetch the raw HTML content of the page.
    - Why: Without this step, we wouldn't have any data to process or analyze.
    """
    try:
        response = requests.get(url, timeout=10)  # Sending a GET request with a 10-second timeout
        response.raise_for_status()  # Ensures the request was successful; raises an error otherwise
        return response.text  # Return the raw HTML content of the page
    except requests.exceptions.RequestException as e:
        # Print an error message if the request fails
        print(f"Error fetching URL {url}: {e}")
        return None  # Return None so the process can continue even if one URL fails

# Step 2: Function to extract meaningful text from HTML content
def extract_text_from_html(html_content):
    """
    Extract visible text from HTML content.
    - Purpose: This function removes unnecessary elements like scripts, styles, and hidden text from the HTML.
    - Why: It ensures that only the main content of the webpage is extracted for analysis.
    """
    soup = BeautifulSoup(html_content, 'html.parser')  # Parse the HTML content using BeautifulSoup
    # Extract visible text by removing unnecessary elements and combining all visible text
    return soup.get_text(separator=' ', strip=True)

# Step 3: Function to scrape all URLs and store their content
def scrape_webpages(urls):
    """
    Scrape webpage content for multiple URLs.
    - Purpose: This function loops through each URL, fetches its content, and cleans it.
    - Why: It structures the process of collecting and organizing webpage data for easy analysis later.
    """
    webpage_data = []  # List to store scraped data for each URL
    for url in urls:
        print(f"Scraping URL: {url}")  # Notify the user of the current URL being processed
        html_content = fetch_content(url)  # Step 1: Fetch HTML content
        if html_content:  # Ensure we have valid content before proceeding
            text_content = extract_text_from_html(html_content)  # Step 2: Extract clean text
            # Store the URL and cleaned content in a dictionary
            webpage_data.append({'URL': url, 'Content': text_content})
    return webpage_data  # Return the scraped data as a list of dictionaries

# Step 4: Save the scraped data to a CSV file
def save_to_csv(data, filename='webpage_content.csv'):
    """
    Save scraped data to a CSV file.
    - Purpose: This function saves the collected webpage data into a structured format (CSV).
    - Why: The CSV format is easy to open, share, and analyze using tools like Excel or Python.
    """
    try:
        # Open the CSV file in write mode with UTF-8 encoding to handle special characters
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            # Create a CSV writer object and define the columns
            writer = csv.DictWriter(file, fieldnames=['URL', 'Content'])
            writer.writeheader()  # Write the column headers to the file
            writer.writerows(data)  # Write each row of data
        print(f"Data successfully saved to {filename}.")  # Confirm success to the user
    except Exception as e:
        # Handle any file-saving issues
        print(f"Error saving data to CSV: {e}")

# Step 5: Display a preview of the scraped data
def preview_data(data, num_rows=5):
    """
    Display a preview of the scraped data in tabular format.
    - Purpose: Show a quick preview of the data to ensure it's correctly scraped before moving forward.
    - Why: This helps validate the data and catch issues early on.
    """
    try:
        df = pd.DataFrame(data)  # Convert the scraped data into a Pandas DataFrame for tabular representation
        print("\nPreview of Scraped Data:\n")
        print(df.head(num_rows))  # Display the first few rows of the data
    except Exception as e:
        # Handle issues with data preview
        print(f"Error displaying preview: {e}")

# Main process: This is where all the functions come together
if __name__ == "__main__":
    # Step 3: Scrape the URLs and get the data
    scraped_data = scrape_webpages(urls)

    # Step 4: Save the scraped data into a CSV file
    save_to_csv(scraped_data)

    # Step 5: Display a preview of the scraped data
    preview_data(scraped_data)


# Importing necessary libraries for text processing
import re  # To clean and normalize text by removing special characters
import pandas as pd  # To handle tabular data in a structured format
import nltk  # Natural Language Toolkit for language-related tasks
from nltk.corpus import stopwords  # For removing common stopwords (e.g., 'the', 'is', etc.)
from textblob import TextBlob  # For sentiment analysis
import spacy  # For advanced natural language processing
from collections import Counter  # For counting keyword frequencies

# Step 1: Download and ensure all necessary NLP resources are available
def initialize_nltk_resources():
    """
    Ensures all required NLTK resources are available for the program.
    - Purpose: Downloads tokenization resources, stopwords, and WordNet.
    - Why: These resources are essential for text cleaning and processing tasks.
    """
    try:
        nltk.download('punkt', force=True)  # For breaking sentences into words
        nltk.download('stopwords', force=True)  # To filter out common stopwords
        nltk.download('wordnet', force=True)  # For word synonym and lexical analysis
        nltk.download('omw-1.4', force=True)  # Additional support for synonyms
        print("All necessary NLTK resources downloaded successfully.")
    except Exception as e:
        print(f"Error downloading NLTK resources: {e}")
        raise e

# Step 2: Load SpaCy's language model
def initialize_spacy_model():
    """
    Loads SpaCy's English language model for advanced NLP tasks.
    - Purpose: Provides features like tokenization, named entity recognition, and more.
    - Why: SpaCy handles advanced linguistic features that enhance text processing.
    """
    try:
        return spacy.load("en_core_web_sm")  # Load the SpaCy English model
    except Exception as e:
        print(f"Error loading SpaCy model: {e}")
        raise e

# Step 3: Preprocess text by cleaning and removing noise
def preprocess_text(text, nlp):
    """
    Cleans and preprocesses text data by:
    - Removing special characters.
    - Converting text to lowercase.
    - Removing stopwords using NLTK and SpaCy.
    - Purpose: Ensures the text is clean and ready for analysis.
    """
    try:
        text = re.sub(r'[^\w\s]', '', text)  # Remove special characters like punctuation
        text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces
        text = text.lower()  # Convert to lowercase for consistency
        doc = nlp(text)  # Use SpaCy to tokenize text
        stop_words = set(stopwords.words('english'))  # Load stopwords
        filtered_tokens = [token.text for token in doc if token.text not in stop_words]
        return ' '.join(filtered_tokens)  # Return cleaned text
    except Exception as e:
        print(f"Error during text preprocessing: {e}")
        raise e

# Step 4: Extract keywords from the text dynamically
def extract_keywords(content, nlp):
    """
    Identifies keywords dynamically based on parts of speech (NOUN, PROPN).
    - Purpose: Highlight important concepts from the text.
    """
    try:
        doc = nlp(content)  # Analyze the text using SpaCy
        keywords = [token.text.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"]]
        return dict(Counter(keywords))  # Count keyword frequencies and return as a dictionary
    except Exception as e:
        print(f"Error extracting keywords: {e}")
        raise e

# Step 5: Perform sentiment analysis on the text
def calculate_sentiment(content):
    """
    Calculates the sentiment polarity of the text.
    - Purpose: Determines whether the text is positive, negative, or neutral.
    - Output: Polarity score between -1 (negative) and 1 (positive).
    """
    try:
        blob = TextBlob(content)  # Use TextBlob for sentiment analysis
        return blob.sentiment.polarity  # Return the polarity score
    except Exception as e:
        print(f"Error calculating sentiment: {e}")
        raise e

# Step 6: Generate sentence-level metadata
def sentence_metadata(content, nlp):
    """
    Provides metadata about each sentence, such as:
    - Sentence length.
    - Whether the sentence is in passive voice.
    """
    try:
        doc = nlp(content)  # Tokenize and analyze the text
        metadata = []  # Store metadata for each sentence
        for sent in doc.sents:
            # Check if the sentence uses passive voice
            is_passive = any([token.tag_ == "VBN" and token.dep_ == "auxpass" for token in sent])
            metadata.append({
                "sentence": sent.text,
                "length": len(sent.text.split()),  # Word count in the sentence
                "is_passive": is_passive
            })
        return metadata
    except Exception as e:
        print(f"Error generating sentence metadata: {e}")
        raise e

# Step 7: Count citations or references in the text
def count_citations(content):
    """
    Counts the number of references or citations in the text.
    - Purpose: Identify if the content provides sufficient references.
    """
    citation_keywords = ["source", "reference", "citation", "study", "report"]
    return sum(content.lower().count(keyword) for keyword in citation_keywords)

# Step 8: Load the CSV input file
def load_input_data(filename):
    """
    Loads the input CSV file containing webpage content.
    - Purpose: Prepares the data for processing.
    """
    try:
        data = pd.read_csv(filename)  # Load data into a Pandas DataFrame
        print(f"\nLoaded data from '{filename}'. Preview:")
        print(data.head())  # Display the first few rows
        return data
    except FileNotFoundError as e:
        print(f"Error: File '{filename}' not found.")
        raise e

# Step 9: Process the data for analysis
def process_data(input_file, output_file):
    """
    Enhances data with advanced NLP features like:
    - Cleaned content.
    - Keyword counts.
    - Sentiment scores.
    - Sentence metadata.
    - Citation counts.
    """
    initialize_nltk_resources()  # Step 1: Initialize NLP resources
    nlp = initialize_spacy_model()  # Step 2: Load SpaCy model
    data = load_input_data(input_file)  # Step 8: Load input data

    try:
        data.rename(columns={"Content": "Original_Content"}, inplace=True)  # Ensure consistent column names
        # Apply preprocessing and analysis functions
        data["Cleaned_Content"] = data["Original_Content"].apply(lambda x: preprocess_text(x, nlp))
        data["Keyword_Counts"] = data["Original_Content"].apply(lambda x: extract_keywords(x, nlp))
        data["Sentiment_Score"] = data["Original_Content"].apply(calculate_sentiment)
        data["Citations_Count"] = data["Original_Content"].apply(count_citations)
        data["Sentence_Metadata"] = data["Original_Content"].apply(lambda x: sentence_metadata(x, nlp))
        # Add flags based on sentiment and citation thresholds
        data["Sentiment_Flag"] = data["Sentiment_Score"].apply(
            lambda x: "Negative" if x < 0 else "Positive" if x > 0 else "Neutral"
        )
        data["Citation_Flag"] = data["Citations_Count"].apply(
            lambda x: "Low Citations" if x < 5 else "Sufficient Citations"
        )
        # Save the enhanced data
        data.to_csv(output_file, index=False)
        print(f"\nEnhanced data saved to '{output_file}'.")
        print("\nPreview of Enhanced Data:")
        print(data.head())  # Display the processed data
    except Exception as e:
        print(f"Error processing data: {e}")
        raise e

# Step 10: Run the workflow
if __name__ == "__main__":
    input_file = "webpage_content.csv"  # Input CSV file
    output_file = "enhanced_webpage_content.csv"  # Output CSV file
    process_data(input_file, output_file)



# Importing necessary libraries
import pandas as pd  # For handling tabular data and reading/writing CSV files
from transformers import AutoTokenizer, pipeline  # For advanced NLP tasks such as tone analysis
from collections import Counter  # To count and manage issues and keyword densities
import spacy  # For advanced natural language processing tasks like grammar analysis
import json  # For saving the results in JSON format

# Step 1: Load Enhanced Data
def load_enhanced_data(filename):
    """
    Loads the input CSV file containing cleaned webpage content.
    Purpose:
        - Converts the CSV into a structured format (DataFrame) for analysis.
    Why:
        - Structured input ensures efficient processing during analysis.
    """
    try:
        # Reading the input CSV file into a Pandas DataFrame
        data = pd.read_csv(filename)
        print(f"Data loaded successfully from '{filename}'.")
        print("Preview of the data:\n", data.head())  # Display the first few rows to verify content
        return data
    except FileNotFoundError:
        raise Exception(f"Error: File '{filename}' not found.")  # Raise an error if the file is missing

# Step 2: Initialize NLP Models
def initialize_models():
    """
    Initializes the required NLP models:
    - Tokenizer: Splits large text into manageable parts.
    - Tone Model: Determines the emotional tone of the content.
    - SpaCy Model: Analyzes grammar and sentence structure.
    Purpose:
        - Provides tools for chunking, analyzing tone, and identifying grammatical structures.
    """
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # Initialize the tokenizer
    tone_model = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion")  # Load tone model
    spacy_model = spacy.load("en_core_web_sm")  # Load SpaCy's English language model
    return tokenizer, tone_model, spacy_model

# Step 3: Chunk Text Dynamically with Token Limit Handling
def chunk_text_dynamically(text, tokenizer, max_tokens=512):
    """
    Splits long text into smaller chunks that fit within the model's token limit.
    Purpose:
        - Prevents errors when processing text that exceeds the model's token limit.
    """
    tokens = tokenizer.tokenize(text)  # Tokenize the text into smaller units
    chunks = []  # Store the resulting chunks
    for i in range(0, len(tokens), max_tokens - 50):  # Split into overlapping chunks to preserve context
        chunk_tokens = tokens[i:i + max_tokens - 50]
        chunks.append(tokenizer.convert_tokens_to_string(chunk_tokens))  # Convert tokens back into text
    return chunks

# Step 4: Analyze Each Chunk
def analyze_chunk(chunk, tone_model, spacy_model):
    """
    Analyzes a single chunk of text for:
    - Tone issues.
    - Grammatical structure and sentence complexity.
    - Keyword density and potential keyword stuffing.
    Purpose:
        - Provides detailed insights for each chunk of content.
    """
    issues = Counter()  # Store detected issues
    suggestions = set()  # Store actionable suggestions
    severity_score = 0  # Overall score indicating the severity of issues in this chunk

    # Tone Analysis
    try:
        truncated_chunk = chunk[:512]  # Ensure the chunk doesn't exceed the model's input limit
        tone = tone_model(truncated_chunk)[0]['label'].lower()  # Perform tone analysis
        if tone in ["negative", "neutral"]:  # Identify undesirable tones
            issues["Negative/neutral tone detected."] += 1
            suggestions.add("Improve the tone to make it more engaging.")
            severity_score += 2  # Assign higher weight to tone issues
    except Exception as e:
        print(f"Tone analysis failed: {e}")  # Handle tone analysis errors gracefully

    # Grammar and Sentence Analysis
    doc = spacy_model(chunk)  # Analyze text using SpaCy
    processed_sentences = set()  # Avoid duplicate sentence processing
    for sent in doc.sents:
        if sent.text not in processed_sentences:
            processed_sentences.add(sent.text)
            # Check for passive voice usage
            passive_voice_count = sum(1 for token in sent if token.dep_ == "auxpass")
            if passive_voice_count > 0:
                issues["Passive voice detected."] += passive_voice_count
                suggestions.add("Rewrite sentences in active voice.")
                severity_score += passive_voice_count  # Weight based on the count of passive constructions
            # Check for sentence complexity
            if len(sent.text.split()) > 20:
                issues["Contains long/complex sentences."] += 1
                suggestions.add("Simplify long sentences for better readability.")
                severity_score += 1

    # Keyword Density Analysis
    words = chunk.split()  # Split the chunk into individual words
    keyword_counts = Counter(words)  # Count occurrences of each word
    flagged_keywords = {}
    for keyword, count in keyword_counts.items():
        density = (count / len(words)) * 100  # Calculate keyword density as a percentage
        if density > 3:  # Flag keywords with high density
            issues[f"Keyword stuffing detected: {keyword}"] += 1
            flagged_keywords[keyword] = density
            suggestions.add(f"Reduce usage of keyword '{keyword}'.")
            severity_score += 1

    return issues, list(suggestions), flagged_keywords, severity_score

# Step 5: Aggregate Results
def dynamic_analysis(row, tokenizer, tone_model, spacy_model):
    """
    Aggregates analysis results across all chunks of a webpage's content.
    Purpose:
        - Combines insights to provide a holistic view of the content's quality.
    """
    aggregated_issues = Counter()  # Consolidate issues
    aggregated_suggestions = set()  # Consolidate suggestions
    keyword_densities = {}  # Store keyword densities
    severity_scores = []  # Track severity scores for each chunk

    text_chunks = chunk_text_dynamically(row["Cleaned_Content"], tokenizer)
    for chunk in text_chunks:
        chunk_issues, chunk_suggestions, flagged_keywords, chunk_severity_score = analyze_chunk(chunk, tone_model, spacy_model)
        aggregated_issues.update(chunk_issues)
        aggregated_suggestions.update(chunk_suggestions)
        keyword_densities.update(flagged_keywords)
        severity_scores.append(chunk_severity_score)

    # Print a summary for the aggregated results
    summary = {
        "Total Issues": sum(aggregated_issues.values()),
        "Severity Score": sum(severity_scores),
        "Top Issues": aggregated_issues.most_common(3),
        "Top Suggestions": list(aggregated_suggestions)[:3]
    }
    print("Summary for Content Analysis:", summary)

    return aggregated_issues, list(aggregated_suggestions), keyword_densities, sum(severity_scores)

# Step 6: Save Results
def save_results(data, csv_file, json_file):
    """
    Saves the analysis results into CSV and JSON formats.
    Purpose:
        - Enables the client to access results in user-friendly formats.
    """
    data.to_csv(csv_file, index=False)  # Save to CSV
    print(f"Results saved to CSV file: {csv_file}")

    json_output = data.to_dict(orient="records")  # Convert data to a JSON-compatible structure
    with open(json_file, "w") as json_file_obj:
        json.dump(json_output, json_file_obj, indent=4)  # Save as a JSON file
    print(f"Results saved to JSON file: {json_file_obj.name}")

# Step 7: Process and Save Analysis Results
def process_trust_scores(input_file, csv_output_file, json_output_file):
    """
    Processes input content, performs analysis, and saves the results.
    Purpose:
        - Executes the complete analysis pipeline.
    """
    data = load_enhanced_data(input_file)  # Step 1: Load data
    tokenizer, tone_model, spacy_model = initialize_models()  # Step 2: Initialize models

    # Apply analysis to each row of content
    data["Issues"], data["Suggestions"], data["Keyword_Densities"], data["Severity_Score"] = zip(
        *data.apply(lambda row: dynamic_analysis(row, tokenizer, tone_model, spacy_model), axis=1)
    )

    print("\nPreview of Analysis Results:")
    print(data[["Issues", "Suggestions", "Keyword_Densities", "Severity_Score"]].head())  # Show a preview

    save_results(data, csv_output_file, json_output_file)  # Save results

# Step 8: Execute Workflow
if __name__ == "__main__":
    input_file = "enhanced_webpage_content.csv"  # Input data
    csv_output_file = "kbt_refined_output.csv"  # CSV output file
    json_output_file = "kbt_refined_output.json"  # JSON output file
    process_trust_scores(input_file, csv_output_file, json_output_file)


Scraping URL: https://thatware.co/software-development-services/
Scraping URL: https://thatware.co/business-intelligence-services/
Scraping URL: https://thatware.co/competitor-keyword-analysis/
Data successfully saved to webpage_content.csv.

Preview of Scraped Data:

                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                             Content  
0  Custom Software Development Services - Softwar...  
1  Business Intelligence Services - Competitive A...  
2  SEO Competitor Keyword Analysis - Competitor R...  


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


All necessary NLTK resources downloaded successfully.

Loaded data from 'webpage_content.csv'. Preview:
                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                             Content  
0  Custom Software Development Services - Softwar...  
1  Business Intelligence Services - Competitive A...  
2  SEO Competitor Keyword Analysis - Competitor R...  

Enhanced data saved to 'enhanced_webpage_content.csv'.

Preview of Enhanced Data:
                                                 URL  \
0  https://thatware.co/software-development-servi...   
1  https://thatware.co/business-intelligence-serv...   
2   https://thatware.co/competitor-keyword-analysis/   

                                    Original_Content  \
0  Custom Software Development Services - Softwar...   
1  Business Intelligence Servic

Token indices sequence length is longer than the specified maximum sequence length for this model (3473 > 512). Running this sequence through the model will result in indexing errors


Summary for Content Analysis: {'Total Issues': 29, 'Severity Score': 29, 'Top Issues': [('Contains long/complex sentences.', 11), ('Keyword stuffing detected: software', 4), ('Keyword stuffing detected: development', 3)], 'Top Suggestions': ["Reduce usage of keyword 'services'.", 'Simplify long sentences for better readability.', "Reduce usage of keyword 'saas'."]}
Summary for Content Analysis: {'Total Issues': 21, 'Severity Score': 21, 'Top Issues': [('Contains long/complex sentences.', 9), ('Keyword stuffing detected: services', 3), ('Keyword stuffing detected: analysis', 2)], 'Top Suggestions': ["Reduce usage of keyword 'services'.", 'Simplify long sentences for better readability.', "Reduce usage of keyword 'upto'."]}
Summary for Content Analysis: {'Total Issues': 23, 'Severity Score': 23, 'Top Issues': [('Contains long/complex sentences.', 8), ('Keyword stuffing detected: seo', 3), ('Keyword stuffing detected: services', 3)], 'Top Suggestions': ["Reduce usage of keyword 'services'

This output provides an **analysis of webpage content** that evaluates various factors to assess its quality, sentiment, readability, and credibility. Here's a step-by-step explanation of each column in the output and how this data can benefit the website owner.

---

### **Column-by-Column Explanation**

1. **`Keyword_Counts`**:
   - **What it is**:  
     This column shows a dictionary of the most frequently used keywords in the webpage's content. For example:
     - Row 0: Keywords like "custom" appear 32 times, "software" appears 105 times, and "development" appears frequently as well.
     - Row 1: Keywords like "business" (35 times) and "intelligence" (20 times) dominate the content.
     - Row 2: Keywords like "SEO" (108 times), "competitor" (34 times), and "keyword" (55 times) are the most repeated.
   - **Purpose**:  
     This helps identify the main focus or theme of the content. Keywords that are repeated too often might indicate "keyword stuffing," which can hurt SEO rankings.
   - **Actionable Steps for Website Owners**:  
     - Use this data to ensure no keywords are overused or underused.
     - Optimize content by spreading keywords naturally throughout the text.
     - Consider introducing synonyms or related terms for more variety.

---

2. **`Sentiment_Score`**:
   - **What it is**:  
     A numerical value measuring the sentiment or emotional tone of the content. Scores range between -1 (negative tone) to +1 (positive tone).
     - Row 0: Score of 0.147029 indicates a mildly positive tone.
     - Row 1: Score of 0.155916 suggests a slightly more positive tone.
     - Row 2: Score of 0.199583 reflects the most positive tone among the rows.
   - **Purpose**:  
     This measures how the content might be perceived by readers. A positive sentiment encourages trust and engagement, while a neutral or negative tone might discourage readers.
   - **Actionable Steps for Website Owners**:  
     - If sentiment scores are low (neutral or negative), rewrite sections to include more positive language.
     - Focus on words that convey benefits, trust, and clarity to engage readers better.

---

3. **`Citations_Count`**:
   - **What it is**:  
     The number of references or citations (e.g., phrases like "source," "study," or "report") found in the content. For example:
     - Row 0: Contains 12 citations, showing strong factual backing.
     - Row 1: Contains 11 citations, suggesting credibility.
     - Row 2: Contains 7 citations, which is comparatively lower but still acceptable.
   - **Purpose**:  
     Citations enhance the credibility and authority of the content, especially when discussing technical topics or research-based insights.
   - **Actionable Steps for Website Owners**:  
     - Ensure at least 5 citations in every article to meet "Sufficient Citations" criteria.
     - Add hyperlinks or references to external credible sources to improve trustworthiness.

---

4. **`Sentence_Metadata`**:
   - **What it is**:  
     A detailed breakdown of individual sentences in the content. For each sentence, it shows:
       - The text of the sentence.
       - The length of the sentence (in words).
       - Whether the sentence is written in passive voice.
     - Example from Row 0:  
       - Sentence: "Custom Software Development Services."  
       - Length: 5 words.  
       - Passive Voice: No.
   - **Purpose**:  
     Identifies structural issues, such as long sentences that are hard to read or sentences written in passive voice, which can feel impersonal.
   - **Actionable Steps for Website Owners**:  
     - Rewrite long sentences (over 20 words) into shorter ones for better readability.
     - Convert passive sentences into active voice for a more engaging tone.

---

5. **`Sentiment_Flag`**:
   - **What it is**:  
     A simple label for sentiment:
       - "Positive" for positive sentiment scores.
       - "Neutral" for scores close to 0.
       - "Negative" for negative scores.
     - All rows in this example are flagged as "Positive."
   - **Purpose**:  
     Provides a quick, easy-to-understand summary of the sentiment.
   - **Actionable Steps for Website Owners**:  
     - Ensure all content has a positive sentiment flag.
     - If flagged as "Neutral" or "Negative," revise to include optimistic, motivating language.

---

6. **`Citation_Flag`**:
   - **What it is**:  
     A label indicating whether the content has enough citations:
       - "Sufficient Citations" for citation counts >= 5.
       - "Low Citations" for counts < 5.
     - All rows in this example are flagged as "Sufficient Citations."
   - **Purpose**:  
     Ensures content meets credibility standards.
   - **Actionable Steps for Website Owners**:  
     - Strive to maintain "Sufficient Citations" for all pages.
     - Add references if any pages are flagged as "Low Citations."

---

### **What Does This Output Convey?**
1. **Content Focus**:
   - The `Keyword_Counts` column highlights the main focus of each webpage. For example:
     - Row 0 focuses on "custom software development."
     - Row 1 emphasizes "business intelligence services."
     - Row 2 targets "SEO competitor keyword analysis."
   - This helps ensure the content aligns with the intended message and SEO goals.

2. **Readability and Tone**:
   - Sentiment scores and sentence metadata help measure how approachable and engaging the content is.
   - Shorter, active-voice sentences with a positive tone are more likely to keep readers interested.

3. **Credibility**:
   - Citation counts and flags indicate whether the content is supported by reliable sources, making it trustworthy.

---

### **How Is This Beneficial for Website Owners?**
1. **Improves SEO**:  
   - Optimized keywords and positive sentiment enhance search engine rankings.
   - Well-referenced content increases the page's authority.

2. **Increases Engagement**:  
   - Positive sentiment and readability improvements make content more engaging.
   - Readers are more likely to trust and share credible, well-structured content.

3. **Enhances Conversion Rates**:  
   - Clear, positive, and factual content can convert casual readers into customers or clients.

---

### **Steps to Take After Getting This Output**
1. **Review the Keywords**:  
   - Check `Keyword_Counts` for overused or underused keywords.
   - Balance keyword usage to prevent keyword stuffing penalties.

2. **Edit Sentences**:  
   - Use `Sentence_Metadata` to identify long or passive sentences.
   - Rewrite them for clarity and engagement.

3. **Add More Citations (if needed)**:  
   - Ensure `Citation_Flag` remains "Sufficient Citations."
   - Add more references if the flag ever shows "Low Citations."

4. **Boost Sentiment**:  
   - If any content has a negative or neutral sentiment, revise it with more positive language to improve engagement.

---

### **Summary**
This output is a **comprehensive analysis** of webpage content that highlights:
- Keyword focus.
- Tone and sentiment.
- Readability and structure.
- Credibility through citations.

Using this data, website owners can enhance their content for better audience engagement, improved trustworthiness, and higher search engine rankings.

### Detailed Explanation of the Output:

This output is a **content analysis report** designed to help website owners improve their webpage content for readability, SEO (Search Engine Optimization), and user engagement.

---

### **Understanding the Output Columns**

#### 1. **`Summary for Content Analysis`**:
   - **What it is**:  
     A concise summary for each webpage, including:
       - **`Total Issues`**: The number of issues identified in the webpage's content.
       - **`Severity Score`**: A score based on the significance of the issues, where higher scores indicate more serious problems.
       - **`Top Issues`**: A list of the most common problems in the content.
       - **`Top Suggestions`**: Actionable advice to resolve these issues.

   - **Example (Row 0)**:
     - **`Total Issues`**: 29 (indicating the content has significant room for improvement).
     - **`Severity Score`**: 29 (severity of the issues matches the number of issues found).
     - **`Top Issues`**:
       - **Contains long/complex sentences**: There are 11 sentences that are too long or complicated.
       - **Keyword stuffing detected: software**: The keyword "software" appears excessively.
       - **Keyword stuffing detected: development**: The keyword "development" is overused.
     - **`Top Suggestions`**:
       - Reduce the usage of keywords like "services" and "saas" to avoid keyword stuffing.
       - Simplify long sentences to improve readability.

   - **Why It’s Important**:  
     This helps website owners prioritize improvements based on the most critical issues.

   - **Actions to Take**:
     - Rewrite sentences to make them shorter and easier to understand.
     - Reduce the repetition of overused keywords to avoid being penalized by search engines.
     - Follow the suggestions to ensure your content is engaging and SEO-friendly.

---

#### 2. **`Issues`**:
   - **What it is**:  
     A detailed breakdown of all issues detected in the content, with counts for each type.
   - **Example (Row 0)**:
     ```json
     {
       "Contains long/complex sentences.": 11,
       "Keyword stuffing detected: software": 4,
       "Keyword stuffing detected: development": 3
     }
     ```
     - **Contains long/complex sentences**: 11 sentences are too lengthy or complex.
     - **Keyword stuffing detected: software**: The keyword "software" is overused 4 times.
     - **Keyword stuffing detected: development**: The keyword "development" appears too frequently (3 times).

   - **Why It’s Important**:  
     Long or complex sentences make content hard to read, and keyword stuffing can harm SEO rankings.

   - **Actions to Take**:
     - Focus on simplifying complex sentences to improve readability.
     - Limit keyword usage to a natural level to avoid search engine penalties.

---

#### 3. **`Suggestions`**:
   - **What it is**:  
     A list of recommendations to address the identified issues.
   - **Example (Row 0)**:
     ```json
     [
       "Reduce usage of keyword 'services'.",
       "Simplify long sentences for better readability.",
       "Reduce usage of keyword 'saas'."
     ]
     ```
     - Suggestion to reduce the use of specific keywords to make the content more balanced.
     - Suggestion to rewrite long sentences for improved clarity.

   - **Why It’s Important**:  
     These are actionable steps to make content more engaging, readable, and SEO-friendly.

   - **Actions to Take**:
     - Follow the suggestions directly. For example:
       - Identify sentences with keywords like "services" or "saas" and replace or reduce them.
       - Break long sentences into shorter, simpler ones.

---

#### 4. **`Keyword_Densities`**:
   - **What it is**:  
     A detailed frequency count of keywords used in the content and their percentage density.
   - **Example (Row 0)**:
     ```json
     {
       "custom": 3.132530120481928,
       "software": 3.980582524271845,
       "development": 2.9125
     }
     ```
     - The keyword "software" makes up 3.98% of the content.
     - The keyword "custom" contributes 3.13%.

   - **Why It’s Important**:  
     Keyword density helps ensure that the content is optimized for search engines without crossing into keyword stuffing.

   - **Actions to Take**:
     - Maintain a keyword density of **1-3%** for critical terms.
     - Replace repetitive keywords with synonyms to reduce density if it exceeds 3%.

---

#### 5. **`Severity_Score`**:
   - **What it is**:  
     A numeric score representing the severity of issues in the content. It is calculated based on the number and type of issues.
   - **Example**:
     - Row 0: **29** (high severity, indicating significant improvement is needed).
     - Row 1: **21** (moderate severity).
     - Row 2: **23** (moderate severity).
   - **Why It’s Important**:  
     Helps prioritize which pages need the most attention.

   - **Actions to Take**:
     - Focus on pages with higher severity scores first, as they require more work to meet quality standards.

---

### **What Does This Output Convey?**

1. **Content Quality**:  
   - It provides a detailed view of where your content excels (e.g., positive sentiment, sufficient citations) and where it needs improvement (e.g., long sentences, keyword stuffing).

2. **Readability and Engagement**:  
   - Long or passive sentences and excessive keyword usage can hurt readability. The suggestions focus on making the content clear and engaging.

3. **SEO Optimization**:  
   - Overused keywords and poor readability can negatively affect your search engine rankings. The output gives actionable insights to fix these issues.

4. **Credibility**:  
   - High citation counts ensure your content is factual and trustworthy.

---

### **How Is This Beneficial for Website Owners?**

1. **Improved SEO Rankings**:  
   - Balancing keyword usage and improving readability can help your content rank higher on search engines.

2. **Better User Experience**:  
   - Clearer, concise content is more likely to engage readers and reduce bounce rates.

3. **Increased Trust**:  
   - Sufficient citations and a positive tone make your content more credible and appealing to users.

---

### **Steps to Take After Getting This Output**

1. **Simplify Content**:  
   - Use the suggestions to rewrite long or complex sentences and remove passive voice.

2. **Balance Keyword Usage**:  
   - Adjust keyword densities to fall within the recommended range (1-3%).

3. **Enhance Sentiment**:  
   - Rewrite any content flagged with neutral or negative sentiment to make it more positive and engaging.

4. **Validate Citations**:  
   - Ensure all facts and references are accurate and add more credible sources if needed.

5. **Prioritize Pages**:  
   - Start with the pages that have the highest severity scores and fix the critical issues first.

---

### **Conclusion**
This output provides actionable insights into how to improve the quality, readability, and SEO performance of your webpage content. By following the recommendations, website owners can create engaging, trustworthy, and search-engine-optimized content.