<a href="https://colab.research.google.com/github/Abhiss123/Capstoneproject2/blob/main/AI_Voice_Search_Optimizer_Revolutionizing_SEO_for_Conversations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**: **AI Voice Search Optimizer: Revolutionizing SEO for Conversations**

---

### **Purpose of the Project**

This project, **AI Voice Search Optimizer**, is designed to address a growing need in the digital landscape: optimizing websites to perform better for **voice search queries**. With the rise of virtual assistants like **Google Assistant, Siri, Alexa**, and others, people are increasingly asking questions in natural, conversational language instead of typing keywords. This shift has created new challenges for website owners to ensure their content ranks effectively for these voice-based queries. The **AI Voice Search Optimizer** solves this problem by using **Artificial Intelligence (AI)** to analyze, improve, and recommend changes to websites, making them more compatible with how people search today.

---

### **Detailed Explanation of the Project**

1. **What the Project Does**:
   - **Scrapes Web Data**: It collects data (like headings, paragraphs, and meta descriptions) from websites.
   - **Analyzes Content**: It uses AI to identify gaps in the content and assesses how well the website answers conversational questions.
   - **Generates Insights**:
     - Finds keywords and phrases that are frequently searched using voice assistants.
     - Identifies areas where website metadata (like titles and descriptions) can be improved.
   - **Recommends Improvements**:
     - Suggests changes to content to make it more conversational and voice-search-friendly.
     - Highlights complex paragraphs to simplify them for better understanding.

2. **Why This is Important**:
   - **User Behavior is Changing**: People now use natural phrases like "What is the best digital marketing strategy?" instead of typing "best digital marketing strategy."
   - **Business Benefits**:
     - Websites optimized for voice search can reach a broader audience.
     - Helps businesses stay competitive in the age of conversational AI.
     - Improves the likelihood of a website appearing in **featured snippets** or **top search results**, which are often prioritized for voice queries.

3. **How It Works**:
   - **Step 1: Data Collection**: The tool extracts content like headings, paragraphs, and meta descriptions from target websites.
   - **Step 2: Text Cleaning and Normalization**: It removes unnecessary clutter (like stopwords) and formats the text for analysis.
   - **Step 3: Keyword and Bigram Analysis**: Finds the most used keywords and common two-word phrases, helping to understand the website’s focus.
   - **Step 4: FAQ Generation**:
     - Identifies potential frequently asked questions based on the content.
     - Uses AI to rank these questions by relevance.
   - **Step 5: Recommendations**:
     - Provides actionable tips to improve metadata, content structure, and readability.
     - Suggests ways to expand content for voice queries.

---

### **Use Case in the Context of a Website**

For a website, the **AI Voice Search Optimizer** offers these key benefits:
1. **Improved Voice Search Ranking**:
   - Makes the website appear in the top results for voice searches by restructuring and optimizing the content.
   
2. **Enhanced User Experience**:
   - Provides answers to user questions in a conversational tone, making it easier for users to engage with the website.
   
3. **Business Growth**:
   - Helps businesses attract more traffic by aligning their content with voice-based queries.
   - Supports content creation strategies by identifying gaps and opportunities.

---

### **Non-Technical Explanation of the Workflow**

1. The tool **collects content** from the website and breaks it down into smaller parts (headings, paragraphs, etc.).
2. It cleans the content and removes any extra noise, ensuring only useful text is processed.
3. The tool checks how well the website’s content aligns with common questions people might ask (e.g., "How does this product work?").
4. AI is used to **recommend changes**, such as:
   - Simplifying overly complicated sentences.
   - Adding questions or sections to fill content gaps.
   - Making metadata more attractive for search engines.
5. Finally, the tool presents all its findings in a structured format (e.g., top keywords, FAQs, content gaps), allowing the website owner to make informed improvements.

---

### **What Steps to Take After Getting This Output**

1. **Optimize Content**:
   - Rewrite or expand paragraphs flagged as too short or too complex.
   - Add conversational FAQs to align with voice search trends.
   
2. **Update Metadata**:
   - Condense long titles or descriptions flagged in the "Metadata Recommendations."
   - Include voice-search-focused keywords in meta descriptions.

3. **Enhance SEO**:
   - Incorporate high-ranking keywords and bigrams (e.g., "digital marketing," "SEO services") into your content strategically.

4. **Monitor Progress**:
   - Use tools like **Google Analytics** and **Google Search Console** to track improvements in traffic and search rankings.

---

### **Why This Project Matters**

With this tool, businesses can stay ahead in the competitive digital world. Voice search is no longer just a trend; it’s a necessity for engaging users and growing online presence. This project equips website owners with the insights they need to succeed, bridging the gap between traditional SEO and the demands of voice-based queries.


---
# **What is AI-Powered Voice Search Optimization?**
AI-Powered Voice Search Optimization is the process of adapting your content (like website text, blogs, or product descriptions) so that it appears in results when people search using voice assistants (like Siri, Alexa, or Google Assistant). Since voice queries are conversational and natural, AI is used to understand how people speak, predict their intent, and optimize content to answer their queries effectively.

---

### **What are its Use Cases?**
1. **Improving Voice Search Visibility:** Optimizing a website to rank for questions asked via voice assistants (e.g., "Where can I buy organic coffee near me?").
2. **Enhanced Customer Experience:** Providing quick answers for voice-based queries, leading to better user engagement.
3. **Local Search Optimization:** Businesses like restaurants, clinics, or stores can target local audiences since most voice searches are location-based.
4. **Voice-Powered E-Commerce:** Optimizing product descriptions so customers can easily find products via voice search.
5. **FAQ Optimization:** Ensuring FAQs on your website are structured in a way that aligns with natural spoken language.

---

### **Real-Life Implementations**
1. **E-commerce Websites:** Amazon uses voice search optimization to make shopping easy via Alexa.
2. **Restaurants:** Local cafes optimize their menus and contact details for voice search to attract more customers nearby.
3. **News and Information Portals:** Websites like news agencies ensure their content matches popular voice queries (e.g., "What's the weather today?").
4. **Healthcare:** Clinics optimize for voice searches like “Find a doctor near me” to attract patients using voice assistants.

---

### **Use Case for Websites**
In the context of websites, **AI-Powered Voice Search Optimization** involves adapting your content to match how users typically speak their queries. For example:
- Instead of targeting keywords like "Best pizza New York," voice optimization would target phrases like "Where can I find the best pizza in New York?"
- Using AI, the system analyzes your website’s text to identify gaps and optimize it to align with natural language patterns.

---

### **Technical Details: What Data Does the Model Need?**
1. **Website URLs or CSV Data?**
   - **URLs of Webpages:** If your website has multiple pages, the AI model will crawl the site, extract the content, and preprocess it.
   - **CSV Format:** If you have structured data (like product descriptions, FAQs, or other website content) in a CSV file, the AI model can process that too. For smaller projects, CSV data might suffice, but larger websites benefit from directly crawling URLs.
   
   **You can choose either method based on your preference or the scale of the project.**

2. **Types of Data Required:**
   - Website text content (product descriptions, blogs, FAQs).
   - User behavior data (what users are searching for).
   - Contextual information (location-specific details, brand tone).

---

### **How AI Models Work to Optimize Content**
1. **Understanding Natural Language:** AI uses algorithms to understand human speech patterns, synonyms, and conversational tones.
2. **Identifying Gaps:** It identifies gaps in your website's content where voice-friendly terms are missing.
3. **Generating Optimized Suggestions:** AI suggests rephrased content tailored to voice search, such as converting “affordable shoes” into “Where can I buy affordable shoes nearby?”

---

### **Expected Output**
1. **Optimized Content Recommendations:**
   - Suggestions to rewrite headlines, product descriptions, or FAQs in a conversational tone.
2. **Structured FAQ Suggestions:**
   - The AI may recommend adding voice-friendly FAQs like “How do I order from your website?” to increase visibility.
3. **Keyword Insights:**
   - A list of popular voice search terms relevant to your website.
4. **Actionable Recommendations:**
   - Guidance to improve metadata (titles, descriptions) for better ranking in voice searches.
5. **Content Gap Analysis:**
   - Insights into missing content that could better answer common voice queries.

---

### **Step-by-Step Process**
1. **Input Data:**
   - Provide website URLs or upload CSV files containing your content.
2. **AI Preprocessing:**
   - The AI analyzes and structures the data for better readability and understanding.
3. **Generate Output:**
   - The system generates a report with optimized content and actionable insights.
4. **Implementation:**
   - You or your team implement these recommendations on the website.

---

### **Why Is This Useful for Voice Search?**
Voice search is growing rapidly because people prefer talking over typing. By optimizing your website for voice, you:
- Attract more traffic.
- Provide users with quicker and better results.
- Gain a competitive edge in local and e-commerce searches.

---


---
# **Part 1: Web Scraper for Content Extraction**
**Title**: **"Extracting Web Content for Analysis"**  
**Purpose**: This part of the code focuses on scraping data from a list of web pages. It retrieves key elements like headings, paragraphs, and meta descriptions, cleans the text, and saves the results in a structured format.

**Key Steps**:
1. **Defining the URLs**: The URLs represent the web pages we want to analyze for SEO or content insights.
2. **Cleaning Text**: A function ensures the text is clean by removing special characters, extra spaces, and optional stopwords (common words like "and", "the").
3. **Scraping Data**: The `scrape_webpage` function fetches webpage content using `requests`, processes it with `BeautifulSoup`, and extracts useful parts like headings and paragraphs.
4. **Saving Results**: The extracted and cleaned data is saved in a CSV file for further analysis.

---


In [1]:
import requests  # For making HTTP requests to fetch webpage content
from bs4 import BeautifulSoup  # For parsing HTML content of webpages
import pandas as pd  # For structuring and saving data in tabular format
import re  # For cleaning and normalizing text data

# Step 1: Define the list of URLs to scrape
# Purpose: The URLs represent webpages we want to scrape to extract useful data such as headings, paragraphs, and meta descriptions.
urls = [
    'https://thatware.co/',  # Homepage URL of the target site
    'https://thatware.co/digital-marketing-services/',  # Digital Marketing Services page
    'https://thatware.co/business-intelligence-services/',  # Business Intelligence Services page
    'https://thatware.co/link-building-services/',  # Link Building Services page
    'https://thatware.co/branding-press-release-services/',  # Branding and Press Release Services page
    'https://thatware.co/advanced-seo-services/',  # Advanced SEO Services page
    # Add more URLs if additional pages need to be scraped
]

# Step 2: Define a function to clean and normalize text
# Purpose: Raw text often contains unwanted characters, extra spaces, or stopwords. This function removes such noise to prepare the text for analysis.
def clean_text(text, remove_stopwords=True):
    """
    Cleans and normalizes text data.
    - Removes extra spaces, converts text to lowercase, and removes special characters.
    - Optionally removes common words ("stopwords") like 'the', 'and', 'to', etc., that don't add much meaning.

    Args:
        text (str): The raw text to clean.
        remove_stopwords (bool): Whether to remove stopwords from the text.

    Returns:
        str: The cleaned text.
    """
    # Replace multiple spaces with a single space for better readability
    text = re.sub(r'\s+', ' ', text)
    # Convert all characters to lowercase for uniformity
    text = text.lower()
    # Remove special characters like punctuation marks
    text = re.sub(r'[^\w\s]', '', text)

    # If stopwords need to be removed
    if remove_stopwords:
        # Define a set of common stopwords
        stop_words = set([
            'the', 'and', 'to', 'for', 'of', 'in', 'a', 'is', 'on', 'with', 'that',
            'as', 'it', 'you', 'your', 'our', 'this', 'by', 'at', 'be', 'are', 'can', 'an'
        ])
        # Filter out stopwords from the text
        text = ' '.join(word for word in text.split() if word not in stop_words)

    return text  # Return the cleaned text

# Step 3: Define a function to scrape data from a single webpage
# Purpose: This function fetches data like headings, paragraphs, and meta descriptions from a given webpage.
def scrape_webpage(url):
    """
    Scrapes data from a given URL, including headings, paragraphs, and meta descriptions.

    Args:
        url (str): The URL of the webpage to scrape.

    Returns:
        dict: A dictionary containing the URL, headings, paragraphs, and meta descriptions.
    """
    try:
        # Make an HTTP GET request to fetch the webpage content
        response = requests.get(url, timeout=10)  # Timeout after 10 seconds if no response
        response.raise_for_status()  # Raise an error if the request fails

        # Parse the webpage content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Remove unwanted tags like <script>, <style>, and <noscript> to clean the HTML content
        for tag in soup(['script', 'style', 'noscript']):
            tag.decompose()

        # Extract headings (H1, H2, H3 tags) from the webpage
        headings = ' '.join(tag.get_text(strip=True) for tag in soup.find_all(['h1', 'h2', 'h3']))

        # Extract paragraphs (<p> tags) from the webpage
        paragraphs = ' '.join(tag.get_text(strip=True) for tag in soup.find_all('p'))

        # Extract the meta description content
        meta_tag = soup.find('meta', attrs={'name': 'description'})
        meta_description = meta_tag['content'] if meta_tag and 'content' in meta_tag.attrs else "No meta description available"

        # Return the extracted data in a dictionary format
        return {
            "URL": url,
            "Headings": headings,
            "Paragraphs": paragraphs,
            "Meta Description": meta_description
        }

    except Exception as e:
        # Handle errors that occur during scraping
        print(f"Error scraping {url}: {e}")
        return {
            "URL": url,
            "Headings": "",
            "Paragraphs": "",
            "Meta Description": "Error fetching meta description"
        }

# Step 4: Scrape content from all URLs in the list
# Purpose: Iterate through all URLs, scrape their content, and clean the extracted text for analysis.
scraped_data = []
for url in urls:
    print(f"Scraping content from: {url}")  # Inform the user which URL is being processed
    # Scrape data from the current URL
    scraped = scrape_webpage(url)
    # Clean the extracted text and add it to the result
    scraped['Cleaned Headings'] = clean_text(scraped['Headings'], remove_stopwords=True)
    scraped['Cleaned Paragraphs'] = clean_text(scraped['Paragraphs'], remove_stopwords=True)
    scraped['Cleaned Meta Description'] = clean_text(scraped['Meta Description'], remove_stopwords=True)
    scraped_data.append(scraped)  # Append the result to the list

# Step 5: Save the scraped data to a CSV file
# Purpose: Save the extracted and cleaned data in a structured format for further analysis or sharing.
if scraped_data:
    # Convert the list of dictionaries to a Pandas DataFrame for tabular representation
    df = pd.DataFrame(scraped_data)
    # Save the DataFrame to a CSV file
    df.to_csv("scraped_content_with_cleaned_text.csv", index=False)
    print("\nScraped content has been saved to 'scraped_content_with_cleaned_text.csv'.")  # Notify the user

# Step 6: Preview the scraped content
# Purpose: Show the first few rows of the scraped data as a quick preview to ensure everything worked as expected.
    print("\nPreview of Scraped Content:")
    print(df.head())  # Display the first 5 rows of the DataFrame
else:
    # Notify the user if no content was scraped
    print("No content was scraped.")


Scraping content from: https://thatware.co/
Scraping content from: https://thatware.co/digital-marketing-services/
Scraping content from: https://thatware.co/business-intelligence-services/
Scraping content from: https://thatware.co/link-building-services/
Scraping content from: https://thatware.co/branding-press-release-services/
Scraping content from: https://thatware.co/advanced-seo-services/

Scraped content has been saved to 'scraped_content_with_cleaned_text.csv'.

Preview of Scraped Content:
                                                 URL  \
0                               https://thatware.co/   
1    https://thatware.co/digital-marketing-services/   
2  https://thatware.co/business-intelligence-serv...   
3        https://thatware.co/link-building-services/   
4  https://thatware.co/branding-press-release-ser...   

                                            Headings  \
0  Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARK...   
1  Advanced Digital Marketing Services GET A FR

# **Understanding the Output in Simple Terms**
---

### 1. **`URL` Column**
- **What it is**:
  This column shows the specific web page address (link) from which the content was extracted. Each URL represents a webpage where headings, paragraphs, and meta descriptions were collected.
- **Example Explanation**:
  For instance, `https://thatware.co/` is the homepage of the Thatware website, while `https://thatware.co/digital-marketing-services/` is the webpage dedicated to digital marketing services.
- **Why it's useful**:
  By knowing the source URL, we can trace the extracted content back to its original page for further context or validation.

---

### 2. **`Headings` Column**
- **What it is**:
  This column contains all the headings (like H1, H2, and H3 tags) extracted from the webpage. These headings summarize the main sections or topics covered on the webpage.
- **Example Explanation**:
  From the homepage (`https://thatware.co/`), the extracted headings include phrases like "Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARKETING STRATEGY FOR YOUR BUSINESS."
- **Why it's useful**:
  Headings are critical for both user readability and search engine optimization (SEO). They give an overview of the content structure.

---

### 3. **`Paragraphs` Column**
- **What it is**:
  This column contains all the paragraph text extracted from the webpage. It includes the main body content visible on the site.
- **Example Explanation**:
  From the homepage, the paragraphs might include details about the services offered by Thatware, like revenue generation through SEO or advanced digital marketing strategies.
- **Why it's useful**:
  Paragraphs contain the detailed information users and search engines read. This is the core content that needs to be analyzed for gaps, relevance, and keyword optimization.

---

### 4. **`Meta Description` Column**
- **What it is**:
  This column contains the meta description tag from the webpage's HTML. The meta description is a summary of the page's content, often displayed in search engine results.
- **Example Explanation**:
  For the homepage, the meta description is: "THATWARE® is the world's first SEO agency to seamlessly integrate AI into its services."
- **Why it's useful**:
  Meta descriptions are essential for SEO because they influence click-through rates on search engine results pages (SERPs). If they’re too short, too long, or irrelevant, they can be improved.

---

### 5. **`Cleaned Headings` Column**
- **What it is**:
  This column contains the headings from the `Headings` column, but they’ve been cleaned up. Cleaning removes unnecessary spaces, special characters, and stopwords (like "and", "the").
- **Example Explanation**:
  The cleaned version of the homepage's headings is: "home get customized seo audit digital marketing strategy business."
- **Why it's useful**:
  Cleaned headings are easier to analyze for patterns and keywords. They help focus on the most meaningful terms for SEO and readability improvements.

---

### 6. **`Cleaned Paragraphs` Column**
- **What it is**:
  This column contains the cleaned version of the `Paragraphs` column. Similar to cleaned headings, unnecessary words and characters are removed here.
- **Example Explanation**:
  For example, the cleaned text from the homepage might include phrases like "revenuegenerated via seo qualified leadsgenerated 11 years ago journey unrav."
- **Why it's useful**:
  Cleaning removes noise, making it easier to focus on the actual content for keyword analysis and content improvement.

---

### 7. **`Cleaned Meta Description` Column**
- **What it is**:
  This column is the cleaned version of the `Meta Description` column. It follows the same cleaning process applied to headings and paragraphs.
- **Example Explanation**:
  For the homepage, the cleaned meta description is: "thatware worlds first seo agency seamlessly integrate ai services."
- **Why it's useful**:
  Cleaned meta descriptions help in identifying the most critical keywords and checking their relevance to the page's content.

---

### **Key Insights from the Output**
- **What has been achieved**:
  We have successfully extracted structured data (headings, paragraphs, meta descriptions) from multiple webpages. The data is also cleaned and ready for further analysis.
- **How it helps**:
  This output lays the foundation for deeper analysis, like identifying content gaps, improving SEO elements, generating FAQs, and analyzing keyword patterns.

---


---
# **Part 2: Keyword Extraction and Analysis**
**Title**: **"Analyzing Keywords from Scraped Data"**  
**Purpose**: This code processes the cleaned content to extract meaningful keywords and common phrases. It uses techniques like tokenization and filtering to focus on the most important terms.

**Key Steps**:
1. **Loading Data**: Reads the cleaned data from the CSV file to ensure structured input for analysis.
2. **Tokenization**: Breaks the combined content into individual words for further processing.
3. **Filtering Stopwords**: Removes common words that don't add significant meaning, such as "is", "and", or "the".
4. **Counting Keywords**: Counts the frequency of each keyword and identifies the top recurring terms.
5. **Generating Bigrams**: Extracts two-word phrases (like "digital marketing") to understand common patterns and themes.

---


In [2]:
import pandas as pd  # For working with tabular data (like spreadsheets)
import re  # For cleaning and manipulating text using regular expressions
from collections import Counter  # For counting occurrences of words and phrases
from nltk.corpus import stopwords  # For filtering out common "stopwords" like "the", "and", etc.
import nltk  # Natural Language Toolkit for text processing

# Step 1: Ensure stopwords are available for filtering common words
# Purpose: Stopwords are common words like "the", "is", and "and" that don't contribute much meaning.
# Removing these helps focus on more meaningful terms in the text analysis.
def ensure_stopwords():
    """
    Download stopwords if not already available.
    This step ensures that we have the necessary resources to filter out common English words.
    """
    try:
        nltk.download('stopwords', quiet=True)  # Silently download stopwords to avoid unnecessary outputs
        print("Stopwords successfully validated.")  # Confirm successful download
    except Exception as e:
        print(f"Error downloading stopwords: {e}")  # Notify if there's an issue
        raise RuntimeError("Failed to initialize stopwords.") from e

# Ensure the stopwords are downloaded before proceeding
ensure_stopwords()

# Step 2: Load the cleaned scraped content
# Purpose: Load the structured data (cleaned text) from a CSV file for further processing.
def load_scraped_data(file_path):
    """
    Load the cleaned content from a CSV file.
    Args:
        file_path (str): Path to the CSV file.
    Returns:
        pd.DataFrame: A DataFrame containing the cleaned scraped data.
    """
    try:
        # Load the data into a pandas DataFrame for easy manipulation
        data = pd.read_csv(file_path)
        print(f"Data loaded successfully from '{file_path}'.")  # Confirm successful loading
        print(data.head())  # Show the first few rows to validate the structure
        return data
    except Exception as e:
        # Notify and return an empty DataFrame if there's an error
        print(f"Error loading file: {e}")
        return pd.DataFrame()

# Load the cleaned scraped content
scraped_data = load_scraped_data("scraped_content_with_cleaned_text.csv")

# Step 3: Process the content for keyword extraction if data is available
if not scraped_data.empty:
    # Combine cleaned headings and paragraphs into one text block for analysis
    # Purpose: To consolidate all meaningful text data for unified processing.
    combined_content = ' '.join(scraped_data['Cleaned Headings'].fillna('')) + ' ' + ' '.join(scraped_data['Cleaned Paragraphs'].fillna(''))
    print("\nCombined content length:", len(combined_content))  # Show the length of the combined text

    # Tokenize the combined content into individual words
    # Purpose: Break the text into smaller units (tokens) for analysis.
    words = re.findall(r'\b\w+\b', combined_content.lower())  # Extract words ignoring case
    print("\nTokenized Words Sample:", words[:20])  # Display the first 20 words for review

    # Step 4: Remove stopwords from the tokenized words
    # Purpose: Eliminate commonly used words (e.g., "the", "and") that do not add meaningful insights.
    stop_words = set(stopwords.words('english'))  # Load the list of stopwords in English
    filtered_words = [word for word in words if word not in stop_words and len(word) > 2]  # Exclude stopwords and very short words
    print("\nFiltered Words Sample:", filtered_words[:20])  # Display the first 20 filtered words for review

    # Step 5: Count keyword frequencies
    # Purpose: Identify the most frequently used meaningful words in the content.
    keyword_counts = Counter(filtered_words)  # Count occurrences of each word
    print("\nTop Keywords:")  # Display the most common keywords
    for word, count in keyword_counts.most_common(10):  # Show the top 10 keywords
        print(f" - {word}: {count} occurrences")

    # Step 6: Extract bigrams (two-word phrases)
    # Purpose: Identify commonly used phrases (pairs of words) for better conversational insights.
    def extract_ngrams(tokens, n):
        """
        Generate n-grams (phrases of n words) from a list of tokens.
        Args:
            tokens (list): List of individual words (tokens).
            n (int): Size of the n-grams to generate (e.g., 2 for bigrams).
        Returns:
            list: List of n-grams as strings.
        """
        # Use zip to create n-grams by shifting tokens
        ngrams = zip(*[tokens[i:] for i in range(n)])
        return [' '.join(ngram) for ngram in ngrams]  # Join the words in each n-gram

    # Extract bigrams from the filtered words
    bigrams = Counter(extract_ngrams(filtered_words, 2))  # Create and count bigrams
    print("\nTop Bigrams (Two-Word Phrases):")  # Display the most common bigrams
    for phrase, count in bigrams.most_common(10):  # Show the top 10 bigrams
        print(f" - {phrase}: {count} occurrences")
else:
    # Notify if no content is available for processing
    print("No content available for analysis.")


Stopwords successfully validated.
Data loaded successfully from 'scraped_content_with_cleaned_text.csv'.
                                                 URL  \
0                               https://thatware.co/   
1    https://thatware.co/digital-marketing-services/   
2  https://thatware.co/business-intelligence-serv...   
3        https://thatware.co/link-building-services/   
4  https://thatware.co/branding-press-release-ser...   

                                            Headings  \
0  Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARK...   
1  Advanced Digital Marketing Services GET A FREE...   
2  Business Intelligence & Consultation Services ...   
3  Advanced Link Building Services GET A FREE CUS...   
4  Branding, Media planning & Paid Marketing WHY ...   

                                          Paragraphs  \
0  $ RevenueGenerated via SEO Qualified LeadsGene...   
1    Thatware is your go-to advanced digital mark...   
2    Thatwareenables you to take strategic decisi... 

# **Clear and Simple Explanation of Each Part of the Output**
---

### 1. **Stopwords Successfully Validated**
- **What it means**:
  This step ensures that a list of commonly used words (like "the", "and", "is") is available for filtering. These words, called stopwords, don’t add much meaning and are removed during keyword analysis.
- **Why it’s important**:
  Removing these words helps focus on more meaningful terms like "SEO" or "digital marketing" that are relevant to your content.

---

### 2. **Data Loaded Successfully**
- **What it means**:
  The program successfully read the scraped content (headings, paragraphs, meta descriptions, etc.) from a file named `scraped_content_with_cleaned_text.csv`.
- **Why it’s important**:
  This file contains all the structured data collected from the webpages. It serves as the input for further analysis.

---

### 3. **Preview of Scraped Content**

#### Columns in the Data
1. **`URL`**:
   - **What it is**: The webpage address where the content was scraped from.
   - **Example**: `https://thatware.co/digital-marketing-services/` shows content from the Digital Marketing Services page.
   - **Why it’s important**: Knowing the URL helps trace the content back to its source.

2. **`Headings`**:
   - **What it is**: All the headings (like H1, H2) collected from the webpage.
   - **Example**: "Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARKETING STRATEGY FOR YOUR BUSINESS."
   - **Why it’s important**: Headings summarize the structure of a webpage, making it easier to understand its main sections.

3. **`Paragraphs`**:
   - **What it is**: The main body text from the webpage.
   - **Example**: "Thatware is your go-to advanced digital marketing agency..."
   - **Why it’s important**: This is the detailed content users and search engines read. It’s crucial for SEO and user experience.

4. **`Meta Description`**:
   - **What it is**: A short summary of the webpage content, usually shown in search engine results.
   - **Example**: "THATWARE® is the world's first SEO agency to seamlessly integrate AI into its services."
   - **Why it’s important**: Meta descriptions influence whether users click on a link in search results.

5. **`Cleaned Headings`**:
   - **What it is**: A version of the `Headings` column with extra words and special characters removed.
   - **Example**: "home get customized seo audit digital marketing strategy business."
   - **Why it’s important**: Cleaning focuses on the most meaningful words for analysis.

6. **`Cleaned Paragraphs`**:
   - **What it is**: A version of the `Paragraphs` column with unnecessary words and special characters removed.
   - **Example**: "revenuegenerated via seo qualified leadsgenerated 11 years ago..."
   - **Why it’s important**: This helps in identifying the core message of the content.

7. **`Cleaned Meta Description`**:
   - **What it is**: A cleaned version of the `Meta Description`.
   - **Example**: "thatware worlds first seo agency seamlessly integrate ai services."
   - **Why it’s important**: Cleaning highlights the most relevant keywords.

---

### 4. **Combined Content Length**
- **What it means**:
  The program combined all the cleaned text (headings, paragraphs, and meta descriptions) into one block of text for analysis. The total length was 97,787 characters.
- **Why it’s important**:
  Combining content ensures a comprehensive analysis of the entire text without missing any part.

---

### 5. **Tokenized Words Sample**
- **What it means**:
  The program split the combined text into individual words for analysis.
  - **Example**: ["home", "get", "customized", "seo", "audit", "digital", "marketing"]
- **Why it’s important**:
  Tokenizing breaks down the content into smaller units, making it easier to identify frequently used words or phrases.

---

### 6. **Filtered Words Sample**
- **What it means**:
  The program removed stopwords (common words) from the tokenized words.
  - **Example**: ["seo", "digital", "marketing", "strategy"]
- **Why it’s important**:
  Filtering out stopwords focuses on the most meaningful words relevant to your content.

---

### 7. **Top Keywords**
- **What it is**:
  A list of the most frequently used words in the cleaned content, along with the number of times each word appears.
  - **Example**:
    - `seo`: 313 occurrences
    - `services`: 274 occurrences
    - `marketing`: 190 occurrences
- **Why it’s important**:
  Keywords show what the content emphasizes the most. High keyword density for relevant terms like "SEO" or "marketing" indicates strong alignment with your website's purpose.

---

### 8. **Top Bigrams (Two-Word Phrases)**
- **What it is**:
  Commonly used pairs of words extracted from the content.
  - **Example**:
    - `digital marketing`: 99 occurrences
    - `advanced seo`: 62 occurrences
    - `link building`: 61 occurrences
- **Why it’s important**:
  Bigrams reveal common phrases that users might search for, helping improve SEO and conversational tone.

---

### **Key Insights from the Output**
1. **Comprehensive Data**:
   All headings, paragraphs, and meta descriptions are now structured and cleaned for analysis.
2. **Keyword Analysis**:
   Top keywords and phrases like "SEO", "digital marketing", and "link building" align well with the site's purpose.
3. **Optimization Opportunities**:
   The data helps identify gaps or areas for improvement in content, meta descriptions, and SEO strategy.

---


---
# **Part 3: FAQ Generation Using AI**
**Title**: **"Generating FAQs for Better User Engagement"**  
**Purpose**: This section extracts potential FAQs from the content by identifying question-like sentences and ranks them based on semantic relevance using a pre-trained AI model.

**Key Steps**:
1. **Loading Content**: Reads raw paragraphs from the CSV to preserve sentence structures needed for FAQ identification.
2. **Splitting Sentences**: Breaks paragraphs into smaller sentences using regex for better analysis.
3. **Identifying FAQs**: Filters sentences that resemble questions, starting with keywords like "What", "How", or "Why".
4. **Ranking FAQs**: Uses a semantic similarity model (`SentenceTransformer`) to rank FAQ candidates based on their relevance to a predefined query.

---


In [3]:
from sentence_transformers import SentenceTransformer, util  # For semantic similarity and ranking
import pandas as pd  # For data manipulation
import re  # For splitting text into sentences using patterns

# Step 1: Load raw content from the specified column
# Purpose: Extract the text content from a specific column in the CSV file for further processing.
def load_raw_content(file_path, column_name):
    """
    Load raw content from a specified column in the provided CSV file.
    Args:
        file_path (str): The file path to the cleaned CSV file.
        column_name (str): The name of the column to extract content from.
    Returns:
        str: Combined text from the specified column.
    """
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)
        # Combine all non-empty rows in the specified column into one string
        content = ' '.join(data[column_name].dropna())
        print(f"Loaded content from '{column_name}' in '{file_path}'. Combined length: {len(content)} characters.")
        return content  # Return the combined text
    except Exception as e:
        print(f"Error loading content from '{column_name}': {e}")
        return ""  # Return an empty string if loading fails

# Load the text content from the 'Paragraphs' column
raw_content = load_raw_content("scraped_content_with_cleaned_text.csv", "Paragraphs")

# Step 2: Split content into sentences
# Purpose: Break the text into smaller units (sentences) for easier analysis and FAQ identification.
def split_into_sentences(content):
    """
    Split combined content into sentences using regex.
    Args:
        content (str): The full text to split into sentences.
    Returns:
        list: List of sentences.
    """
    # Use regex to split text where periods or question marks are followed by a space
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', content)
    print(f"Extracted {len(sentences)} sentences.")  # Display how many sentences were extracted
    return sentences  # Return the list of sentences

# Break the combined content into sentences
sentences = split_into_sentences(raw_content)

# Step 3: Filter FAQ candidate sentences
# Purpose: Identify sentences that are likely to be useful as FAQs (start with "What", "How", etc.).
def filter_faq_candidates(sentences):
    """
    Identify potential FAQ sentences based on keywords and sentence length.
    Args:
        sentences (list): List of sentences.
    Returns:
        list: Filtered sentences that qualify as FAQ candidates.
    """
    # Define keywords that typically indicate a question
    faq_keywords = ["what", "how", "why", "where", "when", "who", "can", "should", "is", "are", "does", "do"]
    # Filter sentences that start with a FAQ keyword and are reasonably long
    candidates = [
        s for s in sentences
        if len(s.split()) > 5 and any(s.lower().startswith(word) for word in faq_keywords)
    ]
    print(f"Filtered {len(candidates)} FAQ candidate sentences.")  # Display the count of FAQ candidates
    return candidates  # Return the filtered list of FAQ sentences

# Extract potential FAQ candidates
faq_candidates = filter_faq_candidates(sentences)

# Step 4: Rank FAQs using semantic similarity
# Purpose: Use AI to rank FAQ candidates based on how similar they are to the given query.
def rank_faqs(candidates, query):
    """
    Rank FAQ candidates by their semantic similarity to a given query.
    Args:
        candidates (list): List of potential FAQ sentences.
        query (str): The query to compare the candidates against.
    Returns:
        list: Ranked FAQ candidates.
    """
    if not candidates:  # If there are no FAQ candidates, return default FAQs
        print("No FAQ candidates found. Returning placeholder FAQs.")
        return ["What is SEO?", "How does advanced SEO improve rankings?", "Why choose AI-powered SEO?"]

    # Load a pre-trained semantic model
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    # Encode the query and candidates into embeddings (numerical representations)
    query_embedding = model.encode(query, convert_to_tensor=True)
    candidate_embeddings = model.encode(candidates, convert_to_tensor=True)

    # Calculate similarity scores between the query and each candidate
    scores = util.pytorch_cos_sim(query_embedding, candidate_embeddings).squeeze().tolist()
    if isinstance(scores, float):  # If there's only one candidate, ensure scores is a list
        scores = [scores]

    # Sort candidates by their similarity scores in descending order
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    ranked_faqs = [candidates[i] for i in ranked_indices[:10]]  # Select the top 10 FAQs
    return ranked_faqs  # Return the ranked FAQ candidates

# Define a query to rank FAQs against
query = "Generate FAQs for SEO and digital marketing"
ranked_faqs = rank_faqs(faq_candidates, query)

# Step 5: Display structured FAQ suggestions
# Purpose: Show the final list of ranked FAQs in a user-friendly format.
def display_faqs(faqs):
    """
    Display the list of FAQ suggestions in a structured format.
    Args:
        faqs (list): List of FAQ sentences.
    """
    print("\nStructured FAQ Suggestions:")  # Title for the FAQ suggestions
    for idx, faq in enumerate(faqs, 1):  # Enumerate to number each FAQ
        print(f"{idx}. {faq}")  # Print each FAQ with its rank

# Display the top-ranked FAQs
display_faqs(ranked_faqs)


  from tqdm.autonotebook import tqdm, trange


Loaded content from 'Paragraphs' in 'scraped_content_with_cleaned_text.csv'. Combined length: 114142 characters.
Extracted 865 sentences.
Filtered 27 FAQ candidate sentences.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Structured FAQ Suggestions:
1. However, technical on-site SEO stands out as one of the most powerful tools in your digital marketing arsenal.
2. HOW DOES OUR DIGITAL MARKETING SYSTEM WORK?
3. When comparing SEO with offline advertising or other digital marketing approaches, one notable advantage of SEO is its ability to deliver a high ROI.
4. When you are committing to digital marketing, you can inquire the specialists to look into your competitor’s online policies.
5. When searchers see keywords they were actively searching for featured on your product pages, they are more likely to consider your business as the right one to engage with.
6. When executed effectively, on-page SEO can yield a high return on investment.
7. Does digital marketing look as unfamiliar as a binary code to you?
8. Downloadable content offerings are resources provided by businesses for users to download.
9. When these strategies are put together, they create the perfect way to make your business more profitabl

# **Detailed Explanation of the Output**
---

### **1. Loaded Content from the 'Paragraphs' Column**
- **What it means**:
  The program extracted all the text from the `Paragraphs` column in the file `scraped_content_with_cleaned_text.csv`. This column contains the main body of text from the scraped webpages.
- **Combined Length**:
  The total length of this text is 114,142 characters.
- **Why it’s important**:
  By consolidating all the content, you can analyze the website's core message without jumping between multiple files or sections. This ensures a comprehensive view of your website's information.

---

### **2. Extracted 865 Sentences**
- **What it means**:
  The combined text was split into 865 sentences. Each sentence is a self-contained piece of information.
- **Why it’s important**:
  Analyzing individual sentences helps identify which ones are relevant as potential FAQs or need optimization. Breaking text into smaller chunks makes it easier to review and process.

---

### **3. Filtered 27 FAQ Candidate Sentences**
- **What it means**:
  From the 865 sentences, the program identified 27 that are potential candidates for FAQs (Frequently Asked Questions). These sentences were selected because they:
  - Start with question words like "what", "how", "when", "does", etc.
  - Are long enough to provide meaningful information.
- **Why it’s important**:
  FAQ-style content is critical for:
  - Answering user queries effectively.
  - Enhancing user experience.
  - Improving SEO rankings, as search engines favor content that directly answers user questions.

---

### **4. Structured FAQ Suggestions**
Here’s a list of the top 10 suggested FAQs from the filtered sentences:

1. **"However, technical on-site SEO stands out as one of the most powerful tools in your digital marketing arsenal."**
   - **Use**: Could be rewritten as an FAQ like: "Why is technical on-site SEO a powerful tool?"
   - **Action**: Expand on this point with examples or statistics about how technical SEO improves results.

2. **"HOW DOES OUR DIGITAL MARKETING SYSTEM WORK?"**
   - **Use**: Directly usable as an FAQ.
   - **Action**: Provide a clear and concise answer explaining your digital marketing process.

3. **"When comparing SEO with offline advertising or other digital marketing approaches, one notable advantage of SEO is its ability to deliver a high ROI."**
   - **Use**: Could be rewritten as: "What are the benefits of SEO compared to offline advertising?"
   - **Action**: Highlight real-world examples and ROI comparisons.

4. **"When you are committing to digital marketing, you can inquire the specialists to look into your competitor’s online policies."**
   - **Use**: Could be rewritten as: "Why should businesses analyze competitors' online strategies in digital marketing?"
   - **Action**: Offer practical steps or tools for competitive analysis.

5. **"When searchers see keywords they were actively searching for featured on your product pages, they are more likely to consider your business as the right one to engage with."**
   - **Use**: Could be rewritten as: "How do targeted keywords improve customer engagement?"
   - **Action**: Provide a guide on keyword research and placement.

6. **"When executed effectively, on-page SEO can yield a high return on investment."**
   - **Use**: Could be rewritten as: "What is the ROI of effective on-page SEO?"
   - **Action**: Share examples or case studies showing ROI improvements.

7. **"Does digital marketing look as unfamiliar as a binary code to you?"**
   - **Use**: Could be rewritten as: "What makes digital marketing easy to understand?"
   - **Action**: Create beginner-friendly guides or explainer videos.

8. **"Downloadable content offerings are resources provided by businesses for users to download."**
   - **Use**: Could be rewritten as: "What are downloadable content offerings?"
   - **Action**: Explain examples like eBooks, whitepapers, and how they benefit users.

9. **"When these strategies are put together, they create the perfect way to make your business more profitable."**
   - **Use**: Could be rewritten as: "How do combined digital strategies increase profitability?"
   - **Action**: Offer insights into creating integrated strategies.

10. **"When your business consistently ranks at the top of search results, it reinforces your position as an industry leader."**
    - **Use**: Could be rewritten as: "How does ranking high on search results enhance your brand's reputation?"
    - **Action**: Highlight the benefits of top search rankings.

---

### **How This Output Is Useful**
1. **Improved User Experience**:
   - By addressing these questions, you directly answer what users are searching for, improving engagement.

2. **SEO Optimization**:
   - FAQ content with keywords improves your chances of ranking for voice and text-based search queries.

3. **Content Gaps**:
   - Highlights areas where additional information or better explanations are needed.

4. **Customer Trust**:
   - Providing clear answers builds trust and positions your brand as an expert in the field.

---

### **Next Steps After Getting This Output**
1. **Rewriting FAQs**:
   - Take the structured FAQ suggestions and rewrite them into concise, user-friendly questions.
   - Ensure answers are detailed, easy to understand, and value-packed.

2. **Integrating into Website**:
   - Add the FAQs to relevant pages, such as service pages or a dedicated FAQ section.

3. **Expanding Content**:
   - For each FAQ, create detailed blog posts, videos, or downloadable resources to further engage users.

4. **Voice Search Optimization**:
   - Use these FAQs as part of your strategy for voice search queries, as they mirror how users phrase questions.

5. **Track and Optimize**:
   - Use analytics tools to track how users interact with these FAQs and refine them based on feedback.

---


---
# **Part 4: Metadata and Content Optimization**
**Title**: **"Improving Metadata and Content Quality"**  
**Purpose**: This part of the code focuses on analyzing and improving meta descriptions, identifying content gaps, and optimizing paragraphs for better readability.

**Key Steps**:
1. **Metadata Recommendations**: Suggests changes to meta descriptions and headings to ensure they are neither too short nor too long.
2. **Content Gap Analysis**: Scans paragraphs for missing FAQs, placeholders like "coming soon", and overly short content that needs expansion.
3. **Content Optimization**: Recommends splitting long paragraphs and simplifying complex sentences for better user engagement.
4. **Keyword Insights**: Analyzes and lists the most common keywords in the content to help refine SEO strategies.

---



In [4]:
import pandas as pd
from collections import Counter
import re

# Step 1: Load the scraped and cleaned data
# Purpose: Load the structured data (headings, paragraphs, etc.) from a CSV file for analysis.
def load_scraped_data(file_path):
    """
    Load the scraped content from a CSV file.
    Args:
        file_path (str): Path to the CSV file containing the cleaned scraped content.
    Returns:
        pd.DataFrame: DataFrame containing the structured data.
    """
    try:
        # Read the CSV file into a pandas DataFrame
        data = pd.read_csv(file_path)
        print(f"Loaded data from '{file_path}'.")
        return data  # Return the DataFrame
    except Exception as e:
        # Handle errors during file loading
        print(f"Error loading data: {e}")
        return pd.DataFrame()  # Return an empty DataFrame if an error occurs

scraped_data = load_scraped_data("scraped_content_with_cleaned_text.csv")

# Step 2: Generate Metadata Recommendations
# Purpose: Analyze meta descriptions and headings to provide suggestions for improvement.
def generate_metadata_recommendations(dataframe):
    """
    Provide recommendations to improve meta descriptions and headings for better visibility.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Metadata improvement suggestions.
    """
    recommendations = []
    for idx, row in dataframe.iterrows():
        # Extract meta description and headings from the row
        meta_desc = row.get("Meta Description", "")
        headings = row.get("Headings", "")

        # Check if the meta description is too short or too long
        if len(meta_desc.split()) < 10:
            recommendations.append(f"Expand meta description: '{meta_desc}' (too short).")
        elif len(meta_desc.split()) > 30:
            recommendations.append(f"Condense meta description: '{meta_desc}' (too long).")

        # Check if the heading is too long for readability
        if len(headings.split()) > 15:
            recommendations.append(f"Condense heading: '{headings}' (too long for readability).")
    return recommendations  # Return the list of recommendations

metadata_recommendations = generate_metadata_recommendations(scraped_data)

# Step 3: Content Gap Analysis
# Purpose: Identify missing or incomplete content to improve website relevance and user experience.
def perform_content_gap_analysis(dataframe):
    """
    Identify potential content gaps based on paragraph analysis.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Insights on content gaps.
    """
    gaps = []
    faq_keywords = ["what", "how", "why", "where", "when", "who", "can", "should", "is", "are", "does", "do"]

    for paragraph in dataframe["Paragraphs"]:
        # Split the paragraph into sentences
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
        for sentence in sentences:
            # Check if the sentence is a potential FAQ
            if any(sentence.lower().startswith(word) for word in faq_keywords):
                gaps.append(f"Expand on this FAQ: '{sentence.strip()}'")

        # Identify placeholder content
        if "coming soon" in paragraph.lower() or "click here" in paragraph.lower():
            gaps.append(f"Replace placeholder content: '{paragraph.strip()}'")

        # Flag very short paragraphs
        if len(paragraph.split()) < 10:
            gaps.append(f"Expand short paragraph: '{paragraph.strip()}'")
    return gaps  # Return the list of content gaps

content_gaps = perform_content_gap_analysis(scraped_data)

# Step 4: Optimized Content Recommendations
# Purpose: Provide suggestions to improve readability and make content more user-friendly.
def generate_optimized_content_recommendations(dataframe):
    """
    Suggest ways to optimize paragraphs for readability and conversational tone.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Optimized content recommendations.
    """
    recommendations = []
    for idx, row in dataframe.iterrows():
        paragraph = row.get("Paragraphs", "")
        # Suggest splitting long paragraphs
        if len(paragraph.split()) > 50:
            recommendations.append(f"Split long paragraph: '{paragraph[:100]}...'")
        # Suggest simplifying complex sentences
        if len(paragraph.split()) > 30 and any(word in paragraph.lower() for word in ["therefore", "hence", "however"]):
            recommendations.append(f"Simplify complex paragraph: '{paragraph[:100]}...'")
    return recommendations  # Return the list of recommendations

optimized_content_recommendations = generate_optimized_content_recommendations(scraped_data)

# Step 5: Extract Keyword Insights
# Purpose: Identify the most frequently used keywords in the content.
def extract_keyword_insights(dataframe):
    """
    Extract and count popular keywords from the content.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Top 10 keywords with their counts.
    """
    # Combine all cleaned paragraphs into one string
    text = ' '.join(dataframe["Cleaned Paragraphs"].fillna(""))
    # Tokenize the text into individual words
    words = re.findall(r'\b\w+\b', text.lower())
    # Remove common stopwords
    stop_words = set(["the", "and", "to", "for", "of", "in", "a", "is", "on", "with", "as", "it", "by", "an", "or"])
    keywords = [word for word in words if word not in stop_words]
    # Count keyword frequencies
    keyword_counts = Counter(keywords)
    return keyword_counts.most_common(10)  # Return the top 10 keywords

keyword_insights = extract_keyword_insights(scraped_data)

# Step 6: Save and Display All Results
# Purpose: Save the analysis results to a CSV file and display a preview for validation.
def save_and_display_results(metadata, gaps, content_recs, keywords):
    """
    Save all recommendations and insights to a CSV file and display a preview.
    Args:
        metadata (list): Metadata recommendations.
        gaps (list): Content gap analysis results.
        content_recs (list): Optimized content recommendations.
        keywords (list): Top keywords with their counts.
    """
    # Create a dictionary of results
    results = {
        "Metadata Recommendations": metadata,
        "Content Gaps": gaps,
        "Optimized Content": content_recs,
        "Keyword Insights": [f"{k}: {v}" for k, v in keywords]
    }
    # Convert the results into a DataFrame
    results_df = pd.DataFrame({k: pd.Series(v) for k, v in results.items()})
    # Save the DataFrame to a CSV file
    results_df.to_csv("voice_search_optimization_results.csv", index=False)
    print("\nResults saved to 'voice_search_optimization_results.csv'.")
    # Display a preview of the results
    print("\nPreview:")
    print(results_df.head().to_markdown())

# Save and display the results
save_and_display_results(metadata_recommendations, content_gaps, optimized_content_recommendations, keyword_insights)


Loaded data from 'scraped_content_with_cleaned_text.csv'.

Results saved to 'voice_search_optimization_results.csv'.

Preview:
|    | Metadata Recommendations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Content Gaps                                                                                                           

# **Detailed Explanation of the Output Table**

---

### **1. Metadata Recommendations**

- **What It Means**:
  - This column highlights issues with headings and meta descriptions that may impact readability and SEO.
  - Example:
    - "Condense heading: 'Home GET A CUSTOMIZED SEO AUDIT…'" suggests that the heading is too long to be user-friendly or effective in search results.
  
- **Why It’s Useful**:
  - Long or unclear headings/meta descriptions can confuse users and harm your SEO ranking. Search engines prefer concise, relevant, and well-structured headings.

- **Action Steps**:
  - **Shorten overly long headings**: Make them direct, relevant, and user-focused.
    - Example: Change "Home GET A CUSTOMIZED SEO AUDIT…" to "Get Your Customized SEO Audit Today."
  - **Revise meta descriptions**: Ensure they are concise (50-160 characters), descriptive, and include keywords.

---

### **2. Content Gaps**

- **What It Means**:
  - This column identifies areas where the content is incomplete, unclear, or missing crucial details.
  - Example:
    - "Expand on this FAQ: 'What distinguishes us is the 927+ AI algorithms...'" suggests you elaborate on how these algorithms are unique and valuable.

- **Why It’s Useful**:
  - Filling content gaps ensures your website answers user questions comprehensively, improving user experience and search engine rankings.

- **Action Steps**:
  - **Expand FAQs**: Provide detailed answers to questions flagged as incomplete.
    - Example: For "What distinguishes us is the 927+ AI algorithms...," explain the unique benefits and applications of these algorithms.
  - **Eliminate placeholder text**: Replace "coming soon" or vague content with real, actionable information.

---

### **3. Optimized Content**

- **What It Means**:
  - Highlights paragraphs that are too long or complex, suggesting ways to make them clearer and more user-friendly.
  - Example:
    - "Split long paragraph: '$ RevenueGenerated via SEO Qualified LeadsGenerated...'" indicates the need to break down this dense block of text into smaller, digestible parts.

- **Why It’s Useful**:
  - Users (and search engines) prefer clear and concise content. Long or complex paragraphs can overwhelm readers and lead to poor engagement.

- **Action Steps**:
  - **Split long paragraphs**: Break content into smaller sections with clear subheadings.
    - Example: Convert "Thatware enables you to take strategic decisions…" into 2-3 shorter paragraphs.
  - **Simplify complex sentences**: Avoid jargon or overly technical language; focus on clarity.
    - Example: Instead of "Therefore, the metrics calculated by our AI improve ROI significantly," say "Our AI boosts ROI with precise metrics."

---

### **4. Keyword Insights**

- **What It Means**:
  - Lists the most frequently used keywords on your website along with their occurrence counts.
  - Example:
    - "SEO: 281 occurrences" shows that "SEO" is the most dominant keyword, followed by "services," "marketing," etc.

- **Why It’s Useful**:
  - Helps identify whether your content aligns with your target audience’s search behavior.
  - Shows whether you’re overusing or underutilizing certain keywords.

- **Action Steps**:
  - **Optimize for target keywords**: Ensure the top keywords like "SEO" and "services" are used naturally in headings, meta descriptions, and body text.
  - **Avoid keyword stuffing**: If a keyword appears excessively, ensure it doesn’t feel forced or redundant.
  - **Leverage related keywords**: For example, if "SEO services" is prominent, include phrases like "SEO strategies" or "SEO tools" to cover related terms.

---

### **How This Output Helps**

1. **Improves Readability and User Experience**:
   - Users appreciate clear, concise, and actionable content. By addressing long headings and complex paragraphs, you make your site more user-friendly.

2. **Boosts SEO Rankings**:
   - Search engines prioritize websites that provide relevant answers, use keywords effectively, and maintain clean metadata.

3. **Identifies Opportunities for Engagement**:
   - FAQs and optimized content ensure users find the answers they’re looking for, increasing the likelihood of conversion or engagement.

4. **Actionable Insights**:
   - You now know exactly where your content falls short and how to fix it. This roadmap helps prioritize tasks like rewriting headings, adding FAQs, or simplifying content.

---

### **Next Steps**

1. **Review Each Recommendation**:
   - Use the metadata recommendations to refine headings and meta descriptions.
   - Use content gaps to create or expand FAQs.

2. **Rewrite Content**:
   - Break down long paragraphs and simplify dense sections as highlighted.

3. **Keyword Optimization**:
   - Ensure high-ranking keywords are incorporated into new or revised content.

4. **Update Your Website**:
   - Implement changes directly on your site and monitor user engagement or SEO improvements over time.

5. **Repeat Analysis**:
   - Periodically run this tool again after updates to measure progress and refine further.

---


# **AI-Powered Voice Search Optimization Model Code**

In [5]:
import requests  # For making HTTP requests to fetch webpage content
from bs4 import BeautifulSoup  # For parsing HTML content of webpages
import pandas as pd  # For structuring and saving data in tabular format
import re  # For cleaning and normalizing text data

# Step 1: Define the list of URLs to scrape
# Purpose: The URLs represent webpages we want to scrape to extract useful data such as headings, paragraphs, and meta descriptions.
urls = [
    'https://thatware.co/',  # Homepage URL of the target site
    'https://thatware.co/digital-marketing-services/',  # Digital Marketing Services page
    'https://thatware.co/business-intelligence-services/',  # Business Intelligence Services page
    'https://thatware.co/link-building-services/',  # Link Building Services page
    'https://thatware.co/branding-press-release-services/',  # Branding and Press Release Services page
    'https://thatware.co/advanced-seo-services/',  # Advanced SEO Services page
    # Add more URLs if additional pages need to be scraped
]

# Step 2: Define a function to clean and normalize text
# Purpose: Raw text often contains unwanted characters, extra spaces, or stopwords. This function removes such noise to prepare the text for analysis.
def clean_text(text, remove_stopwords=True):
    """
    Cleans and normalizes text data.
    - Removes extra spaces, converts text to lowercase, and removes special characters.
    - Optionally removes common words ("stopwords") like 'the', 'and', 'to', etc., that don't add much meaning.

    Args:
        text (str): The raw text to clean.
        remove_stopwords (bool): Whether to remove stopwords from the text.

    Returns:
        str: The cleaned text.
    """
    # Replace multiple spaces with a single space for better readability
    text = re.sub(r'\s+', ' ', text)
    # Convert all characters to lowercase for uniformity
    text = text.lower()
    # Remove special characters like punctuation marks
    text = re.sub(r'[^\w\s]', '', text)

    # If stopwords need to be removed
    if remove_stopwords:
        # Define a set of common stopwords
        stop_words = set([
            'the', 'and', 'to', 'for', 'of', 'in', 'a', 'is', 'on', 'with', 'that',
            'as', 'it', 'you', 'your', 'our', 'this', 'by', 'at', 'be', 'are', 'can', 'an'
        ])
        # Filter out stopwords from the text
        text = ' '.join(word for word in text.split() if word not in stop_words)

    return text  # Return the cleaned text

# Step 3: Define a function to scrape data from a single webpage
# Purpose: This function fetches data like headings, paragraphs, and meta descriptions from a given webpage.
def scrape_webpage(url):
    """
    Scrapes data from a given URL, including headings, paragraphs, and meta descriptions.

    Args:
        url (str): The URL of the webpage to scrape.

    Returns:
        dict: A dictionary containing the URL, headings, paragraphs, and meta descriptions.
    """
    try:
        # Make an HTTP GET request to fetch the webpage content
        response = requests.get(url, timeout=10)  # Timeout after 10 seconds if no response
        response.raise_for_status()  # Raise an error if the request fails

        # Parse the webpage content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Remove unwanted tags like <script>, <style>, and <noscript> to clean the HTML content
        for tag in soup(['script', 'style', 'noscript']):
            tag.decompose()

        # Extract headings (H1, H2, H3 tags) from the webpage
        headings = ' '.join(tag.get_text(strip=True) for tag in soup.find_all(['h1', 'h2', 'h3']))

        # Extract paragraphs (<p> tags) from the webpage
        paragraphs = ' '.join(tag.get_text(strip=True) for tag in soup.find_all('p'))

        # Extract the meta description content
        meta_tag = soup.find('meta', attrs={'name': 'description'})
        meta_description = meta_tag['content'] if meta_tag and 'content' in meta_tag.attrs else "No meta description available"

        # Return the extracted data in a dictionary format
        return {
            "URL": url,
            "Headings": headings,
            "Paragraphs": paragraphs,
            "Meta Description": meta_description
        }

    except Exception as e:
        # Handle errors that occur during scraping
        print(f"Error scraping {url}: {e}")
        return {
            "URL": url,
            "Headings": "",
            "Paragraphs": "",
            "Meta Description": "Error fetching meta description"
        }

# Step 4: Scrape content from all URLs in the list
# Purpose: Iterate through all URLs, scrape their content, and clean the extracted text for analysis.
scraped_data = []
for url in urls:
    print(f"Scraping content from: {url}")  # Inform the user which URL is being processed
    # Scrape data from the current URL
    scraped = scrape_webpage(url)
    # Clean the extracted text and add it to the result
    scraped['Cleaned Headings'] = clean_text(scraped['Headings'], remove_stopwords=True)
    scraped['Cleaned Paragraphs'] = clean_text(scraped['Paragraphs'], remove_stopwords=True)
    scraped['Cleaned Meta Description'] = clean_text(scraped['Meta Description'], remove_stopwords=True)
    scraped_data.append(scraped)  # Append the result to the list

# Step 5: Save the scraped data to a CSV file
# Purpose: Save the extracted and cleaned data in a structured format for further analysis or sharing.
if scraped_data:
    # Convert the list of dictionaries to a Pandas DataFrame for tabular representation
    df = pd.DataFrame(scraped_data)
    # Save the DataFrame to a CSV file
    df.to_csv("scraped_content_with_cleaned_text.csv", index=False)
    print("\nScraped content has been saved to 'scraped_content_with_cleaned_text.csv'.")  # Notify the user

# Step 6: Preview the scraped content
# Purpose: Show the first few rows of the scraped data as a quick preview to ensure everything worked as expected.
    print("\nPreview of Scraped Content:")
    print(df.head())  # Display the first 5 rows of the DataFrame
else:
    # Notify the user if no content was scraped
    print("No content was scraped.")


import pandas as pd  # For working with tabular data (like spreadsheets)
import re  # For cleaning and manipulating text using regular expressions
from collections import Counter  # For counting occurrences of words and phrases
from nltk.corpus import stopwords  # For filtering out common "stopwords" like "the", "and", etc.
import nltk  # Natural Language Toolkit for text processing

# Step 1: Ensure stopwords are available for filtering common words
# Purpose: Stopwords are common words like "the", "is", and "and" that don't contribute much meaning.
# Removing these helps focus on more meaningful terms in the text analysis.
def ensure_stopwords():
    """
    Download stopwords if not already available.
    This step ensures that we have the necessary resources to filter out common English words.
    """
    try:
        nltk.download('stopwords', quiet=True)  # Silently download stopwords to avoid unnecessary outputs
        print("Stopwords successfully validated.")  # Confirm successful download
    except Exception as e:
        print(f"Error downloading stopwords: {e}")  # Notify if there's an issue
        raise RuntimeError("Failed to initialize stopwords.") from e

# Ensure the stopwords are downloaded before proceeding
ensure_stopwords()

# Step 2: Load the cleaned scraped content
# Purpose: Load the structured data (cleaned text) from a CSV file for further processing.
def load_scraped_data(file_path):
    """
    Load the cleaned content from a CSV file.
    Args:
        file_path (str): Path to the CSV file.
    Returns:
        pd.DataFrame: A DataFrame containing the cleaned scraped data.
    """
    try:
        # Load the data into a pandas DataFrame for easy manipulation
        data = pd.read_csv(file_path)
        print(f"Data loaded successfully from '{file_path}'.")  # Confirm successful loading
        print(data.head())  # Show the first few rows to validate the structure
        return data
    except Exception as e:
        # Notify and return an empty DataFrame if there's an error
        print(f"Error loading file: {e}")
        return pd.DataFrame()

# Load the cleaned scraped content
scraped_data = load_scraped_data("scraped_content_with_cleaned_text.csv")

# Step 3: Process the content for keyword extraction if data is available
if not scraped_data.empty:
    # Combine cleaned headings and paragraphs into one text block for analysis
    # Purpose: To consolidate all meaningful text data for unified processing.
    combined_content = ' '.join(scraped_data['Cleaned Headings'].fillna('')) + ' ' + ' '.join(scraped_data['Cleaned Paragraphs'].fillna(''))
    print("\nCombined content length:", len(combined_content))  # Show the length of the combined text

    # Tokenize the combined content into individual words
    # Purpose: Break the text into smaller units (tokens) for analysis.
    words = re.findall(r'\b\w+\b', combined_content.lower())  # Extract words ignoring case
    print("\nTokenized Words Sample:", words[:20])  # Display the first 20 words for review

    # Step 4: Remove stopwords from the tokenized words
    # Purpose: Eliminate commonly used words (e.g., "the", "and") that do not add meaningful insights.
    stop_words = set(stopwords.words('english'))  # Load the list of stopwords in English
    filtered_words = [word for word in words if word not in stop_words and len(word) > 2]  # Exclude stopwords and very short words
    print("\nFiltered Words Sample:", filtered_words[:20])  # Display the first 20 filtered words for review

    # Step 5: Count keyword frequencies
    # Purpose: Identify the most frequently used meaningful words in the content.
    keyword_counts = Counter(filtered_words)  # Count occurrences of each word
    print("\nTop Keywords:")  # Display the most common keywords
    for word, count in keyword_counts.most_common(10):  # Show the top 10 keywords
        print(f" - {word}: {count} occurrences")

    # Step 6: Extract bigrams (two-word phrases)
    # Purpose: Identify commonly used phrases (pairs of words) for better conversational insights.
    def extract_ngrams(tokens, n):
        """
        Generate n-grams (phrases of n words) from a list of tokens.
        Args:
            tokens (list): List of individual words (tokens).
            n (int): Size of the n-grams to generate (e.g., 2 for bigrams).
        Returns:
            list: List of n-grams as strings.
        """
        # Use zip to create n-grams by shifting tokens
        ngrams = zip(*[tokens[i:] for i in range(n)])
        return [' '.join(ngram) for ngram in ngrams]  # Join the words in each n-gram

    # Extract bigrams from the filtered words
    bigrams = Counter(extract_ngrams(filtered_words, 2))  # Create and count bigrams
    print("\nTop Bigrams (Two-Word Phrases):")  # Display the most common bigrams
    for phrase, count in bigrams.most_common(10):  # Show the top 10 bigrams
        print(f" - {phrase}: {count} occurrences")
else:
    # Notify if no content is available for processing
    print("No content available for analysis.")


from sentence_transformers import SentenceTransformer, util  # For semantic similarity and ranking
import pandas as pd  # For data manipulation
import re  # For splitting text into sentences using patterns

# Step 1: Load raw content from the specified column
# Purpose: Extract the text content from a specific column in the CSV file for further processing.
def load_raw_content(file_path, column_name):
    """
    Load raw content from a specified column in the provided CSV file.
    Args:
        file_path (str): The file path to the cleaned CSV file.
        column_name (str): The name of the column to extract content from.
    Returns:
        str: Combined text from the specified column.
    """
    try:
        # Read the CSV file
        data = pd.read_csv(file_path)
        # Combine all non-empty rows in the specified column into one string
        content = ' '.join(data[column_name].dropna())
        print(f"Loaded content from '{column_name}' in '{file_path}'. Combined length: {len(content)} characters.")
        return content  # Return the combined text
    except Exception as e:
        print(f"Error loading content from '{column_name}': {e}")
        return ""  # Return an empty string if loading fails

# Load the text content from the 'Paragraphs' column
raw_content = load_raw_content("scraped_content_with_cleaned_text.csv", "Paragraphs")

# Step 2: Split content into sentences
# Purpose: Break the text into smaller units (sentences) for easier analysis and FAQ identification.
def split_into_sentences(content):
    """
    Split combined content into sentences using regex.
    Args:
        content (str): The full text to split into sentences.
    Returns:
        list: List of sentences.
    """
    # Use regex to split text where periods or question marks are followed by a space
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', content)
    print(f"Extracted {len(sentences)} sentences.")  # Display how many sentences were extracted
    return sentences  # Return the list of sentences

# Break the combined content into sentences
sentences = split_into_sentences(raw_content)

# Step 3: Filter FAQ candidate sentences
# Purpose: Identify sentences that are likely to be useful as FAQs (start with "What", "How", etc.).
def filter_faq_candidates(sentences):
    """
    Identify potential FAQ sentences based on keywords and sentence length.
    Args:
        sentences (list): List of sentences.
    Returns:
        list: Filtered sentences that qualify as FAQ candidates.
    """
    # Define keywords that typically indicate a question
    faq_keywords = ["what", "how", "why", "where", "when", "who", "can", "should", "is", "are", "does", "do"]
    # Filter sentences that start with a FAQ keyword and are reasonably long
    candidates = [
        s for s in sentences
        if len(s.split()) > 5 and any(s.lower().startswith(word) for word in faq_keywords)
    ]
    print(f"Filtered {len(candidates)} FAQ candidate sentences.")  # Display the count of FAQ candidates
    return candidates  # Return the filtered list of FAQ sentences

# Extract potential FAQ candidates
faq_candidates = filter_faq_candidates(sentences)

# Step 4: Rank FAQs using semantic similarity
# Purpose: Use AI to rank FAQ candidates based on how similar they are to the given query.
def rank_faqs(candidates, query):
    """
    Rank FAQ candidates by their semantic similarity to a given query.
    Args:
        candidates (list): List of potential FAQ sentences.
        query (str): The query to compare the candidates against.
    Returns:
        list: Ranked FAQ candidates.
    """
    if not candidates:  # If there are no FAQ candidates, return default FAQs
        print("No FAQ candidates found. Returning placeholder FAQs.")
        return ["What is SEO?", "How does advanced SEO improve rankings?", "Why choose AI-powered SEO?"]

    # Load a pre-trained semantic model
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    # Encode the query and candidates into embeddings (numerical representations)
    query_embedding = model.encode(query, convert_to_tensor=True)
    candidate_embeddings = model.encode(candidates, convert_to_tensor=True)

    # Calculate similarity scores between the query and each candidate
    scores = util.pytorch_cos_sim(query_embedding, candidate_embeddings).squeeze().tolist()
    if isinstance(scores, float):  # If there's only one candidate, ensure scores is a list
        scores = [scores]

    # Sort candidates by their similarity scores in descending order
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    ranked_faqs = [candidates[i] for i in ranked_indices[:10]]  # Select the top 10 FAQs
    return ranked_faqs  # Return the ranked FAQ candidates

# Define a query to rank FAQs against
query = "Generate FAQs for SEO and digital marketing"
ranked_faqs = rank_faqs(faq_candidates, query)

# Step 5: Display structured FAQ suggestions
# Purpose: Show the final list of ranked FAQs in a user-friendly format.
def display_faqs(faqs):
    """
    Display the list of FAQ suggestions in a structured format.
    Args:
        faqs (list): List of FAQ sentences.
    """
    print("\nStructured FAQ Suggestions:")  # Title for the FAQ suggestions
    for idx, faq in enumerate(faqs, 1):  # Enumerate to number each FAQ
        print(f"{idx}. {faq}")  # Print each FAQ with its rank

# Display the top-ranked FAQs
display_faqs(ranked_faqs)



import pandas as pd
from collections import Counter
import re

# Step 1: Load the scraped and cleaned data
# Purpose: Load the structured data (headings, paragraphs, etc.) from a CSV file for analysis.
def load_scraped_data(file_path):
    """
    Load the scraped content from a CSV file.
    Args:
        file_path (str): Path to the CSV file containing the cleaned scraped content.
    Returns:
        pd.DataFrame: DataFrame containing the structured data.
    """
    try:
        # Read the CSV file into a pandas DataFrame
        data = pd.read_csv(file_path)
        print(f"Loaded data from '{file_path}'.")
        return data  # Return the DataFrame
    except Exception as e:
        # Handle errors during file loading
        print(f"Error loading data: {e}")
        return pd.DataFrame()  # Return an empty DataFrame if an error occurs

scraped_data = load_scraped_data("scraped_content_with_cleaned_text.csv")

# Step 2: Generate Metadata Recommendations
# Purpose: Analyze meta descriptions and headings to provide suggestions for improvement.
def generate_metadata_recommendations(dataframe):
    """
    Provide recommendations to improve meta descriptions and headings for better visibility.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Metadata improvement suggestions.
    """
    recommendations = []
    for idx, row in dataframe.iterrows():
        # Extract meta description and headings from the row
        meta_desc = row.get("Meta Description", "")
        headings = row.get("Headings", "")

        # Check if the meta description is too short or too long
        if len(meta_desc.split()) < 10:
            recommendations.append(f"Expand meta description: '{meta_desc}' (too short).")
        elif len(meta_desc.split()) > 30:
            recommendations.append(f"Condense meta description: '{meta_desc}' (too long).")

        # Check if the heading is too long for readability
        if len(headings.split()) > 15:
            recommendations.append(f"Condense heading: '{headings}' (too long for readability).")
    return recommendations  # Return the list of recommendations

metadata_recommendations = generate_metadata_recommendations(scraped_data)

# Step 3: Content Gap Analysis
# Purpose: Identify missing or incomplete content to improve website relevance and user experience.
def perform_content_gap_analysis(dataframe):
    """
    Identify potential content gaps based on paragraph analysis.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Insights on content gaps.
    """
    gaps = []
    faq_keywords = ["what", "how", "why", "where", "when", "who", "can", "should", "is", "are", "does", "do"]

    for paragraph in dataframe["Paragraphs"]:
        # Split the paragraph into sentences
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
        for sentence in sentences:
            # Check if the sentence is a potential FAQ
            if any(sentence.lower().startswith(word) for word in faq_keywords):
                gaps.append(f"Expand on this FAQ: '{sentence.strip()}'")

        # Identify placeholder content
        if "coming soon" in paragraph.lower() or "click here" in paragraph.lower():
            gaps.append(f"Replace placeholder content: '{paragraph.strip()}'")

        # Flag very short paragraphs
        if len(paragraph.split()) < 10:
            gaps.append(f"Expand short paragraph: '{paragraph.strip()}'")
    return gaps  # Return the list of content gaps

content_gaps = perform_content_gap_analysis(scraped_data)

# Step 4: Optimized Content Recommendations
# Purpose: Provide suggestions to improve readability and make content more user-friendly.
def generate_optimized_content_recommendations(dataframe):
    """
    Suggest ways to optimize paragraphs for readability and conversational tone.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Optimized content recommendations.
    """
    recommendations = []
    for idx, row in dataframe.iterrows():
        paragraph = row.get("Paragraphs", "")
        # Suggest splitting long paragraphs
        if len(paragraph.split()) > 50:
            recommendations.append(f"Split long paragraph: '{paragraph[:100]}...'")
        # Suggest simplifying complex sentences
        if len(paragraph.split()) > 30 and any(word in paragraph.lower() for word in ["therefore", "hence", "however"]):
            recommendations.append(f"Simplify complex paragraph: '{paragraph[:100]}...'")
    return recommendations  # Return the list of recommendations

optimized_content_recommendations = generate_optimized_content_recommendations(scraped_data)

# Step 5: Extract Keyword Insights
# Purpose: Identify the most frequently used keywords in the content.
def extract_keyword_insights(dataframe):
    """
    Extract and count popular keywords from the content.
    Args:
        dataframe (pd.DataFrame): DataFrame containing scraped data.
    Returns:
        list: Top 10 keywords with their counts.
    """
    # Combine all cleaned paragraphs into one string
    text = ' '.join(dataframe["Cleaned Paragraphs"].fillna(""))
    # Tokenize the text into individual words
    words = re.findall(r'\b\w+\b', text.lower())
    # Remove common stopwords
    stop_words = set(["the", "and", "to", "for", "of", "in", "a", "is", "on", "with", "as", "it", "by", "an", "or"])
    keywords = [word for word in words if word not in stop_words]
    # Count keyword frequencies
    keyword_counts = Counter(keywords)
    return keyword_counts.most_common(10)  # Return the top 10 keywords

keyword_insights = extract_keyword_insights(scraped_data)

# Step 6: Save and Display All Results
# Purpose: Save the analysis results to a CSV file and display a preview for validation.
def save_and_display_results(metadata, gaps, content_recs, keywords):
    """
    Save all recommendations and insights to a CSV file and display a preview.
    Args:
        metadata (list): Metadata recommendations.
        gaps (list): Content gap analysis results.
        content_recs (list): Optimized content recommendations.
        keywords (list): Top keywords with their counts.
    """
    # Create a dictionary of results
    results = {
        "Metadata Recommendations": metadata,
        "Content Gaps": gaps,
        "Optimized Content": content_recs,
        "Keyword Insights": [f"{k}: {v}" for k, v in keywords]
    }
    # Convert the results into a DataFrame
    results_df = pd.DataFrame({k: pd.Series(v) for k, v in results.items()})
    # Save the DataFrame to a CSV file
    results_df.to_csv("voice_search_optimization_results.csv", index=False)
    print("\nResults saved to 'voice_search_optimization_results.csv'.")
    # Display a preview of the results
    print("\nPreview:")
    print(results_df.head().to_markdown())

# Save and display the results
save_and_display_results(metadata_recommendations, content_gaps, optimized_content_recommendations, keyword_insights)


Scraping content from: https://thatware.co/
Scraping content from: https://thatware.co/digital-marketing-services/
Scraping content from: https://thatware.co/business-intelligence-services/
Scraping content from: https://thatware.co/link-building-services/
Scraping content from: https://thatware.co/branding-press-release-services/
Scraping content from: https://thatware.co/advanced-seo-services/

Scraped content has been saved to 'scraped_content_with_cleaned_text.csv'.

Preview of Scraped Content:
                                                 URL  \
0                               https://thatware.co/   
1    https://thatware.co/digital-marketing-services/   
2  https://thatware.co/business-intelligence-serv...   
3        https://thatware.co/link-building-services/   
4  https://thatware.co/branding-press-release-ser...   

                                            Headings  \
0  Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARK...   
1  Advanced Digital Marketing Services GET A FR

# **Explanation of the Output**
---

### **What Is This Output About?**

1. **Purpose of FAQ Suggestions**:
   - The purpose of this output is to identify potential questions or informative statements that your website visitors might have or find useful.
   - These FAQs can be displayed on your website to:
     - Improve user experience by answering common queries.
     - Enhance SEO by targeting long-tail keywords and natural language queries that people use in search engines.

2. **How Were These Suggestions Generated?**:
   - The system analyzed the text content of your website, looking for sentences that:
     - Start with question words (e.g., "How," "What," "Why").
     - Provide clear and concise information about your services or products.
     - Address potential user needs or concerns.

---

### **Detailed Explanation of Each FAQ Suggestion**

#### **1. "However, technical on-site SEO stands out as one of the most powerful tools in your digital marketing arsenal."**
   - **What It Means**: This is a statement emphasizing the importance of on-site SEO (optimizing individual web pages to rank higher in search results).
   - **Why It’s Useful**: Visitors might want to know why on-site SEO is critical and how it can help their website.
   - **What to Do**: Expand this into a question-answer format, like:
     - **Q**: Why is technical on-site SEO important?
     - **A**: Technical on-site SEO improves website performance and helps search engines understand your content, leading to higher rankings.

#### **2. "HOW DOES OUR DIGITAL MARKETING SYSTEM WORK?"**
   - **What It Means**: This is a direct question extracted from your content, which users might naturally ask.
   - **Why It’s Useful**: Explaining how your digital marketing system works builds trust and helps potential clients understand your process.
   - **What to Do**: Provide a detailed, step-by-step explanation of your marketing approach in response to this question.

#### **3. "When comparing SEO with offline advertising or other digital marketing approaches, one notable advantage of SEO is its ability to deliver a high ROI."**
   - **What It Means**: This statement compares SEO to other marketing methods, highlighting its cost-effectiveness.
   - **Why It’s Useful**: Users might wonder why they should invest in SEO rather than traditional marketing.
   - **What to Do**: Reframe this as a question, like:
     - **Q**: How does SEO compare to offline advertising?
     - **A**: SEO delivers a higher return on investment because it targets users actively searching for your products or services.

#### **4. "When you are committing to digital marketing, you can inquire the specialists to look into your competitor’s online policies."**
   - **What It Means**: This suggests that analyzing competitors is an essential part of a digital marketing strategy.
   - **Why It’s Useful**: Many users are curious about how competitive analysis works and why it’s important.
   - **What to Do**: Turn this into an FAQ, like:
     - **Q**: How does competitor analysis help in digital marketing?
     - **A**: It identifies your competitors’ strategies, strengths, and weaknesses, helping you create a more effective marketing plan.

#### **5. "When searchers see keywords they were actively searching for featured on your product pages, they are more likely to consider your business as the right one to engage with."**
   - **What It Means**: This explains how keyword optimization increases the chances of user engagement.
   - **Why It’s Useful**: Users may want to understand how keywords impact their site’s visibility and conversion rates.
   - **What to Do**: Create an FAQ like:
     - **Q**: Why is keyword optimization important for my website?
     - **A**: Keywords help match your content to user searches, increasing visibility and engagement.

#### **6. "When executed effectively, on-page SEO can yield a high return on investment."**
   - **What It Means**: On-page SEO, when done right, offers significant benefits.
   - **Why It’s Useful**: It assures users that focusing on on-page SEO can directly impact their profitability.
   - **What to Do**: Expand this into an FAQ:
     - **Q**: What is the ROI of on-page SEO?
     - **A**: On-page SEO improves visibility, attracts organic traffic, and drives conversions, resulting in a high return on investment.

#### **7. "Does digital marketing look as unfamiliar as a binary code to you?"**
   - **What It Means**: This is a question targeting users who may feel overwhelmed or confused by digital marketing.
   - **Why It’s Useful**: It empathizes with users and offers to simplify complex concepts.
   - **What to Do**: Follow up with an answer like:
     - **Q**: I’m new to digital marketing. Can you help me understand it?
     - **A**: Absolutely! Our team breaks down digital marketing into simple, actionable steps tailored to your business needs.

#### **8. "Downloadable content offerings are resources provided by businesses for users to download."**
   - **What It Means**: This defines downloadable content as a marketing tool.
   - **Why It’s Useful**: Users might want to know how downloadable resources can benefit their business.
   - **What to Do**: Create an FAQ like:
     - **Q**: What is downloadable content, and how does it help?
     - **A**: Downloadable content, like eBooks or guides, attracts leads by offering valuable information in exchange for user details.

#### **9. "When these strategies are put together, they create the perfect way to make your business more profitable."**
   - **What It Means**: This explains how combining different marketing strategies can drive results.
   - **Why It’s Useful**: Users often look for an integrated approach to achieve profitability.
   - **What to Do**: Frame an FAQ like:
     - **Q**: How do integrated marketing strategies increase profitability?
     - **A**: They combine SEO, content marketing, and social media to maximize your reach and ROI.

#### **10. "When your business consistently ranks at the top of search results, it reinforces your position as an industry leader."**
   - **What It Means**: Ranking high in search results enhances credibility and authority.
   - **Why It’s Useful**: Users want to know the benefits of achieving top search rankings.
   - **What to Do**: Create an FAQ like:
     - **Q**: How does top search ranking benefit my business?
     - **A**: High rankings boost visibility, attract more customers, and establish your business as an industry leader.

---

### **What This Output Conveys**

- It provides a set of potential FAQs and content ideas based on your website’s current text.
- These FAQs can:
  - Enhance user engagement.
  - Address common questions your audience might have.
  - Improve your website’s SEO by targeting natural language queries.

---

### **Next Steps**

1. **Review and Edit the Suggestions**:
   - Ensure the FAQs align with your business offerings and goals.
   - Edit or expand the answers to make them informative and actionable.

2. **Add FAQs to Your Website**:
   - Create an FAQ section or page.
   - Organize the FAQs into categories if needed (e.g., SEO, Digital Marketing, Strategy).

3. **Optimize for SEO**:
   - Use keywords from the FAQ suggestions to improve visibility.
   - Ensure the answers address user intent.

4. **Monitor Performance**:
   - Track how these FAQs impact your website traffic and user engagement.
   - Update or add new FAQs based on user behavior and feedback.

---

This output is a blueprint for improving your website’s relevance and engagement through structured content that answers user questions directly.

### **Explanation of the Output**

This output is the result of analyzing a website's content to generate actionable insights and recommendations. It has four main sections: **Metadata Recommendations**, **Content Gaps**, **Optimized Content**, and **Keyword Insights**. These insights are aimed at improving the website’s performance, user engagement, and search engine optimization (SEO). Below is a simple explanation of each part of the output, what it conveys, and the next steps you should take.

---

### **1. Metadata Recommendations**

#### **What Does It Mean?**
- **Metadata** includes the headings and meta descriptions of your website, which are critical for SEO and user engagement.
- The output provides suggestions to **shorten or optimize headings and meta descriptions** that are too long and may negatively impact readability or search rankings.

#### **Example from Output:**
- **"Condense heading: 'Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARKETING STRATEGY FOR YOUR BUSINESS...'"**
  - This heading is flagged as too long. Long headings can confuse users and are not fully displayed in search results.

#### **What Should You Do?**
- Review the suggested headings and make them shorter and more precise. For example:
  - Original: "Home GET A CUSTOMIZED SEO AUDIT & DIGITAL MARKETING STRATEGY FOR YOUR BUSINESS..."
  - Condensed: "Customized SEO Audit & Digital Marketing Strategy"
- Ensure all meta descriptions are between **50-160 characters** (optimal for Google).

#### **Why Is It Useful?**
- Optimized metadata improves your **click-through rate (CTR)** on search engines.
- Concise and meaningful headings/meta descriptions make it easier for users to understand your content.

---

### **2. Content Gaps**

#### **What Does It Mean?**
- **Content gaps** highlight missing, incomplete, or underdeveloped information on your website.
- The output suggests expanding on specific topics or FAQs to better address user queries.

#### **Example from Output:**
- **"Expand on this FAQ: 'What distinguishes us is the 927+ AI algorithms we've developed over the past 11 years.' "**
  - This indicates that the content briefly mentions the AI algorithms but doesn’t explain how they work or why they’re valuable.

#### **What Should You Do?**
- For each suggested FAQ or content gap:
  - Write a clear, detailed explanation or answer.
  - Add examples, benefits, or case studies to make it more informative.
  - Example: Expand on “927+ AI algorithms” by explaining their impact, industries they’ve benefited, and unique features.

#### **Why Is It Useful?**
- Filling content gaps makes your website more **authoritative** and **relevant** for visitors.
- It also helps rank for **long-tail keywords** (specific queries like "AI-powered SEO benefits").

---

### **3. Optimized Content**

#### **What Does It Mean?**
- This section identifies **long or complex paragraphs** that need to be split or simplified for better readability.
- Content that is easier to read keeps users engaged and reduces bounce rates.

#### **Example from Output:**
- **"Split long paragraph: '$ RevenueGenerated via SEO Qualified LeadsGenerated 11 years ago...'"**
  - This paragraph is too long and hard to follow. Splitting it into shorter sections with subheadings will improve clarity.

- **"Simplify complex paragraph: 'Thatware is your go-to advanced digital marketing agency for the digital marketing services requir...'"**
  - This indicates that the language is too complex and needs to be simplified for a broader audience.

#### **What Should You Do?**
- Break down large blocks of text into smaller paragraphs (3-5 sentences each).
- Use bullet points, subheadings, and simpler language to convey key points.
- Example: Simplify jargon-heavy sentences like “advanced digital marketing agency” to “We specialize in advanced marketing strategies tailored to your needs.”

#### **Why Is It Useful?**
- Improves **readability**, making the content accessible to a wider audience.
- Keeps users on the page longer, improving **engagement metrics** like average time on site.

---

### **4. Keyword Insights**

#### **What Does It Mean?**
- This section identifies the **most frequently used keywords** on your website and their usage frequency.
- It also highlights **popular two-word phrases (bigrams)**, which are crucial for targeting voice and search engine queries.

#### **Examples from Output:**
- **Top Keywords:**
  - **"seo: 281 occurrences"** → Indicates SEO is a primary focus on the website.
  - **"services: 274 occurrences"** → Suggests the website discusses services extensively.

- **Top Bigrams:**
  - **"digital marketing: 99 occurrences"** → A common phrase likely to attract users searching for marketing services.

#### **What Should You Do?**
- Ensure the identified keywords are integrated naturally into headings, meta descriptions, and body text.
- Use **bigrams** like “digital marketing” in FAQs, blog titles, and product descriptions.
- Avoid keyword stuffing; focus on providing value around these terms.

#### **Why Is It Useful?**
- Helps you align your website’s content with **popular search terms**, improving SEO rankings.
- Informs you which topics are overemphasized or underutilized, so you can adjust your focus accordingly.

---

### **How Does This Output Help Your Website?**

1. **Boost SEO**:
   - By optimizing metadata and filling content gaps, your website becomes more search-engine friendly.
   - The use of relevant keywords and phrases helps target specific user queries.

2. **Improve User Engagement**:
   - Concise and relevant FAQs address common user concerns.
   - Simplified content increases readability and keeps users on the site longer.

3. **Build Trust and Authority**:
   - Detailed explanations and well-structured content position your website as a reliable source of information.

4. **Increase Conversion Rates**:
   - Clear calls to action and engaging content encourage users to explore services and make inquiries.

---

### **Steps to Take After This Output**

1. **Implement Metadata Changes**:
   - Shorten and rewrite headings and meta descriptions based on recommendations.
   - Test how the changes affect click-through rates.

2. **Expand FAQs and Content**:
   - Address the suggested content gaps by creating new sections, detailed FAQs, or blog posts.
   - Use real-life examples or client testimonials for credibility.

3. **Optimize Content for Readability**:
   - Simplify complex paragraphs and split long blocks of text.
   - Add visuals like images or infographics to break up dense content.

4. **Incorporate Keywords Strategically**:
   - Ensure keywords and bigrams are used in headings, subheadings, and body text.
   - Create new content targeting these keywords for better SEO performance.

5. **Monitor Results**:
   - Use tools like Google Analytics to track improvements in organic traffic, bounce rates, and user engagement.

---


### **Explanation of the Output**

The output provides insights into **top keywords** and **top bigrams (two-word phrases)** used on the website. These insights are essential for understanding what your content focuses on and how effectively it aligns with the target audience's search queries. Let me explain each part clearly and simply, and outline the steps you should take based on this information.

---

### **1. Top Keywords**

**What Does It Mean?**
- This list shows the most frequently used **single words (keywords)** on your website, along with how many times they appear.
- These keywords indicate the main topics and themes of your content.
- Example:
  - **"seo: 313 occurrences"** means the word "SEO" appears 313 times across your website content.
  - **"services: 274 occurrences"** means "services" appears 274 times.

#### **Breakdown of Top Keywords:**
- **seo (313 occurrences):**
  - Your website heavily focuses on Search Engine Optimization (SEO), indicating it’s a primary offering or topic of discussion.
- **services (274 occurrences):**
  - Suggests your website emphasizes the services you provide, likely related to SEO and marketing.
- **marketing (190 occurrences) & digital (163 occurrences):**
  - Indicates a strong emphasis on digital marketing as a core offering.
- **advanced (125 occurrences):**
  - Suggests that your content promotes advanced-level expertise or tools, potentially setting you apart from competitors.
- **business (111 occurrences):**
  - Shows that your target audience is businesses seeking your services.
- **link (96 occurrences) & building (89 occurrences):**
  - Indicates a focus on link-building strategies, which are essential for SEO.
- **search (75 occurrences) & website (69 occurrences):**
  - Highlights content related to improving search engine performance and website optimization.

---

#### **What Steps Should You Take?**
1. **Validate the Keywords:**
   - Are these the keywords you want to rank for? If yes, continue building content around them. If no, adjust your content strategy.
   - Example: If you want to emphasize "AI-powered SEO" but it’s missing here, create more content targeting that keyword.

2. **Optimize Keyword Placement:**
   - Place these keywords strategically in headings, meta descriptions, subheadings, and content body. Ensure they appear naturally.
   - Avoid overusing keywords (keyword stuffing), as search engines penalize this.

3. **Explore Long-Tail Keywords:**
   - Build on these keywords by targeting specific variations.
   - Example: For "SEO," create content targeting "local SEO tips" or "SEO for small businesses."

---

### **2. Top Bigrams (Two-Word Phrases)**

**What Does It Mean?**
- Bigrams are pairs of words that frequently appear together. They reveal common phrases or topics discussed in your content.

#### **Breakdown of Top Bigrams:**
- **digital marketing (99 occurrences):**
  - Suggests your content heavily emphasizes digital marketing as a primary service.
- **advanced seo (62 occurrences):**
  - Highlights that your content promotes expertise in advanced SEO techniques.
- **link building (61 occurrences):**
  - Indicates a focus on creating strategies to improve website authority through backlinks.
- **seo services (56 occurrences):**
  - Suggests that "SEO services" is a core offering and likely a term your audience searches for.
- **advanced link (23 occurrences):**
  - Suggests a niche offering related to advanced link-building strategies.
- **search engine (22 occurrences) & search engines (18 occurrences):**
  - Indicates your content discusses optimizing websites for search engines like Google.
- **marketing strategy (19 occurrences):**
  - Suggests an emphasis on creating effective marketing strategies for clients.
- **get touch (19 occurrences):**
  - Likely part of a call-to-action encouraging users to contact your team.
- **per month (17 occurrences):**
  - Could indicate pricing plans or recurring services.

---

#### **What Steps Should You Take?**
1. **Focus on Bigram Relevance:**
   - Ensure these bigrams align with your core business goals. For example:
     - If "digital marketing" is your focus, develop more blog posts or service pages around this topic.
     - Highlight your expertise in "advanced SEO" by showcasing case studies or technical capabilities.

2. **Use Bigrams in Content Optimization:**
   - Integrate these phrases into page titles, FAQs, and internal links.
   - Example: Use "digital marketing" in blog post titles like "10 Digital Marketing Trends for 2024."

3. **Expand FAQs and Blog Topics:**
   - Use these phrases to create new blog posts or FAQs.
   - Example:
     - FAQ: "What are advanced SEO techniques?"
     - Blog: "The Role of Link Building in Advanced SEO."

4. **Leverage for Paid Campaigns:**
   - Use these keywords and phrases in your Google Ads or social media campaigns to attract the right audience.

---

### **3. What Does This Output Convey?**

- The **keyword and bigram analysis** shows the themes dominating your website’s content. It provides clarity on what topics you’re covering and how they align with user search behavior.
- It helps identify areas where you’re strong (like "digital marketing" and "SEO services") and areas you might need to improve or expand.

---

### **4. Next Steps to Increase Business**

1. **Target Audience Alignment:**
   - Ensure these keywords and bigrams resonate with the services your audience is looking for. If not, revise your content to better align with their needs.

2. **Content Gap Analysis:**
   - Look for missing keywords or phrases you want to target but aren’t appearing here.
   - Example: If you want to promote "AI-driven SEO," start creating content on this topic.

3. **Create High-Value Content:**
   - Build in-depth blog posts, whitepapers, or videos targeting the top keywords and phrases.
   - Example: A guide titled "Mastering Digital Marketing: Advanced SEO Techniques" will likely rank well and attract more users.

4. **Monitor Performance:**
   - Use tools like Google Analytics or SEMrush to track how your rankings and traffic improve after implementing changes.

5. **Improve Internal Linking:**
   - Use these keywords and bigrams as anchor text to link between related pages on your website.

---

### **How Does This Help Your Website?**

- **Improves SEO Rankings:**
  - By focusing on popular keywords and phrases, your website becomes more relevant to search engines.
- **Increases Traffic:**
  - Targeting the right terms helps attract users who are searching for your services.
- **Boosts Conversions:**
  - Clear and targeted content aligned with user queries improves engagement and encourages action, such as contacting your business or making a purchase.

---

### **Conclusion**

This output is a **content performance report**. It highlights the keywords and phrases that dominate your website, allowing you to fine-tune your strategy for better visibility and engagement. By implementing the suggested steps, you can make your website more appealing to both search engines and users, ultimately driving more traffic and conversions.