<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/Federated_Learning_for_Personalized_SEO_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name:- Federated Learning for Personalized SEO Recommendations**

### **Purpose Of The Project:**

**Overview of the Project Purpose:**
The project "Federated Learning for Personalized SEO Recommendations" aims to harness the power of federated learning—a form of decentralized machine learning—to analyze website engagement data and generate personalized recommendations that improve the visibility, engagement, and overall performance of websites in search engine results. This project is designed to benefit website owners, digital marketers, and content strategists by offering actionable insights that are tailored to the specific needs and behaviors of their target audience.

**Key Objectives and Benefits:**
1. **Personalized Content Recommendations:**
   - The project creates tailored suggestions to help website owners optimize their content based on user behavior and engagement data. For example, if a specific webpage has high user engagement but low traffic, the model might recommend promoting that page more widely to attract more visitors.
   
2. **SEO Insights and Optimization:**
   - By analyzing metrics like views, bounce rates, average session durations, and user engagement, the project generates insights on which types of content, keywords, and topics perform best. This helps website owners refine their SEO strategies, improve their ranking in search results, and better serve their target audience.
   
3. **Enhanced User Experience:**
   - The recommendations aim to enhance the overall user experience on a website by suggesting improvements to page structure, content relevance, loading speed, and interactive features based on data-driven insights.

**Why Use Federated Learning?**
- **Privacy-Preserving Data Analysis:** Unlike traditional models that collect and analyze data on a central server, federated learning works by training models locally on user devices or in separate data silos, ensuring that sensitive user data remains private. This makes it particularly appealing for analyzing user engagement without compromising privacy.
- **Improved Accuracy:** By leveraging data from multiple sources without aggregating it into one central location, federated learning can deliver accurate insights while maintaining data security.

### How the Project Achieves Its Purpose:
The project involves three main parts, each with a specific role in achieving the overall purpose:

1. **Part 1 - Data Collection and Cleaning:**
   - This step involves scraping relevant content from specified URLs. It collects data such as webpage meta descriptions, keywords, and main text content. The data is then cleaned and standardized for further analysis.
   
2. **Part 2 - Data Integration and Merging:**
   - This step merges the scraped content data with engagement and user interaction data. By standardizing and combining data from multiple sources (e.g., webpage views, user sessions, and bounce rates), the project creates a comprehensive dataset ready for analysis.
   
3. **Part 3 - Content Recommendation Generation:**
   - In the final step, the merged data is analyzed to generate personalized recommendations based on user engagement metrics, views, bounce rates, etc. The generated recommendations provide actionable insights that website owners can use to enhance content quality, optimize SEO, and improve overall engagement.


### **1. What is Federated Learning (FL) in SEO?**

Federated Learning is a method in machine learning where models are trained on data that remains on users’ devices or on local servers, rather than collecting all data in a central location. In SEO (Search Engine Optimization), applying FL means the model learns from data distributed across multiple sources (e.g., users visiting different pages on your website, user interactions with the website’s content, or even from various websites) without needing to centralize or collect this data in one place. This is particularly useful for personalization in SEO, where the goal is to make search results or website content more relevant to each user without compromising their privacy.

### **2. How Does Federated Learning Work in SEO?**

In traditional SEO, personalization can involve collecting user data (like search queries, clicks, and preferences) and then analyzing it centrally to improve search results and recommendations. Federated Learning changes this by keeping the data on the devices (or on the website servers where it's generated) and only sharing the model’s learned updates, not the actual user data. For example:
- **Step 1**: Federated Learning collects insights from different sources, like how visitors interact with a site or what keywords they’re interested in.
- **Step 2**: The FL model adjusts based on these insights at each local device or server level.
- **Step 3**: Updates (model changes) are sent to a central server, which merges them to improve the main model without seeing any actual user data.

### **3. Use Cases for Federated Learning in SEO (Website Context)**

In the context of a website, Federated Learning can be used for:
- **Personalized Content Recommendations**: Tailoring the articles, products, or resources shown to each user based on their previous interactions, without storing their data on a central server.
- **Improved Search Rankings within the Website**: The model can improve how it ranks or recommends content on the website itself based on user behavior (like click-through rates or time spent on each page).
- **Predicting User Intent**: By learning from previous user interactions, the model can predict what each visitor is looking for and guide them to relevant content faster.
- **Content Optimization**: Understanding which keywords or page types (blogs, product pages, etc.) perform best for different user segments and adjusting site content accordingly, enhancing SEO.

### **4. Real-Life Implementations of Federated Learning in SEO**

While Federated Learning is relatively new in SEO, it’s being adopted in areas like:
- **Personalized Recommendations on News Websites**: Federated models help personalize content based on how users interact with articles, without sharing their reading habits.
- **E-commerce Product Recommendations**: Online stores are beginning to use FL for customized product suggestions based on previous interactions, maintaining user privacy.
- **Search Engines with Localized Results**: Search engines (e.g., Google’s early experiments with FL) use Federated Learning to improve personalization by learning from device-level interactions.

### **5. What Kind of Data Does a Federated Learning Model Need?**

For Federated Learning in SEO, the model typically needs:
- **User Interaction Data**: Information on clicks, page visits, time spent on different sections, or search queries within the website.
- **Content Data**: Information about the content on the site, like keywords, meta tags, and other SEO-related elements.
- **Behavioral Data**: How users navigate and interact with different content types (blogs, articles, product pages).
  
The data can be in various formats, but a **CSV file** with structured data (e.g., columns for URL, keywords, user interaction metrics) often works best. Alternatively, the model can work directly with website content URLs if it has the capability to process the text from these URLs directly.

### **6. Do You Need URLs or Just Data in CSV Format?**

You have two options:
- **URL-based Processing**: If the Federated Learning model can fetch and analyze webpage content directly, then you would only need to provide URLs of your website’s pages. This approach allows the model to process the text and extract keywords and other elements by crawling the page.
- **CSV-based Data**: If you already have structured data (like user interaction metrics, page keywords, content categories) in a CSV, you can provide this file. The model will then use this data to learn without needing to access the website directly.

For non-technical users, providing structured data in a CSV format is often simpler and more practical than configuring URL-based processing.

### 7. How Federated Learning Enhances Personalized Search Results

Federated Learning can enhance SEO by personalizing user experiences without collecting personal data. For example, as users interact with the website, the model learns which types of content or keywords are most relevant to different audiences. It can then adjust search result rankings or content recommendations to suit each user’s interests, offering a more tailored experience. Since FL doesn’t need to centralize user data, it’s also a privacy-friendly approach, making it suitable for companies sensitive to data privacy concerns.

### 8. Expected Output from Federated Learning in SEO

The output from a Federated Learning SEO model generally includes:
- **Personalized Content Recommendations**: A list of suggested pages or articles tailored for each user based on past behavior.
- **Optimized Internal Search Results**: Improved ranking of pages within the site’s search feature to show more relevant results to individual users.
- **SEO Insights**: Analysis on which keywords or content types are performing best, helping website owners adjust their strategy.
  
In website contexts, this output could take the form of **recommendation lists** or **search result adjustments** on the site itself. For instance, when a visitor searches for "beginner SEO tips," the model might suggest articles related to "SEO for beginners" or "simple SEO hacks" based on previous interactions.


---


In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


---


#### **Part 1: Web Scraping and Content Extraction**
- **Purpose**: This part of the code is responsible for extracting content from specified web pages. It collects meta descriptions, keywords, and main textual content, which will later be analyzed for SEO improvements.
- **What It Does**:
  - **Import Libraries**: Imports necessary libraries for web scraping (like requests, BeautifulSoup) and text processing.
  - **Define `clean_text` Function**: This function cleans the text by removing non-letter characters, converting it to lowercase, and removing common words (stopwords) that do not add much meaning (like "the", "is").
  - **Specify URLs**: A list of URLs to scrape is defined. These URLs point to different web pages related to SEO services.
  - **Scrape Content**: For each URL, the code sends a request to fetch the page content. It extracts meta descriptions, keywords (if available), and text from paragraphs and headings.
  - **Clean the Content**: The extracted text is cleaned using the `clean_text` function.
  - **Save Data**: The scraped data is saved into a CSV file named `final_scraped_webtool_content.csv` for use in later steps.



In [None]:
# Importing necessary libraries for web scraping, data processing, and text cleaning
import requests  # To send HTTP requests to URLs and fetch web content
from bs4 import BeautifulSoup  # For parsing HTML content from webpages
import pandas as pd  # For handling data in tabular (spreadsheet-like) format
import re  # Regular expressions for text cleaning (e.g., removing unwanted characters)
from nltk.corpus import stopwords  # To remove commonly used English words that do not add much meaning
import nltk  # Natural Language Toolkit for text processing

# Downloading stopwords if they are not already available
nltk.download('stopwords')

# Step 1: Function to clean and process text content
def clean_text(text):
    """
    Cleans input text by:
    - Removing non-alphabetic characters (e.g., numbers, punctuation)
    - Converting text to lowercase
    - Removing common English stopwords (e.g., 'and', 'the', etc.)

    Args:
    - text (str): The text content to clean

    Returns:
    - str: The cleaned text
    """
    # Using regular expressions to keep only letters and spaces (removes digits, punctuation, etc.)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Converting text to lowercase (ensures uniformity)
    text = text.lower()
    # Removing common English words that do not carry much meaning (stopwords)
    stop_words = set(stopwords.words('english'))
    # Splitting the text into words, removing stopwords, and joining them back together
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Step 2: List of URLs we want to scrape content from
urls = [
    'https://webtool.co/fitness-based-seo-service/',
    'https://webtool.co/attorney-based-seo-service/',
    'https://webtool.co/medical-seo-service/',
    'https://webtool.co/photography-seo-service/',
    'https://webtool.co/banking-seo-service/',
    'https://webtool.co/fashion-based-seo-service/',
    'https://webtool.co/real-estate-seo-service/',
    'https://webtool.co/adult-seo-service/',
    'https://webtool.co/cbd-seo-service/',
    'https://webtool.co/crypto-seo-service/',
    'https://webtool.co/ecommerce-seo-service/',
    'https://webtool.co/education-based-seo/',
    'https://webtool.co/gaming-seo/',
    'https://webtool.co/igaming-seo-service/',
    'https://webtool.co/cosmetics-seo-service/',
    'https://webtool.co/glass-wall-seo/',
    'https://webtool.co/cora/',
    'https://webtool.co/cosine-similarity/',
    'https://webtool.co/bagofwords/',
    'https://webtool.co/lda/',
    'https://webtool.co/tf-idf-checker/',
    'https://webtool.co/cooccurence/',
    'https://webtool.co/keydensity/',
    'https://webtool.co/proximity/',
    'https://webtool.co/semantic/',
    'https://webtool.co/n-gram/',
    'https://webtool.co/sentiment/',
    'https://webtool.co/advanced-seo-service/'
]

# Step 3: List to store data scraped from each URL
scraped_data = []

# Step 4: Function to scrape data from a single URL
def scrape_content(url):
    """
    Scrapes data from a given URL:
    - Extracts meta description and keywords (if present)
    - Extracts main text content (e.g., paragraphs, headings)
    - Cleans the extracted text using the clean_text function

    Args:
    - url (str): The URL to scrape

    Returns:
    - dict: A dictionary containing URL, cleaned meta description, keywords, and main text
    """
    try:
        # Sending a request to the URL to get its content
        response = requests.get(url, timeout=10)
        # Check if the request was successful (status code 200 means success)
        if response.status_code != 200:
            return {'URL': url, 'Meta Description': 'Error fetching data', 'Keywords': 'N/A', 'Cleaned Text': 'Error fetching data'}

        # Parsing the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extracting meta description (provides a brief summary of the webpage)
        meta_description = soup.find('meta', attrs={'name': 'description'})
        meta_content = meta_description['content'] if meta_description else 'No description'

        # Extracting meta keywords (keywords for SEO purposes, if available)
        meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
        if meta_keywords and 'content' in meta_keywords.attrs:
            keywords_content = meta_keywords['content'].strip()  # Extracting the actual keywords content
        else:
            keywords_content = None  # If no keywords are found

        # Extracting main text content from paragraphs and headings (provides most of the page's content)
        paragraphs = soup.find_all('p')
        headings = soup.find_all(['h1', 'h2'])
        # Combining text from paragraphs and headings into a single string
        main_text = ' '.join([p.get_text() for p in paragraphs + headings])

        # Cleaning the extracted text using the clean_text function
        cleaned_text = clean_text(main_text)

        # Returning the data in a dictionary
        return {
            'URL': url,
            'Meta Description': clean_text(meta_content),  # Cleaned meta description text
            'Keywords': clean_text(keywords_content) if keywords_content else 'N/A',  # Cleaned keywords or 'N/A'
            'Cleaned Text': cleaned_text  # Cleaned main text
        }
    except Exception as e:
        # Handling any errors during the scraping process gracefully
        return {
            'URL': url,
            'Meta Description': 'Error fetching data',
            'Keywords': 'N/A',
            'Cleaned Text': 'Error fetching data'
        }

# Step 5: Looping through the list of URLs to scrape content
for url in urls:
    # Scraping data from each URL and adding it to the list
    scraped_data.append(scrape_content(url))

# Step 6: Converting the list of scraped data into a DataFrame for easier processing and analysis
scraped_df = pd.DataFrame(scraped_data)

# Displaying the first few rows of the DataFrame to verify the scraped data
print("Final Scraped Data (Preview):")
print(scraped_df.head())

# Step 7: Saving the DataFrame to a CSV file for later use (e.g., merging with other datasets)
scraped_df.to_csv('final_scraped_webtool_content.csv', index=False)

# End of the first part
# This code handles web scraping and prepares the data, which will be used for merging in the next part.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Final Scraped Data (Preview):
                                              URL  \
0   https://webtool.co/fitness-based-seo-service/   
1  https://webtool.co/attorney-based-seo-service/   
2         https://webtool.co/medical-seo-service/   
3     https://webtool.co/photography-seo-service/   
4         https://webtool.co/banking-seo-service/   

                                    Meta Description Keywords  \
0  transform fitness brand expert seo services ba...      N/A   
1  looking way boost law firms online visibility ...      N/A   
2  medical seo services help improve healthcare o...      N/A   
3  looking boost photography businesss online vis...      N/A   
4  improve banks online visibility reach professi...      N/A   

                                        Cleaned Text  
0  fitness based seo service quick enquiry seo se...  
1  attorney based seo service quick enquiry seo s...  
2  medical seo service quick enquiry medical base...  
3  photography including techniques levi

---

### Explanation of the Output:

1. **Downloading Package Stopwords**:
   - This part of the output shows that the code is downloading or verifying the presence of a package called `stopwords` from the Natural Language Toolkit (NLTK).
   - **Why is this needed?** Stopwords are common words like "is", "and", "the" that do not contribute much meaning when analyzing text. They are removed from text data to make the analysis more focused and meaningful.
   - The line `nltk_data Downloading package stopwords to /root/nltk_data...` simply shows the location where these stopwords are stored.

2. **Final Scraped Data (Preview)**:
   - This section shows a preview of the data that was extracted (or scraped) from various URLs (web pages). This data represents web content and meta-information relevant to SEO analysis.
   - **What does it contain?** The preview displays a few columns for the first few rows of the dataset. Here’s what each column represents:
     - **URL**: This is the web address of the page that was scraped.
       - Example: `/fitness-based-seo-service/` is the URL for a webpage related to fitness-based SEO services.
     - **Meta Description**: This is a brief summary or description of the webpage, usually provided in the meta tags of the HTML source code. It helps search engines understand the page’s content.
       - Example: The description for the fitness-based SEO service page mentions improving online visibility for fitness brands.
     - **Keywords**: These are keywords extracted from the page's meta tags. They provide an idea of the focus topics or themes of the page. In this example, the value is shown as `N/A`, which means keywords were not found or extracted.
     - **Cleaned Text**: This column contains the main text content extracted from the webpage, which has been cleaned. Cleaning involves converting text to lowercase, removing punctuation, and getting rid of common stopwords to make the text more relevant for analysis.
       - Example: For the fitness-based SEO service page, it shows a snippet of the cleaned text that mentions “fitness based seo service quick enquiry seo service...”.

### What This Data Means:

- **Purpose**: This data is a preliminary step in preparing information for SEO analysis. It helps us understand what content is present on each webpage and allows us to analyze it further for optimization.
- **Use Case**: By having the URL, meta description, keywords, and cleaned text, you can evaluate how well-optimized each page is for search engines. For instance:
  - If a page has a weak or missing meta description, you can recommend creating a more engaging and keyword-rich description.
  - If no relevant keywords are found, it may indicate that the page needs better SEO optimization.
  - The cleaned text can help identify the main topics or keywords being emphasized on the page.

### Why This Matters for SEO:

- **Meta Descriptions and Keywords**: These elements play a critical role in search engine optimization. The meta description can affect click-through rates, and the right keywords help improve a page’s visibility in search results.
- **Content Analysis**: By analyzing the main content (cleaned text), you can assess whether the page is focused on the right topics, if it’s optimized for specific keywords, or if it needs improvements like better structure, additional content, or different formatting.

### Next Steps Based on This Output:

1. **Analyze Meta Descriptions and Keywords**: Identify pages that have missing or weak descriptions and provide recommendations for improvements.
2. **Content Quality Check**: Assess the cleaned text to determine if the content is relevant, engaging, and keyword-rich.
3. **Optimization Recommendations**: Suggest ways to improve the on-page SEO elements based on the analysis, such as adding or refining meta descriptions, optimizing content for target keywords, and enhancing user engagement.

---



#### **Part 2: Data Merging and Standardization**
- **Purpose**: This part merges the scraped content data with other provided datasets (Landing Pages and Pageviews Engagement). The goal is to combine information from different sources to get a comprehensive view of each web page's performance.
- **What It Does**:
  - **Load Data**: Loads the scraped content data (from Part 1) and two other datasets: Landing Pages and Pageviews Engagement data.
  - **Standardize URLs**: Ensures that URLs in all datasets have a consistent format by converting to lowercase, removing trailing slashes, and trimming any whitespace. This makes merging possible.
  - **Merge Data**: Merges the three datasets using the `URL` as the key to bring together content, engagement, and pageview metrics for each page.
  - **Handle Missing Data**: Fills in missing values (e.g., setting missing numerical values to 0) to ensure data consistency.
  - **Save the Merged Data**: Saves the merged data into a CSV file named `final_merged_data.csv` for further processing.


In [None]:
# Importing necessary libraries for data handling and merging
import pandas as pd  # This library is used for data manipulation and analysis, especially working with datasets in table formats (DataFrames)

# File paths for the provided datasets
# Paths to the CSV files containing different datasets. These should match the location of your data files.
landing_page_path = '/content/drive/MyDrive/Federated Learning Datasets/Webtool_landing_pages.csv'
pageviews_engagement_path = '/content/drive/MyDrive/Federated Learning Datasets/Webtool_pageviews_user_engagement.csv'
scraped_content_path = 'final_scraped_webtool_content.csv'  # This is the output file from the previous web scraping code

# Step 1: Load datasets into DataFrames
# Reading data from CSV files into pandas DataFrames for easy data manipulation
scraped_content_df = pd.read_csv(scraped_content_path)  # This contains data scraped from various URLs
landing_page_df = pd.read_csv(landing_page_path)  # This contains data related to landing pages
pageviews_engagement_df = pd.read_csv(pageviews_engagement_path)  # This contains data on user engagement and pageviews

# Display previews of the datasets to understand their structure and contents
print("Final Scraped Webtool Content DataFrame (Preview):")
print(scraped_content_df.head())  # Displays the first 5 rows of the scraped content data
print("\nLanding Page DataFrame (Preview):")
print(landing_page_df.head())  # Displays the first 5 rows of the landing page data
print("\nPageviews Engagement DataFrame (Preview):")
print(pageviews_engagement_df.head())  # Displays the first 5 rows of the pageviews engagement data

# Step 2: Standardize URLs for consistency across datasets
def standardize_url(url):
    """
    Standardizes URLs by converting them to lowercase, trimming whitespace, and removing trailing slashes.

    URLs may differ in terms of cases, extra spaces, or trailing slashes. This function ensures that all URLs
    have a consistent format, making it easier to match URLs across datasets during merging.

    Args:
    - url (str): The URL to be standardized.

    Returns:
    - str: The standardized URL.
    """
    if pd.isna(url):  # Handle missing (NaN) values gracefully to avoid errors
        return url  # Return as-is if the value is NaN
    # Convert the URL to lowercase, remove leading/trailing whitespace, and remove any trailing slashes
    url = url.lower().strip().rstrip('/')
    return url

# Apply the URL standardization function to the URLs in all datasets
scraped_content_df['URL'] = scraped_content_df['URL'].str.replace('https://webtool.co', '', regex=False).apply(standardize_url)
# Renaming the column 'Page path and screen class' to 'URL' for consistent merging keys
landing_page_df.rename(columns={'Page path and screen class': 'URL'}, inplace=True)
landing_page_df['URL'] = landing_page_df['URL'].apply(standardize_url)
# Similarly, renaming and standardizing URLs in the pageviews engagement dataset
pageviews_engagement_df.rename(columns={'Page path and screen class': 'URL'}, inplace=True)
pageviews_engagement_df['URL'] = pageviews_engagement_df['URL'].apply(standardize_url)

# Step 3: Merging the datasets
# The goal is to combine information from different sources based on matching URLs
# Merge the scraped content DataFrame with the landing page DataFrame using 'URL' as the common key
merged_data = pd.merge(scraped_content_df, landing_page_df, on='URL', how='inner')  # 'inner' ensures only matching URLs are kept

# Merge the result with the pageviews engagement DataFrame on 'URL'
final_merged_data = pd.merge(merged_data, pageviews_engagement_df, on='URL', how='inner')

# Display the merged data to verify that the merging process worked correctly
print("\nFinal Merged DataFrame (Preview) with all three datasets:")
print(final_merged_data.head())  # Display the first 5 rows of the final merged DataFrame

# Step 4: Handle Missing Data (if any)
# Fill missing values with default values for consistency and to avoid issues in later analysis
# This ensures there are no empty or missing values for critical fields
final_merged_data.fillna({
    'Engaged sessions': 0,  # Replace missing values with 0 for numerical fields
    'Engagement rate': 0.0,
    'Bounce rate': 0.0,
    'Average session duration': 0.0,
    'Engaged sessions per active user': 0.0,
    'Sessions': 0,
    'Views': 0,
    'Active users': 0,
    'Views per active user': 0.0,
    'Average engagement time per active user': 0.0,
    'Event count': 0,
    'Key events': 0,
    'Total revenue': 0.0
}, inplace=True)

# Step 5: Save the final merged data to a CSV file for further use
final_merged_data.to_csv('final_merged_data.csv', index=False)  # Save the cleaned and merged data as a new CSV file
print("\nFinal merged data successfully saved as 'final_merged_data.csv'.")

# Displaying a confirmation message
print("The final merged dataset has been saved and is ready for further analysis.")


Final Scraped Webtool Content DataFrame (Preview):
                                              URL  \
0   https://webtool.co/fitness-based-seo-service/   
1  https://webtool.co/attorney-based-seo-service/   
2         https://webtool.co/medical-seo-service/   
3     https://webtool.co/photography-seo-service/   
4         https://webtool.co/banking-seo-service/   

                                    Meta Description Keywords  \
0  transform fitness brand expert seo services ba...      NaN   
1  looking way boost law firms online visibility ...      NaN   
2  medical seo services help improve healthcare o...      NaN   
3  looking boost photography businesss online vis...      NaN   
4  improve banks online visibility reach professi...      NaN   

                                        Cleaned Text  
0  fitness based seo service quick enquiry seo se...  
1  attorney based seo service quick enquiry seo s...  
2  medical seo service quick enquiry medical base...  
3  photography incl

---

### Explanation of the Output:

#### 1. **Final Scraped Webtool Content DataFrame (Preview)**
   - **What it represents**: This is the data extracted from the URLs of web pages. It shows information collected through web scraping in the first part of your code.
   - **Columns in this DataFrame**:
     - **URL**: The web address of the page that was scraped.
       - Example: `https://webtool.co/fitness-based-seo-service/` is a URL for a fitness SEO service.
     - **Meta Description**: This is a brief description of the content found on the page. It comes from the meta tag in the webpage's HTML. It's important for SEO because it gives search engines and users a snapshot of what the page is about.
       - Example: The description for the fitness-based SEO service page talks about transforming fitness brands with expert SEO services.
     - **Keywords**: Keywords are extracted from the page’s meta tag and can indicate what the page is optimized for. Here, it shows as `NaN` (meaning "Not a Number") because the data was either missing or could not be extracted.
     - **Cleaned Text**: This column contains the main text content from the page, processed to remove unnecessary characters like punctuation and stopwords (common words like "the" or "and" that don’t add much meaning).
       - Example: "fitness based seo service quick enquiry seo service" is part of the cleaned text.

#### 2. **Landing Page DataFrame (Preview)**
   - **What it represents**: This DataFrame contains data about user engagement metrics for different web pages.
   - **Columns in this DataFrame**:
     - **Page path and screen class**: This column shows the page path or URL suffix for different pages.
       - Example: `/adult-seo-service/` is the path for a page about adult SEO services.
     - **Engaged sessions**: The number of sessions where users were actively engaged with the page.
       - Example: 39 sessions were engaged for the `/adult-seo-service/` page.
     - **Engagement rate**: The proportion of sessions where users were actively engaged.
     - **Bounce rate**: The proportion of sessions where users left after viewing just one page.
     - **Average session duration**: The average time spent by users on the page, measured in seconds.
     - **Engaged sessions per active user**, **Sessions**, etc.: These columns provide further metrics about how users interact with the page.

#### 3. **Pageviews Engagement DataFrame (Preview)**
   - **What it represents**: This DataFrame provides data on page views and user engagement, indicating how often pages are viewed and by how many active users.
   - **Columns in this DataFrame**:
     - **Page path and screen class**: Similar to the previous DataFrame, it shows the page path.
     - **Views**: Number of times the page was viewed.
     - **Active users**: Number of unique users who interacted with the page.
     - **Views per active user**: Average number of views per active user.
     - **Average engagement time per active user**: The average time users spent engaging with the page.
     - **Event count**, **Key events**, **Total revenue**: These columns provide more detailed metrics about user activity and interactions.

#### 4. **Final Merged DataFrame (Preview) with all three datasets**
   - **What it represents**: This is the merged data that combines information from all three previous datasets. It allows for a comprehensive analysis by bringing together scraped data, user engagement metrics, and page views.
   - **Columns in this DataFrame**:
     - **URL**: The page’s URL after standardization.
     - **Meta Description, Keywords, Cleaned Text**: These columns come from the scraped content.
     - **Engaged sessions, Engagement rate, Bounce rate**, etc.: These columns come from the engagement and pageviews datasets.
   - **Why this is useful**: By merging data, you can analyze and compare engagement metrics, views, and content optimization on the same page. For example:
     - A page with high views but low engagement may need content improvement to retain visitors.
     - Pages with a high bounce rate may indicate a lack of relevant content or poor user experience.

### What This Data Means for SEO Improvement:
- **Data-driven SEO Recommendations**: This merged data allows you to generate recommendations to improve SEO. For example, you can identify which pages need better meta descriptions, more engaging content, or optimization for specific keywords.
- **User Engagement Insights**: You can see which pages perform well and which need improvement based on user engagement metrics.
- **Personalized Content Recommendations**: With this data, you can provide specific recommendations tailored to each URL, helping to improve user engagement and search engine ranking.

---

### Next Steps Based on This Output:
1. **Analyze Pages with High Bounce Rates**: Consider why users leave without further interaction. Improve the content, make it more engaging, or ensure it matches user intent.
2. **Optimize Meta Descriptions and Keywords**: Pages with missing or poor meta descriptions and keywords need attention.
3. **Enhance Content for Low Engagement Pages**: Use the cleaned text data to see if the content aligns with user needs and SEO goals.

---


In [None]:
import os
print("Current Working Directory:", os.getcwd())
print("Files in Directory:", os.listdir())  # List files in the current directory


Current Working Directory: /content
Files in Directory: ['.config', 'cleaned_merged_webtool_data.csv', 'drive', 'final_scraped_webtool_content.csv', 'final_merged_data.csv', 'sample_data']


#### **Part 3: Generating Content Recommendations**
- **Purpose**: This part generates content recommendations based on the merged data. It analyzes metrics like views, engagement rate, and bounce rate to provide actionable insights for improving each page's SEO performance.
- **What It Does**:
  - **Load Merged Data**: Loads the data merged in Part 2.
  - **Generate Recommendations**: For each page, it checks various engagement metrics and generates recommendations based on specific conditions. For example, if a page has a high bounce rate and low engagement, it suggests revising the content. If a page has excellent engagement, it recommends expanding the content.
  - **Save Recommendations**: The generated recommendations, along with key metrics (e.g., views, engagement rate, bounce rate), are saved into a CSV file named `content_recommendations.csv`.

---


In [None]:
# Importing necessary libraries for data handling
import pandas as pd  # Pandas is used for working with datasets in a tabular format (similar to spreadsheets)

# Load the merged data from the previous steps
final_merged_data_path = 'final_merged_data.csv'  # This file should contain the merged data from previous code parts
final_merged_data = pd.read_csv(final_merged_data_path)  # Loading the CSV file into a DataFrame for processing

# Display the first few rows of the data to understand its structure
print("Final Merged DataFrame (Preview):")
print(final_merged_data.head())  # Display the first 5 rows of the DataFrame for reference

# Step 1: Function to generate detailed content recommendations
def generate_content_recommendations(data):
    """
    Generates detailed content recommendations based on engagement metrics like views,
    engagement rate, bounce rate, average session duration, etc., to provide actionable insights.

    This function iterates through each URL in the data and generates a recommendation based on
    the engagement metrics of that page.

    Args:
    - data (DataFrame): A DataFrame containing URLs and relevant engagement metrics.

    Returns:
    - recommendations_df (DataFrame): A new DataFrame with columns for URL, recommendations, and key metrics.
    """
    recommendations = []  # Creating an empty list to store generated recommendations

    # Iterate over each row in the data (each row represents data for one URL)
    for _, row in data.iterrows():
        # Extracting data for the current row (URL)
        url = row.get('URL', 'N/A')  # Get the URL or set 'N/A' if it's missing
        views = row.get('Views', 0)  # Get the number of views or set to 0 if missing
        engagement_rate = row.get('Engagement rate', 0)  # Get engagement rate or 0 if missing
        bounce_rate = row.get('Bounce rate', 100)  # Set a high default bounce rate (100%) if missing
        avg_session_duration = row.get('Average session duration', 0)  # Get average session duration or 0 if missing
        active_users = row.get('Active users', 0)  # Get the number of active users or set to 0 if missing

        # Generating recommendations based on specific conditions
        # These conditions check the engagement metrics and provide recommendations accordingly
        if engagement_rate > 70 and views > 500:
            recommendation = f"The page '{url}' shows excellent engagement. Consider expanding its content, adding interactive elements (like videos or infographics), and optimizing high-performing keywords."
        elif engagement_rate > 50 and bounce_rate < 30:
            recommendation = f"The page '{url}' has moderate engagement with a low bounce rate. Boost its visibility through social media promotion, backlinks, and other marketing efforts."
        elif avg_session_duration > 180:
            recommendation = f"The page '{url}' has a high average session duration. Consider breaking down content into subtopics, adding in-depth content, or incorporating related links to retain users."
        elif views < 50 and active_users < 20:
            recommendation = f"The page '{url}' has low views and user engagement. Revisit its SEO strategies, enhance meta tags, and consider re-optimizing the content with targeted keywords to increase visibility. Leverage social media promotion and backlinks."
        else:
            recommendation = f"Review and update the content for '{url}' to improve engagement by focusing on relevance, structure, and user experience."

        # Append the URL, generated recommendation, and key metrics to the list
        recommendations.append({
            'URL': url,
            'Recommendation': recommendation,
            'Views': views,
            'Engagement Rate (%)': engagement_rate,
            'Bounce Rate (%)': bounce_rate,
            'Average Session Duration (seconds)': avg_session_duration,
            'Active Users': active_users
        })

    # Convert the list of recommendations into a DataFrame for easier handling and storage
    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df

# Step 2: Generate content recommendations using the function
recommendations_df = generate_content_recommendations(final_merged_data)  # Call the function and pass the merged data

# Displaying the generated recommendations to verify correctness
print("\nGenerated Content Recommendations (Preview):")
print(recommendations_df.head(15))  # Display the first 15 recommendations for review

# Step 3: Save the recommendations DataFrame to a CSV file for further use
recommendations_df.to_csv('content_recommendations.csv', index=False)  # Save the recommendations as a CSV file

# End of the code
# This code generates content recommendations based on engagement metrics such as views, engagement rate, and bounce rate.
# It offers actionable insights for each URL, helping optimize website content and user engagement.


Final Merged DataFrame (Preview):
                           URL  \
0   /fitness-based-seo-service   
1  /attorney-based-seo-service   
2         /medical-seo-service   
3     /photography-seo-service   
4         /banking-seo-service   

                                    Meta Description Keywords  \
0  transform fitness brand expert seo services ba...      NaN   
1  looking way boost law firms online visibility ...      NaN   
2  medical seo services help improve healthcare o...      NaN   
3  looking boost photography businesss online vis...      NaN   
4  improve banks online visibility reach professi...      NaN   

                                        Cleaned Text  Engaged sessions  \
0  fitness based seo service quick enquiry seo se...                 5   
1  attorney based seo service quick enquiry seo s...                 0   
2  medical seo service quick enquiry medical base...                 1   
3  photography including techniques levitra gener...                 0   


In [None]:
# Importing necessary libraries for web scraping, data processing, and text cleaning
import requests  # To send HTTP requests to URLs
from bs4 import BeautifulSoup  # For parsing HTML content from webpages
import pandas as pd  # For handling and storing data in tabular format
import re  # For text cleaning using regular expressions
from nltk.corpus import stopwords  # To remove common English stopwords
import nltk  # Natural Language Toolkit for text processing tasks

# Downloading stopwords (necessary for text cleaning) if not already available
nltk.download('stopwords')

# Step 1: Defining a function to clean text content
def clean_text(text):
    """
    Cleans input text by removing non-alphabetic characters, converting to lowercase,
    and removing common English stopwords.

    Args:
    - text (str): Input text to be cleaned.

    Returns:
    - str: Cleaned and processed text.
    """
    # Remove digits and punctuation, keeping only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert text to lowercase for uniformity
    text = text.lower()
    # Remove common English stopwords (words like "and", "the", "is", etc.)
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Step 2: Defining a list of URLs to scrape content from
urls = [
    'https://webtool.co/fitness-based-seo-service/',
    'https://webtool.co/attorney-based-seo-service/',
    'https://webtool.co/medical-seo-service/',
    'https://webtool.co/photography-seo-service/',
    'https://webtool.co/banking-seo-service/',
    'https://webtool.co/fashion-based-seo-service/',
    'https://webtool.co/real-estate-seo-service/',
    'https://webtool.co/adult-seo-service/',
    'https://webtool.co/cbd-seo-service/',
    'https://webtool.co/crypto-seo-service/',
    'https://webtool.co/ecommerce-seo-service/',
    'https://webtool.co/education-based-seo/',
    'https://webtool.co/gaming-seo/',
    'https://webtool.co/igaming-seo-service/',
    'https://webtool.co/cosmetics-seo-service/',
    'https://webtool.co/glass-wall-seo/',
    'https://webtool.co/cora/',
    'https://webtool.co/cosine-similarity/',
    'https://webtool.co/bagofwords/',
    'https://webtool.co/lda/',
    'https://webtool.co/tf-idf-checker/',
    'https://webtool.co/cooccurence/',
    'https://webtool.co/keydensity/',
    'https://webtool.co/proximity/',
    'https://webtool.co/semantic/',
    'https://webtool.co/n-gram/',
    'https://webtool.co/sentiment/',
    'https://webtool.co/advanced-seo-service/'
]

# Step 3: Creating an empty list to store scraped data
scraped_data = []

# Step 4: Function to scrape content from each URL
def scrape_content(url):
    """
    Scrapes meta descriptions, keywords (if available), and main text content from the given URL.

    Args:
    - url (str): The URL to scrape content from.

    Returns:
    - dict: A dictionary containing the URL, extracted meta description, optional keywords, and cleaned main text.
    """
    try:
        # Sending an HTTP GET request to fetch the page content
        response = requests.get(url, timeout=10)
        # Checking if the response status code indicates a successful request (200 OK)
        if response.status_code != 200:
            return {'URL': url, 'Meta Description': 'Error fetching data', 'Keywords': 'N/A', 'Cleaned Text': 'Error fetching data'}

        # Parsing the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extracting the meta description from the page (if available)
        meta_description = soup.find('meta', attrs={'name': 'description'})
        meta_content = meta_description['content'] if meta_description else 'No description'

        # Extracting keywords from the meta tag (if available)
        meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
        if meta_keywords and 'content' in meta_keywords.attrs:
            keywords_content = meta_keywords['content'].strip()  # Extracting keywords content
        else:
            keywords_content = None  # No keywords available

        # Extracting main text content from paragraph and heading elements (if available)
        paragraphs = soup.find_all('p')
        headings = soup.find_all(['h1', 'h2'])
        # Combining text from paragraphs and headings into one string
        main_text = ' '.join([p.get_text() for p in paragraphs + headings])

        # Cleaning the extracted main text content using the clean_text function
        cleaned_text = clean_text(main_text)

        # Returning the extracted and cleaned data in a dictionary format
        return {
            'URL': url,
            'Meta Description': clean_text(meta_content),  # Cleaned meta description
            'Keywords': clean_text(keywords_content) if keywords_content else 'N/A',  # Cleaned keywords or 'N/A' if not present
            'Cleaned Text': cleaned_text  # Cleaned main text content
        }
    except Exception as e:
        # Handling errors gracefully by returning an error message for this URL
        return {
            'URL': url,
            'Meta Description': 'Error fetching data',
            'Keywords': 'N/A',
            'Cleaned Text': 'Error fetching data'
        }

# Step 5: Looping through URLs to scrape content and collect results
for url in urls:
    # Calling the scrape_content function for each URL and storing the result
    scraped_data.append(scrape_content(url))

# Step 6: Converting the collected data into a DataFrame for further processing and analysis
scraped_df = pd.DataFrame(scraped_data)

# Displaying the scraped data to verify the extraction
print("Final Scraped Data (Preview):")
print(scraped_df.head())

# Step 7: Saving the scraped data to a CSV file for further use in merging and analysis
scraped_df.to_csv('final_scraped_webtool_content.csv', index=False)

# End of the first part
# This part focuses on web scraping and preparing content data, which will later be merged with other datasets in subsequent parts.

# Importing necessary libraries for data handling and merging
import pandas as pd  # For data manipulation

# File paths for the provided datasets
landing_page_path = '/content/drive/MyDrive/Federated Learning Datasets/Webtool_landing_pages.csv'
pageviews_engagement_path = '/content/drive/MyDrive/Federated Learning Datasets/Webtool_pageviews_user_engagement.csv'
scraped_content_path = 'final_scraped_webtool_content.csv'  # This should have been generated in the first part

# Step 1: Load datasets into DataFrames
scraped_content_df = pd.read_csv(scraped_content_path)
landing_page_df = pd.read_csv(landing_page_path)
pageviews_engagement_df = pd.read_csv(pageviews_engagement_path)

# Display previews of the datasets for reference
print("Final Scraped Webtool Content DataFrame (Preview):")
print(scraped_content_df.head())
print("\nLanding Page DataFrame (Preview):")
print(landing_page_df.head())
print("\nPageviews Engagement DataFrame (Preview):")
print(pageviews_engagement_df.head())

# Step 2: Standardize URLs for consistency across datasets
def standardize_url(url):
    """
    Standardizes URLs by converting them to lowercase, trimming whitespace, and removing trailing slashes.

    Args:
    - url (str): The URL to be standardized.

    Returns:
    - str: Standardized URL.
    """
    if pd.isna(url):  # Handle missing (NaN) values gracefully
        return url
    url = url.lower().strip().rstrip('/')  # Convert to lowercase, trim whitespace, remove trailing slashes
    return url

# Apply URL standardization to the datasets
scraped_content_df['URL'] = scraped_content_df['URL'].str.replace('https://webtool.co', '', regex=False).apply(standardize_url)
landing_page_df.rename(columns={'Page path and screen class': 'URL'}, inplace=True)
landing_page_df['URL'] = landing_page_df['URL'].apply(standardize_url)
pageviews_engagement_df.rename(columns={'Page path and screen class': 'URL'}, inplace=True)
pageviews_engagement_df['URL'] = pageviews_engagement_df['URL'].apply(standardize_url)

# Step 3: Merging the datasets
# Merge Final Scraped Content with the Landing Pages dataset
merged_data = pd.merge(scraped_content_df, landing_page_df, on='URL', how='inner')

# Merge the result with the Pageviews Engagement dataset
final_merged_data = pd.merge(merged_data, pageviews_engagement_df, on='URL', how='inner')

# Display the merged data to verify the merges
print("\nFinal Merged DataFrame (Preview) with all three datasets:")
print(final_merged_data.head())

# Step 4: Handle Missing Data (if any)
# Fill missing values to maintain data consistency
final_merged_data.fillna({
    'Engaged sessions': 0,
    'Engagement rate': 0.0,
    'Bounce rate': 0.0,
    'Average session duration': 0.0,
    'Engaged sessions per active user': 0.0,
    'Sessions': 0,
    'Views': 0,
    'Active users': 0,
    'Views per active user': 0.0,
    'Average engagement time per active user': 0.0,
    'Event count': 0,
    'Key events': 0,
    'Total revenue': 0.0
}, inplace=True)

# Step 5: Save the final merged data to a CSV file for further use
final_merged_data.to_csv('final_merged_data.csv', index=False)
print("\nFinal merged data successfully saved as 'final_merged_data.csv'.")

# Displaying a confirmation message
print("The final merged dataset has been saved and is ready for further analysis.")


# Importing necessary libraries for data handling
import pandas as pd

# Load the merged data from previous steps
final_merged_data_path = 'final_merged_data.csv'  # Ensure this file is available
final_merged_data = pd.read_csv(final_merged_data_path)

# Display the structure of the data for reference
print("Final Merged DataFrame (Preview):")
print(final_merged_data.head())

# Step 1: Function to generate detailed content recommendations
def generate_content_recommendations(data):
    """
    Generates detailed content recommendations based on engagement metrics like views,
    engagement rate, bounce rate, average session duration, etc., to provide actionable insights.

    Args:
    - data (DataFrame): DataFrame containing URLs and relevant metrics.

    Returns:
    - recommendations_df (DataFrame): DataFrame with URL, recommendations, and key metrics.
    """
    recommendations = []  # List to store recommendation data

    # Iterate over each row in the data
    for _, row in data.iterrows():
        url = row.get('URL', 'N/A')
        views = row.get('Views', 0)
        engagement_rate = row.get('Engagement rate', 0)
        bounce_rate = row.get('Bounce rate', 100)  # High default bounce rate if missing
        avg_session_duration = row.get('Average session duration', 0)
        active_users = row.get('Active users', 0)

        # Generating recommendations based on conditions
        if engagement_rate > 70 and views > 500:
            recommendation = f"The page '{url}' shows excellent engagement. Consider expanding its content, adding interactive elements (like videos or infographics), and optimizing high-performing keywords."
        elif engagement_rate > 50 and bounce_rate < 30:
            recommendation = f"The page '{url}' has moderate engagement with a low bounce rate. Boost its visibility through social media promotion, backlinks, and other marketing efforts."
        elif avg_session_duration > 180:
            recommendation = f"The page '{url}' has a high average session duration. Consider breaking down content into subtopics, adding in-depth content, or incorporating related links to retain users."
        elif views < 50 and active_users < 20:
            recommendation = f"The page '{url}' has low views and user engagement. Revisit its SEO strategies, enhance meta tags, and consider re-optimizing the content with targeted keywords to increase visibility. Leverage social media promotion and backlinks."
        else:
            recommendation = f"Review and update the content for '{url}' to improve engagement by focusing on relevance, structure, and user experience."

        # Append the URL, recommendation, and key metrics to the list
        recommendations.append({
            'URL': url,
            'Recommendation': recommendation,
            'Views': views,
            'Engagement Rate (%)': engagement_rate,
            'Bounce Rate (%)': bounce_rate,
            'Average Session Duration (seconds)': avg_session_duration,
            'Active Users': active_users
        })

    # Convert the recommendations list to a DataFrame
    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df

# Step 2: Generate recommendations based on the final merged data
recommendations_df = generate_content_recommendations(final_merged_data)

# Display the generated recommendations for verification
print("\nGenerated Content Recommendations (Preview):")
print(recommendations_df.head(15))

# Step 3: Save the recommendations dataset to a CSV file for client use
recommendations_df.to_csv('content_recommendations.csv', index=False)

# End of the third part of the code
# This part generates a dataset containing content recommendations and key metrics like views, engagement rate, bounce rate, etc., to offer detailed insights for each URL.



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Final Scraped Data (Preview):
                                              URL  \
0   https://webtool.co/fitness-based-seo-service/   
1  https://webtool.co/attorney-based-seo-service/   
2         https://webtool.co/medical-seo-service/   
3     https://webtool.co/photography-seo-service/   
4         https://webtool.co/banking-seo-service/   

                                    Meta Description Keywords  \
0  transform fitness brand expert seo services ba...      N/A   
1  looking way boost law firms online visibility ...      N/A   
2  medical seo services help improve healthcare o...      N/A   
3  looking boost photography businesss online vis...      N/A   
4  improve banks online visibility reach professi...      N/A   

                                        Cleaned Text  
0  fitness based seo service quick enquiry seo se...  
1  attorney based seo service quick enquiry seo s...  
2  medical seo service quick enquiry medical base...  
3  photography including techniques levi

### Explanation of the Output


 **Final Scraped Data (Preview)**:
   - This section shows a preview of the data that was extracted (or scraped) from various URLs (web pages). This data represents web content and meta-information relevant to SEO analysis.
   - **What does it contain?** The preview displays a few columns for the first few rows of the dataset. Here’s what each column represents:
     - **URL**: This is the web address of the page that was scraped.
       - Example: `/fitness-based-seo-service/` is the URL for a webpage related to fitness-based SEO services.
     - **Meta Description**: This is a brief summary or description of the webpage, usually provided in the meta tags of the HTML source code. It helps search engines understand the page’s content.
       - Example: The description for the fitness-based SEO service page mentions improving online visibility for fitness brands.
     - **Keywords**: These are keywords extracted from the page's meta tags. They provide an idea of the focus topics or themes of the page. In this example, the value is shown as `N/A`, which means keywords were not found or extracted.
     - **Cleaned Text**: This column contains the main text content extracted from the webpage, which has been cleaned. Cleaning involves converting text to lowercase, removing punctuation, and getting rid of common stopwords to make the text more relevant for analysis.
       - Example: For the fitness-based SEO service page, it shows a snippet of the cleaned text that mentions “fitness based seo service quick enquiry seo service...”.

### What This Data Means:

- **Purpose**: This data is a preliminary step in preparing information for SEO analysis. It helps us understand what content is present on each webpage and allows us to analyze it further for optimization.
- **Use Case**: By having the URL, meta description, keywords, and cleaned text, you can evaluate how well-optimized each page is for search engines. For instance:
  - If a page has a weak or missing meta description, you can recommend creating a more engaging and keyword-rich description.
  - If no relevant keywords are found, it may indicate that the page needs better SEO optimization.
  - The cleaned text can help identify the main topics or keywords being emphasized on the page.

### Why This Matters for SEO:

- **Meta Descriptions and Keywords**: These elements play a critical role in search engine optimization. The meta description can affect click-through rates, and the right keywords help improve a page’s visibility in search results.
- **Content Analysis**: By analyzing the main content (cleaned text), you can assess whether the page is focused on the right topics, if it’s optimized for specific keywords, or if it needs improvements like better structure, additional content, or different formatting.

### Next Steps Based on This Output:

1. **Analyze Meta Descriptions and Keywords**: Identify pages that have missing or weak descriptions and provide recommendations for improvements.
2. **Content Quality Check**: Assess the cleaned text to determine if the content is relevant, engaging, and keyword-rich.
3. **Optimization Recommendations**: Suggest ways to improve the on-page SEO elements based on the analysis, such as adding or refining meta descriptions, optimizing content for target keywords, and enhancing user engagement.

---

1. **Final Merged DataFrame (Preview)**:
    - This table shows the merged data from various sources, including URLs and their associated SEO metrics.
    - Columns explained:
        - **URL**: This column lists the unique URLs or web pages that have been analyzed.
        - **Meta Description**: This column contains a brief description of the content on the page, extracted during the scraping process. If this value is `NaN` (Not a Number), it means no meta description was found for that URL.
        - **Cleaned Text**: This column shows the cleaned main text content from the page.
        - **Engaged Sessions, Engagement Rate, Bounce Rate, etc.**: These columns show various metrics related to user engagement, such as:
            - **Engaged Sessions**: The number of sessions where users were actively engaged.
            - **Engagement Rate**: The percentage of sessions where users interacted meaningfully with the page.
            - **Bounce Rate**: The percentage of users who left the page without engaging further.
            - **Average Session Duration**: The average time users spent on the page.
            - **Views, Active Users, Event Count, etc.**: Additional metrics that provide insights into the user behavior on each page.

    - **Use Case**: This table allows you to understand how each URL performs based on user engagement metrics, which is critical for evaluating SEO performance.

2. **Generated Content Recommendations (Preview)**:
    - This table is the output of the content recommendation system based on the analyzed data. Each row corresponds to a specific URL and provides actionable recommendations.
    - Columns explained:
        - **URL**: The web page for which the recommendation is made.
        - **Recommendation**: A detailed suggestion tailored to improve the page's engagement, visibility, or other metrics. For example:
            - If the engagement rate is low or the views are minimal, the recommendation might suggest revisiting SEO strategies, enhancing content, or improving user targeting.
            - High engagement might prompt suggestions to expand content or add interactive elements to retain users further.
        - **Views**: The number of times the page was viewed.
        - **Engagement Rate (%)**: The percentage indicating how engaged users were with the page.
        - **Bounce Rate (%)**: The percentage of users who left after viewing the page.
        - **Average Session Duration (seconds)**: The average time spent by users on the page.
        - **Active Users**: The number of unique users who actively engaged with the page.

    - **Use Case**: This table serves as a guide for improving content performance. By examining the metrics and recommendations, you can identify pages that need content optimization, increased visibility, or restructuring to reduce bounce rates.


3. **Understanding the Columns and Their Relevance**:
   - **URL**: This identifies the web page being analyzed. Each URL is unique, representing a specific page on your site.
   - **Recommendation**: This column provides actionable insights based on the data for that URL. The recommendations are generated based on the page's metrics, such as views, engagement rate, bounce rate, and session duration. The goal of these recommendations is to improve the performance of the page by addressing specific issues or optimizing areas where it is performing well.
     - For example, if the engagement rate is low and views are minimal, the recommendation suggests revisiting the page's SEO strategy, enhancing content, or targeting better keywords.
   - **Views**: This represents how many times the page was viewed by users. A high number of views indicates good visibility, while a low number suggests the need to improve the page's reach or relevance.
   - **Engagement Rate (%)**: This shows how effectively the page engages users. A higher engagement rate indicates that users are interacting more with the content. If this value is low, it may suggest that the content is not capturing user interest.
   - **Bounce Rate (%)**: A high bounce rate indicates that many users leave the page after viewing it without engaging further. This could point to a lack of engaging content or relevance to the user's intent.
   - **Average Session Duration (seconds)**: This metric shows the average amount of time users spend on the page. A longer duration generally suggests that users find the content engaging or useful, while a shorter duration may indicate a lack of interest or relevance.
   - **Active Users**: This indicates the number of unique users who actively engaged with the page. More active users can signify that the page has a broad appeal.

4. **Example Interpretations**:
   - **Row 0 (URL: /fitness-based-seo-service)**:
     - **Recommendation**: The recommendation suggests that the page has low views and user engagement. It advises revisiting SEO strategies, enhancing meta tags, and re-optimizing content with targeted keywords to increase visibility and engagement.
     - **Views**: 16 views indicate that the page is not receiving a lot of traffic.
     - **Engagement Rate (%)**: 0.83 (or 83%) suggests a relatively high level of engagement among users who visit.
     - **Bounce Rate (%)**: 0.16 (or 16%) is low, indicating that users who visit tend to stay and engage with the page.
     - **Average Session Duration**: 25.92 seconds, suggesting the average time users spent on the page.
     - **Active Users**: 8 unique users engaged with the page.

   - **Row 7 (URL: /adult-seo-service)**:
     - **Recommendation**: The recommendation highlights that the page has a high average session duration and good engagement. It suggests considering breaking down the content further or adding in-depth analysis to retain users longer.
     - **Views**: 172 views, showing a relatively high traffic level.
     - **Engagement Rate (%)**: 0.71 (or 71%), indicating strong engagement.
     - **Bounce Rate (%)**: 0.29 (or 29%), meaning a lower percentage of users leave without further engagement.
     - **Average Session Duration**: 244.99 seconds, reflecting a long user stay time, indicating high engagement.
     - **Active Users**: 66 unique users, a good number of active users.

5. **Using the Output**:
   - **Identify and Prioritize Improvements**: Use the recommendations to focus your efforts on improving specific pages. For example, pages with low engagement or high bounce rates should be optimized for better content relevance, faster loading times, or improved user targeting.
   - **Measure Impact Over Time**: After implementing changes based on the recommendations, track these metrics to see if engagement, views, or other aspects improve.
   - **Optimize Content**: Pages with high engagement might be expanded or enhanced to retain users further, while pages with low performance should be re-evaluated for content quality and relevance.


### Key Takeaways from the Output
- The **"Final Merged DataFrame"** provides a comprehensive view of all the metrics related to each URL. This helps in understanding the current state of SEO performance.
- The **"Generated Content Recommendations"** table offers actionable insights for each page. The goal is to improve user engagement, reduce bounce rates, and optimize content based on how users currently interact with each page.
- **Example Interpretations**:
    - For a page with low views and high bounce rates, the recommendation might suggest revisiting SEO strategies or improving content relevance.
    - A page with high engagement and long average session duration may prompt expanding the content to retain user interest.

### Next Steps
1. **Use the Recommendations**: Implement the suggested recommendations for each URL to improve SEO performance.
2. **Track Metrics**: After implementing changes, track the metrics over time to assess the impact.
3. **Iterative Improvements**: Continue to refine and optimize content based on updated metrics and user behavior insights.

### Summary
This output provides a comprehensive overview of each URL's performance and actionable steps to improve SEO performance. By following the recommendations, you can increase user engagement, improve content relevance, and enhance SEO strategy for each page on your website. The goal is to provide a tailored approach to optimize your site’s content based on real user behavior and engagement data.    
