<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/AI_Powered_Semantic_Search_Intent_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name: AI-Powered Semantic Search Intent Framework**

---
# **Purpose of the Project:**

This project, **AI-Powered Semantic Search Intent Framework**, is designed to help websites and businesses understand what users are looking for when they search online. It uses **Artificial Intelligence (AI)** and **Machine Learning (ML)** techniques to interpret and classify the "intent" behind search queries. The term "intent" refers to the goal or purpose a user has when they type something into a search bar.

#### **Why Is This Project Important?**
1. **User Experience Enhancement**:
   - Many websites struggle to give users the information they are truly looking for.
   - For example, if a user searches for "best laptops under $500," they might expect a comparison page with budget laptops. If the website instead shows them a random article about laptops, the user might leave the site.
   - This project ensures that websites can understand what users want and provide the most relevant content, improving user satisfaction.

2. **Search Engine Optimization (SEO)**:
   - For businesses, appearing at the top of search engine results is crucial. By understanding what users are searching for and what they want, businesses can create better content that aligns with those needs, increasing their chances of ranking higher.

3. **Revenue Growth**:
   - For e-commerce sites, understanding search intent can boost sales. For example, knowing when users are ready to buy (commercial intent) versus when they are just researching (informational intent) allows businesses to present the right call-to-action, such as "Buy Now" or "Learn More."

---

#### **What Does This Project Do?**
The **AI-Powered Semantic Search Intent Framework** performs three main tasks:

1. **Semantic Understanding**:
   - It analyzes the words in search queries and website content to understand their true meaning, not just the literal text.
   - For instance, the phrase "affordable phones" and "cheap smartphones" mean the same thing, and this system recognizes that similarity.

2. **Classifying Intent**:
   - It categorizes search queries into three main types of intent:
     - **Informational Intent**: When users are looking for knowledge or research (e.g., "How does SEO work?").
     - **Navigational Intent**: When users are trying to find a specific page or resource (e.g., "YouTube login page").
     - **Commercial Intent**: When users are ready to make a purchase or take action (e.g., "Buy iPhone 14 Pro").
   - This helps businesses tailor their content to meet user expectations.

3. **Providing Actionable Insights**:
   - The framework generates recommendations to improve the relevance of website content. For example:
     - If the system detects missing call-to-action buttons, it suggests adding them.
     - If content lacks internal links or pricing information, it provides improvement tips.

---

#### **How Does It Work?**
1. **Data Collection**:
   - The system gathers data from websites, including URLs, titles, descriptions, and the main content of pages.

2. **Text Analysis**:
   - The collected data is cleaned and analyzed using advanced algorithms, such as TF-IDF (Term Frequency-Inverse Document Frequency), to find patterns and important terms.

3. **Machine Learning**:
   - It uses AI models like KMeans clustering to group similar search queries and classify their intent.

4. **Dynamic Recommendations**:
   - Based on the insights, the framework generates actionable steps, such as improving meta descriptions, adding FAQs, or targeting high-volume keywords.

---

#### **Who Can Benefit From This Project?**
1. **Website Owners**:
   - They can optimize their content to make it more relevant to users, leading to higher traffic and better user retention.

2. **Marketers and SEO Experts**:
   - They can use the framework to create data-driven strategies for improving search engine rankings and conversion rates.

3. **E-Commerce Businesses**:
   - They can identify and cater to potential buyers by creating targeted campaigns.

4. **Educational Platforms**:
   - They can improve informational content to help students or researchers find the knowledge they are seeking.

---

#### **Real-World Example**
Imagine an e-commerce website selling laptops:
- A user searches for "best gaming laptops under $1500.

- Using this framework, the system identifies the intent as **Commercial**.
- It then recommends:
  - Adding comparison tables for gaming laptops.
  - Including clear CTAs like "Shop Now."
  - Improving the meta description to include phrases like "Top Gaming Laptops Under $1500."

As a result, the website becomes more relevant to the user, increasing the chances of making a sale.

---

### **Why Is This Framework Unique?**
1. **AI Integration**:
   - The use of AI ensures accuracy and scalability, allowing it to handle vast amounts of data efficiently.
2. **Dynamic Insights**:
   - The system doesn’t just classify data; it actively provides suggestions for improvement.
3. **Broad Application**:
   - Whether it’s a blog, e-commerce site, or educational platform, this framework can adapt to various industries.

---


---
# **1. What is Semantic Search Intent?**

Semantic Search Intent refers to understanding **why** a user is searching for something, not just the words they type. It goes beyond keyword matching to determine the **meaning** and **purpose** behind a search.

There are three main types of search intents:
- **Informational**: The user wants to learn something (e.g., "What is Semantic Search Intent?").
- **Navigational**: The user is looking for a specific website or brand (e.g., "Facebook login").
- **Commercial**: The user wants to make a purchase or is researching products/services (e.g., "Best laptops under $1000").

Semantic Search uses natural language processing (NLP) and machine learning to connect the user's intent with relevant content, even if they don’t use the exact words found in that content.

---

#### **2. Use Cases of Semantic Search Intent**

**General Use Cases**:
- **Search Engines**: Google uses semantic search to rank pages based on user intent, not just keywords.
- **Customer Support**: Virtual assistants (like Alexa or Siri) understand user queries and provide context-based answers.
- **E-commerce**: Platforms like Amazon suggest products based on what a user is likely to buy.
- **Content Recommendation**: Apps like YouTube or Netflix suggest videos or shows based on user preferences.

**Website Context Use Cases**:
For website owners, Semantic Search Intent can:
1. **Improve Content Relevance**: Ensure your website content matches what users are searching for.
2. **Enhance SEO**: Align pages with user search intent to rank higher on Google.
3. **Boost User Engagement**: Provide answers or solutions directly addressing the user’s needs.
4. **Increase Conversions**: Help users find what they need faster, leading to more sales or sign-ups.

---

#### **3. Real-Life Implementations**

- **Google Search**: Uses semantic search to understand queries like "best Italian restaurants near me," even if "restaurants" is not explicitly mentioned on the page.
- **Amazon**: Suggests products based on a combination of user searches and intent.
- **Netflix**: Predicts what you might want to watch next by analyzing your past behavior and intent.
- **E-commerce Sites**: Categorize search results into intent types (informational blogs, product pages, or FAQs).

---

#### **4. Website-Specific Use Cases of Semantic Search Intent**

For your client's website, Semantic Search Intent can:
1. **Help Categorize Pages**:
   - Match pages with **informational** intent (e.g., blogs, guides).
   - Match pages with **navigational** intent (e.g., contact page, product catalog).
   - Match pages with **commercial** intent (e.g., checkout, product details).

2. **Optimize Website Structure**:
   - Use the output to structure pages so users find the right content more quickly.

3. **Improve Targeted Marketing**:
   - Align product pages with commercial intent searches.
   - Use blog posts for informational intent queries to attract traffic.

4. **Content Gap Analysis**:
   - Find which intents are not being served by existing pages and create new content.

---

#### **5. What Data Does the Model Need?**

To function, a Semantic Search Intent model typically needs:
- **Input Data**:
  - Text content from the website pages (scraped from URLs or uploaded as a CSV).
  - Metadata (titles, descriptions, tags).
  - User search queries (from site logs or tools like Google Search Console).

- **Output**:
  - A classification of intents for each page or search query.
  - Recommendations for aligning content with user intent.

---

#### **6. Expected Output in Website Context**

1. **Classification of Intent**:
   - The model will group pages or queries by their primary intent: informational, navigational, or commercial.

2. **Content Recommendations**:
   - Suggestions for improving the page to match user intent better (e.g., adding a call-to-action for commercial intent).

3. **Insights for SEO**:
   - Which keywords or queries to target for different types of intent.

4. **Content Gaps**:
   - Identify missing content to attract more traffic.

---

#### **7. Step-by-Step Workflow**

**For the Website Context**:
1. **Data Collection**:
   - Gather website URLs or download website data as a CSV.
   - Collect search query data (from logs or tools).

2. **Preprocessing**:
   - Extract and clean text from the content.
   - Structure data (titles, headers, keywords).

3. **Semantic Intent Mapping**:
   - Use an NLP model to classify the content or queries into intent types.

4. **Output & Alignment**:
   - Provide insights and recommendations:
     - What type of content to add.
     - How to align current pages to user intent.

---



---
# **What Are User Search Queries?**

**User search queries** are the exact words or phrases people type into search engines (like Google) when looking for information. For example:
- Someone searching "best smartphones under $500" is expressing an **intent** to find a smartphone.
- Another user searching "contact Facebook support" has a **navigational intent**.

Understanding these queries helps website owners:
1. **Learn User Intent**: What visitors want (e.g., information, product, or help).
2. **Optimize Content**: Create pages that directly address these queries.
3. **Improve SEO**: Rank higher in search engines by aligning content with what users are searching for.

---

### Why Do You Need Search Query Data?

- **Identify Traffic Sources**: Understand how people find your website.
- **Target Relevant Keywords**: Find terms users are searching for to optimize content.
- **Improve Website Navigation**: Address gaps in your website by adding relevant pages.
- **Increase Conversions**: Help users find exactly what they’re looking for, which increases sales or engagement.

---

### Where Can You Get Search Query Data?

You can get search query data from two main sources:

#### **1. Google Search Console (Recommended)**
Google Search Console (GSC) is a free tool provided by Google to help website owners monitor and maintain their site's presence in Google Search results. It gives detailed insights into:
- What queries brought users to your site.
- How many clicks and impressions each query got.
- Your website’s ranking position for specific queries.

#### **2. Server Logs**
Your web server automatically keeps a log of user activity, including search terms if your website has a search function. These logs show:
- What users searched for on your website.
- The time and location of the search.

---

### How to Get Search Query Data

#### **Method 1: Google Search Console**
1. **Set Up Google Search Console**:
   - Go to [Google Search Console](https://search.google.com/search-console/).
   - Sign in with a Google account.
   - Add your website and verify ownership (using methods like DNS, HTML file upload, or Google Analytics).

2. **Access Search Query Data**:
   - Navigate to the **Performance** tab.
   - You’ll see a list of queries, along with metrics like clicks, impressions, and average position.

3. **Download Search Query Data**:
   - Click the **Export** button (usually at the top of the page).
   - Choose a format (CSV is best for further processing).
   - Save the file to your computer.

#### **Method 2: Server Logs**
1. **Access Server Logs**:
   - Log into your web hosting control panel (e.g., cPanel).
   - Go to the **Logs** section and look for "Access Logs" or "Search Logs."
   - Download the log files.

2. **Extract Search Queries**:
   - If your website has a search box, search terms will often be included in the log files. Look for lines with terms like `q=search-term`.
   - Use tools like Python or log analyzers to process the data.

---

### What Do You Do With the Data?

Once you have the data (from GSC or logs), here’s how you can use it:

1. **Preprocessing**:
   - Remove duplicates and irrelevant queries.
   - Group queries by intent (informational, navigational, commercial).

2. **Analysis**:
   - Identify high-volume queries to create or optimize pages.
   - Spot gaps where no relevant content exists.

3. **Align Content**:
   - Create new blog posts, FAQs, or landing pages for high-traffic queries.
   - Update existing content to rank higher for specific queries.

---
---

### Why Is This Data Important?

- It tells you **what users are looking for**.
- It helps you **create better content**.
- It improves your website’s visibility in search engines.

---


---
# What is Website Ownership Verification?

Ownership verification is a process that proves you are the rightful owner or administrator of a website. It ensures that only authorized people can access sensitive data or make changes in Google Search Console.

---

### Why is Ownership Verification Necessary?

- **Access Search Data**: You can see which queries bring traffic to your website.
- **Monitor Performance**: Track impressions, clicks, and ranking.
- **Fix Issues**: Google alerts you about errors like broken pages or security problems.

---

### Methods for Verifying Website Ownership

Here’s a breakdown of the three main methods you can use:

---

#### **1. DNS Verification (Domain Name System)**

This is done by adding a special verification code to your website's domain records. Follow these steps:

1. **Log in to Google Search Console**:
   - Go to [Google Search Console](https://search.google.com/search-console/).
   - Click **Start Now** and log in using your Google account.

2. **Add Your Website**:
   - Enter your website URL.
   - If your site has a `www` (e.g., `www.example.com`), include it.

3. **Select "Domain" as the Verification Type**:
   - You will see an option for **Domain**.
   - Google will provide a unique TXT record.

4. **Access Your Domain Registrar**:
   - Log in to the website where you bought your domain (e.g., GoDaddy, Namecheap, Bluehost).
   - Look for **DNS Management** or **DNS Settings**.

5. **Add the TXT Record**:
   - In the DNS settings, find the option to add a new **TXT Record**.
   - Copy the TXT code provided by Google and paste it in the "Value" field.
   - Save the changes.

6. **Verify in Google Search Console**:
   - Return to Google Search Console and click **Verify**.
   - It might take a few hours for the changes to propagate.

---

#### **2. HTML File Upload**

This method involves uploading a special file to your website's root directory. Here’s how:

1. **Log in to Google Search Console**:
   - Go to [Google Search Console](https://search.google.com/search-console/).
   - Add your website URL.

2. **Choose HTML File Upload**:
   - Google will give you a file to download (e.g., `google12345abc.html`).

3. **Access Your Website’s Hosting Panel**:
   - Log in to your web hosting control panel (e.g., cPanel).
   - Go to the **File Manager** section.

4. **Upload the File**:
   - Navigate to the root directory of your website (usually called `public_html`).
   - Upload the file you downloaded from Google.

5. **Verify Ownership**:
   - Go back to Google Search Console and click **Verify**.
   - Google will check if the file exists on your website.

---

#### **3. Google Analytics**

This method requires that your website already has Google Analytics installed.

1. **Log in to Google Search Console**:
   - Add your website URL.

2. **Choose Google Analytics**:
   - Select this option if you’ve already set up Analytics.

3. **Grant Access**:
   - Ensure the Google account you’re using has admin-level permissions in Google Analytics.

4. **Verify**:
   - Click **Verify** in Google Search Console, and it will connect to your Analytics account.

---

### Which Method Should You Choose?

- **Use DNS Verification** if you manage your domain or have access to the DNS settings.
- **Use HTML File Upload** if you can easily upload files to your website.
- **Use Google Analytics** if Analytics is already set up for your site.
---

### What Happens After Verification?

Once ownership is verified:
1. Google grants you access to data about your website’s performance.
2. You can monitor search queries, clicks, and impressions.
3. You’ll receive alerts about issues like mobile usability or page indexing problems.

---



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import necessary libraries
import pandas as pd

# Step 1: Define file paths for datasets
# These file paths specify where the datasets are stored.
# Each file contains data that we want to analyze and merge later for insights.
countries_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Countries.csv'
dates_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Dates.csv'
devices_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Devices.csv'
pages_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Pages Of Thatware.csv'
queries_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Queries.csv'
search_appearances_file = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Search Appearances.csv'

# Step 2: Load datasets into pandas DataFrames
# pandas is a Python library used for handling tabular data (like spreadsheets).
# Each dataset is loaded into a pandas DataFrame, which makes it easier to work with.
countries_data = pd.read_csv(countries_file)  # Data about website traffic from different countries
dates_data = pd.read_csv(dates_file)          # Data containing traffic information for specific dates
devices_data = pd.read_csv(devices_file)      # Data about traffic based on device types (e.g., mobile, desktop)
pages_data = pd.read_csv(pages_file)          # Data about website pages and their performance
queries_data = pd.read_csv(queries_file)      # Data about search queries that led to the website
search_appearances_data = pd.read_csv(search_appearances_file)  # Data about how the website appeared in search results

# Step 3: Define a function to inspect a dataset
# This function helps us quickly understand the structure and content of each dataset.
# It prints the name of the dataset, the column names, and the first 10 rows of data for a quick preview.
def inspect_dataset(name, data):
    """
    Inspects a given dataset by printing its name, columns, and the first 10 rows.

    Parameters:
    - name (str): The name of the dataset (for identification).
    - data (DataFrame): The pandas DataFrame containing the dataset.
    """
    print(f"\n--- {name} Dataset ---")  # Print the dataset name for context
    print("Columns:", data.columns.tolist())  # Display all column names in the dataset
    print("First 10 rows:")  # Preview the first 10 rows of data
    print(data.head(10))  # Output the first 10 rows for review

# Step 4: Inspect all datasets
# Use the inspect_dataset function to check the structure of each dataset.
# This step ensures we know what data is available and how it is organized.
inspect_dataset("Countries", countries_data)  # Inspect the countries dataset
inspect_dataset("Dates", dates_data)          # Inspect the dates dataset
inspect_dataset("Devices", devices_data)      # Inspect the devices dataset
inspect_dataset("Pages", pages_data)          # Inspect the pages dataset
inspect_dataset("Queries", queries_data)      # Inspect the queries dataset
inspect_dataset("Search Appearances", search_appearances_data)  # Inspect the search appearances dataset



--- Countries Dataset ---
Columns: ['Country', 'Clicks', 'Impressions', 'CTR', 'Position']
First 10 rows:
                Country  Clicks  Impressions    CTR  Position
0                 India    7574       744774  1.02%     39.70
1              Pakistan    1377        98188  1.40%     49.23
2         United States    1140      2967202  0.04%     57.96
3        United Kingdom     477       590687  0.08%     59.06
4            Bangladesh     458        73279  0.63%     57.38
5             Indonesia     439       111348  0.39%     51.51
6                Canada     220       131045  0.17%     52.76
7             Australia     182        81330  0.22%     48.79
8  United Arab Emirates     170        28426  0.60%     51.30
9               Vietnam     160       201373  0.08%     51.39

--- Dates Dataset ---
Columns: ['Date', 'Clicks', 'Impressions', 'CTR', 'Position']
First 10 rows:
         Date  Clicks  Impressions    CTR  Position
0  18-12-2024     215       100298  0.21%     52.62
1  17-1

In [None]:
# Import the pandas library
# pandas is a library used for working with datasets in table format.
import pandas as pd

# Step 1: Define the folder path where all datasets are stored
# This specifies the location of the datasets in Google Drive.
# We'll use this path to construct the full file paths for each dataset.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'

# Step 2: Define the file paths for all datasets
# Each dataset contains specific information (e.g., countries, dates, devices, pages, queries, and search appearances).
# These file paths tell the program where to find the respective datasets.
countries_file = folder_path + 'Countries.csv'  # Traffic data from various countries
dates_file = folder_path + 'Dates.csv'          # Traffic data for specific dates
devices_file = folder_path + 'Devices.csv'      # Traffic data for different devices
pages_file = folder_path + 'Pages Of Thatware.csv'  # Performance data for website pages
queries_file = folder_path + 'Queries.csv'      # Performance data for search queries
search_appearances_file = folder_path + 'Search Appearances.csv'  # Data about how the site appears in search results

# Step 3: Load all datasets into pandas DataFrames
# We load each dataset into a "DataFrame", which is like a table or spreadsheet.
# This makes it easier to work with the data for analysis and merging.
countries_data = pd.read_csv(countries_file)  # Load the countries dataset
dates_data = pd.read_csv(dates_file)          # Load the dates dataset
devices_data = pd.read_csv(devices_file)      # Load the devices dataset
pages_data = pd.read_csv(pages_file)          # Load the pages dataset
queries_data = pd.read_csv(queries_file)      # Load the queries dataset
search_appearances_data = pd.read_csv(search_appearances_file)  # Load the search appearances dataset

# Step 4: Merge Queries, Pages, and Search Appearances into Content Performance
# We combine multiple datasets related to website content into a single dataset called "Content Performance".
# This step unifies data about queries, pages, and search appearances for easier analysis.
content_performance = pd.concat([
    # Rename columns to have a consistent "Identifier" for merging and analysis.
    queries_data.rename(columns={'Top queries': 'Identifier'}).assign(Context='Queries'),
    pages_data.rename(columns={'Top pages': 'Identifier'}).assign(Context='Pages'),
    search_appearances_data.rename(columns={'Search Appearance': 'Identifier'}).assign(Context='Search Appearances')
], ignore_index=True)  # Combine all rows into one dataset, ignoring the original index.

# Step 5: Merge Countries, Dates, and Devices into Traffic Segmentation
# We combine datasets related to traffic data (countries, dates, and devices) into "Traffic Segmentation".
# This provides a consolidated view of traffic patterns based on different factors.
traffic_segmentation = pd.concat([
    # Rename columns to have a consistent "Identifier" for merging and analysis.
    countries_data.rename(columns={'Country': 'Identifier'}).assign(Context='Countries'),
    dates_data.rename(columns={'Date': 'Identifier'}).assign(Context='Dates'),
    devices_data.rename(columns={'Device': 'Identifier'}).assign(Context='Devices')
], ignore_index=True)  # Combine all rows into one dataset, ignoring the original index.

# Step 6: Save the merged datasets into the same folder as the original datasets
# Save the combined datasets into new CSV files for future use and analysis.
content_performance_file = folder_path + 'Merged_Content_Performance.csv'  # Path for the Content Performance file
traffic_segmentation_file = folder_path + 'Merged_Traffic_Segmentation.csv'  # Path for the Traffic Segmentation file

# Save the Content Performance dataset
content_performance.to_csv(content_performance_file, index=False)

# Save the Traffic Segmentation dataset
traffic_segmentation.to_csv(traffic_segmentation_file, index=False)

# Step 7: Display the first 10-12 rows of each merged dataset for validation
# We display the first few rows of the merged datasets to confirm the data looks as expected.
print("\n--- Content Performance Dataset Preview ---")
print(content_performance.head(17))  # Display the first 17 rows of the Content Performance dataset

print("\n--- Traffic Segmentation Dataset Preview ---")
print(traffic_segmentation.head(17))  # Display the first 17 rows of the Traffic Segmentation dataset

# Step 8: Print file save confirmation
# Confirm that the merged datasets have been successfully saved.
print(f"\nContent Performance saved at: {content_performance_file}")
print(f"Traffic Segmentation saved at: {traffic_segmentation_file}")



--- Content Performance Dataset Preview ---
                                  Identifier  Clicks  Impressions     CTR  \
0                                   thatware    2046         3706  55.21%   
1                               thatware llp     522         1736  30.07%   
2                                    textise     507        43599   1.16%   
3                               thatware seo     200          342  58.48%   
4                                  that ware     164          235  69.79%   
5                                   thatwere      48           67  71.64%   
6                      seo services in india      39         6398   0.61%   
7   difference between marketing and selling      38         3441   1.10%   
8                              ai seo agency      37         2471   1.50%   
9              how to add exif data to image      29          112  25.89%   
10                     seo agency in kolkata      28         2977   0.94%   
11                    add exif 

---
# **Part 1: Web Content Scraper**
**What it does:**
- Scrapes the content, title, and description from the provided URLs of the website (ThatWare).
- Cleans the extracted content by removing unwanted or repetitive text.
- Saves the cleaned data into a structured CSV file for further analysis.

**Use Case:**
This part is essential for gathering raw data from the website, which will later be analyzed to understand its intent and performance.

---


In [None]:
# Importing necessary libraries
import pandas as pd  # pandas is used to handle and manipulate tabular data efficiently
from bs4 import BeautifulSoup  # BeautifulSoup is used for parsing HTML content
import requests  # requests is used to send HTTP requests to websites and retrieve their content

# Step 1: Define a list of URLs to scrape
# These URLs represent pages from the ThatWare website, which we want to analyze for content and metadata.
urls = [
    'https://thatware.co/',
    'https://thatware.co/advanced-seo-services/',
    'https://thatware.co/digital-marketing-services/',
    'https://thatware.co/business-intelligence-services/',
    'https://thatware.co/link-building-services/',
    'https://thatware.co/branding-press-release-services/',
    'https://thatware.co/conversion-rate-optimization/',
    'https://thatware.co/social-media-marketing/',
    'https://thatware.co/content-proofreading-services/',
    'https://thatware.co/website-design-services/',
    'https://thatware.co/web-development-services/',
    'https://thatware.co/app-development-services/',
    'https://thatware.co/website-maintenance-services/',
    'https://thatware.co/bug-testing-services/',
    'https://thatware.co/software-development-services/',
    'https://thatware.co/competitor-keyword-analysis/'
]

# Step 2: Define a function to scrape content and metadata from URLs
def scrape_relevant_content(url_list):
    """
    This function scrapes metadata (title, description) and main content from each URL in the given list.
    It processes the content to remove unnecessary text and organizes the data into a structured format.

    Parameters:
    - url_list (list): List of URLs to scrape.

    Returns:
    - pd.DataFrame: A structured table with URL, Title, Description, and Cleaned Content columns.
    """
    # Create an empty list to store data extracted from each URL
    scraped_data = []

    # Loop through each URL to fetch and process its content
    for url in url_list:
        try:
            # Step 3: Send an HTTP GET request to fetch the web page's HTML content
            response = requests.get(url, timeout=10)  # Timeout ensures the script doesn't hang indefinitely
            response.raise_for_status()  # Raise an error if the HTTP request fails

            # Step 4: Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')  # Parse the HTML response into a structured object

            # Step 5: Extract the metadata
            # Title: The primary heading of the page, often displayed in search results
            title = soup.title.string if soup.title else "No Title Found"

            # Description: A brief summary of the page's content, usually in the meta tag
            description_tag = soup.find("meta", attrs={"name": "description"})
            description = description_tag["content"] if description_tag else "No Description Found"

            # Step 6: Extract the main content of the page
            # The <main> or <article> tags often hold the main content of a page
            main_content = soup.find('main') or soup.find('article')

            # If <main> or <article> are not found, try using common identifiers for content
            if not main_content:
                main_content = soup.find(attrs={'class': 'content'}) or soup.find(attrs={'id': 'content'})

            # If no specific content section is found, use the entire <body> tag as a last resort
            if not main_content:
                main_content = soup.body

            # Extract text from the identified section and clean it
            content = main_content.get_text(separator=' ', strip=True) if main_content else "No Content Found"

            # Step 7: Filter out unwanted or repetitive text
            # Remove boilerplate text or phrases that don't add value for analysis
            unwanted_phrases = [
                'GET A FREE CUSTOMIZED', 'FILL OUT THE FORM BELOW',
                'SERVICES', 'COPYRIGHT', 'TERMS OF SERVICE'
            ]
            for phrase in unwanted_phrases:
                content = content.replace(phrase, '')

            # Append the processed data for this URL to the list
            scraped_data.append({
                'URL': url,              # The original URL of the page
                'Title': title,          # Extracted title of the page
                'Description': description,  # Extracted description of the page
                'Content': content       # Cleaned main content of the page
            })

        except Exception as e:
            # Handle any errors during the scraping process and log them
            print(f"Failed to scrape {url}: {e}")
            scraped_data.append({
                'URL': url,
                'Title': "Error",
                'Description': "Error",
                'Content': "Error"
            })

    # Step 8: Convert the scraped data into a DataFrame for easier analysis
    return pd.DataFrame(scraped_data)

# Step 9: Call the function to scrape content from the list of URLs
scraped_data = scrape_relevant_content(urls)  # Execute the scraping function on the list of URLs

# Step 10: Save the scraped data into a CSV file
# This ensures the data can be reused later without re-running the scraping process.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Folder path to save the file
scraped_file_path = folder_path + 'Enhanced_Scraped_Website_Content.csv'  # Define the output file path
scraped_data.to_csv(scraped_file_path, index=False)  # Save the scraped data to a CSV file

# Step 11: Display a preview of the first few rows of the scraped data
# This step allows us to verify that the scraping was successful and the data looks as expected.
print("\n--- Enhanced Scraped Data Preview ---")
print(scraped_data.head(10))  # Display the first 10 rows of the data

# Step 12: Confirm the file save location
# Notify the user where the output file has been saved for further analysis.
print(f"\nEnhanced scraped data saved at: {scraped_file_path}")



--- Enhanced Scraped Data Preview ---
                                                 URL  \
0                               https://thatware.co/   
1         https://thatware.co/advanced-seo-services/   
2    https://thatware.co/digital-marketing-services/   
3  https://thatware.co/business-intelligence-serv...   
4        https://thatware.co/link-building-services/   
5  https://thatware.co/branding-press-release-ser...   
6  https://thatware.co/conversion-rate-optimization/   
7        https://thatware.co/social-media-marketing/   
8  https://thatware.co/content-proofreading-servi...   
9       https://thatware.co/website-design-services/   

                                               Title  \
0  THATWARE® - Revolutionizing SEO with Hyper-Int...   
1  Advanced SEO Services - Professional SEO Agenc...   
2  Digital Marketing Services - Advanced Digital ...   
3  Business Intelligence Services - Competitive A...   
4  Link Building Services - Off Page SEO Agency -...   
5  PPC P

---
# Explanation of the Output

This output is **the result of scraping data from a website**, which means it’s a collection of information extracted from different pages of the website. Let’s go through each column and row to understand what it all means.

---

#### **Columns in the Dataset**

1. **`URL`**:
   - This column lists the web addresses (URLs) of the pages from the website that were scraped.
   - Example: `https://thatware.co/` is the homepage, while `https://thatware.co/advanced-seo-services/` is a page about advanced SEO services.

   **Purpose**:  
   - The URL tells you where the information comes from.
   - It helps you locate specific content on the website.

---

2. **`Title`**:
   - This column contains the title of each webpage. Titles are the main headings you see when you visit a webpage or in search engine results.
   - Example: For the homepage, the title is "THATWARE® - Revolutionizing SEO with Hyper-Intelligence."

   **Purpose**:  
   - Titles summarize what each page is about.
   - They are critical for SEO (Search Engine Optimization) because search engines use titles to rank pages and show them to users.

---

3. **`Description`**:
   - This column contains a brief summary of what the page is about, often pulled from the page’s metadata or key content.
   - Example: For the page `https://thatware.co/`, the description is, "THATWARE® is the world's first SEO agency to specialize in hyper-intelligent search engine strategies."

   **Purpose**:  
   - Descriptions are essential for providing context to users and search engines about the page’s content.
   - They often appear below the title in search engine results.

---

4. **`Content`**:
   - This column contains the main text or key highlights from each page. It can include headings, important points, or calls-to-action (e.g., "GET A FREE CONVERSION RATE OPTIMISATION STRATEGY NOW!").
   - Example: For the `https://thatware.co/conversion-rate-optimization/` page, the content discusses strategies to improve sales funnels.

   **Purpose**:  
   - The content helps in understanding the detailed message of the page.
   - It can be analyzed for keywords, tone, and structure to improve SEO or user engagement.

---

#### **Rows in the Dataset**

Each row in the dataset represents a single webpage on the website. Let’s go over a few examples:

1. **Row 1** (`https://thatware.co/` - Homepage):
   - **Title**: "THATWARE® - Revolutionizing SEO with Hyper-Intelligence."
   - **Description**: Briefly explains what Thatware does as a company.
   - **Content**: Contains key highlights like the revenue they’ve generated and their unique approach to SEO.

   **Purpose**:  
   - This row summarizes the homepage, which acts as the gateway for all visitors.

---

2. **Row 2** (`https://thatware.co/advanced-seo-services/`):
   - **Title**: "Advanced SEO Services - Professional SEO Agency."
   - **Description**: Explains the advanced SEO services offered by Thatware.
   - **Content**: Mentions specific services like "Advanced SEO Audit."

   **Purpose**:  
   - This row provides insights into the advanced SEO services page, useful for identifying gaps or improving SEO performance.

---

3. **Row 8** (`https://thatware.co/content-proofreading-services/`):
   - **Title**: "Content Proofreading Services | Hire Content Professionals."
   - **Description**: Talks about Thatware’s expertise in proofreading services.
   - **Content**: Contains details about how users can hire content professionals.

   **Purpose**:  
   - This row is specific to a niche service, which might be useful for targeting certain customers.

---

#### **Enhanced Data**

At the end, the dataset mentions that this enhanced data has been saved as a file (`Enhanced_Scraped_Website_Content.csv`).

**Purpose**:  
- This file can now be used for further analysis or as input for other tools like content strategy models, SEO audits, or machine learning algorithms.

---


---
# **Part 2: Intent Classifier**
**What it does:**
- Prepares the scraped content by cleaning it further (e.g., removing special characters).
- Converts the cleaned text into numerical features using the TF-IDF technique, which helps in understanding the importance of words.
- Groups the web pages into three intent categories (Informational, Navigational, and Commercial) using KMeans clustering.
- Maps each web page to its corresponding intent and saves the output in a new CSV file.

**Use Case:**
This part identifies the purpose of each webpage, which is critical for understanding whether it is performing as intended (e.g., a commercial page should drive sales).

---


In [None]:
# Import necessary libraries
# pandas: for handling and manipulating tabular data.
# TfidfVectorizer: for converting text into numerical data based on term frequency.
# KMeans: for clustering data into distinct groups.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Step 1: Define file paths
# Specify where the input data is stored and where the output should be saved.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Path to the folder
scraped_file_path = folder_path + 'Enhanced_Scraped_Website_Content.csv'  # Path to the scraped content file

# Step 2: Load the scraped data into a pandas DataFrame
# This reads the scraped website content into a structured tabular format.
scraped_data = pd.read_csv(scraped_file_path)

# Step 3: Preprocess the text data for classification
def preprocess_text(text):
    """
    Prepares text for analysis by:
    1. Converting it to lowercase: This ensures consistency in text matching.
    2. Removing special characters and numbers: Keeps only alphabets to focus on meaningful words.
    3. Stripping extra spaces: Cleans up unnecessary whitespace for better processing.

    Parameters:
        text (str): The original text content.
    Returns:
        str: The cleaned and preprocessed text.
    """
    import re  # Regular expressions library for text cleaning
    text = text.lower()  # Convert text to lowercase for uniformity
    text = re.sub(r'[^a-z\s]', '', text)  # Remove special characters and numbers
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.strip()  # Remove leading and trailing spaces

# Apply the preprocessing function to the 'Content' column
# This cleans the text data for further analysis.
scraped_data['Cleaned_Content'] = scraped_data['Content'].apply(preprocess_text)

# Step 4: Convert text into numerical features using TF-IDF Vectorizer
# TF-IDF (Term Frequency-Inverse Document Frequency) captures the importance of words in a document.
vectorizer = TfidfVectorizer(max_features=500)  # Use the top 500 words to reduce noise
tfidf_matrix = vectorizer.fit_transform(scraped_data['Cleaned_Content'])  # Transform text into numerical vectors

# Step 5: Apply KMeans clustering to group pages by intent
# KMeans is a machine learning algorithm that groups data into clusters (similar groups).
kmeans = KMeans(n_clusters=3, random_state=42)  # Define 3 clusters for 3 intents (Informational, Navigational, Commercial)
scraped_data['Intent_Cluster'] = kmeans.fit_predict(tfidf_matrix)  # Assign each page to a cluster

# Step 6: Map the cluster labels to human-readable intent categories
def map_cluster_to_intent(cluster_label):
    """
    Maps numerical cluster labels to meaningful intent categories:
    0: Informational
    1: Navigational
    2: Commercial

    Parameters:
        cluster_label (int): The numerical label assigned by KMeans.
    Returns:
        str: The corresponding intent category.
    """
    intent_mapping = {
        0: 'Informational',  # For pages providing detailed information
        1: 'Navigational',   # For pages guiding users to specific destinations
        2: 'Commercial'      # For pages designed to sell or promote products/services
    }
    return intent_mapping.get(cluster_label, 'Unknown')  # Default to 'Unknown' if no match is found

# Apply the mapping function to the cluster labels
scraped_data['Intent'] = scraped_data['Intent_Cluster'].apply(map_cluster_to_intent)

# Step 7: Save the classified data with intents to a new CSV file
# This stores the output for future use and analysis.
classified_file_path = folder_path + 'Classified_Content_With_Intent.csv'
scraped_data.to_csv(classified_file_path, index=False)  # Save without row indices for a cleaner file

# Step 8: Display a preview of the classified data
# Show the first 10 rows to validate the output.
print("\n--- Classified Data Preview ---")
print(scraped_data[['URL', 'Title', 'Description', 'Intent']].head(10))

# Step 9: Confirm where the file was saved
# Provide the path to the saved file for verification.
print(f"\nClassified data saved at: {classified_file_path}")



--- Classified Data Preview ---
                                                 URL  \
0                               https://thatware.co/   
1         https://thatware.co/advanced-seo-services/   
2    https://thatware.co/digital-marketing-services/   
3  https://thatware.co/business-intelligence-serv...   
4        https://thatware.co/link-building-services/   
5  https://thatware.co/branding-press-release-ser...   
6  https://thatware.co/conversion-rate-optimization/   
7        https://thatware.co/social-media-marketing/   
8  https://thatware.co/content-proofreading-servi...   
9       https://thatware.co/website-design-services/   

                                               Title  \
0  THATWARE® - Revolutionizing SEO with Hyper-Int...   
1  Advanced SEO Services - Professional SEO Agenc...   
2  Digital Marketing Services - Advanced Digital ...   
3  Business Intelligence Services - Competitive A...   
4  Link Building Services - Off Page SEO Agency -...   
5  PPC Paid Ma

---
# What Is This Output About?

This output is a **classified dataset of website pages**, where each page has been categorized based on its purpose (also known as **"intent"**).

For example:
- Some pages aim to sell services or products (commercial intent).
- Some pages are meant to provide information or answers (informational intent).
- Some pages guide visitors to specific content or locations on the site (navigational intent).

This classification helps you understand **why people visit specific pages on your website** and allows you to optimize these pages accordingly.

---

### Detailed Explanation of Each Column

#### 1. **`URL`**
   - This column lists the website links (web addresses) of the pages.
   - Each URL corresponds to a specific page on your website.
   - Example: `https://thatware.co/advanced-seo-services/` points to a page about advanced SEO services.

   **Purpose**:  
   - The URL shows which page is being analyzed and allows you to directly locate it.

---

#### 2. **`Title`**
   - The `Title` column contains the title of each webpage. Titles are the headings that users see on search engine results or the top of a webpage.
   - Example: The title for the homepage (`https://thatware.co/`) is "THATWARE® - Revolutionizing SEO with Hyper-Intelligence."

   **Purpose**:  
   - Titles summarize what the page is about and help users decide whether to click on the page.

---

#### 3. **`Description`**
   - The `Description` column provides a short summary of the content on each page.
   - Example: For `https://thatware.co/`, the description reads, "THATWARE® is the world's first SEO agency to specialize in hyper-intelligent search engine strategies."

   **Purpose**:  
   - Descriptions give users and search engines more information about the page’s purpose.
   - They appear below the title in search engine results.

---

#### 4. **`Intent`**
   - This is the most important column in the dataset. It classifies each page based on its **intent**, which means the purpose or reason why the page exists.
   - There are three types of intent in this dataset:
     - **Commercial**: Pages designed to promote or sell a service/product. Example: "Advanced SEO Services."
     - **Informational**: Pages providing helpful information or answering questions. Example: "Business Intelligence Services."
     - **Navigational**: Pages meant to direct users to specific parts of the website. Example: "Social Media Marketing Services."

   **Purpose**:  
   - Knowing the intent helps you understand how to structure and optimize each page for better user engagement and search engine performance.
   - Example:
     - A **commercial page** should have strong calls to action and a clear focus on selling.
     - An **informational page** should include detailed, helpful content to answer user queries.
     - A **navigational page** should make it easy for users to find what they’re looking for.

---

#### **Classified Data File**
   - The last line mentions that this classified data has been saved as a file named:
     ```
     Classified_Content_With_Intent.csv
     ```
   - This means you now have a structured dataset that can be shared or analyzed further.

   **Purpose**:
   - This file allows you to work on the dataset offline or use it in tools like Excel, Tableau, or a machine learning model for further insights.

---

### How Is This Useful ?

This classified data provides **valuable insights** for improving the website. Here’s how:

1. **Understanding User Behavior**:
   - By knowing the intent of each page, you can align it with the needs of your visitors.
   - Example: Ensure informational pages provide clear answers and commercial pages have strong CTAs (Call to Actions).

2. **Improving Content Strategy**:
   - Analyze which intents dominate your website. Are you missing content for a specific intent?
   - Example: If you lack informational pages, add FAQs or blogs to address user questions.

3. **Optimizing User Experience (UX)**:
   - Make it easier for visitors to achieve their goals (e.g., finding information or making a purchase).
   - Example: Ensure navigational pages have simple layouts and clear links.

4. **Enhancing SEO**:
   - Search engines prioritize pages that match user intent.
   - Example: Optimize titles and descriptions to better reflect the page’s purpose.

---


---
# **Part 3: Dynamic Recommendations Generator**
**What it does:**
- Analyzes the content of each web page based on its intent.
- Generates recommendations dynamically to improve the effectiveness of each page.
  - For commercial pages: Add CTAs (Call-to-Actions) or pricing information.
  - For informational pages: Add FAQs or detailed explanations.
  - For navigational pages: Add clear navigation or contact information.
- Saves the recommendations in a new CSV file.

**Use Case:**
This part provides actionable suggestions for improving each page’s performance based on its intent, ensuring alignment with business goals.

---


In [None]:
# Import necessary libraries
import pandas as pd  # For handling and manipulating tabular data.
import re  # For pattern matching and text searching.

# Step 1: Define the file path and load the data
# This step specifies where the input data is located and reads it into a structured DataFrame.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Path to the folder containing the data.
classified_file_path = folder_path + 'Classified_Content_With_Intent.csv'  # File path for the classified content.

# Load the data into a pandas DataFrame
# The DataFrame will hold the structured content data for analysis.
classified_data = pd.read_csv(classified_file_path)

# Step 2: Define Helper Functions for Content Analysis
# These functions will analyze the content dynamically and provide insights.

def contains_cta(content):
    """
    Checks if the content contains common call-to-action (CTA) phrases.
    Use Case:
    - CTAs encourage user actions like contacting or purchasing. If missing, the content may not drive conversions effectively.
    """
    cta_phrases = ['contact us', 'buy now', 'sign up', 'get started', 'learn more']  # Common CTAs in marketing.
    # Check if any CTA phrase is present in the content (case-insensitive).
    return any(phrase in content.lower() for phrase in cta_phrases)

def content_depth(content):
    """
    Analyzes the depth of the content based on word count.
    Use Case:
    - Ensures the content is detailed enough to engage users and provide value.
    - Returns 'shallow' for less than 300 words, 'adequate' otherwise.
    """
    word_count = len(content.split())  # Count the number of words in the content.
    return 'shallow' if word_count < 300 else 'adequate'

def contains_internal_links(content):
    """
    Checks if the content contains internal links (links pointing to other pages on the same site).
    Use Case:
    - Internal links improve navigation and SEO by connecting related content.
    """
    # Look for anchor tags (<a>) with href attributes pointing to "thatware.co".
    return bool(re.search(r'<a href="[^"]*thatware\.co[^"]*">', content))

# Step 3: Generate Dynamic Recommendations
def generate_dynamic_recommendations(row):
    """
    Generates dynamic recommendations for each piece of content based on its intent and characteristics.
    Use Case:
    - Provides actionable suggestions to improve the content's performance.
    """
    intent = row['Intent']  # Extract the intent (e.g., Commercial, Informational, Navigational).
    content = row['Content']  # Extract the content text for analysis.

    recommendations = []  # Initialize an empty list to hold the recommendations.

    # Recommendations for Commercial content
    if intent == 'Commercial':
        if not contains_cta(content):  # Check for missing CTAs.
            recommendations.append('Add prominent CTAs (e.g., Buy Now, Contact Us).')
        if content_depth(content) == 'shallow':  # Check if content is too short.
            recommendations.append('Expand the content to provide more details.')

    # Recommendations for Informational content
    elif intent == 'Informational':
        if content_depth(content) == 'shallow':  # Check for insufficient depth in informational content.
            recommendations.append('Add more depth to the content (e.g., FAQs, examples).')
        if not contains_internal_links(content):  # Check for missing internal links.
            recommendations.append('Add internal links to related articles or pages.')

    # Recommendations for Navigational content
    elif intent == 'Navigational':
        if not contains_cta(content):  # Ensure CTAs guide users to the intended destination.
            recommendations.append('Add clear navigation CTAs (e.g., Visit Services, Explore Options).')

    # Return the recommendations as a single string
    return ' '.join(recommendations) if recommendations else 'No specific recommendations needed.'

# Apply the function to generate recommendations for each row
# This dynamically creates tailored suggestions for improvement.
classified_data['Recommendations'] = classified_data.apply(generate_dynamic_recommendations, axis=1)

# Step 4: Save the data with dynamic recommendations
# Save the updated DataFrame to a new CSV file for further use or analysis.
dynamic_recommendation_file_path = folder_path + 'Classified_Content_With_Dynamic_Recommendations.csv'
classified_data.to_csv(dynamic_recommendation_file_path, index=False)  # Save without row indices for cleaner output.

# Step 5: Display a preview of the updated data
# Show the first 10 rows to validate that the recommendations were generated correctly.
print("\n--- Data with Dynamic Recommendations Preview ---")
print(classified_data[['URL', 'Intent', 'Recommendations']].head(10))

# Step 6: Confirm file save location
# Print the path to the saved file for easy reference and verification.
print(f"\nDynamic recommendations saved at: {dynamic_recommendation_file_path}")



--- Data with Dynamic Recommendations Preview ---
                                                 URL         Intent  \
0                               https://thatware.co/     Commercial   
1         https://thatware.co/advanced-seo-services/     Commercial   
2    https://thatware.co/digital-marketing-services/   Navigational   
3  https://thatware.co/business-intelligence-serv...  Informational   
4        https://thatware.co/link-building-services/   Navigational   
5  https://thatware.co/branding-press-release-ser...  Informational   
6  https://thatware.co/conversion-rate-optimization/   Navigational   
7        https://thatware.co/social-media-marketing/  Informational   
8  https://thatware.co/content-proofreading-servi...   Navigational   
9       https://thatware.co/website-design-services/   Navigational   

                                     Recommendations  
0                No specific recommendations needed.  
1                No specific recommendations needed.  
2 

---
# What Is This Output About?

This output is a **classified dataset of website pages**, where each page has:
- **Intent classification**: The reason why this page exists (e.g., to sell, to inform, or to navigate users).
- **Dynamic recommendations**: Specific suggestions for improving each page based on its intent and content.

This helps you optimize your website by:
1. Ensuring each page fulfills its intended purpose.
2. Improving user experience and search engine performance.
3. Making navigation, content, and calls-to-action (CTAs) clearer.

---

### Detailed Explanation of Each Column

#### **1. `URL`**
   - This is the link (web address) for each page of your website.
   - Example: `https://thatware.co/advanced-seo-services/` is the page about advanced SEO services.

   **Purpose**:
   - This column tells you which specific page is being analyzed.

---

#### **2. `Intent`**
   - This column classifies each page based on its purpose, or "intent."
   - There are three types of intent in this dataset:
     - **Commercial**: Pages that promote or sell services/products.  
       Example: `https://thatware.co/advanced-seo-services/`.
     - **Informational**: Pages that provide helpful content or answer questions.  
       Example: `https://thatware.co/business-intelligence-services/`.
     - **Navigational**: Pages that guide users to specific sections of your website.  
       Example: `https://thatware.co/digital-marketing-services/`.

   **Purpose**:
   - Understanding the intent helps you ensure that each page is aligned with user expectations and serves its purpose effectively.

---

#### **3. `Recommendations`**
   - This column provides specific, actionable suggestions for each page based on its intent and content.
   - Example:
     - For informational pages, it suggests adding **internal links** to related articles or pages. This helps users find additional information easily.
     - For navigational pages, it suggests adding **clear CTAs (calls-to-action)** like "Visit Service Page" or "Learn More."

   **Purpose**:
   - Recommendations guide you in improving the structure, navigation, and content of each page.
   - By following these suggestions, you can enhance user engagement, improve SEO, and make your website more effective.

---

### Explanation of the Output Data

Here’s what the dataset rows represent:

| **Column**       | **Explanation**                                                                 |
|-------------------|---------------------------------------------------------------------------------|
| **URL**          | The link to a specific webpage.                                                 |
| **Intent**       | The purpose of the page: selling (Commercial), informing (Informational), or guiding (Navigational). |
| **Recommendations** | Suggestions for improving each page based on its intent.                     |

Now let’s look at the dataset examples:

1. **Row 0 (Homepage)**:
   - **URL**: `https://thatware.co/`
   - **Intent**: Commercial (the homepage is designed to promote the business).
   - **Recommendations**: No specific recommendations needed because the homepage is already optimized for commercial purposes.

2. **Row 1 (Advanced SEO Services)**:
   - **URL**: `https://thatware.co/advanced-seo-services/`
   - **Intent**: Commercial (this page is promoting the "Advanced SEO Services").
   - **Recommendations**: No specific recommendations needed; the page is already focused on selling the service.

3. **Row 3 (Business Intelligence Services)**:
   - **URL**: `https://thatware.co/business-intelligence-services/`
   - **Intent**: Informational (the page provides information about business intelligence services).
   - **Recommendations**: Suggests adding **internal links** to related articles or services.  
     Example: Link to "Digital Marketing Services" or "Advanced SEO Services" pages.

4. **Row 6 (Conversion Rate Optimization)**:
   - **URL**: `https://thatware.co/conversion-rate-optimization/`
   - **Intent**: Navigational (this page guides users to learn about CRO services).
   - **Recommendations**: Add **clear CTAs** like "Get a Free Audit" or "Request a Quote" to make navigation easier.

---

### Key Takeaways for the Client

This dataset provides:
1. **A Clear Overview**:
   - It shows the purpose (intent) of every page on your website.
   - It tells you whether the page is optimized for its intent.

2. **Actionable Insights**:
   - Each recommendation is tailored to the page’s intent, making it easy to improve the page.

3. **Improved User Experience**:
   - Following the recommendations ensures that users can easily find what they’re looking for and take the desired action.

4. **SEO Benefits**:
   - Adding internal links and clear CTAs improves your website’s ranking and engagement.

---

### Next Steps for the Client

1. **Review Each Page**:
   - Compare the "intent" with the current content. Does the content match the page’s purpose?

2. **Implement Recommendations**:
   - Add internal links on informational pages to help users explore related topics.
   - Add CTAs on navigational pages to make it easier for users to find what they need.

3. **Track Performance**:
   - Use tools like Google Analytics to monitor the performance of each page after making changes.

4. **Utilize the Classified File**:
   - Open the saved file `/content/drive/MyDrive/Dataset For Semantic Search Intent Model/Classified_Content_With_Dynamic_Recommendations.csv` in Excel or another tool to share it with your team or analyze further.

---


---
# **Part 4: Dataset Merger**
**What it does:**
- Combines the dynamic recommendations with content performance metrics (e.g., CTR, clicks, impressions).
- Ensures data consistency by normalizing identifiers (e.g., converting URLs to lowercase, removing extra spaces).
- Merges the datasets to create a comprehensive view of recommendations and performance for each web page.

**Use Case:**
This part enriches the recommendations with performance data, helping to prioritize which pages need optimization based on their performance.

---


In [None]:
# Import the pandas library
import pandas as pd  # This library is used to handle structured data in tabular formats like CSV files.

# Step 1: Define the file paths for the datasets
# These file paths specify where the datasets are stored. The folder_path points to the Google Drive location.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'
enhanced_data_path = folder_path + 'Classified_Content_With_Dynamic_Recommendations.csv'  # File with dynamic recommendations.
content_data_path = folder_path + 'Merged_Content_Performance.csv'  # File with content performance metrics.

# Step 2: Load the datasets into pandas DataFrames
# The data is loaded into a structured format (DataFrame) for analysis.
enhanced_data = pd.read_csv(enhanced_data_path)  # Enhanced data contains dynamic recommendations for URLs.
content_data = pd.read_csv(content_data_path)  # Content data contains performance metrics for various identifiers.

# Step 3: Normalize identifiers to ensure consistent matching during merging
# Use Case:
# Identifiers in different datasets may have inconsistent formats (e.g., uppercase, trailing spaces).
# Normalization ensures that the comparison is case-insensitive and ignores extra spaces.

# Normalize 'URL' column in enhanced_data for consistency
enhanced_data['Identifier_Content'] = enhanced_data['URL'].str.strip().str.lower()
# Normalize 'Identifier' column in content_data for consistency
content_data['Identifier'] = content_data['Identifier'].str.strip().str.lower()

# Step 4: Merge the datasets
# Use Case:
# Combining the datasets allows us to enrich the enhanced_data with additional performance metrics from content_data.
# This gives a comprehensive view of recommendations and performance.

merged_data = pd.merge(
    enhanced_data,  # Left dataset: Contains URLs with recommendations.
    content_data,   # Right dataset: Contains performance metrics for identifiers.
    left_on='Identifier_Content',  # Match normalized URLs from enhanced_data.
    right_on='Identifier',         # Match normalized identifiers from content_data.
    how='left',  # Use 'left' join to retain all rows from enhanced_data even if no match is found in content_data.
    suffixes=('', '_Content')  # Add suffixes to columns from content_data to avoid name conflicts.
)

# Step 5: Validate the merging process
# Use Case:
# Validation ensures that the merge worked as intended and all necessary data is present in the final dataset.

print("\n--- Post-Merge Validation ---")
# Display the total number of rows in the merged dataset to confirm no data was lost.
print(f"Total rows in merged data: {len(merged_data)}")
# Display all column names to confirm the structure of the merged dataset.
print(f"Columns in merged data: {merged_data.columns.tolist()}")

# Step 6: Save the final merged dataset
# Use Case:
# Saving the final output allows us to use it in later steps of the model pipeline or analysis.

final_merged_output_path = folder_path + 'Final_Merged_Without_Traffic.csv'  # Path for the final merged dataset.
merged_data.to_csv(final_merged_output_path, index=False)  # Save the DataFrame to a CSV file without row indices.
print(f"Final merged output saved at: {final_merged_output_path}")  # Confirm the file save location.

# Step 7: Preview the final merged dataset
# Use Case:
# Previewing the first few rows helps validate that the merged dataset contains the expected data.

print("\n--- Final Output Preview ---")
print(merged_data.head(17))  # Display the first 17 rows to inspect the merged dataset's content and structure.



--- Post-Merge Validation ---
Total rows in merged data: 16
Columns in merged data: ['URL', 'Title', 'Description', 'Content', 'Cleaned_Content', 'Intent_Cluster', 'Intent', 'Recommendations', 'Identifier_Content', 'Identifier', 'Clicks', 'Impressions', 'CTR', 'Position', 'Context']
Final merged output saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Final_Merged_Without_Traffic.csv

--- Final Output Preview ---
                                                  URL  \
0                                https://thatware.co/   
1          https://thatware.co/advanced-seo-services/   
2     https://thatware.co/digital-marketing-services/   
3   https://thatware.co/business-intelligence-serv...   
4         https://thatware.co/link-building-services/   
5   https://thatware.co/branding-press-release-ser...   
6   https://thatware.co/conversion-rate-optimization/   
7         https://thatware.co/social-media-marketing/   
8   https://thatware.co/content-proofreading-

---
# What Does This Output Represent?

This is a **website performance report** based on search engine data. It provides insights into how specific web pages on the site are performing in terms of visibility, user engagement, and search rankings.

It helps answer questions like:
- How often are your pages shown to users in search results?
- Are users clicking on these pages after seeing them in search results?
- What is the average ranking of your pages on Google?

This data is useful for improving **SEO (Search Engine Optimization)** and helps the website owner make better decisions to increase visibility and user engagement.

---

### Detailed Explanation of Each Column

Let’s go through each column step by step.

---

#### **1. Identifier (URL)**
   - This column lists the URLs (links) of the web pages being analyzed.
   - Each row corresponds to a specific web page on the website.

**Example**:  
   - `https://thatware.co/` → Refers to the homepage.
   - `https://thatware.co/advanced-seo-services/` → Refers to the "Advanced SEO Services" page.

**Purpose**:  
   - This tells you which page the data belongs to. You can easily locate the performance metrics of specific pages.

---

#### **2. Clicks**
   - This shows the **number of times users clicked on the page** after it appeared in search results.

**Example**:  
   - The homepage (`https://thatware.co/`) received **3420 clicks**, meaning 3420 users clicked on it after seeing it in search results.
   - The "Advanced SEO Services" page received **109 clicks**.

**Why it matters**:  
   - Clicks are a direct indicator of how engaging the page’s title and description are in search results.
   - Pages with low clicks may need better titles, descriptions, or content to attract more users.

---

#### **3. Impressions**
   - This shows the **number of times the page appeared in search results**, regardless of whether users clicked on it or not.

**Example**:  
   - The "Advanced SEO Services" page appeared **203,570 times** in search results.
   - The homepage appeared **116,255 times**.

**Why it matters**:  
   - Impressions show how often your page is visible to users in searches.  
   - If a page has high impressions but low clicks, it suggests that users see the page but are not interested in clicking.

---

#### **4. CTR (Click-Through Rate)**
   - This is the **percentage of impressions that resulted in clicks**.  
     **Formula**: CTR = (Clicks / Impressions) × 100

**Example**:  
   - The homepage has a CTR of **2.94%**, meaning out of 100 times it appeared in search results, 2.94 users clicked on it.
   - The "Advanced SEO Services" page has a CTR of **0.05%**, meaning it’s not attracting many clicks despite high impressions.

**Why it matters**:  
   - A low CTR indicates that the page’s title or description might not be compelling enough to encourage users to click.  
   - Improving the title, description, or content relevance can increase CTR.

---

#### **5. Position**
   - This shows the **average ranking of the page** in search results.  
   - Lower numbers mean better rankings (e.g., Position 1 = First Page, Position 11 = Second Page, etc.).

**Example**:  
   - The homepage has an average position of **73.14**, meaning it appears on the 8th page of search results (since 10 results are shown per page).
   - The "Advanced SEO Services" page has a position of **68.45**.

**Why it matters**:  
   - Higher-ranked pages (Positions 1–10) get more visibility and clicks.  
   - Pages with low positions need better SEO strategies to rank higher.

---

#### **6. Context**
   - This column indicates that the data is specific to "Pages" (web pages of the site).

**Why it matters**:  
   - It provides clarity that the metrics shown are for individual web pages, not for other elements like videos, images, or ads.

---

### Key Insights From This Data

1. **Homepage Performance**:  
   - The homepage gets the most clicks (3420) but has a low position (73.14).  
   - **Action**: Improve the SEO of the homepage by adding relevant keywords and improving content to rank higher.

2. **"Advanced SEO Services" Page**:  
   - This page has very high impressions (203,570) but a low CTR (0.05%).  
   - **Action**: Rewrite the page’s title and meta description to make it more engaging for users.

3. **Pages With Low Clicks**:  
   - Pages like "Bug Testing Services" and "Competitor Keyword Analysis" have very few clicks (e.g., 8 and 10, respectively).
   - **Action**: Enhance these pages with better content and keywords to attract more users.

4. **High-Impression Pages With Low Rankings**:  
   - Many pages have high impressions but poor rankings (e.g., Position 68–75).  
   - **Action**: Focus on improving on-page SEO, adding backlinks, and creating high-quality, relevant content.

5. **CTR Improvements**:  
   - Pages with a CTR below 1% need urgent attention. Titles and descriptions should be more compelling and relevant to user queries.

---

### Steps the Website Owner Should Take

1. **Optimize Titles and Descriptions**:
   - For pages with low CTR, rewrite the title and description to match what users are searching for.
   - Example: Use phrases like "Get Professional SEO Services Today!" to make it more attractive.

2. **Improve Rankings**:
   - Focus on pages with high impressions but poor rankings (e.g., Positions 68+).  
   - Add relevant keywords, create fresh content, and build backlinks to improve rankings.

3. **Enhance Content**:
   - Pages with low clicks or impressions need better-quality content that aligns with user intent.  
   - Example: Add FAQs, case studies, or visuals to engage users.

4. **Track Performance**:
   - Regularly monitor metrics like clicks, impressions, CTR, and rankings to see if your changes are effective.

5. **Focus on High-Performing Pages**:
   - Pages with high clicks and impressions, like the homepage, should be maintained and further optimized to retain their traffic.

---


---
# **Part 5: Insight Generator**
**What it does:**
- Analyzes the merged dataset to generate three types of insights for each page:
  1. **Recommendations:** Suggestions for improving user engagement or conversions.
  2. **SEO Insights:** Specific SEO actions, such as improving meta descriptions or targeting better keywords.
  3. **Content Gaps:** Missing elements in the content, such as FAQs, pricing, or internal links.
- Saves the enhanced dataset with these insights for further action.

**Use Case:**
This part provides a detailed, structured plan for improving each web page's content, SEO, and user experience.

---


In [None]:
# Import the pandas library for handling tabular data
import pandas as pd

# Step 1: Load the Merged Dataset
# Use Case:
# The merged dataset combines information about URLs, their content, and performance metrics (e.g., CTR, Impressions, Clicks).
# Loading this dataset into a DataFrame allows us to manipulate and analyze it.

# Define the folder path where the dataset is stored
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'

# Define the file path for the merged dataset
merged_data_path = folder_path + 'Final_Merged_Without_Traffic.csv'

# Read the merged dataset into a pandas DataFrame
merged_data = pd.read_csv(merged_data_path)

# Step 2: Add Recommendations, SEO Insights, and Content Gaps Columns
# Use Case:
# These columns will store dynamically generated suggestions to improve the performance of each URL.
# Adding empty columns ensures the DataFrame has placeholders for these values before we populate them.

merged_data['Recommendations'] = ''  # Placeholder for improvement recommendations
merged_data['SEO_Insights'] = ''  # Placeholder for actionable SEO insights
merged_data['Content_Gaps'] = ''  # Placeholder for identifying missing content opportunities

# Define thresholds for analysis
# Use Case:
# Setting thresholds helps us decide when specific recommendations or insights should be triggered.
CTR_THRESHOLD = 1.0  # Minimum acceptable click-through rate (CTR) percentage
IMPRESSION_THRESHOLD = 1000  # Minimum impressions for meaningful visibility

# Step 3: Iterate Through Rows to Populate Insights
# Use Case:
# For each row (representing a URL), we analyze its intent, CTR, impressions, and content to generate tailored recommendations.

# Loop through each row in the DataFrame
for index, row in merged_data.iterrows():
    intent = row['Intent']  # Identify the intent of the current page (e.g., Commercial, Informational, Navigational)
    recommendations = []  # Initialize a list to store recommendations for this row
    seo_insights = []  # Initialize a list to store SEO insights for this row
    content_gaps = []  # Initialize a list to store identified content gaps for this row

    # Generate Recommendations Based on Intent
    # Use Case:
    # Tailored recommendations based on the intent of the page help improve its effectiveness for users.
    if intent == 'Commercial':  # If the intent is to drive sales or conversions
        recommendations.append("Add clear CTAs to encourage user actions. ")  # Suggest adding strong Call-to-Actions
        recommendations.append("Highlight unique selling points (USPs) prominently. ")  # Emphasize key product/service benefits
        if 'pricing' not in row['Content'].lower():  # Check if pricing information is missing
            content_gaps.append("Include detailed pricing or discount information.")  # Suggest adding pricing details
    elif intent == 'Informational':  # If the intent is to educate or inform users
        recommendations.append("Add internal links to related pages or guides.")  # Suggest linking to related topics
        recommendations.append("Provide detailed, value-driven content such as FAQs or tutorials.")  # Suggest adding FAQs
        if 'FAQ' not in row['Content']:  # Check if FAQs are missing
            content_gaps.append("Consider adding FAQs to address common user questions.")  # Highlight FAQ as a content gap
    elif intent == 'Navigational':  # If the intent is to help users navigate
        recommendations.append("Ensure smooth navigation to the target resource.")  # Suggest improving navigation
        recommendations.append("Add breadcrumbs or context-specific navigation options.")  # Suggest adding breadcrumbs
        if 'contact' not in row['Content'].lower():  # Check if a contact section is missing
            content_gaps.append("Add a clear contact or inquiry section.")  # Suggest adding a contact section

    # Generate SEO Insights
    # Use Case:
    # SEO insights help identify specific actions to improve search visibility and user engagement.
    ctr_value = float(row['CTR'].strip('%'))  # Convert CTR (stored as a string with '%') into a float value
    impressions = row['Impressions']  # Extract the number of impressions for the URL
    clicks = row['Clicks']  # Extract the number of clicks for the URL

    if ctr_value < CTR_THRESHOLD:  # If the CTR is below the defined threshold
        seo_insights.append("Improve meta descriptions or headlines to boost CTR.")  # Suggest optimizing meta descriptions
    if impressions < IMPRESSION_THRESHOLD:  # If impressions are below the threshold
        seo_insights.append("Focus on high-volume keywords to increase impressions.")  # Suggest targeting high-volume keywords
    if clicks == 0:  # If the URL has received no clicks
        seo_insights.append("No clicks received; reassess keyword strategy or content relevance.")  # Suggest revisiting keyword strategy

    # Populate the DataFrame with the dynamically generated values
    # Use Case:
    # Adding the generated recommendations, insights, and content gaps into the DataFrame for each URL.
    merged_data.at[index, 'Recommendations'] = '; '.join(recommendations)  # Join the list of recommendations into a single string
    merged_data.at[index, 'SEO_Insights'] = '; '.join(seo_insights)  # Join the list of SEO insights into a single string
    merged_data.at[index, 'Content_Gaps'] = '; '.join(content_gaps)  # Join the list of content gaps into a single string

# Step 4: Save the Updated Dataset
# Use Case:
# Saving the updated dataset ensures that all the dynamically generated insights are preserved for further analysis or use.

# Define the file path for the updated dataset
final_output_path = folder_path + 'Improved_Dynamic_Final_Output.csv'

# Save the DataFrame to a CSV file
merged_data.to_csv(final_output_path, index=False)  # Export the DataFrame without row indices
print(f"Improved dataset with all columns saved at: {final_output_path}")  # Confirm the save location

# Step 5: Display a Preview of the Updated Dataset
# Use Case:
# Previewing the updated dataset allows for quick validation of the generated insights and recommendations.

print("\n--- Final Output Preview with All Columns ---")
print(merged_data.head(17))  # Display the first 17 rows of the updated dataset


Improved dataset with all columns saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Improved_Dynamic_Final_Output.csv

--- Final Output Preview with All Columns ---
                                                  URL  \
0                                https://thatware.co/   
1          https://thatware.co/advanced-seo-services/   
2     https://thatware.co/digital-marketing-services/   
3   https://thatware.co/business-intelligence-serv...   
4         https://thatware.co/link-building-services/   
5   https://thatware.co/branding-press-release-ser...   
6   https://thatware.co/conversion-rate-optimization/   
7         https://thatware.co/social-media-marketing/   
8   https://thatware.co/content-proofreading-servi...   
9        https://thatware.co/website-design-services/   
10      https://thatware.co/web-development-services/   
11      https://thatware.co/app-development-services/   
12  https://thatware.co/website-maintenance-services/   
13          h

### Understanding the Output

This output provides a detailed **SEO (Search Engine Optimization) analysis** of specific web pages on a website. It evaluates **click performance**, **search engine impressions**, and suggests improvements in areas like **meta descriptions**, **content gaps**, and user engagement. The output is designed to help the website owner understand how their pages are performing and what steps to take to improve visibility and user interaction.

---

### Detailed Explanation of Each Column

Let’s go through each column one by one.

---

#### **1. Identifier (URL)**

- This is the **web address** (link) for a specific page on the website.
- Each row corresponds to one page.

**Example**:
- `https://thatware.co/` refers to the homepage.
- `https://thatware.co/advanced-seo-services/` refers to the "Advanced SEO Services" page.

**Purpose**:
- This column identifies the specific page being analyzed so you can link performance data back to the right page.

---

#### **2. Clicks**

- This shows how many times users **clicked on the page link** after seeing it in search results.

**Example**:
- The homepage (`https://thatware.co/`) received **3420 clicks**.
- The "Advanced SEO Services" page received only **109 clicks**.

**Why It Matters**:
- A higher number of clicks indicates that users find the page attractive or relevant when they see it in search results.
- Pages with low clicks might need better content or more compelling meta descriptions to attract attention.

---

#### **3. Impressions**

- This represents the number of times the page appeared in **search engine results** (like Google), regardless of whether users clicked on it or not.

**Example**:
- The "Link Building Services" page appeared **196,223 times** in search results.
- The "Bug Testing Services" page appeared only **2319 times**.

**Why It Matters**:
- High impressions show that the page is visible in search engines, but if clicks are low, the page might not be attractive enough to users.

---

#### **4. CTR (Click-Through Rate)**

- CTR is the **percentage of impressions that turned into clicks**.  
  **Formula**: CTR = (Clicks ÷ Impressions) × 100

**Example**:
- The homepage has a CTR of **2.94%**, meaning out of every 100 times it appears in search results, about 3 users click on it.
- The "Competitor Keyword Analysis" page has a CTR of **0%**, meaning no one clicked on it despite being shown.

**Why It Matters**:
- Low CTR means the page needs better titles or descriptions in search results to make users click on it.

---

#### **5. Position**

- This is the **average ranking** of the page in search results. Lower numbers are better.

**Example**:
- The homepage is ranked at **73.14**, meaning it often appears on the 8th page of search results (10 results per page).
- The "Competitor Keyword Analysis" page has a position of **74.92**, which is even lower.

**Why It Matters**:
- A high (bad) position number means users are unlikely to see the page. You need to improve the page’s content and backlinks to rank higher.

---

#### **6. Context**

- This column simply notes that the data applies to **pages** (not videos, images, or ads).

---

#### **7. SEO Insights**

- These are **recommendations** for improving the page’s visibility and engagement.  

**Example**:
- For the "Advanced SEO Services" page:  
  _“Improve meta descriptions or headlines to boost CTR.”_  
  This means the title and description shown in search results should be rewritten to attract more clicks.

**Why It Matters**:
- Following these insights can directly improve the CTR, impressions, and ranking of the page.

---

#### **8. Content Gaps**

- These are **suggestions for missing content** that can make the page more useful or attractive to users.  

**Example**:
- For the homepage:  
  _“Include detailed pricing or discount information.”_  
  This means adding pricing details might attract users who are searching for cost-related information.

- For the "Web Development Services" page:  
  _“Add a clear contact or inquiry section.”_  
  This suggests making it easier for users to contact you.

**Why It Matters**:
- Filling these gaps can improve user satisfaction and engagement, which can lead to better search rankings and more conversions.

---

### Key Insights From the Data

1. **Homepage Strength**:
   - The homepage gets the most clicks (3420) and has a decent CTR (2.94%).  
   - **Action**: Maintain its performance but focus on improving its ranking from Position 73.14.

2. **Pages Needing Attention**:
   - Pages like "Competitor Keyword Analysis" and "Advanced SEO Services" have high impressions but very low CTRs.  
   - **Action**: Rewrite their titles and meta descriptions to make them more engaging and relevant.

3. **Content Gaps**:
   - Several pages are missing key elements like FAQs or contact sections.  
   - **Action**: Add these missing elements to make the pages more comprehensive and user-friendly.

4. **Improving Rankings**:
   - Most pages are ranked very low (Positions 60+).  
   - **Action**: Invest in SEO strategies like:
     - Adding high-quality content.
     - Building backlinks.
     - Targeting relevant keywords.

---

### Benefits of This Data for the Website Owner

1. **Actionable Insights**:
   - The data clearly shows which pages are underperforming and why.
   - It provides specific steps to fix issues, like improving titles, adding content, or targeting better keywords.

2. **Increased Visibility**:
   - By following the recommendations, pages can rank higher in search results, increasing impressions and clicks.

3. **Better Engagement**:
   - Filling content gaps (e.g., adding FAQs or contact sections) can make the site more useful to users, keeping them engaged longer.

4. **Higher Conversions**:
   - Optimized pages attract more clicks, which can lead to more inquiries, purchases, or other desired actions.

---

### Steps to Take After Reviewing This Output

1. **Focus on Pages With High Impressions but Low CTR**:
   - Example: "Advanced SEO Services" page.
   - Rewrite the title and description to make it more appealing.

2. **Fill Content Gaps**:
   - Example: Add FAQs to pages like "Social Media Marketing" or "Bug Testing Services."

3. **Boost Rankings**:
   - Improve on-page SEO (e.g., using keywords strategically, improving content quality).
   - Build backlinks to improve authority.

4. **Track Performance**:
   - After making changes, monitor clicks, impressions, and rankings to see if the improvements are effective.

5. **Enhance User Experience**:
   - Add clear navigation, pricing details, or contact forms to make it easier for users to find what they need.

---


---
# **Part 6: Priority Scorer**
**What it does:**
- Calculates a priority score for each web page based on factors such as:
  - The intent of the page (commercial pages get higher priority).
  - CTR (pages with low CTR are prioritized for improvement).
  - Presence of content gaps and actionable SEO insights.
- Sorts the web pages by priority score, highlighting which pages need immediate attention.
- Saves the ranked dataset for review.

**Use Case:**
This part helps focus efforts on the most critical pages, ensuring limited resources are used effectively to improve performance and achieve business goals.

---


In [None]:
# Importing the pandas library to handle tabular data
import pandas as pd

# Step 1: Load the Improved Dataset
# Use Case:
# The improved dataset contains details about URLs, their intents, content, SEO insights, and content gaps.
# Loading the dataset allows us to process it further for scoring and ranking URLs based on their priority.

# Define the folder path where the dataset is stored
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'

# Define the file path for the improved dataset
improved_dataset_path = folder_path + 'Improved_Dynamic_Final_Output.csv'

# Read the dataset into a pandas DataFrame
improved_data = pd.read_csv(improved_dataset_path)

# Step 2: Initialize Scoring Columns
# Use Case:
# Adding a new column "Priority_Score" to store the calculated scores for each URL based on predefined factors.
# Initializing the score to 0 ensures consistency across all rows before calculation.

improved_data['Priority_Score'] = 0  # Set initial score to 0 for all rows

# Step 3: Scoring Mechanism
# Use Case:
# Calculate a priority score for each URL based on factors such as intent, CTR (click-through rate), content gaps, and SEO insights.
# This helps identify which URLs need immediate attention for optimization.

# Loop through each row in the dataset to calculate scores
for index, row in improved_data.iterrows():
    score = 0  # Start with a score of 0 for each row

    # Factor 1: Intent-Based Weight
    # Use Case:
    # Assign weights based on the intent of the URL.
    # Commercial pages get the highest weight because they directly impact revenue.
    # Navigational pages get medium weight as they guide users to important resources.
    # Informational pages get the lowest weight as their primary role is to educate or inform.
    if row['Intent'] == 'Commercial':  # Check if the intent is Commercial
        score += 50  # Add 50 points for Commercial intent
    elif row['Intent'] == 'Navigational':  # Check if the intent is Navigational
        score += 30  # Add 30 points for Navigational intent
    elif row['Intent'] == 'Informational':  # Check if the intent is Informational
        score += 20  # Add 20 points for Informational intent

    # Factor 2: CTR Improvement Opportunity
    # Use Case:
    # URLs with low CTR need urgent optimization to improve visibility and user engagement.
    ctr = float(row['CTR'].strip('%'))  # Convert the CTR value (stored as a percentage string) to a float
    if ctr < 1.0:  # If CTR is less than 1.0%
        score += 40  # Add 40 points for low CTR to prioritize improvement
    elif 1.0 <= ctr < 3.0:  # If CTR is between 1.0% and 3.0%
        score += 20  # Add 20 points for medium CTR

    # Factor 3: Content Gaps
    # Use Case:
    # Pages with content gaps (e.g., missing FAQs, pricing details) need improvement to better serve user needs.
    if row['Content_Gaps']:  # Check if the Content_Gaps column is not empty
        score += 30  # Add 30 points for identified content gaps

    # Factor 4: SEO Insights
    # Use Case:
    # URLs with actionable SEO insights indicate potential for optimization.
    if row['SEO_Insights']:  # Check if the SEO_Insights column is not empty
        score += 20  # Add 20 points for actionable SEO insights

    # Update the Priority_Score column with the calculated score
    improved_data.at[index, 'Priority_Score'] = score

# Step 4: Rank URLs by Priority Score
# Use Case:
# Sorting the dataset by the Priority_Score in descending order highlights the most important URLs for optimization.
ranked_data = improved_data.sort_values(by='Priority_Score', ascending=False)  # Sort by score in descending order

# Step 5: Save the Ranked Dataset
# Use Case:
# Save the ranked dataset to a new file for further review and action.
# This ensures the calculated rankings are preserved and can be shared with other teams or stakeholders.

# Define the file path for the ranked dataset
ranked_output_path = folder_path + 'Ranked_Final_Model_Output.csv'

# Save the ranked dataset to a CSV file
ranked_data.to_csv(ranked_output_path, index=False)  # Save without row indices
print(f"Ranked dataset with priority scores saved at: {ranked_output_path}")  # Confirm the save location

# Step 6: Display Preview for Validation
# Use Case:
# Displaying the top rows of the ranked dataset helps validate the scoring mechanism and ensures the output is as expected.

print("\n--- Ranked Dataset Preview ---")
print(ranked_data[['URL', 'Intent', 'CTR', 'Priority_Score']].head(17))  # Show the top 17 rows of the ranked dataset


Ranked dataset with priority scores saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Ranked_Final_Model_Output.csv

--- Ranked Dataset Preview ---
                                                  URL         Intent    CTR  \
1          https://thatware.co/advanced-seo-services/     Commercial  0.05%   
0                                https://thatware.co/     Commercial  2.94%   
2     https://thatware.co/digital-marketing-services/   Navigational  0.19%   
4         https://thatware.co/link-building-services/   Navigational  0.03%   
6   https://thatware.co/conversion-rate-optimization/   Navigational  0.07%   
8   https://thatware.co/content-proofreading-servi...   Navigational  0.04%   
9        https://thatware.co/website-design-services/   Navigational  0.01%   
10      https://thatware.co/web-development-services/   Navigational  0.24%   
11      https://thatware.co/app-development-services/   Navigational  0.07%   
12  https://thatware.co/website-mainte

### Understanding the Output

This output is a **ranked dataset** that provides a prioritized list of pages from a website. It helps the website owner identify which pages to focus on for **SEO improvement**, based on metrics like **CTR (Click-Through Rate)** and a calculated **Priority Score**. Let’s go through each column and explain what it means and how it helps.

---

### Detailed Explanation of Each Column

#### **1. URL**
- The **web address** of a specific page on your website.
- Each row corresponds to a single page.

**Example**:
- `https://thatware.co/advanced-seo-services/`: The "Advanced SEO Services" page.
- `https://thatware.co/`: The homepage.

**Purpose**:
- This identifies which page is being analyzed so you can link the performance metrics and recommendations back to the correct page.

---

#### **2. Intent**
- This indicates the **purpose** of the page or what users are looking for when they land on it.
  - **Commercial**: Pages designed to sell products or services.
  - **Navigational**: Pages users visit to find specific information, often related to navigation through the website.
  - **Informational**: Pages that provide knowledge or insights, such as blogs or articles.

**Example**:
- `https://thatware.co/advanced-seo-services/` is **Commercial**, meaning it’s intended to convert users into customers by selling SEO services.
- `https://thatware.co/social-media-marketing/` is **Informational**, meaning it’s designed to educate users about social media marketing.

**Why It Matters**:
- Understanding intent helps determine if the page is performing well based on its purpose. For example:
  - Commercial pages should have a high **CTR** and **Priority Score** because they directly impact revenue.
  - Informational pages should attract users through useful content.

---

#### **3. CTR (Click-Through Rate)**
- CTR is the percentage of people who clicked on the link after seeing it in search results.  
  **Formula**: CTR = (Clicks ÷ Impressions) × 100.

**Example**:
- The homepage (`https://thatware.co/`) has a **CTR of 2.94%**, meaning nearly 3 out of 100 people clicked on it.
- The "Competitor Keyword Analysis" page has a **CTR of 0%**, meaning no one clicked on it despite it appearing in search results.

**Why It Matters**:
- A low CTR indicates that the page’s title or description in search results might not be engaging or relevant.
- Pages with a high CTR are performing well in attracting users.

---

#### **4. Priority Score**
- This is a calculated score that determines how important it is to improve or focus on a particular page.
  - Higher scores mean the page is more critical to the website’s success and requires immediate attention.

**Example**:
- The "Advanced SEO Services" page has the highest **Priority Score** of **140**, meaning it’s very important for the business.
- Other pages like "Business Intelligence Services" have a **Priority Score** of **110**, meaning they are less critical but still require attention.

**How It’s Calculated**:
- Priority Scores are likely based on:
  - The intent of the page (e.g., commercial pages get higher scores).
  - CTR (low CTR increases the priority of improvement).
  - Impressions and clicks (important pages with low performance get higher priority).

**Why It Matters**:
- This score helps website owners focus on the pages that need the most improvement and can bring the highest ROI.

---

### What This Output Conveys

1. **Performance of Each Page**:
   - It provides an overview of how well each page is performing in terms of user engagement (CTR) and importance to the business (Priority Score).

2. **Areas of Focus**:
   - Pages with high Priority Scores but low CTRs need immediate optimization. For example:
     - The "Advanced SEO Services" page has a **high Priority Score (140)** but a **low CTR (0.05%)**.  
       This means it’s critical for the business but is failing to attract clicks.  

3. **Insights Into Intent**:
   - It aligns each page with its intent, helping owners understand whether a page’s performance matches its purpose.

---

### Benefits for the Website Owner

1. **Improved Prioritization**:
   - The Priority Scores allow owners to focus on improving the most critical pages first, saving time and resources.

2. **Better Search Rankings**:
   - By improving CTR and content, pages can rank higher on search engines, increasing visibility and attracting more users.

3. **Increased Conversions**:
   - Commercial pages with higher CTRs are more likely to convert visitors into customers, directly impacting revenue.

4. **Enhanced User Experience**:
   - Informational and navigational pages can be optimized to provide a better experience, keeping users on the site longer.

---

### Steps to Take After Reviewing This Output

1. **Focus on High Priority Pages**:
   - Start with pages that have a high Priority Score but low CTR, such as:
     - `https://thatware.co/advanced-seo-services/` (Score: 140, CTR: 0.05%).

2. **Rewrite Titles and Meta Descriptions**:
   - Optimize the text that appears in search results for low-CTR pages to make them more engaging.

3. **Enhance Content**:
   - For pages with informational intent, add FAQs, improve readability, or include more relevant keywords.

4. **Track Improvements**:
   - After making changes, monitor the CTR, impressions, and rankings to ensure the changes are effective.

5. **Optimize Navigation**:
   - Ensure that navigational pages, such as "Website Design Services," are easy to use and provide clear pathways to other parts of the website.

---

### Final Summary

This ranked dataset is a roadmap for improving your website’s performance. By addressing pages with low CTRs and focusing on high-priority areas, you can boost traffic, improve search engine rankings, and increase customer engagement. Let me know if you need detailed guidance on how to optimize specific pages or implement these changes!

# **Semantic Search Intent Model**

In [None]:
# Importing necessary libraries
import pandas as pd  # pandas is used for handling tabular data and creating structured datasets
from bs4 import BeautifulSoup  # BeautifulSoup is used for parsing and extracting specific parts of HTML content
import requests  # requests is used to send HTTP requests and retrieve web page content

# Step 1: Define a list of URLs to scrape
# These URLs represent different pages from the ThatWare website that we need to analyze.
# Each URL contains content and metadata that can provide valuable insights for intent classification.
urls = [
    'https://thatware.co/',
    'https://thatware.co/advanced-seo-services/',
    'https://thatware.co/digital-marketing-services/',
    'https://thatware.co/business-intelligence-services/',
    'https://thatware.co/link-building-services/',
    'https://thatware.co/branding-press-release-services/',
    'https://thatware.co/conversion-rate-optimization/',
    'https://thatware.co/social-media-marketing/',
    'https://thatware.co/content-proofreading-services/',
    'https://thatware.co/website-design-services/',
    'https://thatware.co/web-development-services/',
    'https://thatware.co/app-development-services/',
    'https://thatware.co/website-maintenance-services/',
    'https://thatware.co/bug-testing-services/',
    'https://thatware.co/software-development-services/',
    'https://thatware.co/competitor-keyword-analysis/'
]

# Step 2: Define a function to scrape content and metadata from URLs
def scrape_relevant_content(url_list):
    """
    This function scrapes metadata (title, description) and main content from each URL in the given list.
    It cleans and organizes the content into a structured format for further analysis.

    Parameters:
    - url_list (list): A list of website URLs to scrape.

    Returns:
    - pd.DataFrame: A structured table with columns for URL, Title, Description, and Cleaned Content.
    """
    # Initialize an empty list to store data extracted from each URL
    scraped_data = []

    # Loop through each URL in the provided list
    for url in url_list:
        try:
            # Step 3: Send an HTTP GET request to fetch the web page content
            # The timeout ensures that the request doesn't hang indefinitely in case of delays.
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise an error if the HTTP request fails

            # Step 4: Parse the HTML content of the web page using BeautifulSoup
            # BeautifulSoup creates a structured object from the HTML for easy access to elements like <title>, <meta>, etc.
            soup = BeautifulSoup(response.text, 'html.parser')

            # Step 5: Extract the metadata (title and description)
            # Title: The main heading of the web page, often displayed in browser tabs or search results.
            title = soup.title.string if soup.title else "No Title Found"

            # Description: A brief summary of the page's content, typically stored in the <meta> tag with name="description".
            description_tag = soup.find("meta", attrs={"name": "description"})
            description = description_tag["content"] if description_tag else "No Description Found"

            # Step 6: Extract the main content of the page
            # Check for <main> or <article> tags, as these often hold the primary content of the page.
            main_content = soup.find('main') or soup.find('article')

            # If no <main> or <article> tag is found, check for tags with common identifiers like 'content'.
            if not main_content:
                main_content = soup.find(attrs={'class': 'content'}) or soup.find(attrs={'id': 'content'})

            # If no specific content section is found, use the entire <body> tag as a fallback.
            if not main_content:
                main_content = soup.body

            # Extract text from the identified section and clean it
            # `get_text` extracts text from HTML, removing tags. The `separator=' '` ensures proper spacing.
            content = main_content.get_text(separator=' ', strip=True) if main_content else "No Content Found"

            # Step 7: Remove unwanted or repetitive text
            # Unwanted phrases like "Terms of Service" or promotional lines add noise to the analysis and are removed.
            unwanted_phrases = [
                'GET A FREE CUSTOMIZED', 'FILL OUT THE FORM BELOW',
                'SERVICES', 'COPYRIGHT', 'TERMS OF SERVICE'
            ]
            for phrase in unwanted_phrases:
                content = content.replace(phrase, '')  # Replace unwanted text with an empty string

            # Append the cleaned and structured data for this URL to the list
            scraped_data.append({
                'URL': url,              # The original URL of the page
                'Title': title,          # Extracted title of the page
                'Description': description,  # Extracted description of the page
                'Content': content       # Cleaned main content of the page
            })

        except Exception as e:
            # Handle any errors during the scraping process
            # Log the error and append a placeholder entry for the failed URL.
            print(f"Failed to scrape {url}: {e}")
            scraped_data.append({
                'URL': url,
                'Title': "Error",
                'Description': "Error",
                'Content': "Error"
            })

    # Step 8: Convert the scraped data into a pandas DataFrame
    # This structured format makes it easier to analyze, manipulate, and save the data.
    return pd.DataFrame(scraped_data)

# Step 9: Call the function to scrape content from the list of URLs
# This executes the scraping process and returns the cleaned and structured data.
scraped_data = scrape_relevant_content(urls)

# Step 10: Save the scraped data into a CSV file
# Saving the data ensures it can be reused later without repeating the scraping process.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Define the folder path for saving files
scraped_file_path = folder_path + 'Enhanced_Scraped_Website_Content.csv'  # Define the file name for the scraped data
scraped_data.to_csv(scraped_file_path, index=False)  # Save the data into a CSV file without adding row numbers

# Step 11: Confirm the file save location
# Notify the user where the output file has been saved for further verification or analysis.
print(f"\nEnhanced scraped data saved at: {scraped_file_path}")


# Importing necessary libraries
import pandas as pd  # For handling and manipulating tabular data efficiently
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text data into numerical representations
from sklearn.cluster import KMeans  # Clusters data into meaningful groups using machine learning

# Step 1: Define file paths
# Specify the location of input data (scraped content) and the output file.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Path where data files are stored
scraped_file_path = folder_path + 'Enhanced_Scraped_Website_Content.csv'  # File containing scraped website content

# Step 2: Load the scraped data
# Read the scraped content file into a pandas DataFrame to organize it as a table.
# The file contains columns like URL, Title, Description, and Content.
scraped_data = pd.read_csv(scraped_file_path)  # Load data for processing

# Step 3: Preprocess the text data for classification
# Define a function to clean and prepare the textual content.
def preprocess_text(text):
    """
    This function preprocesses text data by:
    - Converting all text to lowercase to ensure uniformity.
    - Removing special characters and numbers to keep only meaningful words.
    - Stripping extra spaces for cleaner input.

    Parameters:
        text (str): Original unprocessed text content.
    Returns:
        str: Preprocessed and cleaned text.
    """
    import re  # Import regular expressions for cleaning text
    text = text.lower()  # Convert all text to lowercase for consistency
    text = re.sub(r'[^a-z\s]', '', text)  # Remove all special characters and numbers
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.strip()  # Remove leading and trailing spaces

# Apply the preprocessing function to the 'Content' column
# This cleans the main content of each URL and prepares it for numerical analysis.
scraped_data['Cleaned_Content'] = scraped_data['Content'].apply(preprocess_text)

# Step 4: Convert text into numerical features using TF-IDF Vectorizer
# TF-IDF (Term Frequency-Inverse Document Frequency) measures how important a word is in a document relative to a collection of documents.
# This step converts text data into a numerical format that can be used by machine learning algorithms.
vectorizer = TfidfVectorizer(max_features=500)  # Limit to the top 500 most important words to reduce noise
tfidf_matrix = vectorizer.fit_transform(scraped_data['Cleaned_Content'])  # Convert text into numerical vectors

# Step 5: Apply KMeans clustering to group pages by intent
# KMeans is an unsupervised machine learning algorithm that groups similar data points into clusters.
# In this case, we group web pages into three clusters based on their textual content.
kmeans = KMeans(n_clusters=3, random_state=42)  # Define 3 clusters for the three intents (Informational, Navigational, Commercial)
scraped_data['Intent_Cluster'] = kmeans.fit_predict(tfidf_matrix)  # Assign each page to one of the clusters

# Step 6: Map the cluster labels to human-readable intent categories
# Define a function to map numerical cluster labels (assigned by KMeans) to meaningful intent categories.
def map_cluster_to_intent(cluster_label):
    """
    Maps cluster labels to meaningful intent categories:
    - 0: Informational (pages providing detailed information).
    - 1: Navigational (pages guiding users to specific destinations or resources).
    - 2: Commercial (pages designed to sell or promote products/services).

    Parameters:
        cluster_label (int): Numerical label assigned by KMeans clustering.
    Returns:
        str: Corresponding intent category.
    """
    intent_mapping = {
        0: 'Informational',  # For pages with in-depth content, guides, or tutorials
        1: 'Navigational',   # For pages helping users navigate to specific services or sections
        2: 'Commercial'      # For pages focused on sales, promotions, or product details
    }
    return intent_mapping.get(cluster_label, 'Unknown')  # Default to 'Unknown' if label doesn't match

# Apply the mapping function to the cluster labels
# This converts numerical cluster labels into descriptive intent categories.
scraped_data['Intent'] = scraped_data['Intent_Cluster'].apply(map_cluster_to_intent)

# Step 7: Save the classified data with intents to a new CSV file
# Store the processed and classified data in a CSV file for future use.
# The new file will include the URL, Title, Description, and the classified intent of each page.
classified_file_path = folder_path + 'Classified_Content_With_Intent.csv'  # Define output file path
scraped_data.to_csv(classified_file_path, index=False)  # Save data without row indices for cleaner output

# Step 8: Confirmation of file save location
# Notify the user where the classified data has been saved for verification and further analysis.
print(f"\nClassified content with intents saved at: {classified_file_path}")



# Importing necessary libraries
import pandas as pd  # For handling tabular data (structured data like Excel files or databases)
import re  # For searching and manipulating text patterns

# Step 1: Define the file path and load the data
# Specify where the input dataset is located and load it into a pandas DataFrame.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Path to folder containing the dataset
classified_file_path = folder_path + 'Classified_Content_With_Intent.csv'  # File containing classified content data

# Load the dataset into a pandas DataFrame
# A DataFrame is a table-like structure that makes it easy to manipulate and analyze data.
classified_data = pd.read_csv(classified_file_path)

# Step 2: Define helper functions for analyzing content
# These functions will analyze the textual content and provide specific insights or observations.

def contains_cta(content):
    """
    Check if the content contains common Call-to-Action (CTA) phrases.
    CTAs are prompts designed to encourage users to take an action (e.g., "Contact Us", "Buy Now").
    If CTAs are missing, it may reduce the content's ability to drive conversions.

    Parameters:
        content (str): The textual content to analyze.
    Returns:
        bool: True if any CTA is found, otherwise False.
    """
    cta_phrases = ['contact us', 'buy now', 'sign up', 'get started', 'learn more']  # Common CTA phrases
    return any(phrase in content.lower() for phrase in cta_phrases)  # Check if any CTA is present (case-insensitive)

def content_depth(content):
    """
    Evaluate the depth of the content based on the word count.
    Deeper content (more words) is often more informative and engaging.
    If the content is too short, it may be marked as "shallow".

    Parameters:
        content (str): The textual content to analyze.
    Returns:
        str: 'shallow' if word count < 300, 'adequate' otherwise.
    """
    word_count = len(content.split())  # Count the number of words in the content
    return 'shallow' if word_count < 300 else 'adequate'  # Categorize content based on word count

def contains_internal_links(content):
    """
    Check if the content contains internal links to other pages on the website.
    Internal links help with navigation and improve SEO (search engine optimization).

    Parameters:
        content (str): The textual content to analyze.
    Returns:
        bool: True if internal links are found, otherwise False.
    """
    return bool(re.search(r'<a href="[^"]*thatware\.co[^"]*">', content))  # Look for internal links pointing to "thatware.co"

# Step 3: Generate dynamic recommendations
# Create tailored recommendations for improving the content based on its intent and characteristics.

def generate_dynamic_recommendations(row):
    """
    Generate actionable recommendations for improving content quality and performance.
    The recommendations are based on the intent of the content and its characteristics (e.g., depth, CTAs).

    Parameters:
        row (pd.Series): A row of data representing a single piece of content.
    Returns:
        str: A list of recommendations joined into a single string.
    """
    intent = row['Intent']  # The intent of the content (e.g., Commercial, Informational, Navigational)
    content = row['Content']  # The actual text content to analyze

    recommendations = []  # Initialize an empty list to store recommendations

    # Generate recommendations for Commercial content
    if intent == 'Commercial':
        if not contains_cta(content):  # Check if CTAs are missing
            recommendations.append('Add prominent CTAs (e.g., Buy Now, Contact Us).')
        if content_depth(content) == 'shallow':  # Check if content is too short
            recommendations.append('Expand the content to provide more details.')

    # Generate recommendations for Informational content
    elif intent == 'Informational':
        if content_depth(content) == 'shallow':  # Ensure content provides enough information
            recommendations.append('Add more depth to the content (e.g., FAQs, examples).')
        if not contains_internal_links(content):  # Check if internal links are missing
            recommendations.append('Add internal links to related articles or pages.')

    # Generate recommendations for Navigational content
    elif intent == 'Navigational':
        if not contains_cta(content):  # Ensure content provides clear navigation CTAs
            recommendations.append('Add clear navigation CTAs (e.g., Visit Services, Explore Options).')

    # Combine all recommendations into a single string, or provide a default message if no recommendations
    return ' '.join(recommendations) if recommendations else 'No specific recommendations needed.'

# Apply the recommendation function to each row in the dataset
# This dynamically generates recommendations for improving content based on its characteristics.
classified_data['Recommendations'] = classified_data.apply(generate_dynamic_recommendations, axis=1)

# Step 4: Save the data with dynamic recommendations
# Save the updated dataset into a new CSV file for future use or further analysis.
dynamic_recommendation_file_path = folder_path + 'Classified_Content_With_Dynamic_Recommendations.csv'  # Output file path
classified_data.to_csv(dynamic_recommendation_file_path, index=False)  # Save the dataset without row indices

# Step 5: Confirm file save location
# Inform the user where the updated dataset has been saved.
print(f"\nDynamic recommendations saved at: {dynamic_recommendation_file_path}")

# Import necessary libraries
import pandas as pd  # pandas is used for handling structured data in a tabular format like CSV files.

# Step 1: Define the file paths for the datasets
# These file paths point to the location where the input data is stored.
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'  # Base folder path
# File containing enhanced data with dynamic recommendations
enhanced_data_path = folder_path + 'Classified_Content_With_Dynamic_Recommendations.csv'
# File containing content performance metrics
content_data_path = folder_path + 'Merged_Content_Performance.csv'

# Step 2: Load the datasets into pandas DataFrames
# Use Case:
# Loading data into a DataFrame structure makes it easy to analyze, manipulate, and combine datasets.
enhanced_data = pd.read_csv(enhanced_data_path)  # Load enhanced data with recommendations
content_data = pd.read_csv(content_data_path)  # Load content performance data

# Step 3: Normalize identifiers to ensure consistent matching during merging
# Use Case:
# Identifiers may have inconsistent formats (e.g., uppercase vs. lowercase, trailing spaces).
# Normalization ensures that identifiers from both datasets can be compared accurately.

# Normalize the 'URL' column in enhanced_data to create a standardized identifier
# str.strip() removes leading/trailing spaces, and str.lower() converts text to lowercase.
enhanced_data['Identifier_Content'] = enhanced_data['URL'].str.strip().str.lower()

# Normalize the 'Identifier' column in content_data for consistency
content_data['Identifier'] = content_data['Identifier'].str.strip().str.lower()

# Step 4: Merge the datasets
# Use Case:
# Merging combines related information from two datasets into a single dataset for a comprehensive view.
# Here, we enrich the enhanced_data with performance metrics from content_data.

# Perform a left join to retain all rows from enhanced_data.
# Match 'Identifier_Content' from enhanced_data with 'Identifier' from content_data.
merged_data = pd.merge(
    enhanced_data,  # The left dataset containing enhanced recommendations.
    content_data,   # The right dataset containing content performance metrics.
    left_on='Identifier_Content',  # Column from enhanced_data for matching.
    right_on='Identifier',         # Column from content_data for matching.
    how='left',  # 'Left' join retains all rows from enhanced_data even if no match is found in content_data.
    suffixes=('', '_Content')  # Add suffixes to avoid column name conflicts.
)

# Step 5: Save the final merged dataset
# Use Case:
# Saving the merged dataset ensures that the enriched data is available for further analysis or processing steps.

# Define the output file path for the final merged dataset
final_merged_output_path = folder_path + 'Final_Merged_Without_Traffic.csv'

# Save the merged dataset to a CSV file
# index=False ensures that the row numbers are not saved in the CSV file for a cleaner format.
merged_data.to_csv(final_merged_output_path, index=False)

# Confirm the save location to the user
print(f"Final merged output saved at: {final_merged_output_path}")


# Import the pandas library for handling tabular data
import pandas as pd

# Step 1: Load the Merged Dataset
# Use Case:
# The merged dataset combines information about URLs, their content, and performance metrics (e.g., CTR, Impressions, Clicks).
# Loading this dataset into a DataFrame allows us to manipulate and analyze it.

# Define the folder path where the dataset is stored
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'

# Define the file path for the merged dataset
merged_data_path = folder_path + 'Final_Merged_Without_Traffic.csv'

# Read the merged dataset into a pandas DataFrame
merged_data = pd.read_csv(merged_data_path)

# Step 2: Add Recommendations, SEO Insights, and Content Gaps Columns
# Use Case:
# These columns will store dynamically generated suggestions to improve the performance of each URL.
# Adding empty columns ensures the DataFrame has placeholders for these values before we populate them.

merged_data['Recommendations'] = ''  # Placeholder for improvement recommendations
merged_data['SEO_Insights'] = ''  # Placeholder for actionable SEO insights
merged_data['Content_Gaps'] = ''  # Placeholder for identifying missing content opportunities

# Define thresholds for analysis
# Use Case:
# Setting thresholds helps us decide when specific recommendations or insights should be triggered.
CTR_THRESHOLD = 1.0  # Minimum acceptable click-through rate (CTR) percentage
IMPRESSION_THRESHOLD = 1000  # Minimum impressions for meaningful visibility

# Step 3: Iterate Through Rows to Populate Insights
# Use Case:
# For each row (representing a URL), we analyze its intent, CTR, impressions, and content to generate tailored recommendations.

# Loop through each row in the DataFrame
for index, row in merged_data.iterrows():
    intent = row['Intent']  # Identify the intent of the current page (e.g., Commercial, Informational, Navigational)
    recommendations = []  # Initialize a list to store recommendations for this row
    seo_insights = []  # Initialize a list to store SEO insights for this row
    content_gaps = []  # Initialize a list to store identified content gaps for this row

    # Generate Recommendations Based on Intent
    # Use Case:
    # Tailored recommendations based on the intent of the page help improve its effectiveness for users.
    if intent == 'Commercial':  # If the intent is to drive sales or conversions
        recommendations.append("Add clear CTAs to encourage user actions. ")  # Suggest adding strong Call-to-Actions
        recommendations.append("Highlight unique selling points (USPs) prominently. ")  # Emphasize key product/service benefits
        if 'pricing' not in row['Content'].lower():  # Check if pricing information is missing
            content_gaps.append("Include detailed pricing or discount information.")  # Suggest adding pricing details
    elif intent == 'Informational':  # If the intent is to educate or inform users
        recommendations.append("Add internal links to related pages or guides.")  # Suggest linking to related topics
        recommendations.append("Provide detailed, value-driven content such as FAQs or tutorials.")  # Suggest adding FAQs
        if 'FAQ' not in row['Content']:  # Check if FAQs are missing
            content_gaps.append("Consider adding FAQs to address common user questions.")  # Highlight FAQ as a content gap
    elif intent == 'Navigational':  # If the intent is to help users navigate
        recommendations.append("Ensure smooth navigation to the target resource.")  # Suggest improving navigation
        recommendations.append("Add breadcrumbs or context-specific navigation options.")  # Suggest adding breadcrumbs
        if 'contact' not in row['Content'].lower():  # Check if a contact section is missing
            content_gaps.append("Add a clear contact or inquiry section.")  # Suggest adding a contact section

    # Generate SEO Insights
    # Use Case:
    # SEO insights help identify specific actions to improve search visibility and user engagement.
    ctr_value = float(row['CTR'].strip('%'))  # Convert CTR (stored as a string with '%') into a float value
    impressions = row['Impressions']  # Extract the number of impressions for the URL
    clicks = row['Clicks']  # Extract the number of clicks for the URL

    if ctr_value < CTR_THRESHOLD:  # If the CTR is below the defined threshold
        seo_insights.append("Improve meta descriptions or headlines to boost CTR.")  # Suggest optimizing meta descriptions
    if impressions < IMPRESSION_THRESHOLD:  # If impressions are below the threshold
        seo_insights.append("Focus on high-volume keywords to increase impressions.")  # Suggest targeting high-volume keywords
    if clicks == 0:  # If the URL has received no clicks
        seo_insights.append("No clicks received; reassess keyword strategy or content relevance.")  # Suggest revisiting keyword strategy

    # Populate the DataFrame with the dynamically generated values
    # Use Case:
    # Adding the generated recommendations, insights, and content gaps into the DataFrame for each URL.
    merged_data.at[index, 'Recommendations'] = '; '.join(recommendations)  # Join the list of recommendations into a single string
    merged_data.at[index, 'SEO_Insights'] = '; '.join(seo_insights)  # Join the list of SEO insights into a single string
    merged_data.at[index, 'Content_Gaps'] = '; '.join(content_gaps)  # Join the list of content gaps into a single string

# Step 4: Save the Updated Dataset
# Use Case:
# Saving the updated dataset ensures that all the dynamically generated insights are preserved for further analysis or use.

# Define the file path for the updated dataset
final_output_path = folder_path + 'Improved_Dynamic_Final_Output.csv'

# Save the DataFrame to a CSV file
merged_data.to_csv(final_output_path, index=False)  # Export the DataFrame without row indices
print(f"Improved dataset with all columns saved at: {final_output_path}")  # Confirm the save location

# Step 5: Display a Preview of the Updated Dataset
# Use Case:
# Previewing the updated dataset allows for quick validation of the generated insights and recommendations.

print("\n--- Final Output Preview with All Columns ---")
print(merged_data.head(17))  # Display the first 17 rows of the updated dataset


# Importing the pandas library to handle tabular data
import pandas as pd

# Step 1: Load the Improved Dataset
# Use Case:
# The improved dataset contains details about URLs, their intents, content, SEO insights, and content gaps.
# Loading the dataset allows us to process it further for scoring and ranking URLs based on their priority.

# Define the folder path where the dataset is stored
folder_path = '/content/drive/MyDrive/Dataset For Semantic Search Intent Model/'

# Define the file path for the improved dataset
improved_dataset_path = folder_path + 'Improved_Dynamic_Final_Output.csv'

# Read the dataset into a pandas DataFrame
improved_data = pd.read_csv(improved_dataset_path)

# Step 2: Initialize Scoring Columns
# Use Case:
# Adding a new column "Priority_Score" to store the calculated scores for each URL based on predefined factors.
# Initializing the score to 0 ensures consistency across all rows before calculation.

improved_data['Priority_Score'] = 0  # Set initial score to 0 for all rows

# Step 3: Scoring Mechanism
# Use Case:
# Calculate a priority score for each URL based on factors such as intent, CTR (click-through rate), content gaps, and SEO insights.
# This helps identify which URLs need immediate attention for optimization.

# Loop through each row in the dataset to calculate scores
for index, row in improved_data.iterrows():
    score = 0  # Start with a score of 0 for each row

    # Factor 1: Intent-Based Weight
    # Use Case:
    # Assign weights based on the intent of the URL.
    # Commercial pages get the highest weight because they directly impact revenue.
    # Navigational pages get medium weight as they guide users to important resources.
    # Informational pages get the lowest weight as their primary role is to educate or inform.
    if row['Intent'] == 'Commercial':  # Check if the intent is Commercial
        score += 50  # Add 50 points for Commercial intent
    elif row['Intent'] == 'Navigational':  # Check if the intent is Navigational
        score += 30  # Add 30 points for Navigational intent
    elif row['Intent'] == 'Informational':  # Check if the intent is Informational
        score += 20  # Add 20 points for Informational intent

    # Factor 2: CTR Improvement Opportunity
    # Use Case:
    # URLs with low CTR need urgent optimization to improve visibility and user engagement.
    ctr = float(row['CTR'].strip('%'))  # Convert the CTR value (stored as a percentage string) to a float
    if ctr < 1.0:  # If CTR is less than 1.0%
        score += 40  # Add 40 points for low CTR to prioritize improvement
    elif 1.0 <= ctr < 3.0:  # If CTR is between 1.0% and 3.0%
        score += 20  # Add 20 points for medium CTR

    # Factor 3: Content Gaps
    # Use Case:
    # Pages with content gaps (e.g., missing FAQs, pricing details) need improvement to better serve user needs.
    if row['Content_Gaps']:  # Check if the Content_Gaps column is not empty
        score += 30  # Add 30 points for identified content gaps

    # Factor 4: SEO Insights
    # Use Case:
    # URLs with actionable SEO insights indicate potential for optimization.
    if row['SEO_Insights']:  # Check if the SEO_Insights column is not empty
        score += 20  # Add 20 points for actionable SEO insights

    # Update the Priority_Score column with the calculated score
    improved_data.at[index, 'Priority_Score'] = score

# Step 4: Rank URLs by Priority Score
# Use Case:
# Sorting the dataset by the Priority_Score in descending order highlights the most important URLs for optimization.
ranked_data = improved_data.sort_values(by='Priority_Score', ascending=False)  # Sort by score in descending order

# Step 5: Save the Ranked Dataset
# Use Case:
# Save the ranked dataset to a new file for further review and action.
# This ensures the calculated rankings are preserved and can be shared with other teams or stakeholders.

# Define the file path for the ranked dataset
ranked_output_path = folder_path + 'Ranked_Final_Model_Output.csv'

# Save the ranked dataset to a CSV file
ranked_data.to_csv(ranked_output_path, index=False)  # Save without row indices
print(f"Ranked dataset with priority scores saved at: {ranked_output_path}")  # Confirm the save location

# Step 6: Display Preview for Validation
# Use Case:
# Displaying the top rows of the ranked dataset helps validate the scoring mechanism and ensures the output is as expected.

print("\n--- Ranked Dataset Preview ---")
print(ranked_data[['URL', 'Intent', 'CTR', 'Priority_Score']].head(17))  # Show the top 17 rows of the ranked dataset




Enhanced scraped data saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Enhanced_Scraped_Website_Content.csv

Classified content with intents saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Classified_Content_With_Intent.csv

Dynamic recommendations saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Classified_Content_With_Dynamic_Recommendations.csv
Final merged output saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Final_Merged_Without_Traffic.csv
Improved dataset with all columns saved at: /content/drive/MyDrive/Dataset For Semantic Search Intent Model/Improved_Dynamic_Final_Output.csv

--- Final Output Preview with All Columns ---
                                                  URL  \
0                                https://thatware.co/   
1          https://thatware.co/advanced-seo-services/   
2     https://thatware.co/digital-marketing-services/   
3   https://thatware.co/busin

---
# What is this output?

This is a **data table** that provides insights into how different pages of a website (in this case, ThatWare) are performing in terms of:
1. **Clicks** (How many times users clicked on these pages in search results).
2. **Impressions** (How often these pages were shown in search results).
3. **CTR (Click-Through Rate)** (How many impressions turned into clicks, expressed as a percentage).
4. **Position** (Average ranking of these pages in search results).
5. **SEO Insights** (Suggestions to improve the page's performance in search engines).
6. **Content Gaps** (Missing elements or information that, if added, could improve user engagement or search rankings).

Each row in the table represents a specific page of the website.

---

### Step-by-Step Explanation of the Output

#### 1. **Identifier**
- **What it is:** The URL of the webpage being analyzed.
- **Use Case:** This tells you which specific page the data corresponds to. For example:
  - `https://thatware.co/` represents the homepage.
  - `https://thatware.co/advanced-seo-services/` represents the "Advanced SEO Services" page.
- **What to do:** Identify which pages are important for your business goals (e.g., product pages, service pages, or informational pages).

---

#### 2. **Clicks**
- **What it is:** The total number of clicks this page received in search engine results.
- **Use Case:** More clicks mean more users are visiting your site through search engines. A page with high clicks is performing well in attracting users.
- **What to do:**
  - For pages with **low clicks**, consider improving the title and description (meta tags) to make them more engaging.
  - For pages with **high clicks**, maintain and optimize them further to sustain performance.

---

#### 3. **Impressions**
- **What it is:** The number of times this page was shown in search results.
- **Use Case:** High impressions mean the page appears frequently in search results, but it doesn't necessarily mean users are clicking on it.
- **What to do:**
  - If a page has **high impressions but low clicks**, it may need better meta descriptions or titles to attract users.
  - If impressions are **low**, consider optimizing the content or targeting better keywords.

---

#### 4. **CTR (Click-Through Rate)**
- **What it is:** The percentage of impressions that resulted in clicks. For example:
  - If a page has 100 impressions and 10 clicks, the CTR is 10%.
- **Use Case:** A higher CTR indicates that the page is attractive to users in search results.
- **What to do:**
  - Pages with **low CTR** (e.g., below 1%) should have their meta tags optimized. Use clear, actionable language in titles like "Buy Now" or "Learn More."
  - Pages with **high CTR** (e.g., above 2%) are performing well and can be further enhanced to maintain their position.

---

#### 5. **Position**
- **What it is:** The average position of the page in search results. A lower number (e.g., 1 or 2) means the page is ranking higher.
- **Use Case:** Higher-ranking pages (positions 1–3) get more clicks.
- **What to do:**
  - For pages ranked **below 10**, improve their content and keyword targeting.
  - For pages ranked **in the top 3**, maintain their performance with regular updates.

---

#### 6. **Context**
- **What it is:** Describes the type of data being analyzed (e.g., "Pages").
- **Use Case:** Helps categorize the data for further segmentation or analysis.

---

#### 7. **SEO Insights**
- **What it is:** Specific suggestions to improve the page's performance in search engines. Examples include:
  - "Improve meta descriptions or headlines to boost CTR."
- **Use Case:** These actionable suggestions highlight areas where the page is underperforming in terms of SEO.
- **What to do:** Follow the insights for each page. For example:
  - Rewrite meta descriptions to make them more engaging and keyword-rich.
  - Use attention-grabbing headlines to attract clicks.

---

#### 8. **Content Gaps**
- **What it is:** Missing elements or opportunities in the content that could enhance user experience or search performance. Examples include:
  - "Include detailed pricing or discount information."
  - "Add a clear contact or inquiry section."
- **Use Case:** These are opportunities to improve the content so that it aligns better with user expectations and search intent.
- **What to do:**
  - Fill the identified gaps in content. For example:
    - Add pricing details to product pages.
    - Include a contact form or FAQs where relevant.

---

### Benefits of This Output for Website Owners

1. **Improved Clicks and Engagement:** By acting on the recommendations, pages with low clicks and CTR can attract more users.
2. **Better SEO Rankings:** Filling content gaps and improving meta descriptions can boost rankings, leading to more impressions and clicks.
3. **Enhanced User Experience:** Addressing content gaps ensures that users find relevant information, increasing their satisfaction and likelihood to convert.
4. **Targeted Optimization:** The insights and gaps allow owners to focus on underperforming pages rather than applying blanket changes across the site.

---

### What Steps Should Be Taken After Getting This Output?

1. **Prioritize Pages:**
   - Focus on pages with low clicks, low CTR, or poor rankings (Position > 10).
   - Pages with high impressions but low clicks need immediate attention.

2. **Address SEO Insights:**
   - Rewrite meta descriptions and titles for the suggested pages.
   - Use keywords that align with user search queries.

3. **Fill Content Gaps:**
   - Add missing elements like pricing, FAQs, or internal links.
   - Ensure the content is detailed and addresses user needs.

4. **Track Progress:**
   - After making changes, monitor the performance of these pages using tools like Google Search Console.

---

### Final Thoughts

This output provides a **comprehensive roadmap** for improving the performance of a website's pages. By acting on the SEO insights and filling the identified content gaps, a website owner can:
- Attract more users.
- Improve search rankings.
- Enhance the overall user experience.


---
# **What is this output?**

This is a **ranked dataset** of web pages from a website (in this case, ThatWare). Each row in the dataset represents one webpage, and the columns provide details about that page, such as:
1. **URL:** The address of the webpage.
2. **Intent:** The purpose or goal of the webpage (e.g., to provide information, sell a product, or guide users to a resource).
3. **CTR (Click-Through Rate):** The percentage of impressions that turned into clicks.
4. **Priority Score:** A calculated score that ranks the importance or urgency of optimizing this page.

This output helps the website owner understand how their pages are performing in search results and which pages they should prioritize for improvement.

---

### **Step-by-Step Explanation of the Output**

#### **1. URL**
- **What it is:**
  - This is the web address of the page being analyzed.
  - For example:
    - `https://thatware.co/advanced-seo-services/` is a page about "Advanced SEO Services."
    - `https://thatware.co/website-maintenance-services/` is about "Website Maintenance Services."
- **Why it matters:**
  - Knowing which specific page the data corresponds to allows you to make targeted improvements.
  - Each page serves a different purpose (e.g., product promotion, information sharing), so optimizing them requires different strategies.

---

#### **2. Intent**
- **What it is:**
  - The **intent** explains the purpose of the webpage. In this output, there are three types of intent:
    1. **Commercial:** Pages that aim to sell a product or service (e.g., Advanced SEO Services).
    2. **Navigational:** Pages designed to guide users to a specific resource or service (e.g., Website Maintenance Services).
    3. **Informational:** Pages that provide information or answer questions (e.g., Business Intelligence Services).
- **Why it matters:**
  - Intent helps you decide the best way to optimize a page:
    - **Commercial pages:** Focus on adding clear call-to-action (CTA) buttons like "Buy Now" or "Contact Us."
    - **Navigational pages:** Ensure smooth navigation to other important pages on the website.
    - **Informational pages:** Provide detailed, high-quality content that answers user queries.
- **Actions to take:**
  - Review each page's content to ensure it aligns with its intent:
    - Add pricing details or testimonials to **commercial pages**.
    - Add breadcrumbs or links to other pages on **navigational pages**.
    - Add FAQs, guides, or tutorials to **informational pages**.

---

#### **3. CTR (Click-Through Rate)**
- **What it is:**
  - CTR is the percentage of times a page was shown in search results (impressions) and clicked by users.
  - For example:
    - `2.94%` CTR means 2.94 out of every 100 impressions led to a click.
- **Why it matters:**
  - Higher CTR indicates the page is attractive to users in search results.
  - Low CTR may mean the page’s meta description or title isn’t appealing enough.
- **Actions to take:**
  - Pages with **low CTR (e.g., below 1%)** need better optimization:
    - Rewrite the meta title to include action words like "Best," "Free," or "Easy."
    - Improve the meta description to make it more engaging and informative.

---

#### **4. Priority Score**
- **What it is:**
  - This is a calculated score that ranks how important it is to optimize a page. Higher scores indicate greater urgency.
  - For example:
    - A score of **140** (e.g., Advanced SEO Services) means this page is very important and needs immediate attention.
    - A score of **120** (e.g., Website Design Services) indicates moderate importance.
- **Why it matters:**
  - Website owners often have limited resources to optimize all pages at once. The priority score helps focus on the most critical pages first.
- **How it is calculated:**
  - The score is based on factors like:
    - Page intent (e.g., commercial pages have higher priority).
    - CTR (low CTR increases priority).
    - Content gaps or missing SEO elements.
- **Actions to take:**
  - Start optimizing pages with the highest scores (e.g., those with **140** or **120**) before moving to lower-priority ones.

---

### **What Steps Should Be Taken After Getting This Output?**

1. **Review the URLs and their Intent:**
   - Understand what each page is trying to achieve.
   - Ensure the content matches the intent:
     - For **commercial pages:** Highlight products/services, add CTAs like "Get a Quote."
     - For **informational pages:** Provide detailed answers to user questions.
     - For **navigational pages:** Ensure smooth links to other resources or services.

2. **Improve Meta Titles and Descriptions:**
   - Focus on pages with **low CTR** (e.g., less than 1%).
   - Write compelling meta descriptions with action-oriented words like "Learn More" or "Explore Now."
   - Include keywords that match what users search for.

3. **Address Content Gaps:**
   - Review the "Content Gaps" column from previous outputs.
   - For example:
     - If a gap says "Add detailed pricing information," include clear pricing details.
     - If it says "Add a contact section," ensure the page has a visible "Contact Us" form.

4. **Track Progress:**
   - After making changes, monitor the pages over time.
   - Use tools like Google Search Console to see if clicks and CTR improve.

5. **Focus on High-Priority Pages First:**
   - Start with pages that have a **Priority Score of 140** or **120** as they are more critical for your business.

---

### **How Is This Beneficial for the Website Owner?**

1. **Maximizing Returns:** By focusing on high-priority pages, the website owner can quickly improve the performance of their most important pages.
2. **Improving User Engagement:** Enhancing meta descriptions, adding CTAs, and filling content gaps will attract more users and keep them engaged.
3. **Boosting Search Rankings:** Optimized pages are more likely to rank higher in search engines, leading to increased visibility.
4. **Driving Conversions:** Well-optimized commercial pages can turn more visitors into customers, increasing revenue.

---

### **Final Thoughts**

This output provides a **data-driven roadmap** for optimizing a website. It highlights which pages need attention, why they need it, and how to prioritize them. By following the outlined steps:
- The website owner can improve the visibility and performance of their pages.
- User experience will improve, leading to higher conversions and better rankings.
