<a href="https://colab.research.google.com/github/Abhiss123/AlmaBetter-Projects/blob/main/Advanced_Content_Optimization_with_BERT_for_SEO_Enhancement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Name: Advanced Content Optimization with BERT for SEO Enhancement

**Summary:** **BERT** is a powerful tool that helps search engines like Google better understand the context of words in a sentence. This means that when people search for something, they're more likely to find exactly what they're looking for. For you, as someone creating content or running a website, **BERT** means you should focus on writing clear, helpful, and natural content. The more your content makes sense in context, the better **BERT** can help it get found by the right people. This ultimately improves your SEO and helps your website perform better in search results.

# What is the purpose of this project?
* The purpose of this project is to utilize **BERT (Bidirectional Encoder Representations from Transformers)** to analyze and optimize the content on a website. By generating embeddings that represent the meaning of the website's content, the project allows for identifying redundant content, uncovering gaps in the content strategy, and improving overall SEO performance. The goal is to ensure that the website is delivering unique, relevant, and contextually rich content that aligns with search engine algorithms and user intent.








# What is BERT?

*  **BERT stands for Bidirectional Encoder Representations from Transformers.** While the name sounds complex, let's break it down into simple, non-technical terms so you can easily understand what it is and how it helps.

**1. Understanding BERT in Simple Terms**

*  **Context Matters:** Imagine you're reading a book, and you come across the word "bank." Depending on the sentence, "bank" could mean the side of a river or a place where you keep your money. How do you know which meaning is correct? You look at the words around it to understand the context. BERT works in a similar way. It reads the entire sentence, both forwards and backwards, to understand the meaning of each word based on the words around it.

*  **Bidirectional:** Most traditional language models read text in one direction—either from left to right (like how we read English) or right to left. BERT, however, reads the text in both directions at the same time. This allows it to have a much better understanding of the context because it considers all the words in a sentence together.

*  **Transformers:** This is the type of technology that powers BERT. Think of transformers as a kind of brain that processes and understands language. They help BERT focus on the important parts of a sentence and figure out how words relate to each other.

**2. How Does BERT Help with Search Engines and SEO?**

*  **Improving Search Results:** When you type a query into Google, **BERT** helps the search engine understand what you're really asking for. For example, if you search for "how to catch a cold," **BERT** helps Google understand that you're not asking for tips on getting sick, but rather how people typically get colds. This means you'll get more relevant results.

*  **Content Relevance:** **BERT** also helps search engines match your website content to what people are searching for. If your content is well-written and clear, BERT can understand it better and match it to relevant search queries. This makes your website more likely to show up in search results when someone searches for a topic you cover.

**3. Why is BERT Important for SEO?**

*  **Understanding User Intent:** One of the biggest challenges in SEO (Search Engine Optimization) is making sure your website content matches what people are looking for. BERT helps with this by understanding the intent behind search queries. It can differentiate between similar phrases with different meanings and ensure that users get results that truly answer their questions.

*  **Long-Tail Keywords:** In SEO, long-tail keywords are longer and more specific search phrases. BERT is especially good at understanding these more complex queries. For example, instead of just understanding "SEO tips," BERT can help with a search like "how to improve my website's SEO ranking without technical skills."

*  **Better Content Creation:** Knowing how BERT works can help you create content that is more likely to rank well on search engines. Since BERT understands context, it's important to write naturally and clearly, rather than just stuffing keywords into your content.

*  **Think of BERT as a Smart Assistant:** Imagine you have a really smart assistant who understands everything you say, even if it's complex or vague. BERT is like that assistant for search engines. It helps them understand what you mean and what you're looking for, so they can give you the best possible results.

*  **Focus on Quality Content:** As a website owner or content creator, your job is to write content that clearly answers questions or provides valuable information. You don't need to worry about cramming in exact keywords all the time. Instead, focus on making your content easy to read and informative, and BERT will help match it to the right searches.


# This project improve SEO performance

*  The project improves SEO performance by ensuring that the content on the website is unique, contextually rich, and aligned with user intent. By avoiding redundant content and filling gaps in the content strategy, the website is better positioned to rank higher on search engine results pages (SERPs). Additionally, the use of BERT helps in understanding and targeting long-tail keywords and semantic variations, which are increasingly important in modern SEO.

# What is the significance of content similarity in SEO?

*  Content similarity is crucial in SEO because search engines aim to provide the most relevant and diverse results to users. If a website has multiple pages with highly similar content, it can lead to content cannibalization, where these pages compete against each other in search rankings. By identifying and addressing content similarity, the project helps to avoid this issue, ensuring that each page on the website has a unique purpose and targets specific keywords or user intents, thereby improving overall site performance.

# How does the project help in identifying content gaps?

*  By generating and comparing embeddings for various content pieces on the website, the project can highlight areas where related topics are not well-covered or linked. If certain topics that should be related have low similarity in their embeddings, it indicates that the content strategy might be lacking in connecting these topics. The project then suggests creating new content to bridge these gaps, ensuring comprehensive coverage of all relevant topics.

# What are the practical benefits of this project for a website owner?

**For a website owner, this project offers several practical benefits:**

*  **Improved SEO:** By ensuring that the content is unique and contextually relevant, the website is more likely to rank higher in search results.

*  **Content Strategy Optimization:** The project identifies redundant content and gaps, helping the owner create a more effective content strategy.


*  **Enhanced User Experience:** With well-structured, relevant content, users are more likely to engage with the website, leading to better conversion rates.

# 1. requests Library
**What is it?**

* The requests library is a popular Python library used to send HTTP requests to web servers. It allows you to interact with websites programmatically, enabling you to retrieve web pages, send data to a server, and more.

**Why Do We Need It?**

*  **Web Scraping:** In this project, you're retrieving content from a website to analyze it. The requests library is used to send a GET request to the website's server, which responds by sending back the HTML content of the page.

*  **Example Use:** When you want to scrape the text content from a webpage (like getting all the paragraphs or headings), requests helps you fetch the raw HTML data of that page.

  **response = requests.get(url)**  # Sends a GET request to the specified URL and retrieves the web page content.

# 2. BeautifulSoup from bs4

**What is it?**
*  BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates a parse tree that makes it easy to navigate and search through the HTML content.

**Why Do We Need It?**
*  **HTML Parsing:** After retrieving the raw HTML content of a webpage using requests, you need to extract useful information from it, such as the text inside paragraph tags (p). BeautifulSoup allows you to do this efficiently by parsing the HTML and providing simple methods to locate elements within it.

*  **Example Use:** If you want to extract all the text inside (p) tags, BeautifulSoup helps you easily find and extract this content.
   
   **soup = BeautifulSoup(response.text, 'html.parser')**  # Parses the HTML content retrieved by `requests`.

   **paragraphs = soup.find_all('p')**  # Finds all paragraph tags in the HTML.

# 3. re (Regular Expressions) Library

**What is it?**

*  The re library in Python is used for working with regular expressions, which are a powerful tool for searching, matching, and manipulating text.

**Why Do We Need It?**

*  **Text Preprocessing:** When you scrape content from a webpage, the raw text often contains unwanted characters, like punctuation, numbers, or special symbols, that you may want to remove before analysis. Regular expressions allow you to clean and preprocess this text by defining patterns to match and remove or replace unwanted parts.

*  **Example Use:** If you want to remove everything except letters and spaces from the text, re makes it possible with a simple pattern.

   **text = re.sub(r'[^a-zA-Z\s]', '', text)**  # Removes all characters that are not letters or spaces.

# 4. torch Library (PyTorch)

**What is it?**

*  torch is the core library of PyTorch, a popular open-source deep learning framework. It provides powerful tools for working with tensors (multi-dimensional arrays) and building neural networks.

**Why Do We Need It?**

*  **Running Neural Networks:** In this project, you're using a pre-trained BERT model, which is a type of neural network. PyTorch provides the infrastructure to load, manipulate, and run this model, allowing you to generate embeddings from text data.

*  **Example Use:** When you want to use BERT to convert text into numerical embeddings, PyTorch manages the entire process, from loading the model to performing the computations needed to generate the embeddings.

   **import torch**  # Required to work with the BERT model, which is implemented in PyTorch.

# 5. transformers from Huggingface

**What is it?**
*  transformers is a library provided by Hugging Face that makes it easy to use pre-trained models like BERT, GPT, and others. It includes tools for tokenizing text (converting words into numerical tokens that the model can understand) and loading the models themselves.

**Why Do We Need It?**
*  **BERT Model and Tokenization:** BERT requires text to be tokenized before it can generate embeddings. The transformers library provides a BertTokenizer to handle this tokenization and a BertModel to load and run the BERT model.

*  **Example Use:** When you input text, the tokenizer converts it into tokens that the BERT model can process. The model then takes these tokens and generates embeddings, which are numerical representations of the text.

     **from transformers import BertTokenizer, BertModel**

     **tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')**  # Loads the BERT tokenizer.

    **model = BertModel.from_pretrained('bert-base-uncased')**  # Loads the pre-trained BERT model.


In [None]:
import requests  # Library to make HTTP requests to access website content
from bs4 import BeautifulSoup  # Library to parse HTML content from websites
import re  # Regular expressions library to clean and preprocess text
import torch  # PyTorch library, used for working with neural networks
from transformers import BertTokenizer, BertModel  # Huggingface Transformers library, used for BERT model and tokenizer
import numpy as np  # NumPy library, used for handling arrays and numerical operations


In [None]:
url= " https://thatware.co/ "
# Send an HTTP request to the provided URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
soup

<!DOCTYPE html>

<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-US"> <![endif]-->
<!--[if gt IE 9]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head><meta charset="utf-8"/><script>if(navigator.userAgent.match(/MSIE|Internet Explorer/i)||navigator.userAgent.match(/Trident\/7\..*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]nowprocket/)){if(href.indexOf("?")==-1){if(href.indexOf("#")==-1){document.location.href=href+"?nowprocket=1"}else{document.location.href=href.replace("#","?nowprocket=1#")}}else{if(href.indexOf("#")==-1){document.location.href=href+"&nowprocket=1"}else{document.location.href=href.replace("#","&nowprocket=1#")}}}}</script><script>(()=>{class RocketLazyLoadScripts{constructor(){this.v="1.2.6",this.triggerEvents=["keydown","mousedown","mousemove","touchmove","touchstart","touchend","wheel"],this.userEventHandler=this.t.bind(this),this.touchStartHandler=this.i.bind(this),this.touchMoveHandler=this.o.bind(this),this.touchEndHandler=th

# paragraphs = soup.find_all('p')

This line of code is using the BeautifulSoup library in Python to parse HTML content and extract all paragraph elements from a webpage. Here's what's happening:

**1. soup:** This is an object that represents the parsed HTML content of the webpage. It's created by using the BeautifulSoup library to load the HTML content into a Python object.

**2. find_all():** This is a method of the soup object that finds all occurrences of a specific HTML element or elements in the parsed content.

**3. ('p'):** This is the argument passed to the find_all() method, specifying that we want to find all paragraph elements (<p) in the HTML content.

So, when we combine them,
**soup.find_all('p')** returns a list of all paragraph elements found in the HTML content.

*  The resulting paragraphs variable will be a list of BeautifulSoup Tag objects, each representing a paragraph element. You can then iterate over this list to access the text content of each paragraph, like this:

  **for paragraph in paragraphs: print(paragraph.text)**

  



In [None]:
# Find all paragraphs in the webpage
paragraphs = soup.find_all('p')
paragraphs

[<p>$ Revenue<br/>Generated via SEO</p>,
 <p>Qualified Leads<br/>Generated</p>,
 <p aria-atomic="true" aria-live="polite" role="status"></p>,
 <p><span class="wpcf7-form-control-wrap" data-name="full-name"><input aria-invalid="false" aria-required="true" class="wpcf7-form-control wpcf7-text wpcf7-validates-as-required" name="full-name" placeholder="Name" size="40" type="text" value=""/></span><span class="wpcf7-form-control-wrap" data-name="email-address"><input aria-invalid="false" aria-required="true" class="wpcf7-form-control wpcf7-email wpcf7-validates-as-required wpcf7-text wpcf7-validates-as-email" name="email-address" placeholder="Email Address" size="40" type="email" value=""/></span><span class="wpcf7-form-control-wrap" data-name="web-url"><input aria-invalid="false" aria-required="true" class="wpcf7-form-control wpcf7-text wpcf7-validates-as-required" name="web-url" placeholder="Website URL" size="40" type="text" value=""/></span><input class="wpcf7-form-control wpcf7-submit 

# text_content = ' '.join([para.get_text() for para in paragraphs])

**1. [para.get_text() for para in paragraphs]:** This is a list comprehension that iterates over each paragraph element in the paragraphs list.

**2. para.get_text():** For each paragraph element, this method extracts the text content of the element, without any HTML tags.

**3. [...]:** The list comprehension creates a new list containing the text content of each paragraph element.

**4. ' '.join(...):** This method takes the list of text content and joins them together into a single string, separated by spaces (' ').

* The resulting text_content variable will be a single string containing the text content of all paragraph elements, separated by spaces.

*  For example, if the paragraphs list contains three paragraph elements with the following text content:

  **['This is paragraph 1.', 'This is paragraph 2.', 'This is paragraph 3.']**

*  The text_content variable will be:

   **'This is paragraph 1. This is paragraph 2. This is paragraph 3.'**


In [None]:

# Combine all paragraph texts into one large string
text_content = ' '.join([para.get_text() for para in paragraphs])
text_content

'$ RevenueGenerated via SEO Qualified LeadsGenerated  \n 8 years ago, we embarked on a journey to unravel the intricacies of the Google algorithm—a cryptic enigma begging to be deciphered. Consider it akin to unlocking a closely guarded secret, comparable only to the recipe of Coca Cola or the security measures surrounding the Crown Jewels of London. To traverse the Google maze, we decided to rewrite the rules and carve our own path. Our strategy? Develop proprietary AI algorithms to adeptly monitor and navigate the evolving landscape of the Google algorithm. To date, we\'ve pioneered an impressive portfolio of 753+ unique AI SEO algorithms, elevating the effectiveness and efficiency of our work. While SEO teams globally have traditionally relied on three key strategies—on-site SEO optimization, backlink building, and content creation and optimization—we at Thatware AI SEO have rewritten the playbook. Picture this scenario: Your company aspires to secure a coveted spot on page 1 for a 

# Removing Special Characters, Digits, and Extra Spaces

**text = re.sub(r'[^a-zA-Z\s]', '', text)**

**Explanation:**

*  **re.sub():** This function from the re (regular expression) module is used to search for patterns in a string and replace them with something else.

*  **r'[^a-zA-Z\s]':** This is a regular expression pattern. Let’s break it down:

*  **^:** Means "not" in regular expressions, so [^...] means "anything that is not..."

*  **a-zA-Z:** Represents all uppercase and lowercase letters.

*  **\s:** Represents whitespace characters (like spaces, tabs, and newlines).

*    **'' (empty string):** This tells re.sub() to replace any character that is not a letter or space with nothing, effectively removing it.

**Why It’s Useful:**

*  This step removes anything from the text that isn’t a letter or space. This includes **punctuation** **(like commas, periods)**, **digits (like 123)**, and **special characters** **(like @, #, $).**

**Example:**
*  **Input: "Hello, World! 2023 is the year of AI."**

*  **Output: "Hello World is the year of AI"**


In [None]:
# Remove everything that isn't a letter or space
text = re.sub(r'[^a-zA-Z\s]', '', text_content)
text

' RevenueGenerated via SEO Qualified LeadsGenerated  \n  years ago we embarked on a journey to unravel the intricacies of the Google algorithma cryptic enigma begging to be deciphered Consider it akin to unlocking a closely guarded secret comparable only to the recipe of Coca Cola or the security measures surrounding the Crown Jewels of London To traverse the Google maze we decided to rewrite the rules and carve our own path Our strategy Develop proprietary AI algorithms to adeptly monitor and navigate the evolving landscape of the Google algorithm To date weve pioneered an impressive portfolio of  unique AI SEO algorithms elevating the effectiveness and efficiency of our work While SEO teams globally have traditionally relied on three key strategiesonsite SEO optimization backlink building and content creation and optimizationwe at Thatware AI SEO have rewritten the playbook Picture this scenario Your company aspires to secure a coveted spot on page  for a strategic keyword Like clock

# Removing Extra Spaces
**text = re.sub(r'\s+', ' ', text).strip()**

**Explanation:**

*  **re.sub(r'\s+', ' ', text):** This replaces one or more whitespace characters (like spaces, tabs, newlines) with a single space.

*  **\s+:** The + means "one or more", so this pattern matches any sequence of one or more whitespace characters.

*  **' '` (single space):** Replaces the matched sequence with a single space.

*  **.strip():** This removes any leading or trailing whitespace from the text.

**Why It’s Useful:**

*  This step ensures that the text is neatly formatted with only single spaces between words and no unnecessary spaces at the beginning or end of the text.

**Example:**

*   **Input: " Hello World AI "**

*   **Output: "hello world ai"**

In [None]:

# Convert all text to lowercase to ensure uniformity
text = text.lower()  # Convert all text to lowercase to ensure uniformity

# Remove extra spaces and trim the text
text = re.sub(r'\s+', ' ', text).strip()

text

'revenuegenerated via seo qualified leadsgenerated years ago we embarked on a journey to unravel the intricacies of the google algorithma cryptic enigma begging to be deciphered consider it akin to unlocking a closely guarded secret comparable only to the recipe of coca cola or the security measures surrounding the crown jewels of london to traverse the google maze we decided to rewrite the rules and carve our own path our strategy develop proprietary ai algorithms to adeptly monitor and navigate the evolving landscape of the google algorithm to date weve pioneered an impressive portfolio of unique ai seo algorithms elevating the effectiveness and efficiency of our work while seo teams globally have traditionally relied on three key strategiesonsite seo optimization backlink building and content creation and optimizationwe at thatware ai seo have rewritten the playbook picture this scenario your company aspires to secure a coveted spot on page for a strategic keyword like clockwork you

# Loading the BERT Model and Tokenizer
The **load_bert_model** function is crucial for preparing the tools needed to use **BERT (Bidirectional Encoder Representations from Transformers)** for natural language processing tasks.

**Here's how each step works:**

**1. Loading the BERT Tokenizer**

*  **tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')**

**Explanation:**

*  **BertTokenizer:** This is a class provided by the Hugging Face Transformers library. The tokenizer is responsible for converting raw text into tokens that the BERT model can process.

*  **.from_pretrained('bert-base-uncased'):** This method loads a pre-trained BERT tokenizer from Hugging Face's model hub.


*  **'bert-base-uncased':** Refers to the specific version of BERT being loaded. "Base" indicates it's the smaller version of BERT (12 layers, 110 million parameters), and "uncased" means the tokenizer doesn't differentiate between uppercase and lowercase letters (it converts everything to lowercase).

**Why It’s Useful:**

BERT cannot directly process raw text. The tokenizer splits the text into smaller units called "tokens," which can be words, subwords, or even characters. These tokens are then mapped to unique numerical IDs that the model understands.

**Example:**

* **Input Text:** "Machine learning is fascinating."
* **Tokenizer Output:** ['machine', 'learning', 'is', 'fascinating', '.']
*  **Token IDs:** [2535, 4084, 2003, 13136, 1012]

   Each word is converted into a token, and then each token is converted into a unique ID.

   






In [None]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

#  Loading the Pre-Trained BERT Model

*  **model = BertModel.from_pretrained('bert-base-uncased')**

**Explanation:**

*  **BertModel:** This is a class from the Hugging Face Transformers library that represents the BERT model itself.

*  **.from_pretrained('bert-base-uncased'):** This method loads the pre-trained BERT model corresponding to the tokenizer.


*  **'bert-base-uncased':** This matches the tokenizer version to ensure compatibility between the tokens generated and the model’s expectations.

**Why It’s Useful:**

*  The BERT model, once loaded, can take the token IDs generated by the tokenizer and convert them into embeddings (dense vectors). These embeddings capture the meaning of the words in context.

**Example:**

*  **Input Token IDs:** [2535, 4084, 2003, 13136, 1012]

*  **Model Output:** A set of vectors (one for each token) representing the meaning of each token in the context of the sentence. These vectors are used for various tasks like text classification, similarity measurement, or sentiment analysis.

In [None]:
 # Load the pre-trained BERT model

model = BertModel.from_pretrained('bert-base-uncased')
model

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

#  Tokenizing the Input Text

**inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)**

**Explanation:**

*  **tokenizer(text, return_tensors='pt', max_length=512, truncation=True):**

*  **tokenizer(text):** The tokenizer, which was loaded earlier, processes the input text. It breaks down the text into smaller units (tokens) and converts them into numerical IDs that the BERT model can understand.

*  **return_tensors='pt':** This specifies that the tokenizer should return the tokens in the format required by PyTorch (hence 'pt'). PyTorch is the deep learning framework used by the BERT model.

*  **max_length=512:** BERT models can only process up to 512 tokens at a time. If the text is longer, it will be truncated.

*  **truncation=True:** If the input text exceeds the maximum length, it will be cut off at 512 tokens to avoid errors.

**Why It’s Useful:**

This step converts raw text into a format (tokens) that the BERT model can process. Without tokenization, BERT wouldn’t understand the input text.

**Example:**

*  **Input Text:** "Machine learning is fascinating."

**Tokenization Output:**

*  **Tokens:** ['machine', 'learning', 'is', 'fascinating', '.']

*  **Token IDs:** [2535, 4084, 2003, 13136, 1012]
*  **Tensor Format:** The token IDs are returned as a PyTorch tensor, ready for input to the BERT model.




In [None]:
# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
inputs

{'input_ids': tensor([[  101,  6599,  6914, 16848,  3081, 27457,  4591,  5260,  6914, 16848,
          2086,  3283,  2057, 11299,  2006,  1037,  4990,  2000,  4895, 22401,
          2140,  1996, 20014, 14735,  9243,  1997,  1996,  8224,  9896,  2050,
         26483, 26757, 12858,  2000,  2022, 11703, 11514, 27190,  5136,  2009,
         17793,  2000, 19829,  2075,  1037,  4876, 13802,  3595, 12435,  2069,
          2000,  1996, 17974,  1997, 16787, 15270,  2030,  1996,  3036,  5761,
          4193,  1996,  4410, 15565,  1997,  2414,  2000, 20811,  1996,  8224,
         15079,  2057,  2787,  2000,  2128, 26373,  1996,  3513,  1998,  2482,
          3726,  2256,  2219,  4130,  2256,  5656,  4503, 16350,  9932, 13792,
          2000, 26398,  2135,  8080,  1998, 22149,  1996, 20607,  5957,  1997,
          1996,  8224,  9896,  2000,  3058,  2057,  3726, 16193,  2019,  8052,
         11103,  1997,  4310,  9932, 27457, 13792,  3449, 13331,  3436,  1996,
         12353,  1998,  8122,  1997,  

#  Disabling Gradient Calculation for Efficiency

**with torch.no_grad():**

**Explanation:**

*  **torch.no_grad():** This context manager in PyTorch temporarily disables gradient calculations. Gradients are used during model training to update weights, but since we're only interested in using the model for inference (not training it), we don’t need gradients.

*  **Efficiency:** Disabling gradients reduces memory usage and speeds up the computation, making the process more efficient.

**Why It’s Useful:**

*  When you’re only using the model for prediction or generating embeddings, calculating gradients is unnecessary and slows down the process. This step ensures that the model runs faster and uses less memory.

**Example:**
*  **Without no_grad():** The model would perform extra calculations to track gradients, which aren't needed here.

*  **With no_grad():** The model runs more efficiently, focusing only on generating the embeddings.

#   Getting the Output from BERT

**outputs =** model(**inputs)

**Explanation:**

*  model(**inputs): The tokenized text is fed into the BERT model. The **inputs syntax unpacks the inputs (tokens) into the required format for the BERT model.


*  **Model Output:** BERT produces several outputs, but the one we're interested in is last_hidden_state, which contains the embeddings for each token in the input.

**Why It’s Useful:**

*  This is where BERT processes the input text and generates the embeddings, which are dense vectors that represent the meaning of the text.

**Example:**
*  **Input Tokens:** [2535, 4084, 2003, 13136, 1012] (for "Machine learning is fascinating.")

*  **Model Output:** A set of vectors (one for each token) representing the meaning of each token in the context of the sentence.

#  Averaging the Embeddings to Get a Single Vector

**embeddings = outputs.last_hidden_state.mean(dim=1).numpy()**

**Explanation:**

*  **outputs.last_hidden_state:** This contains the embeddings for all tokens in the input. It’s a tensor where each token’s embedding is a vector.

*  **mean(dim=1):** This averages the embeddings across the sequence dimension (i.e., it averages the vectors for all tokens in the input). The result is a single vector that represents the entire input text.

*  **.numpy():** Converts the tensor to a NumPy array for easier manipulation and compatibility with other Python libraries.

**Why It’s Useful:**

*  Averaging the token embeddings gives a single, fixed-size vector that captures the overall meaning of the entire text. This is useful for comparing different texts, feeding the vector into another model, or performing tasks like similarity measurement.

**Example:**

*  **Token Embeddings:** Let's say the embeddings for the tokens were [v1, v2, v3, v4, v5].
*  **Averaged Embedding:** The function calculates the average of these vectors to get a single vector v_mean, which summarizes the meaning of the whole sentence.

In [None]:
# Disable gradient calculation for efficiency (we don't need to train the model)
with torch.no_grad():
# Get the output from BERT
  outputs = model(**inputs)
# Average the embeddings to get a single vector for the text
  embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
embeddings

array([[-2.08914220e-01,  2.90331960e-01,  3.72634113e-01,
         1.07679982e-02,  4.10373807e-01,  1.01679064e-01,
         2.69965917e-01,  1.27174631e-01,  1.06049284e-01,
        -2.27307796e-01,  1.94186941e-01, -2.52300382e-01,
        -2.01816067e-01,  1.00820232e-02, -2.20977738e-02,
         5.86546063e-01,  1.65676743e-01,  5.36501184e-02,
        -2.33653396e-01,  4.48684514e-01,  1.93905815e-01,
         6.12458624e-02,  1.19915858e-01,  5.77895999e-01,
         3.81071746e-01,  2.28283238e-02, -2.01666743e-01,
        -3.91541719e-01, -2.67810166e-01,  2.46322472e-02,
         6.13871574e-01,  8.24265853e-02,  4.71602157e-02,
        -3.21649581e-01,  9.78195816e-02,  8.13781172e-02,
        -2.65979767e-01, -1.40730381e-01, -7.00944886e-02,
         3.04888844e-01, -5.15511572e-01, -2.26617634e-01,
        -3.42631936e-02,  3.44502442e-02, -3.45854878e-01,
        -5.27447224e-01, -8.45835060e-02,  1.75630935e-02,
         2.77050555e-01,  3.85022163e-02, -6.10150874e-0

**Embeddings Code:** While more technical, the embeddings code is powerful for deeper analysis of content similarity. It’s useful if the owner or their technical team wants to delve into the granular differences between pages at a semantic level.

In [None]:
# Import necessary libraries
import requests  # Library to make HTTP requests to access website content
from bs4 import BeautifulSoup  # Library to parse HTML content from websites
import re  # Regular expressions library to clean and preprocess text
import torch  # PyTorch library, used for working with neural networks
from transformers import BertTokenizer, BertModel  # Huggingface Transformers library, used for BERT model and tokenizer
import numpy as np  # NumPy library, used for handling arrays and numerical operations

# Step 1: Web Scraping - Get content from the website
def scrape_website(url):
    """
    This function takes a URL, sends a request to fetch the content,
    and extracts the text from all paragraph tags (<p>) in the HTML.
    """
    response = requests.get(url)  # Send an HTTP request to the provided URL
    soup = BeautifulSoup(response.text, 'html.parser')  # Parse the HTML content using BeautifulSoup
    paragraphs = soup.find_all('p')  # Find all paragraphs in the webpage
    text_content = ' '.join([para.get_text() for para in paragraphs])  # Combine all paragraph texts into one large string
    return text_content  # Return the combined text

# Step 2: Text Preprocessing - Clean up the text for analysis
def preprocess_text(text):
    """
    This function takes raw text as input and cleans it by removing
    special characters, digits, and extra spaces. It also converts
    the text to lowercase.
    """
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove everything that isn't a letter or space
    text = text.lower()  # Convert all text to lowercase to ensure uniformity
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces and trim the text
    return text  # Return the cleaned text

# Step 3: Load BERT Model and Tokenizer
def load_bert_model():
    """
    This function loads the pre-trained BERT model and its tokenizer.
    The tokenizer converts text into tokens that BERT can understand,
    and the model generates embeddings (vectors) from the tokens.
    """
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Load the BERT tokenizer
    model = BertModel.from_pretrained('bert-base-uncased')  # Load the pre-trained BERT model
    return tokenizer, model  # Return both the tokenizer and model

# Step 4: Generate BERT Embeddings for the Text
def generate_bert_embeddings(text, tokenizer, model):
    """
    This function takes preprocessed text, tokenizes it using BERT tokenizer,
    and then uses the BERT model to generate embeddings (numerical representations)
    for the text. These embeddings capture the meaning of the words in context.
    """
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)  # Tokenize the input text
    with torch.no_grad():  # Disable gradient calculation for efficiency (we don't need to train the model)
        outputs = model(**inputs)  # Get the output from BERT
    embeddings = outputs.last_hidden_state.mean(dim=1).numpy()  # Average the embeddings to get a single vector for the text
    return embeddings  # Return the generated embeddings

# Step 5: Analyze the Content using BERT
def analyze_content(url):
    """
    This function combines all the previous steps to scrape the website,
    preprocess the content, generate BERT embeddings, and then print
    the embeddings for further analysis.
    """
    print(f"Processing website: {url}")

    raw_text = scrape_website(url)  # Fetch the text content from the website
    cleaned_text = preprocess_text(raw_text)  # Clean the text to make it ready for analysis

    tokenizer, model = load_bert_model()  # Load the BERT model and tokenizer

    embeddings = generate_bert_embeddings(cleaned_text, tokenizer, model)  # Generate BERT embeddings for the text

    print(f"BERT Embeddings for {url}:")
    print(embeddings)  # Print the BERT embeddings, which are vectors representing the text

    # At this point, you can use the embeddings for various NLP tasks,
    # such as similarity checks, clustering, or sentiment analysis.
    # This code focuses on generating the embeddings.

# Step 6: Main Function - Run the Analysis
def main():
    url = "https://thatware.co/"  # This is the website we want to analyze using BERT
    analyze_content(url)  # Run the content analysis for the provided URL

if __name__ == "__main__":
    main()  # Run the main function to execute the program


Processing website: https://thatware.co/
BERT Embeddings for https://thatware.co/:
[[-2.08914220e-01  2.90331960e-01  3.72634113e-01  1.07679982e-02
   4.10373807e-01  1.01679064e-01  2.69965917e-01  1.27174631e-01
   1.06049284e-01 -2.27307796e-01  1.94186941e-01 -2.52300382e-01
  -2.01816067e-01  1.00820232e-02 -2.20977738e-02  5.86546063e-01
   1.65676743e-01  5.36501184e-02 -2.33653396e-01  4.48684514e-01
   1.93905815e-01  6.12458624e-02  1.19915858e-01  5.77895999e-01
   3.81071746e-01  2.28283238e-02 -2.01666743e-01 -3.91541719e-01
  -2.67810166e-01  2.46322472e-02  6.13871574e-01  8.24265853e-02
   4.71602157e-02 -3.21649581e-01  9.78195816e-02  8.13781172e-02
  -2.65979767e-01 -1.40730381e-01 -7.00944886e-02  3.04888844e-01
  -5.15511572e-01 -2.26617634e-01 -3.42631936e-02  3.44502442e-02
  -3.45854878e-01 -5.27447224e-01 -8.45835060e-02  1.75630935e-02
   2.77050555e-01  3.85022163e-02 -6.10150874e-01  1.41489103e-01
  -6.14626110e-02 -1.49571896e-01  2.29616731e-01  7.469931

#  What is the Output?
**The output you received is a BERT embedding.**

*  **BERT Embedding:** Think of this as a numerical representation of the content on your website. BERT takes the text from your website, processes it, and transforms it into a set of numbers (or vectors). Each number (or set of numbers) in the output represents specific aspects of the meaning and context of the content on your site.

#  What Does This Output Mean?
*  **Contextual Understanding:** The BERT model has taken all the text on your website and broken it down into its core meanings. The embedding (the list of numbers) is a summary of that content. Unlike older models that might just count words, BERT understands the relationships between words, the context in which they appear, and the overall meaning of sentences and paragraphs.

*  **Embeddings:** The numbers represent a high-dimensional space (a complex mathematical space) where similar meanings or contexts are closer together. This means that content that is similar in meaning will produce embeddings that are close together in this space.

#  Understanding Embeddings and Content Similarity

**What Are Embeddings?**

*  **Embeddings:** Think of embeddings as a way to represent words, sentences, or entire pieces of content as numbers (specifically, as vectors). These numbers capture the meaning of the content in a way that computers can understand.

*  **Example:** Imagine you have two sentences: "I love coffee" and "I enjoy coffee." While the sentences use different words ("love" vs. "enjoy"), their meanings are very similar. Embeddings for these sentences would be similar because they capture the context and meaning, not just the words themselves.

**How Can You Compare Embeddings?**

*  **Similarity of Embeddings:** When we talk about comparing embeddings, we're looking at how "close" these numerical representations are to each other in a mathematical space. If two pieces of content have similar embeddings, it means that they are similar in meaning.


*  **Cosine Similarity:** A common way to measure how similar two embeddings are is by calculating the cosine similarity.

   **This is a number between -1 and 1:**
*  **1:** The two embeddings are very similar.
*  **0:** The embeddings are neither similar nor dissimilar.
*  **-1:** The embeddings are very dissimilar.

**Example of Comparing Embeddings**

**Step-by-Step:**

*  **Generate Embeddings:** First, you generate embeddings for different pages on your website.

*  **Calculate Similarity:** Next, you calculate the cosine similarity between the embeddings of two different pages.

 **Interpret the Similarity:**

*  **High Similarity:** If the cosine similarity is close to 1, the pages are very similar in content.
*  **Low Similarity:** If it's closer to 0, the pages are less similar.

# Why This Matters for a Website Owner

*  **Avoiding Redundancy:** If two pages have high similarity, they might be too similar, which could confuse search engines and users. In this case, you might consider merging the content or making one page focus on a different aspect of the topic.

*  **Identifying Gaps:** If two important topics have low similarity, you might find that you're not covering related content well. This signals an opportunity to create new content that bridges the gap.

#  Practical Steps to Avoid Redundancy and Identify Gaps
**a) Avoiding Redundancy**
* **Step 1:** Generate Embeddings for Your Pages: Use the BERT model to create embeddings for the key pages on your website. This process turns your content into numerical vectors that capture their meaning.

*  **Step 2:** Compare Pages: Calculate the cosine similarity between the embeddings of different pages.

**Step 3: Analyze Similarity:**

*  **High Similarity:** If two pages are too similar (high cosine similarity), you should review them. Ask yourself:

    Are these pages targeting the same keywords or user intent?

    Can these pages be combined into one more comprehensive page?

*  **Action:** If you find that two pages are nearly identical in meaning, consider merging them. Alternatively, differentiate the content by focusing on different aspects or adding unique information to each page.

**b) Identifying Content Gaps**

*  **Step 1: Generate Embeddings for Existing Content:** As before, create embeddings for all your key pages.

*  **Step 2: Compare Across Topics:** Look at the similarity between different topics or sections of your site.

**Step 3: Look for Low Similarity Scores:**

*  **Low Similarity:** If two topics that should be related have a low similarity, this might indicate a gap in your content.

*  **Example:** If your site covers "SEO strategies" and "content marketing" but the embeddings show low similarity, it might mean that you're not effectively linking these two topics. This could be an opportunity to create content that connects these ideas.

*  **Action:** Create new content to bridge these gaps. This could be an article, a guide, or a series of posts that link these topics together.


# Page Embedding Similarity Code (Cosine Similarity):

*  This code plays a critical role in evaluating the similarity between different pages on your website. It is designed to help you identify whether different pages on your site contain similar or redundant content. Understanding the level of similarity between pages is crucial for a number of reasons, particularly in the areas of Search Engine Optimization (SEO) and user experience.



In [None]:
import requests
from bs4 import BeautifulSoup
import re
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from numpy import dot
from numpy.linalg import norm

# Step 1: Function to extract and clean the text from a webpage
def extract_text_from_url(url):
    """
    Extract the main content text from a given URL.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract text content from <p> tags and join them together
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])

    return preprocess_text(text)

# Step 2: Text Preprocessing - Clean up the text for analysis
def preprocess_text(text):
    """
    Clean the extracted text by removing non-alphabet characters and digits.
    """
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove everything that isn't a letter or space
    text = text.lower()  # Convert all text to lowercase to ensure uniformity
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces and trim the text
    return text

# Step 3: Load BERT Model and Tokenizer
def load_bert_model():
    """
    Load the pre-trained BERT model and tokenizer.
    """
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    return tokenizer, model

# Step 4: Generate BERT Embeddings for the Text
def generate_bert_embeddings(text, tokenizer, model):
    """
    Generate BERT embeddings for the given text.
    """
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
    with torch.no_grad():  # Disable gradient calculation for efficiency
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).numpy().flatten()  # Flatten the embeddings to 1D array
    return embeddings

# Step 5: Calculate Cosine Similarity Between Two Embeddings
def calculate_cosine_similarity(embeddings1, embeddings2):
    """
    Calculate the cosine similarity between two sets of embeddings.
    """
    cosine_similarity = dot(embeddings1, embeddings2) / (norm(embeddings1) * norm(embeddings2))
    return cosine_similarity

# Step 6: Main function to compare the similarity of multiple web pages
def compare_pages_similarity(urls):
    """
    Compare the similarity between multiple pages by calculating cosine similarity of their embeddings.
    """
    tokenizer, model = load_bert_model()
    embeddings_list = []

    for url in urls:
        print(f"Processing website: {url}")
        text = extract_text_from_url(url)
        embeddings = generate_bert_embeddings(text, tokenizer, model)
        embeddings_list.append(embeddings)

    for i in range(len(embeddings_list)):
        for j in range(i + 1, len(embeddings_list)):
            similarity = calculate_cosine_similarity(embeddings_list[i], embeddings_list[j])
            print(f"Cosine Similarity between {urls[i]} and {urls[j]}: {similarity}")

# Replace these URLs with the actual URLs you'd like to compare
urls = [
    'https://thatware.co/',  # Page 1 URL
    'https://thatware.co/services/',  # Page 2 URL
    'https://thatware.co/contact-us/'  # Page 3 URL (optional)
]

# Run the comparison
compare_pages_similarity(urls)


Processing website: https://thatware.co/
Processing website: https://thatware.co/services/
Processing website: https://thatware.co/contact-us/
Cosine Similarity between https://thatware.co/ and https://thatware.co/services/: 0.8258922100067139
Cosine Similarity between https://thatware.co/ and https://thatware.co/contact-us/: 0.8225039839744568
Cosine Similarity between https://thatware.co/services/ and https://thatware.co/contact-us/: 0.992397129535675


# Analyzing the Output and Next Steps for a Website Owner

**1. Understanding the Cosine Similarity Results:**

**Cosine Similarity between https://thatware.co/ and https://thatware.co/services/: 0.8258**
*  This indicates that the content of the homepage and the services page is fairly similar. While they are not identical, they share a significant amount of content, which might lead to redundancy.

**Cosine Similarity between https://thatware.co/ and https://thatware.co/contact-us/: 0.8225**

*  The homepage and the contact page also show a high similarity score. Although the contact page should generally contain unique content focused on contact information, a high similarity score suggests that it may contain redundant content similar to the homepage.

**Cosine Similarity between https://thatware.co/services/ and https://thatware.co/contact-us/: 0.9924**
*  This extremely high similarity score indicates that the services and contact pages are almost identical in content, which is problematic as it can confuse both users and search engines.

**2. Recommendations for the Website Owner:**

**Content Differentiation:**

*  **Reduce Redundancy:** Given the high similarity scores, it's important to differentiate the content across these pages. The services and contact pages, in particular, need unique content that is distinct from each other and from the homepage.

*  **Homepage vs. Services Page:** The homepage should provide a broad overview of what the business offers, while the services page should delve into specific services in more detail. Avoid repeating the same content; instead, focus on unique value propositions and details that are service-specific.

*  **Contact Page:** Ensure that the contact page contains primarily contact information and location-specific details. If additional content is necessary, it should be focused on how to reach the company, possibly including a contact form, map, and specific instructions on how to contact or visit.




In [None]:
import requests
from bs4 import BeautifulSoup
import re
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from numpy import dot
from numpy.linalg import norm

# Function to extract and clean the text from a webpage
def extract_text_from_url(url):
    """
    Extract the main content text from a given URL.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract text content from <p> tags and join them together
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])

    return preprocess_text(text)

# Text Preprocessing - Clean up the text for analysis
def preprocess_text(text):
    """
    Clean the extracted text by removing non-alphabet characters and digits.
    """
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove everything that isn't a letter or space
    text = text.lower()  # Convert all text to lowercase to ensure uniformity
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces and trim the text
    return text

# Load BERT Model and Tokenizer
def load_bert_model():
    """
    Load the pre-trained BERT model and tokenizer.
    """
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    return tokenizer, model

# Generate BERT Embeddings for the Text
def generate_bert_embeddings(text, tokenizer, model):
    """
    Generate BERT embeddings for the given text.
    """
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
    with torch.no_grad():  # Disable gradient calculation for efficiency
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).numpy().flatten()  # Flatten the embeddings to 1D array
    return embeddings

# Calculate Cosine Similarity Between Two Embeddings
def calculate_cosine_similarity(embeddings1, embeddings2):
    """
    Calculate the cosine similarity between two sets of embeddings.
    """
    cosine_similarity = dot(embeddings1, embeddings2) / (norm(embeddings1) * norm(embeddings2))
    return cosine_similarity

# Function to compare similarity within multiple pages of each website
def compare_website_pages_similarity(website_urls):
    """
    Compare the similarity between different pages of each website.
    """
    tokenizer, model = load_bert_model()

    for site, urls in website_urls.items():
        print(f"Processing pages for website: {site}")
        embeddings_list = []

        for url in urls:
            print(f"Processing page: {url}")
            text = extract_text_from_url(url)
            embeddings = generate_bert_embeddings(text, tokenizer, model)
            embeddings_list.append((url, embeddings))

        for i in range(len(embeddings_list)):
            for j in range(i + 1, len(embeddings_list)):
                similarity = calculate_cosine_similarity(embeddings_list[i][1], embeddings_list[j][1])
                print(f"Cosine Similarity between {embeddings_list[i][0]} and {embeddings_list[j][0]}: {similarity:.4f}")
        print("\n" + "-"*50 + "\n")

# URLs of different pages for each website
website_urls = {
    'ThatWare': [
        'https://thatware.co/',
        'https://thatware.co/services/',
        'https://thatware.co/contact-us/'
    ],
    'Incrementors': [
        'https://www.incrementors.com/',
        'https://www.incrementors.com/services/',
        'https://www.incrementors.com/contact-us/'
    ],
    'Techwebers': [
        'https://www.techwebers.com/',
        'https://www.techwebers.com/seo-services/',
        'https://www.techwebers.com/contact-us/'
    ],
    'SEO Tech Experts': [
        'https://www.seotechexperts.com/seo-agency-india.html',
        'https://www.seotechexperts.com/contact-us.html',
        'https://www.seotechexperts.com/services.html'
    ]
}

# Run the comparison
compare_website_pages_similarity(website_urls)


Processing pages for website: ThatWare
Processing page: https://thatware.co/
Processing page: https://thatware.co/services/
Processing page: https://thatware.co/contact-us/
Cosine Similarity between https://thatware.co/ and https://thatware.co/services/: 0.8259
Cosine Similarity between https://thatware.co/ and https://thatware.co/contact-us/: 0.8225
Cosine Similarity between https://thatware.co/services/ and https://thatware.co/contact-us/: 0.9924

--------------------------------------------------

Processing pages for website: Incrementors
Processing page: https://www.incrementors.com/
Processing page: https://www.incrementors.com/services/
Processing page: https://www.incrementors.com/contact-us/
Cosine Similarity between https://www.incrementors.com/ and https://www.incrementors.com/services/: 0.9659
Cosine Similarity between https://www.incrementors.com/ and https://www.incrementors.com/contact-us/: 0.8982
Cosine Similarity between https://www.incrementors.com/services/ and https

# Understanding Cosine Similarity

**Cosine Similarity is a number between 0 and 1** that tells you how similar two pieces of content are.

* 1 means the contents are very similar or almost identical.

* 0 means the contents are completely different.

# What the Scores Mean

**1. Cosine Similarity between Homepage and Services Page: 0.8258**

* **What it means:** The homepage and the services page have quite a bit of similar content. This isn’t always a bad thing, but if the content is too similar, it might confuse search engines and users.
* **Why it matters:** You want each page to have unique, focused content. If the services page repeats a lot of what’s on the homepage, it might be less effective at targeting specific keywords related to your services.

**2. Cosine Similarity between Homepage and Contact Page: 0.8225**

*  **What it means:** The homepage and the contact page also have a lot of similar content. Usually, the contact page should be more unique, focusing on how users can get in touch with you, rather than repeating what’s on the homepage.
* **Why it matters:** If your contact page has too much content similar to the homepage, it might dilute its purpose. The contact page should clearly provide contact details and a form or information on how to reach you.

**3. Cosine Similarity between Services Page and Contact Page: 0.9924**

*  **What it means:** The services and contact pages are almost identical in content. This is very high and likely indicates that the content on these pages is nearly the same.
* **Why it matters:** This is problematic because these pages should serve different purposes. Having almost identical content on both pages could hurt your SEO (Search Engine Optimization) and confuse visitors who expect different information on each page.

**What’s Good and What’s Not**

**Good Similarity:**

*  Similar pages can be good if they share some common elements but also have distinct, focused content. For example, it’s okay if the homepage and services page have a few overlapping sections (like a brief introduction), but most of the content should be different to target different user intents.

**Bad Similarity:**

*  Too much similarity between pages is bad because it can lead to content redundancy. This means that search engines might have a hard time figuring out which page to show for specific search queries. It can also lead to a poor user experience if visitors see the same information repeated on different pages.

**Good Dissimilarity:**

*  Dissimilarity is good when it reflects that each page has a unique purpose.

 **For example,** the contact page should be quite different from the services page, focusing solely on how to reach the business.



# What Should You Do?

* **Review the Content:** Look at the services and contact pages and consider how they can be made more distinct from each other and from the homepage. Each page should have a clear, unique focus.

* **Optimize for SEO:** Ensure that each page targets different keywords or phrases that are relevant to its specific content.

* **Improve User Experience:** Make sure that when users visit different pages, they find unique and relevant information that matches their needs at that point in their journey on your website.


**The Keyword Similarity Code** is designed to analyze the most important keywords on a website and determine how similar or different they are to each other. This analysis is crucial for website owners and SEO specialists because it helps in understanding how effectively the top keywords are being used across the site, ensuring that the content is well-optimized for search engines without unnecessary repetition or redundancy.




In [None]:
import requests  # Library to make HTTP requests to access website content
from bs4 import BeautifulSoup  # Library to parse HTML content from websites
import re  # Regular expressions library to clean and preprocess text
import torch  # PyTorch library, used for working with neural networks
from transformers import BertTokenizer, BertModel  # Huggingface Transformers library, used for BERT model and tokenizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS  # Predefined list of English stopwords
from collections import Counter  # Used for counting the frequency of words
from numpy import dot  # Used for calculating the dot product of two vectors
from numpy.linalg import norm  # Used for calculating the norm (magnitude) of a vector

# Step 1: Extract and clean the text from a webpage
def extract_text_from_url(url):
    """
    Extract the main content text from a given URL.

    1. Send an HTTP request to the website using the URL.
    2. Parse the website's HTML content.
    3. Find all the paragraph tags (<p>) in the HTML.
    4. Extract and combine the text from these paragraphs into a single string.
    5. Return the cleaned text by calling the preprocess_text function.
    """
    response = requests.get(url)  # Make a request to the website
    soup = BeautifulSoup(response.text, 'html.parser')  # Parse the HTML content

    # Extract text from all <p> tags and join them together into a single string
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])

    # Preprocess the extracted text to remove unwanted characters and words
    return preprocess_text(text)

# Step 2: Text Preprocessing - Clean up the text for analysis
def preprocess_text(text):
    """
    Clean the extracted text by removing non-alphabet characters, stopwords, and digits.

    1. Remove all characters that are not letters or spaces.
    2. Convert the text to lowercase for uniformity.
    3. Split the text into individual words.
    4. Remove common stopwords (e.g., "the", "and") that don't carry significant meaning.
    5. Join the cleaned words back into a single string.
    6. Return the cleaned text.
    """
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only letters and spaces
    text = text.lower()  # Convert to lowercase to standardize the text
    words = text.split()  # Split the text into a list of words
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]  # Remove stopwords
    text = ' '.join(words)  # Join the words back into a single string
    return text

# Step 3: Load BERT Model and Tokenizer
def load_bert_model():
    """
    Load the pre-trained BERT model and tokenizer.

    1. Load the BERT tokenizer, which converts text into tokens that the BERT model can process.
    2. Load the BERT model, which generates embeddings (numerical representations) for the tokens.
    3. Return both the tokenizer and the model.
    """
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Load the BERT tokenizer
    model = BertModel.from_pretrained('bert-base-uncased')  # Load the pre-trained BERT model
    return tokenizer, model  # Return both the tokenizer and model

# Step 4: Generate BERT Embeddings for a word
def generate_bert_embeddings(word, tokenizer, model):
    """
    Generate BERT embeddings for the given word.

    1. Tokenize the input word using the BERT tokenizer.
    2. Pass the tokens through the BERT model to get the embeddings.
    3. Average the embeddings across all tokens to get a single vector representation.
    4. Return the flattened (1D) vector of the word embeddings.
    """
    inputs = tokenizer(word, return_tensors='pt', max_length=512, truncation=True)  # Tokenize the word
    with torch.no_grad():  # Disable gradient calculation for efficiency
        outputs = model(**inputs)  # Get the output from BERT
    embeddings = outputs.last_hidden_state.mean(dim=1).numpy().flatten()  # Flatten to 1D vector
    return embeddings

# Step 5: Calculate Cosine Similarity Between Two Embeddings
def calculate_cosine_similarity(embedding1, embedding2):
    """
    Calculate the cosine similarity between two sets of embeddings.

    1. Compute the dot product of the two embedding vectors.
    2. Compute the norm (magnitude) of each embedding vector.
    3. Divide the dot product by the product of the norms to get the cosine similarity.
    4. Return the cosine similarity score.
    """
    cosine_similarity = dot(embedding1, embedding2) / (norm(embedding1) * norm(embedding2))
    return cosine_similarity

# Step 6: Extract Top Keywords from the Text
def extract_top_keywords(text, limit=10):
    """
    Extract top keywords based on their frequency in the text.

    1. Split the cleaned text into individual words.
    2. Count the frequency of each word using the Counter class.
    3. Select the top 'limit' keywords based on their frequency.
    4. Return the list of top keywords.
    """
    words = text.split()  # Split the text into words
    word_freq = Counter(words)  # Count the frequency of each word
    top_keywords = [word for word, freq in word_freq.most_common(limit)]  # Get the most common words
    return top_keywords

# Step 7: Compare Similarity and Dissimilarity Between Top Keywords
def compare_top_keyword_similarity(url, keyword_limit=10):
    """
    Compare the similarity and dissimilarity between top keywords extracted from the website.

    1. Load the BERT model and tokenizer.
    2. Extract and preprocess the text from the given URL.
    3. Extract the top keywords from the cleaned text based on the specified limit.
    4. For each pair of top keywords:
        a. Generate BERT embeddings for both keywords.
        b. Calculate the cosine similarity between the embeddings.
        c. Categorize the similarity score as high, moderate, or low.
    5. Print the similarity or dissimilarity between each pair of keywords.
    """
    tokenizer, model = load_bert_model()  # Load BERT model and tokenizer

    # Extract and preprocess text from the website
    print(f"Processing website: {url}")
    text = extract_text_from_url(url)

    # Extract top keywords based on the specified limit
    top_keywords = extract_top_keywords(text, limit=keyword_limit)
    print(f"Top Keywords: {top_keywords}")

    # Compare each pair of top keywords for similarity and dissimilarity
    for i in range(len(top_keywords)):
        for j in range(i + 1, len(top_keywords)):
            embedding1 = generate_bert_embeddings(top_keywords[i], tokenizer, model)
            embedding2 = generate_bert_embeddings(top_keywords[j], tokenizer, model)
            similarity = calculate_cosine_similarity(embedding1, embedding2)

            if similarity > 0.5:
                print(f"Similarity between '{top_keywords[i]}' and '{top_keywords[j]}': {similarity:.4f} (High Similarity)")
            elif similarity < 0.2:
                print(f"Similarity between '{top_keywords[i]}' and '{top_keywords[j]}': {similarity:.4f} (High Dissimilarity)")
            else:
                print(f"Similarity between '{top_keywords[i]}' and '{top_keywords[j]}': {similarity:.4f} (Moderate Similarity)")

# Run the comparison for a specific website with a specified number of top keywords
website_url = 'https://thatware.co/'  # Replace this with the desired website URL
keyword_limit = 10  # Set the limit for the number of top keywords
compare_top_keyword_similarity(website_url, keyword_limit)


Processing website: https://thatware.co/
Top Keywords: ['seo', 'services', 'ai', 'advanced', 'algorithms', 'search', 'marketing', 'data', 'thatware', 'online']
Similarity between 'seo' and 'services': 0.6185 (High Similarity)
Similarity between 'seo' and 'ai': 0.8150 (High Similarity)
Similarity between 'seo' and 'advanced': 0.6690 (High Similarity)
Similarity between 'seo' and 'algorithms': 0.7094 (High Similarity)
Similarity between 'seo' and 'search': 0.8106 (High Similarity)
Similarity between 'seo' and 'marketing': 0.6889 (High Similarity)
Similarity between 'seo' and 'data': 0.6464 (High Similarity)
Similarity between 'seo' and 'thatware': 0.6807 (High Similarity)
Similarity between 'seo' and 'online': 0.7074 (High Similarity)
Similarity between 'services' and 'ai': 0.6401 (High Similarity)
Similarity between 'services' and 'advanced': 0.6572 (High Similarity)
Similarity between 'services' and 'algorithms': 0.6946 (High Similarity)
Similarity between 'services' and 'search': 0.68

# Example Output Analysis:

**Keywords and Their Similarity Scores:**

*  Similarity between 'seo' and 'services': 0.6185 (High Similarity)
*  Similarity between 'ai' and 'algorithms': 0.7945 (High Similarity)
*  Similarity between 'data' and 'development': 0.8101 (High Similarity)

**1. Good Keyword Similarity Examples:**

**SEO' and 'Services' (0.6185):**

* **Why It's Good:** These keywords are closely related, especially if your business offers SEO services. Having content that highlights both keywords is beneficial because it aligns with user intent. When people search for "SEO services," they expect to find content that discusses both SEO and the services offered. This similarity indicates that your content is focused and relevant to the specific topic.

*  **How to Leverage It:** Ensure your content highlights how your services relate to SEO, and create detailed pages or posts that explore these topics together.

**'AI' and 'Algorithms' (0.7945):**

*  **Why It's Good:** AI and algorithms are naturally linked. If your content discusses AI, it's logical that algorithms would also be a significant topic. This high similarity is good if your goal is to be seen as an authority on AI technologies, as algorithms are a core component of AI.
*  **How to Leverage It:** Continue creating content that explains the relationship between AI and algorithms, perhaps through case studies, white papers, or technical blogs that go deeper into how algorithms drive AI solutions.

**2. Problematic Keyword Similarity Examples:**

**'SEO' and 'AI' (0.8150):**

*  **Why It Might Be a Problem:** While both SEO and AI are important topics, they serve different purposes and appeal to different audiences. If your content is highly similar between these two keywords, it might indicate that your content isn't well differentiated. This could confuse your audience or dilute the focus of your content, especially if you're trying to target specific user intents.

*  **What to Do:** Consider creating separate content strategies for SEO and AI. While they can intersect (e.g., using AI in SEO), make sure you have distinct content that clearly addresses each topic's unique aspects.

**'Data' and 'Development' (0.8101):**

*  **Why It Might Be a Problem:** If your content heavily overlaps between these two keywords, it could mean that your pages are not sufficiently distinct. While data is important in development, they are broad topics that should be addressed separately to maximize their SEO potential.

**What to Do:** Ensure that your "data" content focuses on aspects like data analytics, big data, etc., while "development" content should focus more on software development, web development, etc. By separating these, you avoid competing against yourself in search engine rankings.


**3. The Concept of Dissimilarity:**

**When Dissimilarity is Good:**
* Example: If you had a keyword like "SEO" and another like "Customer Support," you would expect low similarity because they are unrelated topics. This is good as it shows your website covers a broad range of services or topics, catering to different needs without overlapping or confusing content.

**When Dissimilarity Might Be Bad:**
*  **Example:** If you find a very low similarity between "SEO" and "Search," it could indicate a missed opportunity to connect closely related topics. If your content doesn't link these two well, you might be missing out on better keyword targeting and user engagement.


#  Final Thoughts:

*  **High Similarity is generally good** when it indicates that related keywords are well-integrated into your content. However, too much similarity between unrelated keywords might suggest that your content is not focused enough, potentially confusing users or search engines.

*  **Dissimilarity is expected and beneficial** when dealing with unrelated keywords, ensuring that your content strategy is diverse and comprehensive.
